Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ioHub key logging module is not unicode compliant on Windows #8

Closed
isolver opened this issue Aug 28, 2012 · 6 comments
Closed

ioHub key logging module is not unicode compliant on Windows #8

isolver opened this issue Aug 28, 2012 · 6 comments
Assignees
Labels
Milestone

Comments

@isolver
Copy link
Owner

isolver commented Aug 28, 2012

Currently, only language sets / keyboards that work with the ascii character set will work with ioHub. Need to replace the module being used on Windows for keyboard event hooking with a unicode version (or patch existing module source to it uses Unicode version of Windows functions).

@peircej
Copy link

peircej commented Apr 9, 2013

I'm not sure about /replace/ with a unicode version. The idea would be to know what key (by name) is pressed and also which character it represents. e.g. I could be presssing 'e' but using modifiers that give it an accent - some users want the key and others want the character.

@isolver
Copy link
Owner Author

isolver commented Apr 9, 2013

The 'intention', which may not be fully working yet, is that both are given.

The goal, I believe, is to still make the primary purpose of this type of
'scientific / experimental' key logger to create an event for every actual
physical key pressed on the keyboard. So if you have to press 3 - 3 keys to
get the 'output' character intended, then the ioHub should report all three
keys events, since that was what was pressed, but also give the information
needed to know what the end intent was by the user if the sequence of keys
preses was looked at. This is not a text editor key event API, it is a 'the
keyboard is being used as an input device for my experiment' key event API.
;)

So with that goal in mind, this is what some of the relevant fields of each
key event contain :

  1. Scancode field: The OS / local / keyboard even, dependent scan code for
    each key event. This is not going to be useful most of the time, but if
    errors in key mappings are found and someone wants to try and recover the
    meaningful data, this is needed.

  2. Char code / Key ID filed: The charcode / key_id for each key press. This
    is the value given by every OS after the scan code has been run through the
    keyboard and local mapping tables. The value is expected to be a 1 to 1
    relationship to actual characters or key descriptions for non visible keys.
    It is still OS dependent though for sure. On OSX, it seems that some of the
    time this code is actually the utf-8 encoded value for the key, but other
    times it is not, so this can not be counted on. In Windows and Xlib it
    is definitely 'not' the Unicode encoded key value.

  3. The uchar field. Given level 2 of information, then each OS has a way to
    get the Unicode utf-8 encoded value for the character, if it is a
    valid Unicode character , or if it has been assigned an utf-8 encoding by
    the OS vendor (Apple has done this for several dozen characters for
    example, many of the 'operational' function keys on the keyboard for
    example, event the apple logo.) If a given key event falls in the category
    of having a utf-8 encoded value available, then that value is all that is
    needed to also present the actual key graphic to the user as well, as long
    as the text viewer they are using supports displaying utf-8 characters of
    course, which most do. The code field has the Unicode unit version of what
    you would normally see as hex in a utf-8 encoded character (I guess up to 3
    or even 4 bytes can be used for 1 utf8 encoded char). If there is not a
    'real' unicode value for the key pressed, or it can not be
    determined, callbacks are used when possible, like OS provided lookup
    tables etc. For example, on OSX, Apple reserved a block of unicode utf-8
    encoding address space and have used it to define 'unicode' utf-8 encoding
    values for most / all of the non-visible type keys / actions that can be
    triggered on the Mac. That is what the UnicodeChars sub class of the
    KeyboardConstants class is, a copy from part of the NSEvent.h file. If both
    these steps fail, then it is given a 'Dead Char' assignment in this field .
    Some keys will be 'dead' keys , which have no representation at this level
    because they do not generate any input themselves in terms of a char /
    uchar. You gave an example of such a case. Even the 'modifier' keys we are
    used to can be classified as 'dead' characters in this sense (shift, alt,
    ctrl, etc). These type of keys seem to always require a 'lookup' table type
    approach to being able to provide a human readable / meaningful
    representation of the key. This is also OS dependent. Thus why there is a
    forth field of information for each key event. The 'key' field.

  4. The key field: This is the field most people will ever need to worry
    about. ;) This field holds the actual unicode char, and can be viewed as
    intended. If it is a valid, visible, utf-8 encoded char we are dealing
    with, then the only difference between field 3 and 4 is that field 4 has
    been decoded so it is a unicode char, not a sequence of 8 bit chars, as far
    as the environment is concerned, so it can be viewed properly as it should.
    If the key is not a visible key or does not have a utf-8 encoded version
    that the OS provided, then the key field will hold the "text label" for the
    char. Common examples of this are the Page Up and Page Down keys, Escape,
    arrow keys, all the common modifier keys, function keys, multimedia keys
    etc. In these cases the key field holds the name / label for the key. This
    is very OS specific, so I have added the ability to define a mapping file
    between the 'standard' ioHub label for such keys and the label the current
    OS uses. This way the idea is even most dead / non visible keys can be
    given a consistent label by ioHub across OSs. This is the idea anyhow. Not
    complete yet for sure.

I've learned a lot about this stuff over the last 1/2 week, and why it has
taken so long to get the initial ports done, and why the design
has evolved from one OS to the other, as until i actually got how each OS
is doing it and what functions I had access to on each, I could not make a
proper overall plan that 'should ' be able to address the OS
differences while making things as consistent as possible for people, using
this multi stages approach.

I also figured out how to actually display a unicode character properly in
a console window when a script outputs it, instead of showning the ascii
version of the utf-8 ended text, or junk, I have not perfected it in terms
of handling all the use cases that can occur, but it is possible, even when
sending the unicode text over a network or pipe and displaying it n the
console of another process. I have started to fix the bugs in ioHub in this
regard when I run into them. It seems to me that there are two main causes
for a unicode character or text not displaying correctly. A) A piece of
text that is already unicode utf-8 encoded get eencoded with utf-8 by
another part of the program somewhere, so the doubly encoded text being
garbage and can not be displayed correctly. b) Unicode Text that has been
converted to a utf-8 encoded string does not get decoded 'on the other
side' back into a unicode string, or the decoding process doesn not use
utf-8 explicitly and therefore falls back to the OS or pipes (like std.out)
default encoding type. In teh first case, you get shown the hex looking
string of text, the utf-8 encoded string rep of the unicode text. In the
second case, since a different encoding was used when decoding the utf-8
string back to unicode text, you get garbage. The same encoding scheme
(utf-8 IMO is hat we should standardize on) must be used to both encode and
decode the data for things to work right, It is not really possible for
software to automatically detemine datas encoding type in a reliable way
programatically, so the encoding type that wants to be used should
e explicitly specified. Inside the program, just always use unicode strings
/ text. Other than makig sure things are unicode strings instead of 8 bit
char strings is easy to change as the python interface supports all the
same operations on unicode text as 8 bit char strings. Then when the
program needs to either 'output' the text or 'input' text, that is raelly
only when we need to worry about ensuring a consistent encoding type is
used and that thing are not double encoded or decoded.

Anyhow, that was a bit of a dump, but it was good to get it down in
writting I think. ;) There is still lots to do and fix in this area, but
now is the time to do it and I think I finally 'get it'. Well mostly. ;) I
am able to actually sebd unicode chars from the ioHub process in an error
for example, send them over the network, and then have them displayed as
the correct unicode character graphics in a command prompt on windows or
OSX, or in a python shell. I could never get the printing to work right
before, so I think I must be going in the geneal right direction.

Let me know if you see any issues, etc.
Thanks!

On Tue, Apr 9, 2013 at 8:13 AM, Jon Peirce notifications@github.com wrote:

I'm not sure about /replace/ with a unicode version. The idea would be to
know what key (by name) is pressed and also which character it represents.
e.g. I could be presssing 'e' but using modifiers that give it an accent -
some users want the key and others want the character.


Reply to this email directly or view it on GitHubhttps://github.com//issues/8#issuecomment-16108904
.

@peircej
Copy link

peircej commented Apr 9, 2013

Honestly, I haven't got time to read that amount of text properly! But
from a quick scan it sounds like you were already going for the concept
I had in my head (multiple fields so the user can extract different
forms of info). Great!

On 09/04/2013 15:31, Sol Simpson wrote:

The 'intention', which may not be fully working yet, is that both are
given.

The goal, I believe, is to still make the primary purpose of this type of
'scientific / experimental' key logger to create an event for every
actual
physical key pressed on the keyboard. So if you have to press 3 - 3
keys to
get the 'output' character intended, then the ioHub should report all
three
keys events, since that was what was pressed, but also give the
information
needed to know what the end intent was by the user if the sequence of
keys
preses was looked at. This is not a text editor key event API, it is a
'the
keyboard is being used as an input device for my experiment' key event
API.
;)

So with that goal in mind, this is what some of the relevant fields of
each
key event contain :

  1. Scancode field: The OS / local / keyboard even, dependent scan code
    for
    each key event. This is not going to be useful most of the time, but if
    errors in key mappings are found and someone wants to try and recover the
    meaningful data, this is needed.

  2. Char code / Key ID filed: The charcode / key_id for each key press.
    This
    is the value given by every OS after the scan code has been run
    through the
    keyboard and local mapping tables. The value is expected to be a 1 to 1
    relationship to actual characters or key descriptions for non visible
    keys.
    It is still OS dependent though for sure. On OSX, it seems that some
    of the
    time this code is actually the utf-8 encoded value for the key, but other
    times it is not, so this can not be counted on. In Windows and Xlib it
    is definitely 'not' the Unicode encoded key value.

  3. The uchar field. Given level 2 of information, then each OS has a
    way to
    get the Unicode utf-8 encoded value for the character, if it is a
    valid Unicode character , or if it has been assigned an utf-8 encoding by
    the OS vendor (Apple has done this for several dozen characters for
    example, many of the 'operational' function keys on the keyboard for
    example, event the apple logo.) If a given key event falls in the
    category
    of having a utf-8 encoded value available, then that value is all that is
    needed to also present the actual key graphic to the user as well, as
    long
    as the text viewer they are using supports displaying utf-8 characters of
    course, which most do. The code field has the Unicode unit version of
    what
    you would normally see as hex in a utf-8 encoded character (I guess up
    to 3
    or even 4 bytes can be used for 1 utf8 encoded char). If there is not a
    'real' unicode value for the key pressed, or it can not be
    determined, callbacks are used when possible, like OS provided lookup
    tables etc. For example, on OSX, Apple reserved a block of unicode utf-8
    encoding address space and have used it to define 'unicode' utf-8
    encoding
    values for most / all of the non-visible type keys / actions that can be
    triggered on the Mac. That is what the UnicodeChars sub class of the
    KeyboardConstants class is, a copy from part of the NSEvent.h file. If
    both
    these steps fail, then it is given a 'Dead Char' assignment in this
    field .
    Some keys will be 'dead' keys , which have no representation at this
    level
    because they do not generate any input themselves in terms of a char /
    uchar. You gave an example of such a case. Even the 'modifier' keys we
    are
    used to can be classified as 'dead' characters in this sense (shift, alt,
    ctrl, etc). These type of keys seem to always require a 'lookup' table
    type
    approach to being able to provide a human readable / meaningful
    representation of the key. This is also OS dependent. Thus why there is a
    forth field of information for each key event. The 'key' field.

  4. The key field: This is the field most people will ever need to worry
    about. ;) This field holds the actual unicode char, and can be viewed as
    intended. If it is a valid, visible, utf-8 encoded char we are dealing
    with, then the only difference between field 3 and 4 is that field 4 has
    been decoded so it is a unicode char, not a sequence of 8 bit chars,
    as far
    as the environment is concerned, so it can be viewed properly as it
    should.
    If the key is not a visible key or does not have a utf-8 encoded version
    that the OS provided, then the key field will hold the "text label"
    for the
    char. Common examples of this are the Page Up and Page Down keys, Escape,
    arrow keys, all the common modifier keys, function keys, multimedia keys
    etc. In these cases the key field holds the name / label for the key.
    This
    is very OS specific, so I have added the ability to define a mapping file
    between the 'standard' ioHub label for such keys and the label the
    current
    OS uses. This way the idea is even most dead / non visible keys can be
    given a consistent label by ioHub across OSs. This is the idea anyhow.
    Not
    complete yet for sure.

I've learned a lot about this stuff over the last 1/2 week, and why it
has
taken so long to get the initial ports done, and why the design
has evolved from one OS to the other, as until i actually got how each OS
is doing it and what functions I had access to on each, I could not
make a
proper overall plan that 'should ' be able to address the OS
differences while making things as consistent as possible for people,
using
this multi stages approach.

I also figured out how to actually display a unicode character
properly in
a console window when a script outputs it, instead of showning the ascii
version of the utf-8 ended text, or junk, I have not perfected it in
terms
of handling all the use cases that can occur, but it is possible, even
when
sending the unicode text over a network or pipe and displaying it n the
console of another process. I have started to fix the bugs in ioHub in
this
regard when I run into them. It seems to me that there are two main
causes
for a unicode character or text not displaying correctly. A) A piece of
text that is already unicode utf-8 encoded get eencoded with utf-8 by
another part of the program somewhere, so the doubly encoded text being
garbage and can not be displayed correctly. b) Unicode Text that has been
converted to a utf-8 encoded string does not get decoded 'on the other
side' back into a unicode string, or the decoding process doesn not use
utf-8 explicitly and therefore falls back to the OS or pipes (like
std.out)
default encoding type. In teh first case, you get shown the hex looking
string of text, the utf-8 encoded string rep of the unicode text. In the
second case, since a different encoding was used when decoding the utf-8
string back to unicode text, you get garbage. The same encoding scheme
(utf-8 IMO is hat we should standardize on) must be used to both
encode and
decode the data for things to work right, It is not really possible for
software to automatically detemine datas encoding type in a reliable way
programatically, so the encoding type that wants to be used should
e explicitly specified. Inside the program, just always use unicode
strings
/ text. Other than makig sure things are unicode strings instead of 8 bit
char strings is easy to change as the python interface supports all the
same operations on unicode text as 8 bit char strings. Then when the
program needs to either 'output' the text or 'input' text, that is raelly
only when we need to worry about ensuring a consistent encoding type is
used and that thing are not double encoded or decoded.

Anyhow, that was a bit of a dump, but it was good to get it down in
writting I think. ;) There is still lots to do and fix in this area, but
now is the time to do it and I think I finally 'get it'. Well mostly.
;) I
am able to actually sebd unicode chars from the ioHub process in an error
for example, send them over the network, and then have them displayed as
the correct unicode character graphics in a command prompt on windows or
OSX, or in a python shell. I could never get the printing to work right
before, so I think I must be going in the geneal right direction.

Let me know if you see any issues, etc.
Thanks!

On Tue, Apr 9, 2013 at 8:13 AM, Jon Peirce notifications@github.com
wrote:

I'm not sure about /replace/ with a unicode version. The idea would
be to
know what key (by name) is pressed and also which character it
represents.
e.g. I could be presssing 'e' but using modifiers that give it an
accent -
some users want the key and others want the character.


Reply to this email directly or view it on
GitHubhttps://github.com//issues/8#issuecomment-16108904
.


Reply to this email directly or view it on GitHub
#8 (comment).

Jonathan Peirce
Nottingham Visual Neuroscience

http://www.peirce.org.uk

@isolver
Copy link
Owner Author

isolver commented Apr 9, 2013

I also just realized you were pickin up n the bug in github. I need to
check that actually. That for the very elevent reminder. ;)

I had read that pyHook as not unicode compliment because it used the A
versions of the Windows functions. However I have done tests entering a few
french letters since my keyboard supports that and they seemed to be
storing and retrievable decodable back to the right letter. I will try
some chinese ones using the screen keyboard and see with that later.

On Tue, Apr 9, 2013 at 8:13 AM, Jon Peirce notifications@github.com wrote:

I'm not sure about /replace/ with a unicode version. The idea would be to
know what key (by name) is pressed and also which character it represents.
e.g. I could be presssing 'e' but using modifiers that give it an accent -
some users want the key and others want the character.


Reply to this email directly or view it on GitHubhttps://github.com//issues/8#issuecomment-16108904
.

@isolver
Copy link
Owner Author

isolver commented Apr 14, 2013

I have looked at the pyhook code and I do not think there is an issue. pyHook does not 'give' you unicode characters, but using the information is does give, it should be possible to get them ourselves. The keycode and scancode should be valid and can be used in a call to the ToUnicode function. The 'twist' is that since we are getting the events from a LL hook in a separate process, we can not use the GetKeyboardState() function to get the keyboard state that is also passed into the ToUnicode call. Each thread has a different keyboard state array in windows, and there is no 'standard way' for one process to get another processes keyboard state array. SO the plan is to create and maintain a keyboard state array that is compatible with what needs to be handed to the ToUnicode function.

@ghost ghost assigned isolver Apr 14, 2013
@peircej
Copy link

peircej commented Apr 15, 2013

Unicode can be addressed later if needed. It is something that people
have requested before, but it isnt currently provided, so it isn't like
anybody would be losing anything.

Jon

On 14/04/2013 17:05, Sol Simpson wrote:

I have looked at the pyhook code and I do not think there is an issue.
pyHook does not 'give' you unicode characters, but using the
information is does give, it should be possible to get them ourselves.
The keycode and scancode should be valid and can be used in a call to
the ToUnicode function. The 'twist' is that since we are getting the
events from a LL hook in a separate process, we can not use the
GetKeyboardState() function to get the keyboard state that is also
passed into the ToUnicode call. Each thread has a different keyboard
state array in windows, and there is no 'standard way' for one process
to get another processes keyboard state array. SO the plan is to
create and maintain a keyboard state array that is compatible with
what needs to be handed to the ToUnicode function.


Reply to this email directly or view it on GitHub
#8 (comment).

Jonathan Peirce
Nottingham Visual Neuroscience

http://www.peirce.org.uk

@isolver isolver closed this as completed Apr 17, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants