Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for UTF-8 encoding #21

Closed
ian-mcdowell opened this issue Feb 7, 2018 · 14 comments
Closed

Support for UTF-8 encoding #21

ian-mcdowell opened this issue Feb 7, 2018 · 14 comments

Comments

@ian-mcdowell
Copy link
Contributor

OpenTerm passes UTF-8 commands to ios_system, and expects UTF-8 output from commands. However, argument parsing & command parsing converts the string to ASCII, which causes data loss.

Try: echo 馃榾 in OpenTerm to see results.

@holzschu
Copy link
Owner

holzschu commented Feb 7, 2018

That is a very good question. I'll look into it, but most Unix commands expect a C-string as input. There might not be a way to keep UTF8 characters through the entire pipeline.

At some point, the arguments have to be (char argc, char **argv). I do not know of a way to do UTF8 -> char -> UTF8 while keeping UTF8 information. I'll have to learn.

@ian-mcdowell
Copy link
Contributor Author

-[NSString cStringUsingEncoding:NSUTF8StringEncoding]?

@holzschu
Copy link
Owner

holzschu commented Feb 7, 2018

Sorry, strike my previous answer. ios_system() operates on char*, because system() operates on char*. There is no operation on the encoding inside ios_system(). The conversion was done in CommandExecutor, where there is the line ios_system(command.utf8CString) which is equivalent to [command cStringUsingEncoding:NSUTF8StringEncoding] (but shorter).

We need to track the result of this command.utf8CString conversion, but my guess is that it removed the UTF-8 content to make it compatible with C-like strings. This conversion cannot be reversed.

@ian-mcdowell
Copy link
Contributor Author

ian-mcdowell commented Feb 7, 2018

Take a look at parseArgument. You鈥檙e converting the arguments to ASCII.

If you take a UTF-8 C string (char *), then call +[NSString initWithCString:string encoding:NSASCIIStringEncoding] to convert it to an NSString, you have lost the UTF-8 characters. Even if you later later call -[NSString cStringWithEncoding:NSUTF8StringEncoding], they will be gone.

@ian-mcdowell
Copy link
Contributor Author

All that should be necessary to fix this (I've verified locally) is to change all occurrences of NSASCIIStringEncoding to NSUTF8StringEncoding in ios_system.m

diff.txt

However, if we want to support other encodings than UTF-8 in the future, more work would be needed.

@holzschu
Copy link
Owner

holzschu commented Feb 7, 2018

You're absolutely right. It works now. Gonna make a short PR for that.
The effects are... interesting. Not all commands work, but most of them do.
img_1160

@ian-mcdowell
Copy link
Contributor Author

Looks like ls fails, that's it? Overall, looks like a nice improvement to what we currently have.

@ian-mcdowell
Copy link
Contributor Author

ian-mcdowell commented Feb 7, 2018

Investigating ls. Looks like the relevant code is at file_cmds_ios/ls/util.c line 113. iswprint(wc) fails, so it prints a ?

Removing that iswprint check allows emoji to display, but I'm sure there's a reason that check is there.

@holzschu
Copy link
Owner

holzschu commented Feb 7, 2018

According to the doc, it should reply true if the character is printable, according to the defined locale: http://en.cppreference.com/w/c/string/wide/iswprint

@ian-mcdowell
Copy link
Contributor Author

ian-mcdowell commented Feb 7, 2018

Do we need to define LANG, LC_ALL, etc in the environment?

@holzschu
Copy link
Owner

holzschu commented Feb 7, 2018

Defining LC_ALL to en_US.UTF-8: doesn't work.
Calling setlocale(LC_ALL, "en_US.UTF-8") in iOS_system.m: doesn't work.
Calling setlocale(LC_ALL, "en_US.UTF-8") in ls.c: doesn't work.
It does work on my mac.

@ian-mcdowell
Copy link
Contributor Author

What does setlocale(LC_ALL, NULL) return? That should get you the current locale that C is using.

@holzschu
Copy link
Owner

holzschu commented Feb 7, 2018

It says "C", which explains the behaviour of iswprint but is inconsistent with the LC_ALL.
Because setlocale doesn't like en_US.UTF-8.

Update: found a similar question on stackoverflow: https://stackoverflow.com/questions/32594377/setlocale-works-in-ios-simulator-but-fails-on-device

setlocale depends on access to /usr/share/locale, which is not possible in the sandbox. I'm going to deactivate all iswprint calls.

@holzschu
Copy link
Owner

holzschu commented Feb 7, 2018

Fixed with 6c22c28. I guess I can close this issue now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants