New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incoming message non-English characters are shown as ** #128

Closed
selishta opened this Issue Aug 11, 2011 · 11 comments

Comments

Projects
None yet
2 participants
@selishta
Contributor

selishta commented Aug 11, 2011

When an authority replies to a request with characters like ë, ç, they get replaced with **
This does not happen with outgoing messages.

@sebbacon

This comment has been minimized.

Show comment
Hide comment
@sebbacon

sebbacon Aug 11, 2011

Contributor

I am unable to reproduce this, either manually by sending emails directly, or via functional tests.

Considering this request: http://informatazyrtare.org/sq/request/special_characters

I note that the raw_email is correctly stored in encoded form: http://informatazyrtare.org/sq/admin/request/show_raw_email/28

If I download it (using the download link on that page), and then run:

cat 28 | ./script/mailin

The message is displayed correctly on my system (in the Holding Pen, since I don't have the original request).

Do you still get the asterisks if you do the same?

It's interesting to note that it's two asterisks for each single unicode character, which suggests an encoding issue to do with UTF8 -- like something is trying to convert it to ASCII.

Contributor

sebbacon commented Aug 11, 2011

I am unable to reproduce this, either manually by sending emails directly, or via functional tests.

Considering this request: http://informatazyrtare.org/sq/request/special_characters

I note that the raw_email is correctly stored in encoded form: http://informatazyrtare.org/sq/admin/request/show_raw_email/28

If I download it (using the download link on that page), and then run:

cat 28 | ./script/mailin

The message is displayed correctly on my system (in the Holding Pen, since I don't have the original request).

Do you still get the asterisks if you do the same?

It's interesting to note that it's two asterisks for each single unicode character, which suggests an encoding issue to do with UTF8 -- like something is trying to convert it to ASCII.

@sebbacon

This comment has been minimized.

Show comment
Hide comment
@sebbacon

sebbacon Aug 11, 2011

Contributor

Also... what are LC_ALL and LANG environment variables on your server? Not sure how that would be relevant yet, but it could be a useful data point.

Contributor

sebbacon commented Aug 11, 2011

Also... what are LC_ALL and LANG environment variables on your server? Not sure how that would be relevant yet, but it could be a useful data point.

@sebbacon

This comment has been minimized.

Show comment
Hide comment
@sebbacon

sebbacon Aug 11, 2011

Contributor

...and you're running Ruby 1.8.x, right?

Contributor

sebbacon commented Aug 11, 2011

...and you're running Ruby 1.8.x, right?

@selishta

This comment has been minimized.

Show comment
Hide comment
@selishta

selishta Aug 12, 2011

Contributor

I am running Ruby 1.8.7,
Environment variables are LANG = en_US.UTF-8 , LC_ALL = ---- (unset).

I have downloaded the message and run cat .. ./script/mail and is still the same
http://informatazyrtare.org/sq/request/special_characters#incoming-33

Contributor

selishta commented Aug 12, 2011

I am running Ruby 1.8.7,
Environment variables are LANG = en_US.UTF-8 , LC_ALL = ---- (unset).

I have downloaded the message and run cat .. ./script/mail and is still the same
http://informatazyrtare.org/sq/request/special_characters#incoming-33

@sebbacon

This comment has been minimized.

Show comment
Hide comment
@sebbacon

sebbacon Aug 12, 2011

Contributor

So, the problem is that elinks is used to generate the plain text email based on the HTML version. Elinks is receiving UTF8 but treating it as some single-byte character set, perhaps ASCII. The invocation elinks foo.html -force-html -dump-charset utf-8 -eval 'set document.codepage.assume = "utf-8"' -dump <filename> will tell it which input character set to assume.

https://github.com/sebbacon/alaveteli/blob/master/app/models/incoming_message.rb#L832 calls through to the plain-textify method without passing a charset.

In this particular example, however, the email supplied wrongly declares its charset to be iso-8859-1 when it's actually UTF-8, so perhaps this isn't the fix (note the mailer in question is YahooMailWebService/0.8.113.313619)

The reason this doesn't work on the IZ server and works everywhere else is also unclear. The document.codepage.assume elinks setting mentioned above is documented as having a default setting that defers to the system locale, yet LC_ALL etc are the same on the IZ server and elsewhere.

Possibly the correct thing to do these days is to assume UTF8 in the first instance (given that here we have a broken charset in any case).

Contributor

sebbacon commented Aug 12, 2011

So, the problem is that elinks is used to generate the plain text email based on the HTML version. Elinks is receiving UTF8 but treating it as some single-byte character set, perhaps ASCII. The invocation elinks foo.html -force-html -dump-charset utf-8 -eval 'set document.codepage.assume = "utf-8"' -dump <filename> will tell it which input character set to assume.

https://github.com/sebbacon/alaveteli/blob/master/app/models/incoming_message.rb#L832 calls through to the plain-textify method without passing a charset.

In this particular example, however, the email supplied wrongly declares its charset to be iso-8859-1 when it's actually UTF-8, so perhaps this isn't the fix (note the mailer in question is YahooMailWebService/0.8.113.313619)

The reason this doesn't work on the IZ server and works everywhere else is also unclear. The document.codepage.assume elinks setting mentioned above is documented as having a default setting that defers to the system locale, yet LC_ALL etc are the same on the IZ server and elsewhere.

Possibly the correct thing to do these days is to assume UTF8 in the first instance (given that here we have a broken charset in any case).

@sebbacon

This comment has been minimized.

Show comment
Hide comment
@sebbacon

sebbacon Aug 12, 2011

Contributor

Fixed in dbac412

(Though note that the test, while it failed on the IZ server, never failed for me anywhere else)

Contributor

sebbacon commented Aug 12, 2011

Fixed in dbac412

(Though note that the test, while it failed on the IZ server, never failed for me anywhere else)

@sebbacon sebbacon closed this Aug 12, 2011

@sebbacon

This comment has been minimized.

Show comment
Hide comment
@sebbacon

sebbacon Aug 16, 2011

Contributor

When looking into the performance issues on iz-srv-01 I found out what was the problem on your server. The LANG and associated locale environment variables were set to en_GB.utf8 but this locale wasn't installed on the server:

izuser01@iz-srv-01:/iz/alaveteli/script$ locale -a
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
C
POSIX
en_US
en_US.utf8

Fixed with:

$ sudo locale-gen en_GB.utf8

Not sure how it ended up in this state, though.

Contributor

sebbacon commented Aug 16, 2011

When looking into the performance issues on iz-srv-01 I found out what was the problem on your server. The LANG and associated locale environment variables were set to en_GB.utf8 but this locale wasn't installed on the server:

izuser01@iz-srv-01:/iz/alaveteli/script$ locale -a
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
C
POSIX
en_US
en_US.utf8

Fixed with:

$ sudo locale-gen en_GB.utf8

Not sure how it ended up in this state, though.

@sebbacon

This comment has been minimized.

Show comment
Hide comment
@sebbacon

sebbacon Oct 7, 2011

Contributor

Reopening this though I think the issue might be slightly different.

Incoming emails such as this one are still displaying double-byte characters as asterisks. It is part of the elinks conversion.

The really puzzling thing is that if I get the incoming message from the command line, and cause it to regenerate the text version, it works:

>> asd = IncomingMessage.find(5)
>> asd.cached_main_body_text_unfolded = nil  # delete the old cached copy of the plain text version
>> asd.save!
>> asd.get_body_for_html_display

Viewing the request from the web browser now shows the correct characters.

However, if I reset the cache again as above, but omit the final step (asd.get_body_for_html_display), and then visit the request page in a browser, I get the asterisks. This is despite the fact that the request page simply calls get_body_for_html_display on the message anyway.

This is very confusing! I can only assume the different is somehow between the environment the web server's running in, and the environment that my rails console is running in (e.g. apache is running as www-user but I'm running the rails console as izuser01 ... could it be related to this?). Perhaps STDOUT in the apache environment has a different encoding set somehow. (This hypothesis is complicated by the lack of documentation regarding Ruby 1.8.x's handling of encoding in IO streams).

Contributor

sebbacon commented Oct 7, 2011

Reopening this though I think the issue might be slightly different.

Incoming emails such as this one are still displaying double-byte characters as asterisks. It is part of the elinks conversion.

The really puzzling thing is that if I get the incoming message from the command line, and cause it to regenerate the text version, it works:

>> asd = IncomingMessage.find(5)
>> asd.cached_main_body_text_unfolded = nil  # delete the old cached copy of the plain text version
>> asd.save!
>> asd.get_body_for_html_display

Viewing the request from the web browser now shows the correct characters.

However, if I reset the cache again as above, but omit the final step (asd.get_body_for_html_display), and then visit the request page in a browser, I get the asterisks. This is despite the fact that the request page simply calls get_body_for_html_display on the message anyway.

This is very confusing! I can only assume the different is somehow between the environment the web server's running in, and the environment that my rails console is running in (e.g. apache is running as www-user but I'm running the rails console as izuser01 ... could it be related to this?). Perhaps STDOUT in the apache environment has a different encoding set somehow. (This hypothesis is complicated by the lack of documentation regarding Ruby 1.8.x's handling of encoding in IO streams).

@sebbacon

This comment has been minimized.

Show comment
Hide comment
@sebbacon

sebbacon Oct 7, 2011

Contributor

I should add I am unable to reproduce this locally, so it appears to be something to do with the server settings, as was the case previously.

Contributor

sebbacon commented Oct 7, 2011

I should add I am unable to reproduce this locally, so it appears to be something to do with the server settings, as was the case previously.

@sebbacon sebbacon reopened this Oct 7, 2011

@sebbacon

This comment has been minimized.

Show comment
Hide comment
@sebbacon

sebbacon Oct 7, 2011

Contributor

Also, if I run the elinks command corresponding to that executed by the IncomingMessage text processing, against exactly the same input, but from the command line, I get the correct output. Therefore, it feels to me likely that there is an encoding issue somewhere between the various IO pipes that get constructed to execute the external command.

Contributor

sebbacon commented Oct 7, 2011

Also, if I run the elinks command corresponding to that executed by the IncomingMessage text processing, against exactly the same input, but from the command line, I get the correct output. Therefore, it feels to me likely that there is an encoding issue somewhere between the various IO pipes that get constructed to execute the external command.

@sebbacon

This comment has been minimized.

Show comment
Hide comment
@sebbacon

sebbacon Oct 10, 2011

Contributor

I just double checked this. The elinks command is receiving identical UTF-8 to work on, both when it works (i.e. calling get_body_for_html_display from the command line) and when it doesn't (calling it via the show-request logic via a browser). I had wondered if we needed to pass the parsed charset to elinks, but in fact it turns out (undocumented!) that TMail does automatic conversion to UTF8 for you.

So the input to the command isn't the problem.

The command is always invoked the same, viz. /usr/bin/elinks -eval 'set document.codepage.assume = "utf-8"' -dump-charset utf-8 -force-html -dump.

If I run that command from a bash shel against the utf-8 input, then I get utf-8 output.

If I cause that command to be run from Rails, via the console using IncomingMessage.find(1).get_body_for_html_display, I get identical utf-8 output.

If I cause that command to be run from Rails, via a web browser, I get the double byte characters replaced by asterisks.

The asterisks can be reproduced by forcing elinks to assume ASCII for the input (/usr/bin/elinks -eval 'set document.codepage.assume = "ascii"' -dump-charset utf-8 -force-html -dump).

Passenger is running Rails as izuser01, so it's not to do with what user is executing elinks. I noticed that the .elinks config directory in /home/izuser01 was owned and only readable by root, which is strange.

However, it is clearly some kind of elinks-related setting that is overriding or ignoring the codepage we're trying to set from the command line, because I have fixed this by adding set document.codepage.assume = "utf-8" to the global config file at /etc/elinks/elinks.conf.

Contributor

sebbacon commented Oct 10, 2011

I just double checked this. The elinks command is receiving identical UTF-8 to work on, both when it works (i.e. calling get_body_for_html_display from the command line) and when it doesn't (calling it via the show-request logic via a browser). I had wondered if we needed to pass the parsed charset to elinks, but in fact it turns out (undocumented!) that TMail does automatic conversion to UTF8 for you.

So the input to the command isn't the problem.

The command is always invoked the same, viz. /usr/bin/elinks -eval 'set document.codepage.assume = "utf-8"' -dump-charset utf-8 -force-html -dump.

If I run that command from a bash shel against the utf-8 input, then I get utf-8 output.

If I cause that command to be run from Rails, via the console using IncomingMessage.find(1).get_body_for_html_display, I get identical utf-8 output.

If I cause that command to be run from Rails, via a web browser, I get the double byte characters replaced by asterisks.

The asterisks can be reproduced by forcing elinks to assume ASCII for the input (/usr/bin/elinks -eval 'set document.codepage.assume = "ascii"' -dump-charset utf-8 -force-html -dump).

Passenger is running Rails as izuser01, so it's not to do with what user is executing elinks. I noticed that the .elinks config directory in /home/izuser01 was owned and only readable by root, which is strange.

However, it is clearly some kind of elinks-related setting that is overriding or ignoring the codepage we're trying to set from the command line, because I have fixed this by adding set document.codepage.assume = "utf-8" to the global config file at /etc/elinks/elinks.conf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment