Incoming message non-English characters are shown as ** #128

selishta · 2011-08-11T08:34:52Z

When an authority replies to a request with characters like ë, ç, they get replaced with **
This does not happen with outgoing messages.

sebbacon · 2011-08-11T12:52:27Z

I am unable to reproduce this, either manually by sending emails directly, or via functional tests.

Considering this request: http://informatazyrtare.org/sq/request/special_characters

I note that the raw_email is correctly stored in encoded form: http://informatazyrtare.org/sq/admin/request/show_raw_email/28

If I download it (using the download link on that page), and then run:

cat 28 | ./script/mailin

The message is displayed correctly on my system (in the Holding Pen, since I don't have the original request).

Do you still get the asterisks if you do the same?

It's interesting to note that it's two asterisks for each single unicode character, which suggests an encoding issue to do with UTF8 -- like something is trying to convert it to ASCII.

sebbacon · 2011-08-11T13:03:03Z

Also... what are LC_ALL and LANG environment variables on your server? Not sure how that would be relevant yet, but it could be a useful data point.

sebbacon · 2011-08-11T13:17:35Z

...and you're running Ruby 1.8.x, right?

selishta · 2011-08-12T07:56:23Z

I am running Ruby 1.8.7,
Environment variables are LANG = en_US.UTF-8 , LC_ALL = ---- (unset).

I have downloaded the message and run cat .. ./script/mail and is still the same
http://informatazyrtare.org/sq/request/special_characters#incoming-33

sebbacon · 2011-08-12T13:15:28Z

So, the problem is that elinks is used to generate the plain text email based on the HTML version. Elinks is receiving UTF8 but treating it as some single-byte character set, perhaps ASCII. The invocation elinks foo.html -force-html -dump-charset utf-8 -eval 'set document.codepage.assume = "utf-8"' -dump <filename> will tell it which input character set to assume.

https://github.com/sebbacon/alaveteli/blob/master/app/models/incoming_message.rb#L832 calls through to the plain-textify method without passing a charset.

In this particular example, however, the email supplied wrongly declares its charset to be iso-8859-1 when it's actually UTF-8, so perhaps this isn't the fix (note the mailer in question is YahooMailWebService/0.8.113.313619)

The reason this doesn't work on the IZ server and works everywhere else is also unclear. The document.codepage.assume elinks setting mentioned above is documented as having a default setting that defers to the system locale, yet LC_ALL etc are the same on the IZ server and elsewhere.

Possibly the correct thing to do these days is to assume UTF8 in the first instance (given that here we have a broken charset in any case).

sebbacon · 2011-08-12T15:15:31Z

Fixed in dbac412

(Though note that the test, while it failed on the IZ server, never failed for me anywhere else)

sebbacon · 2011-08-16T09:17:58Z

When looking into the performance issues on iz-srv-01 I found out what was the problem on your server. The LANG and associated locale environment variables were set to en_GB.utf8 but this locale wasn't installed on the server:

izuser01@iz-srv-01:/iz/alaveteli/script$ locale -a
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
C
POSIX
en_US
en_US.utf8

Fixed with:

$ sudo locale-gen en_GB.utf8

Not sure how it ended up in this state, though.

sebbacon · 2011-10-07T13:42:41Z

Reopening this though I think the issue might be slightly different.

Incoming emails such as this one are still displaying double-byte characters as asterisks. It is part of the elinks conversion.

The really puzzling thing is that if I get the incoming message from the command line, and cause it to regenerate the text version, it works:

>> asd = IncomingMessage.find(5)
>> asd.cached_main_body_text_unfolded = nil  # delete the old cached copy of the plain text version
>> asd.save!
>> asd.get_body_for_html_display

Viewing the request from the web browser now shows the correct characters.

However, if I reset the cache again as above, but omit the final step (asd.get_body_for_html_display), and then visit the request page in a browser, I get the asterisks. This is despite the fact that the request page simply calls get_body_for_html_display on the message anyway.

This is very confusing! I can only assume the different is somehow between the environment the web server's running in, and the environment that my rails console is running in (e.g. apache is running as www-user but I'm running the rails console as izuser01 ... could it be related to this?). Perhaps STDOUT in the apache environment has a different encoding set somehow. (This hypothesis is complicated by the lack of documentation regarding Ruby 1.8.x's handling of encoding in IO streams).

sebbacon · 2011-10-07T13:44:50Z

I should add I am unable to reproduce this locally, so it appears to be something to do with the server settings, as was the case previously.

sebbacon · 2011-10-07T13:53:35Z

Also, if I run the elinks command corresponding to that executed by the IncomingMessage text processing, against exactly the same input, but from the command line, I get the correct output. Therefore, it feels to me likely that there is an encoding issue somewhere between the various IO pipes that get constructed to execute the external command.

sebbacon · 2011-10-10T12:19:18Z

I just double checked this. The elinks command is receiving identical UTF-8 to work on, both when it works (i.e. calling get_body_for_html_display from the command line) and when it doesn't (calling it via the show-request logic via a browser). I had wondered if we needed to pass the parsed charset to elinks, but in fact it turns out (undocumented!) that TMail does automatic conversion to UTF8 for you.

So the input to the command isn't the problem.

The command is always invoked the same, viz. /usr/bin/elinks -eval 'set document.codepage.assume = "utf-8"' -dump-charset utf-8 -force-html -dump.

If I run that command from a bash shel against the utf-8 input, then I get utf-8 output.

If I cause that command to be run from Rails, via the console using IncomingMessage.find(1).get_body_for_html_display, I get identical utf-8 output.

If I cause that command to be run from Rails, via a web browser, I get the double byte characters replaced by asterisks.

The asterisks can be reproduced by forcing elinks to assume ASCII for the input (/usr/bin/elinks -eval 'set document.codepage.assume = "ascii"' -dump-charset utf-8 -force-html -dump).

Passenger is running Rails as izuser01, so it's not to do with what user is executing elinks. I noticed that the .elinks config directory in /home/izuser01 was owned and only readable by root, which is strange.

However, it is clearly some kind of elinks-related setting that is overriding or ignoring the codepage we're trying to set from the command line, because I have fixed this by adding set document.codepage.assume = "utf-8" to the global config file at /etc/elinks/elinks.conf.

…ound)

sebbacon closed this as completed Aug 12, 2011

sebbacon reopened this Oct 7, 2011

sebbacon closed this as completed Oct 10, 2011

sebbacon added a commit that referenced this issue Oct 10, 2011

Extra tests to sanity check UTF conversion (see issue #128 for backgr…

247a24c

…ound)

sebbacon added a commit that referenced this issue Oct 10, 2011

Document workaround for elinks/UTF8 encoding issues (see issue #128)

9856816

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incoming message non-English characters are shown as ** #128

Incoming message non-English characters are shown as ** #128

selishta commented Aug 11, 2011

sebbacon commented Aug 11, 2011

sebbacon commented Aug 11, 2011

sebbacon commented Aug 11, 2011

selishta commented Aug 12, 2011

sebbacon commented Aug 12, 2011

sebbacon commented Aug 12, 2011

sebbacon commented Aug 16, 2011

sebbacon commented Oct 7, 2011

sebbacon commented Oct 7, 2011

sebbacon commented Oct 7, 2011

sebbacon commented Oct 10, 2011

Incoming message non-English characters are shown as ** #128

Incoming message non-English characters are shown as ** #128

Comments

selishta commented Aug 11, 2011

sebbacon commented Aug 11, 2011

sebbacon commented Aug 11, 2011

sebbacon commented Aug 11, 2011

selishta commented Aug 12, 2011

sebbacon commented Aug 12, 2011

sebbacon commented Aug 12, 2011

sebbacon commented Aug 16, 2011

sebbacon commented Oct 7, 2011

sebbacon commented Oct 7, 2011

sebbacon commented Oct 7, 2011

sebbacon commented Oct 10, 2011