Problem when converting from byte to UTF8 #782

carragom · 2015-11-13T05:51:13Z

Hi there,

I have some data in my amavis database that's rendering modoboa unusable. All parts of the system throw an internal server error. I have traced the issue to the file modoboa/extensions/amavis/sql_conector.py. The problem happens with all queries that use the function convert_from(maddr.email, 'UTF8'). Turns out if I manually run the query directly in PostgreSQL it throws the error:

ERROR: invalid byte sequence for encoding "UTF8": 0xf1657340

There seems to be data in there that's not UTF8 compliant. At first I saw #677 and upgraded to 1.2.2 thinking it was the same problem. But this did not fixed the issue. In the end I replaced all convert_from functions in modoboa/extensions/amavis/sql_conector.py to use LATIN1 instead of UTF8 and the issue is now fixed.

In the end, the problematic character was a ñ. I know RFC6531 states that email addresses could be encoded in UTF8 but amavis does not seem to be playing by those rules; yet?. An immediate solution is to switch to LATIN1 and hope it's the right encoding.

As an alternate solution the data could be retrieved in binary form from the database and converted in python maybe even a django filter that would fallback to something like "invalid address" if the conversion fails. But this would force modoboa to retrive all record from the database and in my case that would be around 7K and counting.

Hope this helps.
Cheers.

The text was updated successfully, but these errors were encountered:

tonioo · 2015-11-16T16:35:54Z

Hi,

as for the Quarantine.mail_text field, we could try to use a BinaryField and to remove all calls to convert_from.

Do you think you could try it ?

tonioo · 2015-11-27T16:42:43Z

@carragom ping

carragom · 2015-12-18T19:39:51Z

Hi @tonioo sorry for the absence. At least in version 1.2.2 Quarantine.mail_text already is a BinaryField as shown here or am I looking at the wrong place ?. Maybe I did not understand what you meant ?

tonioo · 2015-12-18T20:19:26Z

Hi @carragom, your problem seems to be related to the email field:
https://github.com/modoboa/modoboa-amavis/blob/master/modoboa_amavis/models.py#L20

carragom · 2015-12-18T22:25:39Z

@tonioo Yes that's the field causing the problems. Like I said, I see two options to fix this:

1- Keep the conversion in the database as it's now, but ask the database to convert to LATIN1 instead of UTF8 in all occurrences of the convert_from function here. I did this and it's working for me. I'm not sure if this is the right encoding but for sure it's better than using UTF8 that we already know it breaks.

2- Switch to using BinaryField for the Maddr.email and handle the conversion in python. Just keep in mind that the number of rows in the Maddr table grows fast, when I reported this a month ago the table was around 7K rows, right now it's sitting at 12K rows. So doing this conversion out of the database could be a performance issue.

I have been looking around here to see if there is any indication on what encoding it's actually used without luck. But it does mention that the option $sql_allow_8bit_address needs to be set to use this field as bytea. So LATIN1 sounds like a safe encoding to use.

If you ask me I would just go option number 1 which is simple to implement and have no performance issues.

I can provide a quick PR for option 1 if you decide to go with it.
Let me know.

tonioo · 2016-01-27T15:25:58Z

@carragom Using LATIN1 as encoding won't cover all cases. I guess we will encounter the same issues with another encoding soon. I still think a BinaryField is the right answer and I do hope Django uses the appropriate field when it generates queries. The manual conversion you see in the current code would also disappear.

tonioo · 2016-01-27T15:36:49Z

@carragom BTW, the right place for this issue is into the https://github.com/modoboa/modoboa-amavis repository.

carragom · 2016-01-27T23:40:40Z

@tonioo I agree that LATIN1 does not cover all cases and it's far from ideal. The one thing for sure is that this problem renders Modoboa unusable, all of it, not just the amavis module. So this should be fixed in any way necessary.

It might be possible that amavis does not intent for these fields to be used as text, from the amavis README.sql-pg.txt:

Upgrade note: field quarantine.mail_text should be of data type 'bytea'
and not 'text' as suggested in earlier documentation; this is to prevent
it from being unjustifiably associated with a character set, and to be
able to store any byte value; to convert existing field from type 'text'
to type 'bytea' the following clause may be used:
ALTER TABLE quarantine ALTER mail_text TYPE bytea
USING decode(replace(mail_text,'','\'),'escape');

Thanks a lot for your time.

tonioo · 2016-01-28T08:11:32Z

Please look at this thread (the end of the page is interesting):
https://code.djangoproject.com/ticket/2417
And this commit (django source code):
django/django@8ee1edd

And tell me what do you think :)

carragom · 2016-01-29T19:02:34Z

Yes using a BinaryField is definitively an option see here. But switching to BinaryField alone is not enough. Every custom query using convert_from needs to be replaced with something that fetches the data from the table and filter's it on the python side. This means probably rewriting this entire class.

In any case, it does not matter what type of field is used or where the conversion happens (db or app) at some point those bytes on the database will have to be converted to text in order to be useful and the conversion will require a character set. UTF8 is not the right charset for that data and currently breaks the entire application. The main objective here is to find a way where the application does not break even if the conversion fails.

Again I see two options:

1- Find a way to handle the conversion gracefully at the database level (maybe a stored procedure would help here or just use LATIN1 as charset which is working for me and seems to be what amavis is using).
2- Use a BinaryField and move the entire logic of converting/filtering the data to the web app which is inefficient and a lot of work and will still break if we keep trying to use UTF8 as charset.

Again thanks for your time, I hope I was a bit more clear this time.
Cheers.

tonioo · 2016-06-09T14:42:53Z

This issue was moved to modoboa/modoboa-amavis#35

tonioo mentioned this issue Jun 9, 2016

Problem when converting from byte to UTF8 modoboa/modoboa-amavis#35

Closed

tonioo closed this as completed Jun 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem when converting from byte to UTF8 #782

Problem when converting from byte to UTF8 #782

carragom commented Nov 13, 2015

tonioo commented Nov 16, 2015

tonioo commented Nov 27, 2015

carragom commented Dec 18, 2015

tonioo commented Dec 18, 2015

carragom commented Dec 18, 2015

tonioo commented Jan 27, 2016

tonioo commented Jan 27, 2016

carragom commented Jan 27, 2016

tonioo commented Jan 28, 2016

carragom commented Jan 29, 2016

tonioo commented Jun 9, 2016

Problem when converting from byte to UTF8 #782

Problem when converting from byte to UTF8 #782

Comments

carragom commented Nov 13, 2015

tonioo commented Nov 16, 2015

tonioo commented Nov 27, 2015

carragom commented Dec 18, 2015

tonioo commented Dec 18, 2015

carragom commented Dec 18, 2015

tonioo commented Jan 27, 2016

tonioo commented Jan 27, 2016

carragom commented Jan 27, 2016

tonioo commented Jan 28, 2016

carragom commented Jan 29, 2016

tonioo commented Jun 9, 2016