Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem when converting from byte to UTF8 #782

Closed
carragom opened this issue Nov 13, 2015 · 11 comments
Closed

Problem when converting from byte to UTF8 #782

carragom opened this issue Nov 13, 2015 · 11 comments

Comments

@carragom
Copy link
Contributor

Hi there,

I have some data in my amavis database that's rendering modoboa unusable. All parts of the system throw an internal server error. I have traced the issue to the file modoboa/extensions/amavis/sql_conector.py. The problem happens with all queries that use the function convert_from(maddr.email, 'UTF8'). Turns out if I manually run the query directly in PostgreSQL it throws the error:

ERROR: invalid byte sequence for encoding "UTF8": 0xf1657340

There seems to be data in there that's not UTF8 compliant. At first I saw #677 and upgraded to 1.2.2 thinking it was the same problem. But this did not fixed the issue. In the end I replaced all convert_from functions in modoboa/extensions/amavis/sql_conector.py to use LATIN1 instead of UTF8 and the issue is now fixed.

In the end, the problematic character was a ñ. I know RFC6531 states that email addresses could be encoded in UTF8 but amavis does not seem to be playing by those rules; yet?. An immediate solution is to switch to LATIN1 and hope it's the right encoding.

As an alternate solution the data could be retrieved in binary form from the database and converted in python maybe even a django filter that would fallback to something like "invalid address" if the conversion fails. But this would force modoboa to retrive all record from the database and in my case that would be around 7K and counting.

Hope this helps.
Cheers.

@tonioo
Copy link
Member

tonioo commented Nov 16, 2015

Hi,

as for the Quarantine.mail_text field, we could try to use a BinaryField and to remove all calls to convert_from.

Do you think you could try it ?

@tonioo
Copy link
Member

tonioo commented Nov 27, 2015

@carragom ping

@carragom
Copy link
Contributor Author

Hi @tonioo sorry for the absence. At least in version 1.2.2 Quarantine.mail_text already is a BinaryField as shown here or am I looking at the wrong place ?. Maybe I did not understand what you meant ?

@tonioo
Copy link
Member

tonioo commented Dec 18, 2015

Hi @carragom, your problem seems to be related to the email field:
https://github.com/modoboa/modoboa-amavis/blob/master/modoboa_amavis/models.py#L20

@carragom
Copy link
Contributor Author

@tonioo Yes that's the field causing the problems. Like I said, I see two options to fix this:

1- Keep the conversion in the database as it's now, but ask the database to convert to LATIN1 instead of UTF8 in all occurrences of the convert_from function here. I did this and it's working for me. I'm not sure if this is the right encoding but for sure it's better than using UTF8 that we already know it breaks.

2- Switch to using BinaryField for the Maddr.email and handle the conversion in python. Just keep in mind that the number of rows in the Maddr table grows fast, when I reported this a month ago the table was around 7K rows, right now it's sitting at 12K rows. So doing this conversion out of the database could be a performance issue.

I have been looking around here to see if there is any indication on what encoding it's actually used without luck. But it does mention that the option $sql_allow_8bit_address needs to be set to use this field as bytea. So LATIN1 sounds like a safe encoding to use.

If you ask me I would just go option number 1 which is simple to implement and have no performance issues.

I can provide a quick PR for option 1 if you decide to go with it.
Let me know.

@tonioo
Copy link
Member

tonioo commented Jan 27, 2016

@carragom Using LATIN1 as encoding won't cover all cases. I guess we will encounter the same issues with another encoding soon. I still think a BinaryField is the right answer and I do hope Django uses the appropriate field when it generates queries. The manual conversion you see in the current code would also disappear.

@tonioo
Copy link
Member

tonioo commented Jan 27, 2016

@carragom BTW, the right place for this issue is into the https://github.com/modoboa/modoboa-amavis repository.

@carragom
Copy link
Contributor Author

@tonioo I agree that LATIN1 does not cover all cases and it's far from ideal. The one thing for sure is that this problem renders Modoboa unusable, all of it, not just the amavis module. So this should be fixed in any way necessary.

It might be possible that amavis does not intent for these fields to be used as text, from the amavis README.sql-pg.txt:

Upgrade note: field quarantine.mail_text should be of data type 'bytea'
and not 'text' as suggested in earlier documentation; this is to prevent
it from being unjustifiably associated with a character set
, and to be
able to store any byte value; to convert existing field from type 'text'
to type 'bytea' the following clause may be used:
ALTER TABLE quarantine ALTER mail_text TYPE bytea
USING decode(replace(mail_text,'','\'),'escape');

Thanks a lot for your time.

@tonioo
Copy link
Member

tonioo commented Jan 28, 2016

Please look at this thread (the end of the page is interesting):
https://code.djangoproject.com/ticket/2417
And this commit (django source code):
django/django@8ee1edd

And tell me what do you think :)

@carragom
Copy link
Contributor Author

Yes using a BinaryField is definitively an option see here. But switching to BinaryField alone is not enough. Every custom query using convert_from needs to be replaced with something that fetches the data from the table and filter's it on the python side. This means probably rewriting this entire class.

In any case, it does not matter what type of field is used or where the conversion happens (db or app) at some point those bytes on the database will have to be converted to text in order to be useful and the conversion will require a character set. UTF8 is not the right charset for that data and currently breaks the entire application. The main objective here is to find a way where the application does not break even if the conversion fails.

Again I see two options:

1- Find a way to handle the conversion gracefully at the database level (maybe a stored procedure would help here or just use LATIN1 as charset which is working for me and seems to be what amavis is using).
2- Use a BinaryField and move the entire logic of converting/filtering the data to the web app which is inefficient and a lot of work and will still break if we keep trying to use UTF8 as charset.

Again thanks for your time, I hope I was a bit more clear this time.
Cheers.

@tonioo
Copy link
Member

tonioo commented Jun 9, 2016

This issue was moved to modoboa/modoboa-amavis#35

@tonioo tonioo closed this as completed Jun 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants