Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contributor forum content gets cut when there's an emoji #435

Closed
Tracked by #1372
kelimuttu opened this issue May 15, 2020 · 9 comments
Closed
Tracked by #1372

Contributor forum content gets cut when there's an emoji #435

kelimuttu opened this issue May 15, 2020 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@kelimuttu
Copy link
Collaborator

kelimuttu commented May 15, 2020

Related bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1723190

I posted a community announcement in the contributor forum containing emoji earlier today. But instead of displaying the full content, it got cut off exactly where I put the emoji although the preview seems to be able to display the emoji just fine. Not sure if it was possible in the past, but even if it's not possible, I think it's enough to just remove the emoji and display the rest of the content instead of cutting the content halfway.

@LeoMcA LeoMcA added this to the Python3 upgrade milestone May 18, 2020
@LeoMcA
Copy link

LeoMcA commented May 18, 2020

@kelimuttu thanks for reporting this.

It seems to be happening because of some weird interaction between django and the db, from the limited debugging I did a string with an emoji in it seems to be properly passed to the db, but on retrieval is cut as you say: with everything including and after the emoji missing.

Our planned upgrade to python 3 might magically fix this, so we'll revisit this after that.

@kelimuttu
Copy link
Collaborator Author

Thanks for looking into it, Leo. Can you please update here once it's land on the staging site so I could test it out?

@akatsoulas akatsoulas modified the milestones: Python3 upgrade, KTLO Jul 21, 2020
@akatsoulas akatsoulas added this to Triage/Parking Lot in SUMO Engineering Board via automation Feb 11, 2021
@akatsoulas akatsoulas added the bug Something isn't working label Feb 11, 2021
@akatsoulas akatsoulas moved this from Triage/Parking Lot to Backlog in SUMO Engineering Board Feb 11, 2021
@akatsoulas
Copy link
Collaborator

Python3 upgrade is live in prod since a few months. I suspect that it's a limitation of our DB but let's investigate if this is the case.

@akatsoulas akatsoulas moved this from Backlog to Triage/Parking Lot in SUMO Engineering Board Feb 11, 2021
@akatsoulas akatsoulas removed this from Triage/Parking Lot in SUMO Engineering Board Jun 10, 2021
@akatsoulas
Copy link
Collaborator

Blocked by #765

@LeoMcA LeoMcA self-assigned this Aug 6, 2021
@LeoMcA LeoMcA added this to Triage/Parking Lot in SUMO Engineering Board via automation Aug 6, 2021
@LeoMcA LeoMcA moved this from Triage/Parking Lot to In Progress in SUMO Engineering Board Aug 6, 2021
@LeoMcA
Copy link

LeoMcA commented Aug 6, 2021

Rediscovered this after looking at a recently filed bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1723190

This looks to be a bizarre problem in MySQL where utf8 isn't actually valid utf-8, but can only store up to 3-byte utf-8 chars. Any 4-byte characters will cause the rest of the content to get truncated. Converting our tables to use "proper" utf-8 is rather an involved process, but for now it should be possible to strip these problematic characters before saving to DB, to avoid the data loss which can happen.

LeoMcA added a commit to LeoMcA/kitsune that referenced this issue Aug 6, 2021
currently our mysql db is set up to use utf8mb3 which cannot store 4-byte utf-8 characters. it'll truncate the field just before the offending character, which can lead to some pretty major data loss

this change adds a couple of fields which will strip out those characters, or in the case of utf8mb3TextField can be configured to store them as html numeric character references

all end-user modifiable fields have been changed to use these utf8mb3 fields, and in a few carefully tested cases will store 4-byte characters as html numeric references

mozilla/sumo#435
LeoMcA added a commit to LeoMcA/kitsune that referenced this issue Aug 6, 2021
currently our mysql db is set up to use utf8mb3 which cannot store 4-byte utf-8 characters. it'll truncate the field just before the offending character, which can lead to some pretty major data loss

this change adds a couple of fields which will strip out those characters, or in the case of utf8mb3TextField can be configured to store them as html numeric character references

all end-user modifiable fields have been changed to use these utf8mb3 fields, and in a few carefully tested cases will store 4-byte characters as html numeric references

mozilla/sumo#435
@LeoMcA LeoMcA moved this from In Progress to Review in SUMO Engineering Board Aug 6, 2021
@akatsoulas akatsoulas moved this from Review to Comms/Dependencies/Blocked in SUMO Engineering Board Aug 13, 2021
@akatsoulas akatsoulas moved this from Projects, Epics and Blocked Items to Backlog in SUMO Engineering Board Jun 2, 2022
@emilghittasv
Copy link
Collaborator

When I'm trying to add the following content body in a new article ( /kb/new ):


Another test
🞇

🞇

ffw
ewwe
we

Sentry fires a Unhandled (1366, "Incorrect string value: '\xF0\x9F\x9E\x87 t...' for column 'content' at row 1") Data Error -> https://mozilla.sentry.io/issues/4104208612/events/b4360388166b4217b968b2b8b0d8e40e/

@kelimuttu
Copy link
Collaborator Author

Thanks for testing this issue, @emilghittasv

@escattone
Copy link
Contributor

This should no longer be an issue once we migrate to Postgres, which we're hoping to complete by the end of June 2023.

@escattone
Copy link
Contributor

This has been fixed as of today! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

5 participants