Contributor forum content gets cut when there's an emoji #435

kelimuttu · 2020-05-15T08:59:54Z

Related bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1723190

I posted a community announcement in the contributor forum containing emoji earlier today. But instead of displaying the full content, it got cut off exactly where I put the emoji although the preview seems to be able to display the emoji just fine. Not sure if it was possible in the past, but even if it's not possible, I think it's enough to just remove the emoji and display the rest of the content instead of cutting the content halfway.

LeoMcA · 2020-05-18T10:46:29Z

@kelimuttu thanks for reporting this.

It seems to be happening because of some weird interaction between django and the db, from the limited debugging I did a string with an emoji in it seems to be properly passed to the db, but on retrieval is cut as you say: with everything including and after the emoji missing.

Our planned upgrade to python 3 might magically fix this, so we'll revisit this after that.

kelimuttu · 2020-05-18T13:27:41Z

Thanks for looking into it, Leo. Can you please update here once it's land on the staging site so I could test it out?

akatsoulas · 2021-02-11T17:43:27Z

Python3 upgrade is live in prod since a few months. I suspect that it's a limitation of our DB but let's investigate if this is the case.

akatsoulas · 2021-06-10T10:00:03Z

Blocked by #765

LeoMcA · 2021-08-06T11:39:08Z

Rediscovered this after looking at a recently filed bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1723190

This looks to be a bizarre problem in MySQL where utf8 isn't actually valid utf-8, but can only store up to 3-byte utf-8 chars. Any 4-byte characters will cause the rest of the content to get truncated. Converting our tables to use "proper" utf-8 is rather an involved process, but for now it should be possible to strip these problematic characters before saving to DB, to avoid the data loss which can happen.

currently our mysql db is set up to use utf8mb3 which cannot store 4-byte utf-8 characters. it'll truncate the field just before the offending character, which can lead to some pretty major data loss this change adds a couple of fields which will strip out those characters, or in the case of utf8mb3TextField can be configured to store them as html numeric character references all end-user modifiable fields have been changed to use these utf8mb3 fields, and in a few carefully tested cases will store 4-byte characters as html numeric references mozilla/sumo#435

emilghittasv · 2023-04-18T08:42:18Z

When I'm trying to add the following content body in a new article ( /kb/new ):


Another test
🞇

🞇

ffw
ewwe
we

Sentry fires a Unhandled (1366, "Incorrect string value: '\xF0\x9F\x9E\x87 t...' for column 'content' at row 1") Data Error -> https://mozilla.sentry.io/issues/4104208612/events/b4360388166b4217b968b2b8b0d8e40e/

kelimuttu · 2023-04-18T08:55:43Z

Thanks for testing this issue, @emilghittasv

escattone · 2023-04-18T17:03:35Z

This should no longer be an issue once we migrate to Postgres, which we're hoping to complete by the end of June 2023.

escattone · 2023-11-29T00:03:34Z

This has been fixed as of today! 🎉

LeoMcA added this to the Python3 upgrade milestone May 18, 2020

akatsoulas modified the milestones: Python3 upgrade, KTLO Jul 21, 2020

akatsoulas added this to Triage/Parking Lot in SUMO Engineering Board via automation Feb 11, 2021

akatsoulas added the bug Something isn't working label Feb 11, 2021

akatsoulas moved this from Triage/Parking Lot to Backlog in SUMO Engineering Board Feb 11, 2021

akatsoulas moved this from Backlog to Triage/Parking Lot in SUMO Engineering Board Feb 11, 2021

akatsoulas modified the milestones: KTLO, Upgrade Django and underlying libraries Jun 10, 2021

akatsoulas removed this from Triage/Parking Lot in SUMO Engineering Board Jun 10, 2021

LeoMcA self-assigned this Aug 6, 2021

LeoMcA added this to Triage/Parking Lot in SUMO Engineering Board via automation Aug 6, 2021

LeoMcA moved this from Triage/Parking Lot to In Progress in SUMO Engineering Board Aug 6, 2021

LeoMcA mentioned this issue Aug 6, 2021

elegantly handle 4-byte utf-8 characters in django mozilla/kitsune#4867

Closed

LeoMcA moved this from In Progress to Review in SUMO Engineering Board Aug 6, 2021

akatsoulas moved this from Review to Comms/Dependencies/Blocked in SUMO Engineering Board Aug 13, 2021

akatsoulas removed this from the Upgrade Django and underlying libraries milestone Mar 22, 2022

akatsoulas moved this from Projects, Epics and Blocked Items to Backlog in SUMO Engineering Board Jun 2, 2022

akatsoulas mentioned this issue Jul 17, 2023

Backend and pipeline improvements #1372

Open

escattone closed this as completed Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributor forum content gets cut when there's an emoji #435

Contributor forum content gets cut when there's an emoji #435

kelimuttu commented May 15, 2020 •

edited by LeoMcA

LeoMcA commented May 18, 2020

kelimuttu commented May 18, 2020

akatsoulas commented Feb 11, 2021

akatsoulas commented Jun 10, 2021

LeoMcA commented Aug 6, 2021

emilghittasv commented Apr 18, 2023

kelimuttu commented Apr 18, 2023

escattone commented Apr 18, 2023

escattone commented Nov 29, 2023

Contributor forum content gets cut when there's an emoji #435

Contributor forum content gets cut when there's an emoji #435

Comments

kelimuttu commented May 15, 2020 • edited by LeoMcA

LeoMcA commented May 18, 2020

kelimuttu commented May 18, 2020

akatsoulas commented Feb 11, 2021

akatsoulas commented Jun 10, 2021

LeoMcA commented Aug 6, 2021

emilghittasv commented Apr 18, 2023

kelimuttu commented Apr 18, 2023

escattone commented Apr 18, 2023

escattone commented Nov 29, 2023

kelimuttu commented May 15, 2020 •

edited by LeoMcA