exponential cancer-like growth of non-ASCII characters #72

Open
kforner opened this Issue Nov 11, 2010 · 1 comment

Projects

None yet

2 participants

@kforner

We had issues when writing pages using cut and paste from Word or PDF documents: our pages grew bigger with long stretches of non-displayable characters in the rendering.
At some point it grew so huge firefox was not able to render the page anymore.

I solved it by getting the page with the dump_content.pl script, fix it in xemacs, then importing it using the import_content.pl script. Then I manually hack the database to remove the bad versions.

I was able to reproduce the problem:

Step1:
In word I insert the alpha symbol, then I copy-paste it into a new mojomojo page.
At this point, the rendering is OK, the rendered page has the greek symbol alpha displayed.
I dumped the content of the page: only 2 characters
%od -c bug2.1.txt
0000000 316 261

Step2:
I edit the new page (now the rendering is wrong, just garbage),
then WITHOUT modifying it anything I save it, it creates a new version
Here's the content: 4 chars
%od -c bug2.2.txt
0000000 303 216 302 261

Step3: I edit/save once again: 8 chars !!!!
%od -c bug2.3.txt
0000000 303 203 305 275 303 202 302 261

MojoMojo Version 1.01
This is perl, v5.8.9 built for x86_64-linux
Linux 2.6.16.60-0.54.5-smp #1 SMP Fri Sep 4 01:28:03 UTC 2009 x86_64 x86_64 x86_64 GNU/Linux

@tbull

Looks like a charset problem, where the page is sent to the browser in utf-8, but the contents of the form sent back designated as single byte charset like latin1.
To verify this suspicion, it would be helpful if you dumped your file in hex, not octal. (use od -tx1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment