Encoding does not handle Greek files generated by Excel #64

Closed
SheetJSDev opened this Issue Jul 16, 2014 · 10 comments

Projects

None yet

2 participants

@SheetJSDev

Example file: https://github.com/SheetJS/test_files/blob/master/biff5/number_format_greek.xls.0.csv

The encoding is windows codepage 1253. The original XLS file is https://github.com/SheetJS/test_files/blob/master/biff5/number_format_greek.xls. To reproduce, set the language for non-Unicode characters to "Greek".

What is the correct encoding setting to parse the file? Using "iso-8859-7" appears to have no effect on the result.

@mholt mholt added the under review label Jul 16, 2014
@mholt
Owner
mholt commented Jul 16, 2014

(Mac user here)

Hm, it's whatever encoding would be passed into readAsText on the HTML5 FileReader. Does "CP1253" work, maybe? Meanwhile, I'm taking a look at the file to reproduce on my machine if I can.

@SheetJSDev

@mholt unfortunately CP1253 didn't help either (you don't need Windows to test this).

FYI: you can always set up Windows via Boot Camp or a virtual machine (like virtualbox or fusion). Excel performance is somewhat slower in a VM but it is convenient for quick tests like this

@mholt
Owner
mholt commented Jul 16, 2014

Thanks for helping me through this -- I'm a bit inexperienced with encodings.

I think I'm having trouble downloading the file in its proper format. GitHub shows me this:

screen shot 2014-07-16 at 1 34 26 pm

And the "Raw" version appears as:

screen shot 2014-07-16 at 1 34 20 pm

I get the same result when I "Save Link As..." -- is that what it's supposed to look like? I want to make sure I have the data in the same format as you before continuing...

Incidentally, parsing the file as I have it now seems to work OK -- what exactly is the problem?

@SheetJSDev

pbcopy and other utilities may end up transparently converting to UTF-8. The safest way to do this is to pass a base64-encoded string, as follows:

$ curl https://raw.githubusercontent.com/SheetJS/test_files/master/biff5/number_format_greek.xls.0.csv | base64 | pbcopy

(pbcopy is a cool little program that copies data to the pasteboard)

Open up the javascript console on the demo page and type document.getElementById('input').value = atob(", then paste, then type "). That should be something like http://hastebin.com/raw/oxijuzuzow

To see what Excel shows in the correct codepage, I threw up a quick demo: http://sheetjs.com/demos/codepage.html. Select codepage 1253, paste the base64 text in the textarea, check the base64 box, and click Convert. You should see:

The demo uses the codepage library, but that is definitely inappropriate in this context. I was wondering if there was a way to process CSVs from excel using your encoding parameter (or maybe you can clarify what the encoding is supposed to do -- do you have a sample file with a different encoding?).

@mholt mholt changed the title from Encoding does not handle files generated by Excel to Encoding does not handle Greek files generated by Excel Jul 16, 2014
@mholt
Owner
mholt commented Jul 16, 2014

Okay, thanks. That helps me feel confident that I'm working with the same data now!

But, unfortunately, I'm still lost as to where things aren't working. You say it does not "handle" the file, but everything I'm trying seems to work fine. What's happening that's different from what's expected?

@SheetJSDev

If I run the data through the console, I don't see the same characters that excel shows:

The second column according to the parser is "Ïêô-33". When I open the CSV in Excel, I see that the second column is "Οκτ-33".

@mholt
Owner
mholt commented Jul 16, 2014

Okay. That helps a lot. Now I see the specific characters that you're talking about. Using your tool I was able to get the get the file saved with the correct encoding so that the data looks like "Οκτ-33" whereas before it was "Ïêô-33" or question marks. I used Sublime Text and saved the file with the Windows 1253 encoding and re-opened it to confirm that everything was preserved. It copied and pasted and again confirmed that the characters/encoding were preserved.

I then used the demo page to choose the file from my file system and parsed it using the default (UTF-8) encoding. On row 14 indeed, I saw what you see: "Ïêô-33"

Then I changed the encoding in the encoding textbox to "cp1253" and re-parsed. The same field appeared as "Οκτ-33". I also tried it using "iso-8859-7" and had the same result. Both seemed to work.

screen shot 2014-07-16 at 3 34 53 pm

As far as I can tell, this is working properly. Is the CSV file saved on your system as UTF-8 or Windows 1253? (From your instructions, it sounds like it is...) Does knowing the steps I took help?

@mholt mholt added question and removed under review labels Jul 16, 2014
@SheetJSDev

I see my confusion: the encoding parameter only applies to files, not to data pushed via the textarea.

This probably should be mentioned on the demo page.

@SheetJSDev SheetJSDev closed this Jul 16, 2014
@mholt
Owner
mholt commented Jul 16, 2014

Ahhhhh. I understand now. Sorry for the confusion. I'll clarify that on the demo page, indeed.

@mholt mholt added the invalid label Jul 19, 2014
@mholt
Owner
mholt commented Jul 28, 2014

As a sort of follow-up, this looks pretty promising for dealing with encodings when parsing strings: https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/iWDqDWQ8mhs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment