Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding does not handle Greek files generated by Excel #64

Closed
SheetJSDev opened this issue Jul 16, 2014 · 11 comments
Closed

Encoding does not handle Greek files generated by Excel #64

SheetJSDev opened this issue Jul 16, 2014 · 11 comments

Comments

@SheetJSDev
Copy link

Example file: https://github.com/SheetJS/test_files/blob/master/biff5/number_format_greek.xls.0.csv

The encoding is windows codepage 1253. The original XLS file is https://github.com/SheetJS/test_files/blob/master/biff5/number_format_greek.xls. To reproduce, set the language for non-Unicode characters to "Greek".

What is the correct encoding setting to parse the file? Using "iso-8859-7" appears to have no effect on the result.

@mholt
Copy link
Owner

mholt commented Jul 16, 2014

(Mac user here)

Hm, it's whatever encoding would be passed into readAsText on the HTML5 FileReader. Does "CP1253" work, maybe? Meanwhile, I'm taking a look at the file to reproduce on my machine if I can.

@SheetJSDev
Copy link
Author

@mholt unfortunately CP1253 didn't help either (you don't need Windows to test this).

FYI: you can always set up Windows via Boot Camp or a virtual machine (like virtualbox or fusion). Excel performance is somewhat slower in a VM but it is convenient for quick tests like this

@mholt
Copy link
Owner

mholt commented Jul 16, 2014

Thanks for helping me through this -- I'm a bit inexperienced with encodings.

I think I'm having trouble downloading the file in its proper format. GitHub shows me this:

screen shot 2014-07-16 at 1 34 26 pm

And the "Raw" version appears as:

screen shot 2014-07-16 at 1 34 20 pm

I get the same result when I "Save Link As..." -- is that what it's supposed to look like? I want to make sure I have the data in the same format as you before continuing...

Incidentally, parsing the file as I have it now seems to work OK -- what exactly is the problem?

@SheetJSDev
Copy link
Author

pbcopy and other utilities may end up transparently converting to UTF-8. The safest way to do this is to pass a base64-encoded string, as follows:

$ curl https://raw.githubusercontent.com/SheetJS/test_files/master/biff5/number_format_greek.xls.0.csv | base64 | pbcopy

(pbcopy is a cool little program that copies data to the pasteboard)

Open up the javascript console on the demo page and type document.getElementById('input').value = atob(", then paste, then type "). That should be something like http://hastebin.com/raw/oxijuzuzow

To see what Excel shows in the correct codepage, I threw up a quick demo: http://sheetjs.com/demos/codepage.html. Select codepage 1253, paste the base64 text in the textarea, check the base64 box, and click Convert. You should see:

The demo uses the codepage library, but that is definitely inappropriate in this context. I was wondering if there was a way to process CSVs from excel using your encoding parameter (or maybe you can clarify what the encoding is supposed to do -- do you have a sample file with a different encoding?).

@mholt mholt changed the title Encoding does not handle files generated by Excel Encoding does not handle Greek files generated by Excel Jul 16, 2014
@mholt
Copy link
Owner

mholt commented Jul 16, 2014

Okay, thanks. That helps me feel confident that I'm working with the same data now!

But, unfortunately, I'm still lost as to where things aren't working. You say it does not "handle" the file, but everything I'm trying seems to work fine. What's happening that's different from what's expected?

@SheetJSDev
Copy link
Author

If I run the data through the console, I don't see the same characters that excel shows:

The second column according to the parser is "Ïêô-33". When I open the CSV in Excel, I see that the second column is "Οκτ-33".

@mholt
Copy link
Owner

mholt commented Jul 16, 2014

Okay. That helps a lot. Now I see the specific characters that you're talking about. Using your tool I was able to get the get the file saved with the correct encoding so that the data looks like "Οκτ-33" whereas before it was "Ïêô-33" or question marks. I used Sublime Text and saved the file with the Windows 1253 encoding and re-opened it to confirm that everything was preserved. It copied and pasted and again confirmed that the characters/encoding were preserved.

I then used the demo page to choose the file from my file system and parsed it using the default (UTF-8) encoding. On row 14 indeed, I saw what you see: "Ïêô-33"

Then I changed the encoding in the encoding textbox to "cp1253" and re-parsed. The same field appeared as "Οκτ-33". I also tried it using "iso-8859-7" and had the same result. Both seemed to work.

screen shot 2014-07-16 at 3 34 53 pm

As far as I can tell, this is working properly. Is the CSV file saved on your system as UTF-8 or Windows 1253? (From your instructions, it sounds like it is...) Does knowing the steps I took help?

@SheetJSDev
Copy link
Author

I see my confusion: the encoding parameter only applies to files, not to data pushed via the textarea.

This probably should be mentioned on the demo page.

@mholt
Copy link
Owner

mholt commented Jul 16, 2014

Ahhhhh. I understand now. Sorry for the confusion. I'll clarify that on the demo page, indeed.

@mholt
Copy link
Owner

mholt commented Jul 28, 2014

As a sort of follow-up, this looks pretty promising for dealing with encodings when parsing strings: https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/iWDqDWQ8mhs

@mharn
Copy link

mharn commented Jun 30, 2017

FWIW I've had success with using Encoding.js (https://github.com/polygonplanet/encoding.js) to detect some encodings including Japanese text (e.g. SJIS) and then run Papaparse based on that result.

Be warned that Encoding.js says that it mainly supports Unicode and Japanese formats, and not other European-centric ones.

`

<script type="text/javascript" src="./js/encoding.min.js"></script>

function onFileSelect(event) {
var file = event.target.files[0];

var reader = new FileReader();
reader.onload = function(e) {
  var codes = new Uint8Array(e.target.result);
  var encoding = Encoding.detect(codes);
  console.log(encoding);
  var files = event.target.files;
  Papa.parse(files[0], {
    skipEmptyLines: true,  // need this or papaparse adds a blank entry (despite csv only have 5 lines, it gives a 6th empty string)
    header: true, // testing adding a source/target header
    encoding: encoding,
    complete: function(results) {
	      console.log(results.data);
     }
  });
};

reader.readAsArrayBuffer(file);

}

document.getElementById('encoding.js').addEventListener('change', onFileSelect, false);
`

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants