New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding does not handle Greek files generated by Excel #64

Closed
SheetJSDev opened this Issue Jul 16, 2014 · 11 comments

Comments

Projects
None yet
3 participants
@SheetJSDev

Example file: https://github.com/SheetJS/test_files/blob/master/biff5/number_format_greek.xls.0.csv

The encoding is windows codepage 1253. The original XLS file is https://github.com/SheetJS/test_files/blob/master/biff5/number_format_greek.xls. To reproduce, set the language for non-Unicode characters to "Greek".

What is the correct encoding setting to parse the file? Using "iso-8859-7" appears to have no effect on the result.

@mholt mholt added the under review label Jul 16, 2014

@mholt

This comment has been minimized.

Show comment
Hide comment
@mholt

mholt Jul 16, 2014

Owner

(Mac user here)

Hm, it's whatever encoding would be passed into readAsText on the HTML5 FileReader. Does "CP1253" work, maybe? Meanwhile, I'm taking a look at the file to reproduce on my machine if I can.

Owner

mholt commented Jul 16, 2014

(Mac user here)

Hm, it's whatever encoding would be passed into readAsText on the HTML5 FileReader. Does "CP1253" work, maybe? Meanwhile, I'm taking a look at the file to reproduce on my machine if I can.

@SheetJSDev

This comment has been minimized.

Show comment
Hide comment
@SheetJSDev

SheetJSDev Jul 16, 2014

@mholt unfortunately CP1253 didn't help either (you don't need Windows to test this).

FYI: you can always set up Windows via Boot Camp or a virtual machine (like virtualbox or fusion). Excel performance is somewhat slower in a VM but it is convenient for quick tests like this

@mholt unfortunately CP1253 didn't help either (you don't need Windows to test this).

FYI: you can always set up Windows via Boot Camp or a virtual machine (like virtualbox or fusion). Excel performance is somewhat slower in a VM but it is convenient for quick tests like this

@mholt

This comment has been minimized.

Show comment
Hide comment
@mholt

mholt Jul 16, 2014

Owner

Thanks for helping me through this -- I'm a bit inexperienced with encodings.

I think I'm having trouble downloading the file in its proper format. GitHub shows me this:

screen shot 2014-07-16 at 1 34 26 pm

And the "Raw" version appears as:

screen shot 2014-07-16 at 1 34 20 pm

I get the same result when I "Save Link As..." -- is that what it's supposed to look like? I want to make sure I have the data in the same format as you before continuing...

Incidentally, parsing the file as I have it now seems to work OK -- what exactly is the problem?

Owner

mholt commented Jul 16, 2014

Thanks for helping me through this -- I'm a bit inexperienced with encodings.

I think I'm having trouble downloading the file in its proper format. GitHub shows me this:

screen shot 2014-07-16 at 1 34 26 pm

And the "Raw" version appears as:

screen shot 2014-07-16 at 1 34 20 pm

I get the same result when I "Save Link As..." -- is that what it's supposed to look like? I want to make sure I have the data in the same format as you before continuing...

Incidentally, parsing the file as I have it now seems to work OK -- what exactly is the problem?

@SheetJSDev

This comment has been minimized.

Show comment
Hide comment
@SheetJSDev

SheetJSDev Jul 16, 2014

pbcopy and other utilities may end up transparently converting to UTF-8. The safest way to do this is to pass a base64-encoded string, as follows:

$ curl https://raw.githubusercontent.com/SheetJS/test_files/master/biff5/number_format_greek.xls.0.csv | base64 | pbcopy

(pbcopy is a cool little program that copies data to the pasteboard)

Open up the javascript console on the demo page and type document.getElementById('input').value = atob(", then paste, then type "). That should be something like http://hastebin.com/raw/oxijuzuzow

To see what Excel shows in the correct codepage, I threw up a quick demo: http://sheetjs.com/demos/codepage.html. Select codepage 1253, paste the base64 text in the textarea, check the base64 box, and click Convert. You should see:

The demo uses the codepage library, but that is definitely inappropriate in this context. I was wondering if there was a way to process CSVs from excel using your encoding parameter (or maybe you can clarify what the encoding is supposed to do -- do you have a sample file with a different encoding?).

pbcopy and other utilities may end up transparently converting to UTF-8. The safest way to do this is to pass a base64-encoded string, as follows:

$ curl https://raw.githubusercontent.com/SheetJS/test_files/master/biff5/number_format_greek.xls.0.csv | base64 | pbcopy

(pbcopy is a cool little program that copies data to the pasteboard)

Open up the javascript console on the demo page and type document.getElementById('input').value = atob(", then paste, then type "). That should be something like http://hastebin.com/raw/oxijuzuzow

To see what Excel shows in the correct codepage, I threw up a quick demo: http://sheetjs.com/demos/codepage.html. Select codepage 1253, paste the base64 text in the textarea, check the base64 box, and click Convert. You should see:

The demo uses the codepage library, but that is definitely inappropriate in this context. I was wondering if there was a way to process CSVs from excel using your encoding parameter (or maybe you can clarify what the encoding is supposed to do -- do you have a sample file with a different encoding?).

@mholt mholt changed the title from Encoding does not handle files generated by Excel to Encoding does not handle Greek files generated by Excel Jul 16, 2014

@mholt

This comment has been minimized.

Show comment
Hide comment
@mholt

mholt Jul 16, 2014

Owner

Okay, thanks. That helps me feel confident that I'm working with the same data now!

But, unfortunately, I'm still lost as to where things aren't working. You say it does not "handle" the file, but everything I'm trying seems to work fine. What's happening that's different from what's expected?

Owner

mholt commented Jul 16, 2014

Okay, thanks. That helps me feel confident that I'm working with the same data now!

But, unfortunately, I'm still lost as to where things aren't working. You say it does not "handle" the file, but everything I'm trying seems to work fine. What's happening that's different from what's expected?

@SheetJSDev

This comment has been minimized.

Show comment
Hide comment
@SheetJSDev

SheetJSDev Jul 16, 2014

If I run the data through the console, I don't see the same characters that excel shows:

The second column according to the parser is "Ïêô-33". When I open the CSV in Excel, I see that the second column is "Οκτ-33".

If I run the data through the console, I don't see the same characters that excel shows:

The second column according to the parser is "Ïêô-33". When I open the CSV in Excel, I see that the second column is "Οκτ-33".

@mholt

This comment has been minimized.

Show comment
Hide comment
@mholt

mholt Jul 16, 2014

Owner

Okay. That helps a lot. Now I see the specific characters that you're talking about. Using your tool I was able to get the get the file saved with the correct encoding so that the data looks like "Οκτ-33" whereas before it was "Ïêô-33" or question marks. I used Sublime Text and saved the file with the Windows 1253 encoding and re-opened it to confirm that everything was preserved. It copied and pasted and again confirmed that the characters/encoding were preserved.

I then used the demo page to choose the file from my file system and parsed it using the default (UTF-8) encoding. On row 14 indeed, I saw what you see: "Ïêô-33"

Then I changed the encoding in the encoding textbox to "cp1253" and re-parsed. The same field appeared as "Οκτ-33". I also tried it using "iso-8859-7" and had the same result. Both seemed to work.

screen shot 2014-07-16 at 3 34 53 pm

As far as I can tell, this is working properly. Is the CSV file saved on your system as UTF-8 or Windows 1253? (From your instructions, it sounds like it is...) Does knowing the steps I took help?

Owner

mholt commented Jul 16, 2014

Okay. That helps a lot. Now I see the specific characters that you're talking about. Using your tool I was able to get the get the file saved with the correct encoding so that the data looks like "Οκτ-33" whereas before it was "Ïêô-33" or question marks. I used Sublime Text and saved the file with the Windows 1253 encoding and re-opened it to confirm that everything was preserved. It copied and pasted and again confirmed that the characters/encoding were preserved.

I then used the demo page to choose the file from my file system and parsed it using the default (UTF-8) encoding. On row 14 indeed, I saw what you see: "Ïêô-33"

Then I changed the encoding in the encoding textbox to "cp1253" and re-parsed. The same field appeared as "Οκτ-33". I also tried it using "iso-8859-7" and had the same result. Both seemed to work.

screen shot 2014-07-16 at 3 34 53 pm

As far as I can tell, this is working properly. Is the CSV file saved on your system as UTF-8 or Windows 1253? (From your instructions, it sounds like it is...) Does knowing the steps I took help?

@mholt mholt added question and removed under review labels Jul 16, 2014

@SheetJSDev

This comment has been minimized.

Show comment
Hide comment
@SheetJSDev

SheetJSDev Jul 16, 2014

I see my confusion: the encoding parameter only applies to files, not to data pushed via the textarea.

This probably should be mentioned on the demo page.

I see my confusion: the encoding parameter only applies to files, not to data pushed via the textarea.

This probably should be mentioned on the demo page.

@SheetJSDev SheetJSDev closed this Jul 16, 2014

@mholt

This comment has been minimized.

Show comment
Hide comment
@mholt

mholt Jul 16, 2014

Owner

Ahhhhh. I understand now. Sorry for the confusion. I'll clarify that on the demo page, indeed.

Owner

mholt commented Jul 16, 2014

Ahhhhh. I understand now. Sorry for the confusion. I'll clarify that on the demo page, indeed.

@mholt

This comment has been minimized.

Show comment
Hide comment
@mholt

mholt Jul 28, 2014

Owner

As a sort of follow-up, this looks pretty promising for dealing with encodings when parsing strings: https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/iWDqDWQ8mhs

Owner

mholt commented Jul 28, 2014

As a sort of follow-up, this looks pretty promising for dealing with encodings when parsing strings: https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/iWDqDWQ8mhs

@mharn

This comment has been minimized.

Show comment
Hide comment
@mharn

mharn Jun 30, 2017

FWIW I've had success with using Encoding.js (https://github.com/polygonplanet/encoding.js) to detect some encodings including Japanese text (e.g. SJIS) and then run Papaparse based on that result.

Be warned that Encoding.js says that it mainly supports Unicode and Japanese formats, and not other European-centric ones.

`

<script type="text/javascript" src="./js/encoding.min.js"></script>

function onFileSelect(event) {
var file = event.target.files[0];

var reader = new FileReader();
reader.onload = function(e) {
  var codes = new Uint8Array(e.target.result);
  var encoding = Encoding.detect(codes);
  console.log(encoding);
  var files = event.target.files;
  Papa.parse(files[0], {
    skipEmptyLines: true,  // need this or papaparse adds a blank entry (despite csv only have 5 lines, it gives a 6th empty string)
    header: true, // testing adding a source/target header
    encoding: encoding,
    complete: function(results) {
	      console.log(results.data);
     }
  });
};

reader.readAsArrayBuffer(file);

}

document.getElementById('encoding.js').addEventListener('change', onFileSelect, false);
`

mharn commented Jun 30, 2017

FWIW I've had success with using Encoding.js (https://github.com/polygonplanet/encoding.js) to detect some encodings including Japanese text (e.g. SJIS) and then run Papaparse based on that result.

Be warned that Encoding.js says that it mainly supports Unicode and Japanese formats, and not other European-centric ones.

`

<script type="text/javascript" src="./js/encoding.min.js"></script>

function onFileSelect(event) {
var file = event.target.files[0];

var reader = new FileReader();
reader.onload = function(e) {
  var codes = new Uint8Array(e.target.result);
  var encoding = Encoding.detect(codes);
  console.log(encoding);
  var files = event.target.files;
  Papa.parse(files[0], {
    skipEmptyLines: true,  // need this or papaparse adds a blank entry (despite csv only have 5 lines, it gives a 6th empty string)
    header: true, // testing adding a source/target header
    encoding: encoding,
    complete: function(results) {
	      console.log(results.data);
     }
  });
};

reader.readAsArrayBuffer(file);

}

document.getElementById('encoding.js').addEventListener('change', onFileSelect, false);
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment