-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding does not handle Greek files generated by Excel #64
Comments
(Mac user here) Hm, it's whatever encoding would be passed into readAsText on the HTML5 FileReader. Does "CP1253" work, maybe? Meanwhile, I'm taking a look at the file to reproduce on my machine if I can. |
@mholt unfortunately CP1253 didn't help either (you don't need Windows to test this). FYI: you can always set up Windows via Boot Camp or a virtual machine (like virtualbox or fusion). Excel performance is somewhat slower in a VM but it is convenient for quick tests like this |
Thanks for helping me through this -- I'm a bit inexperienced with encodings. I think I'm having trouble downloading the file in its proper format. GitHub shows me this: And the "Raw" version appears as: I get the same result when I "Save Link As..." -- is that what it's supposed to look like? I want to make sure I have the data in the same format as you before continuing... Incidentally, parsing the file as I have it now seems to work OK -- what exactly is the problem? |
pbcopy and other utilities may end up transparently converting to UTF-8. The safest way to do this is to pass a base64-encoded string, as follows:
(pbcopy is a cool little program that copies data to the pasteboard) Open up the javascript console on the demo page and type To see what Excel shows in the correct codepage, I threw up a quick demo: http://sheetjs.com/demos/codepage.html. Select codepage 1253, paste the base64 text in the textarea, check the base64 box, and click Convert. You should see: The demo uses the codepage library, but that is definitely inappropriate in this context. I was wondering if there was a way to process CSVs from excel using your encoding parameter (or maybe you can clarify what the encoding is supposed to do -- do you have a sample file with a different encoding?). |
Okay, thanks. That helps me feel confident that I'm working with the same data now! But, unfortunately, I'm still lost as to where things aren't working. You say it does not "handle" the file, but everything I'm trying seems to work fine. What's happening that's different from what's expected? |
Okay. That helps a lot. Now I see the specific characters that you're talking about. Using your tool I was able to get the get the file saved with the correct encoding so that the data looks like "Οκτ-33" whereas before it was "Ïêô-33" or question marks. I used Sublime Text and saved the file with the Windows 1253 encoding and re-opened it to confirm that everything was preserved. It copied and pasted and again confirmed that the characters/encoding were preserved. I then used the demo page to choose the file from my file system and parsed it using the default (UTF-8) encoding. On row 14 indeed, I saw what you see: "Ïêô-33" Then I changed the encoding in the encoding textbox to "cp1253" and re-parsed. The same field appeared as "Οκτ-33". I also tried it using "iso-8859-7" and had the same result. Both seemed to work. As far as I can tell, this is working properly. Is the CSV file saved on your system as UTF-8 or Windows 1253? (From your instructions, it sounds like it is...) Does knowing the steps I took help? |
I see my confusion: the encoding parameter only applies to files, not to data pushed via the textarea. This probably should be mentioned on the demo page. |
Ahhhhh. I understand now. Sorry for the confusion. I'll clarify that on the demo page, indeed. |
As a sort of follow-up, this looks pretty promising for dealing with encodings when parsing strings: https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/iWDqDWQ8mhs |
FWIW I've had success with using Encoding.js (https://github.com/polygonplanet/encoding.js) to detect some encodings including Japanese text (e.g. SJIS) and then run Papaparse based on that result. Be warned that Encoding.js says that it mainly supports Unicode and Japanese formats, and not other European-centric ones. ` <script type="text/javascript" src="./js/encoding.min.js"></script>function onFileSelect(event) {
} document.getElementById('encoding.js').addEventListener('change', onFileSelect, false); |
Example file: https://github.com/SheetJS/test_files/blob/master/biff5/number_format_greek.xls.0.csv
The encoding is windows codepage 1253. The original XLS file is https://github.com/SheetJS/test_files/blob/master/biff5/number_format_greek.xls. To reproduce, set the language for non-Unicode characters to "Greek".
What is the correct encoding setting to parse the file? Using "iso-8859-7" appears to have no effect on the result.
The text was updated successfully, but these errors were encountered: