Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 Character set/encoding of text stimuli not recognized in online experiment #2299

Closed
jvcasillas opened this issue Feb 22, 2019 · 8 comments · Fixed by psychopy/psychojs#137
Assignees

Comments

@jvcasillas
Copy link

Accented characters that work locally do not show up in online experiments when the stimuli are drawn from a conditions file in a loop. Example here: https://pavlovia.org/run/jvcasillas/lextale_sp_template/html/

This issue is referenced in the psychopy forums here: https://discourse.psychopy.org/t/including-utf-8-unicode-characters-in-online-experiments/6723

@jvcasillas
Copy link
Author

I have resolved this issue by saving my conditions file as an excel worksheet. I was using a .csv and saving with utf-8 encoding in sublimetext, but apparently that wasn't working as I thought it was. After passing the list to an excel file and the accented characters are now showing up as they should.

@peircej
Copy link
Member

peircej commented Mar 15, 2019

Thanks for the info. I think it suggests there's still something to fix here in our decoding of csv files but I'm glad it's now working for you.

@hsogo
Copy link
Contributor

hsogo commented Jun 12, 2019

Hi, I've also experienced this issue.
I'm not familiar with JavaScript but I guess Byte Order Mark (BOM) may have something to do with this. According to following pages, at least when exporting to xlsx, BOM is necessary to re-open exported xlsx files by Excel.

I created a test experiment to confirm UTF-8 CSV file with BOM can be imported correctly. This is a Japanese Stroop task and three conditions files are prepared for this experiment. Conditions file can be changed by expInfo dialog at the beginning of the experiment.

  1. cnd.xlsx: Conditions file saved as xlsx file.
  2. cnd.csv: Conditions file saved as UTF-8 CSV file (without BOM).
  3. cnd_with_bom.csv: Conditions file saved as UTF-8 CSV file with BOM.

The results were as follows. As jvcasillas reported, xlsx file worked file (1) while UTF-8 CSV file didn't (2). UTF-8 CSV with BOM worked fine (3).

1

So, if we add BOM to CSV file without BOM, the CSV file would be read correctly, I guess.

Another possible way to solve this issue would be to specify codepage (65001) when opening CSV file.

Sorry that I'm not good at JavaScript enough to test this by myself. I hope this information will be of some help.

@lnnrtwttkhn
Copy link

lnnrtwttkhn commented May 27, 2020

I had the same issue and @hsogo's solution (saving the .csv file with UTF-8 and BOM) solved it! In my case, I create the conditions.csv file with pandas, so I could simply add encoding='utf-8-sig' when saving the pandas dataframe to .csv (e.g., df.to_csv('conditions.csv', encoding='utf-8-sig')). Thanks @hsogo!

@drakeasberry
Copy link

drakeasberry commented Jun 8, 2020

@hsogo Thank you for the proposed solution and it is working for my online experiment. I was trying to understand the workings of BOM a little bit better and I noticed that the python docs:. Here they say that using BOM with utf-8 should be avoided.

Are there other side-effects that experimenters should be aware of when using BOM with utf-8 or is there a better alternative we should be using?

@hsogo
Copy link
Contributor

hsogo commented Jun 11, 2020

Sorry, I'm not sure about potential issues of BOM with utf-8.

By the way, now that local debugging of PsychoJS works on my PC (Japanese Windows 10, PsychoJS 2020.1, Firefox 77.0.1), I tried to fix this problem . I found that line 297 of data-2020.1.js reads conditions file.

const workbook = XLSX.read(new Uint8Array(resourceValue), { type: "array" });

Replacing this line with the following, Japanese characters in CSV files were correctly read regardless of BOM.

workbook = XLSX.read((new TextDecoder).decode(new Uint8Array(resourceValue)), { type: "string" });

However, this modification caused error when reading xlsx files. So I added if statement as follows. This worked with all of xlsx, CSV without BOM and CSV with BOM on my environment.

let workbook;
if (['csv'].indexOf(resourceExtension) > -1)
	workbook = XLSX.read((new TextDecoder).decode(new Uint8Array(resourceValue)), { type: "string" });
else
	workbook = XLSX.read(new Uint8Array(resourceValue), { type: "array" });

Unfortunately, I don't know how to test this on Pavlovia server. @peircej What should I do?

@peircej
Copy link
Member

peircej commented Jun 11, 2020

We had some discussion about whether using utf-8-sig was a problem regarding data files a while ago #2166 In the end we implemented it as that default and it does not appear to have introduced any problems. @hoechenberger tested on a range of software and couldn't find anything that tripped over when the BO was present. One thing that's interesting is that the BOM is not technically needed for its original purpose by UTF-8 (because the byte order is a part of the encoding) but it is nonetheless useful in helping the receiving application to detect that this it UTF-8.

It's ideal obviously if @hsogo's fix means that people don't need BOM-encoded files. @hsogo would be able to submit a pull request on the https://github.com/psychopy/psychojs repository with your fix and @apitiot can review it and pull it in from there?

@hsogo
Copy link
Contributor

hsogo commented Jun 12, 2020

I've sent a pull request. psychopy/psychojs#95

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants