Multilingual#934
Conversation
… and mess up the database. Note: We do not need decode from thaw because as sequences of bytes nothing changes. (I think.)
…ch/webwork2 into locbug Conflicts: courses.dist/modelCourse/course.conf
… suggested by goehle
added [qw(Encode::Encoding)] to ${pg}{modules}) in defaults.config as…
…lop_uft8_ver2 # Conflicts: # lib/WeBWorK/ContentGenerator/Instructor/SendMail.pm # lib/WeBWorK/Utils.pm
|
It shouldn't matter but I'll fix it. I can get user names with extended utf8 to work in my docker |
|
As it stands I can store chinese characters correctly in the database. When I retrieve them they will produces entries such as: Any ideas? The code is the same and the settings in mysql.cnf seem to be the same on the two machines. |
|
I have a conjecture below about what is causing the problem. Note: I added the Given that, I don't think there is any problem with the WW code side of things. @mgage What character set are the relevant database columns set to use? I suspect that the problem you ran into is that the database table for the course you tested is using the Here is why I suspect this:
It is possible that this data was put into the database before you had some of the UTF8 settings made, so that the database understood whatever was set to it as these six I suspect that "fixing" the database table so that the relevant columns use the Note: Reading what you are getting in the log file as if it was supposed to be UTF-8 (see https://en.wikipedia.org/wiki/UTF-8) would give:
Small comment: I'm guessing the warn line you included and the one used when the page was generated was a bit different... "user values" and apparently the "login name" show up in the page/log record and not the warn code given above. |
|
Thanks for the response Tani. I has helped give me some ideas, but I haven't found the culprit yet. At the moment I can't see anything wrong, but this is a good way to document what I've found so far. Perhaps you'll notice something I missed. I'm working on some more tests suggested by your analysis -- but I have a graduation to attend this afternoon so I may not get the results posted until late tonight or tomorrow. demo_mysql_variables_output.txt |
|
Here is the output showing some contents of the table chinese_language_course2_user on demo.webwork: The first name for herman has been retrieved to editing page and then re-saved and is therefore garbled. The first name of gage and madhu are chinese characters entered directly from the keyboard. |
|
I saved the source of this page to a file (using the browser), deleted almost everything except the first names in the listing you gave of the output of The Chinese character 呵 for the first name for madhu is encoded using three bytes:
The three characters � for the first name of herman seem to have the 7 bytes:
This strongly hints to me that the three byte sequence The big question is where exactly this second encoding to UTF-8 is happening and why it is happening. I took a look at I tried checking whether the HTML page of demo.webwork for the Note: The math4 template does this in one manner ( I am not having any problem with this Chinese character in my testing on a Docker based server with either math4 or math3. I also tried using browser tools to examine the data being sent by POST when a change is made. It looked fine to me. Thus, I suspect that the mangling is happening on the server side. Maybe have a look at the Apache configuration. Could there be an Apache setting forcing a different default charset instead if UTF8? |
|
Some light at the end of the tunnel? :-) With these lines The variants line prints correctly. So apparently I need to decode the return values from the database with decode_utf8(). This still doesn't tell me what configuration setting of the database allows decode_utf8() to be omitted from the docker version of the course. |
|
Maybe have a look at the Apache configuration. Could there be an Apache setting forcing a different default charset instead if UTF8? That's a good idea. I'll look into it. |
|
for reference. On docker installation: mysql> select last_name, hex(last_name), length(last_name), char_length(last_name) from test_chinese_language_user; Welcome to the MySQL monitor. Commands end with ; or \g. Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved. |
|
Update: The bad chinese characters below were caused by the setting on the terminal while on demo.webwork.rochester (and similar but not identical class list database) I get select first_name, hex(first_name), length(first_name), char_length(first_name) from chinese_language_course2_user; Welcome to the MySQL monitor. Commands end with ; or \g. Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its so the servers are not the same but it doesn't seem to me that they should be this far apart. mysql> show variables like 'character%'; |
|
Here are some ideas: The The version in Docker (from Ubuntu 16.04) is 4.033, which probably explains why utf8mb4 databases are working properly in Docker. It seems that many distributions have older versions of Based on the files changed in that PR for This might explain why we seem to have encountered "double encoding" on @mgage's tests on demo.webwork. It seems to me that for older versions of As far as I can tell, using the "wrong" switch is likely to force the connection charset to mySQL's limited (max 3 byte) |
…itch to SQL.pm to accommodate older browsers.
… multilingual # Conflicts: # lib/WeBWorK/DB/Driver/SQL.pm fixed merge conflict
|
OK. I think that with this change -- adding both mysql_enable_utf8 and mysql_enable_utf8mb4 We'll test this branch a bit more and then I'll merge it in to develop where we can test it further. A lot of work still needs to be done on translation dictionaries, sorting orders and many other things but that can be done on development which will become release 2.15 sometime this summer. |
# Conflicts: # Dockerfile
|
I'm ready to merge this PR (multilingual) into develop and to begin refining develop as release 2.15. This branch is up and running on the site: https://demo.webwork.rochester.edu/webwork2 . There are several language courses set up there. I can set up others (or give you access to existing language courses) if someone is interested. You have professor access to the courses UR101, UR102, HKUST101, HKUST102 using profa/profa if you want to try entering foreign names in class lists. -- mike (gage@math.rochester.edu) |
like Hebrew or Arabic. The data from lib/WeBWorK/ContentGenerator/Instructor/ShowAnswers.pm (using table formatting) and lib/WeBWorK/ContentGenerator/Instructor/StudentProgress.pm which are preformatted using a <pre> tag are not displated properly when the overall HTML document in in RTL direction. Force this data to be displayed LTR using an HTML dir attribute. For StudentProgress data an enveloping SPAN tag was needed. no changes added to commit (use "git add" and/or "git commit -a") [tani@webwork webwork2]$ git commit lib/WeBWorK/ContentGenerator/Instructor/ShowAnswers.pm lib/WeBWorK/ContentGenerator/Instructor/StudentProgress.pm
…ebrew_fixes2 RTL display fixes and Hebrew translation fixes
textarea for "Edit" and in a (new) enveloping DIV around the <pre> for "View". Then in courses whose primarly language used the RTL text direction, lines which start in LTL languages (most of the lines in PG files, the set definition files, etc) will be left justified and formatted nicely. Without this, all lines would be right justified based on the main dir="rtl" which applies to the entire HTML page.
taniwallach
left a comment
There was a problem hiding this comment.
@mgage has indicated that he would like to merge this PR into develop, and fix remaining problems there, so that the code starts to get wider testing and feedback from additional people.
There are some loose ends (see below) but overall - I think that the UTF-8 / multilingual support is almost ready for public release, and should be ready for 2.15.
I agree that that approach is probably needed, as it seems that those people already using this and prior branches to provide UTF-8 support have reported and fixed whatever issues have been discovered.
I'm running the code from this PR on a production server (using Docker) which is serving courses in Hebrew. My database is using utf8mb4 as the character set. Overall, my system is working fine, including support for LTI with UTF-8 (Hebrew) characters in the student names.
The following are not yet working:
- The link to send email to a TA fails when the student name has wide UTF-8 characters:
Failed to send message: Wide character in syswrite at /usr/local/share/perl/5.22.1/Net/Cmd.pm line 210.- However, wide-characters in the body do not trigger the error, but arrive as gibberish.
- The solution is most probably to apply
encode_utf8()before sending data to the mailer, and to make sure that the email headers/settings are set up to support UTF-8 encoded text. - I have not looked into this yet or at any of the WW code which generates the emails.
- My site currently has
mailto:links in thecourse_infofile for now and we told out students to use them, as the main email feature does not yet work with Hebrew.
- PDF generation with Hebrew question.
- The LaTeX headers will needs more configuration work, and additional LaTeX packages and fonts will need to be installed.
- This sort of issue will apply to any language which requires additional LaTeX packages and/or fonts.
- For Hebrew, Arabic, etc - the need to swap between RTL and LTR text directions make the matter even more complex.
- Importing set definition files created with "bad" short time-zone codes (like "IDT") does not work.
*See #928 where there is additional information.- Deleting the time zone code in the SDF file bypasses the problem.
- A different solution is proposed in #941.
- I don't think these are reasons to hold back on merging the PR into develop, but the first issue is something to be fixed before version 2.15 can be released.
I just made mgage#28 to address another RTL-LTR issue. Edit and View in RTL language courses have issues without dir="auto" added, and that is done in mgage#28.
Some of the main features I have not tested:
- Generating scoring files (not needed yet), but probably tested by prior users of UTF-8 branches.
- Achievements
- Gateway quizzes
- Recent changes to the Docker config files, as I am using customized versions which added support for SSL and local configuration on my server.
I think it would be helpful for someone else to test Gateway quizzes and achievements with this branch before or after it gets merged into develop.
RTL fixes for View and Edit in FileManager.pm
|
@mgage I also added mgage/pg#10 for pg, adding macros to add |
|
OK!!! Here goes. Multilingual is merging into develop. Be cautious downloading develop for the next few weeks as we work out bugs. It works great on our machines!!! :-) There are outstanding (relatively minor) things to be checked up on if you actually use the extended utf8 characters. See #942, #943, #944, #945, #946, #947, |
combined multilingual PR. the names were getting too unwieldy