Skip to content

Multilingual#934

Merged
mgage merged 148 commits into
openwebwork:developfrom
mgage:multilingual
May 24, 2019
Merged

Multilingual#934
mgage merged 148 commits into
openwebwork:developfrom
mgage:multilingual

Conversation

@mgage
Copy link
Copy Markdown
Member

@mgage mgage commented Apr 15, 2019

combined multilingual PR. the names were getting too unwieldy

goehle and others added 30 commits June 20, 2016 14:13
… and mess up the database. Note: We do not need decode from thaw because as sequences of bytes nothing changes. (I think.)
…ch/webwork2 into locbug

Conflicts:
	courses.dist/modelCourse/course.conf
added [qw(Encode::Encoding)] to ${pg}{modules}) in defaults.config as…
…lop_uft8_ver2

# Conflicts:
#	lib/WeBWorK/ContentGenerator/Instructor/SendMail.pm
#	lib/WeBWorK/Utils.pm
@mgage
Copy link
Copy Markdown
Member Author

mgage commented May 18, 2019

It shouldn't matter but I'll fix it. I can get user names with extended utf8 to work in my docker
examples but I can't get it to work yet with my other installations where the database and course
files already exist. So far I haven't found which variable is not set properly. The script above may not yet find all relevant variables.

@mgage
Copy link
Copy Markdown
Member Author

mgage commented May 19, 2019

As it stands I can store chinese characters correctly in the database. When I retrieve them they will
print out correctly to log file but not to the HTML page:
e.g. around line 444 of UserList2.pm

	for (my $i = 0; $i < @Users; $i++) {
		my $User = $Users[$i];
		warn "UserList2: user  name, ".$User->first_name." ".$User->last_name."\n";
		# DBFIX we maybe already have the permission level from above (for use in sorting)
		my $PermissionLevel = $db->getPermissionLevel($User->user_id); # checked

produces entries such as:
[/webwork2/chinese_language_course2/instructor/users2/] user values gage \xc3\xa5\xc2\x91\xc2\xb5 Gage, in the log.
but
user values gage � Gage
is printed to the screen.

Any ideas?

The code is the same and the settings in mysql.cnf seem to be the same on the two machines.

@taniwallach
Copy link
Copy Markdown
Member

I have a conjecture below about what is causing the problem.

Note: I added the warn line to my personal Docker setup. When I look at the user list, the warning on the HTML page displays the correct name (using Hebrew characters), and the log record in /var/log/apache2/error.log contains a similar sequence of back-slash codes (in my case 2 per Hebrew letter), apparently as that is how Apache writes the log files when it gets bytes with a high-order bit set, such as many sequences of UTF-8 encoded characters.

Warning messages
    UserList2: user name, פלוני תלמיד

[Sun May 19 12:35:10.013724 2019] [perl:warn] [pid 1450] [client 172.20.0.1:41938] [/webwork2/LinAlgDevelop/instructor/users2/] UserList2: user  name, \xd7\xa4\xd7\x9c\xd7\x95\xd7\xa0\xd7\x99 \xd7\xaa\xd7\x9c\xd7\x9e\xd7\x99\xd7\x93, referer: http://localhost:8080/webwork2/LinAlgDevelop/instructor/users2/?user=tani&key=DELETED&effectiveUser=tani&editMode=1&visible_users=stud

Given that, I don't think there is any problem with the WW code side of things.

@mgage What character set are the relevant database columns set to use?

I suspect that the problem you ran into is that the database table for the course you tested is using the latin1 encoding (the old default) and that the first name was set to a sequence of 6 bytes (those in the log message) whose hex codes are: C3 A5 C2 91 C2 B5. Thus, I suspect that mySQL is reading in those 6 bytes as a latin1 encoded string, encoding it to UTF-8 and sending it to WW encoded (due to the use of set NAMES), and then WW is decoding the UTF-8 and getting the 5 special characters and the "bad Unicode" mark.

Here is why I suspect this:

  1. The log record has the hex codes for these 6 bytes.
  2. Five of the 6 funny characters sent to the web page correspond to the relevant special/accented characters from iso-8859-1: all except for 91 which apparently is not used by iso-8859-1. See: https://en.wikipedia.org/wiki/ISO/IEC_8859-1
  3. These 5 valid latin1 characters are correctly displayed in HTML: åµ and correspond to the latin1 characters expected.
  4. The last funny character is � is often used when an invalid sequence of byes is processed which cannot be treated as a UTF-8 byte sequence encoding a valid Unicode character. I'm guessing that mySQL triggered this when it ready they byte 91 from a latin1 table, which is apparently not a valid character.

It is possible that this data was put into the database before you had some of the UTF8 settings made, so that the database understood whatever was set to it as these six latin1 code-page bytes. However, I'm not certain of this. Maybe it was something you tried to save to the database recently, and only these 6 bytes were actually stored by the DB for some reason.

I suspect that "fixing" the database table so that the relevant columns use the utf8mb4 (or utf8) character set and then editing and saving the user name may help fix the problem.

Note: set NAMES in the SQL settings is almost certain to be needed, but was probably set via modified SQL config files.

Reading what you are getting in the log file as if it was supposed to be UTF-8 (see https://en.wikipedia.org/wiki/UTF-8) would give:

  • \xc3 should start a 2-byte UTF-8 sequence expected to be a "LATIN" character starting 00C0
  • \xa5 would be the continuation byte for 6 bits: 25 (in hex, with 2 high order 0 bits)
  • \xc2 should start a 2-byte UTF-8 sequence expected to be a "LATIN" character starting 0080
  • \x91 would be the continuation byte for 6 bits: 11 (in hex, with 2 high order 0 bits)
  • \xc2 should start a 2-byte UTF-8 sequence expected to be a "LATIN" character starting 0080
  • \xb5 would be the continuation byte for 6 bits: 35 (in hex, with 2 high order 0 bits)
    which would make sense if a sequence of three latin1 characters with the high-bit set were encoding into UTF-8. If my manual calculations are correct, it would be the 3 "bytes" E5 91 B5 from latin1.

Small comment: I'm guessing the warn line you included and the one used when the page was generated was a bit different... "user values" and apparently the "login name" show up in the page/log record and not the warn code given above.

@mgage
Copy link
Copy Markdown
Member Author

mgage commented May 19, 2019

Thanks for the response Tani. I has helped give me some ideas, but I haven't found the culprit yet.
I've attached three files.
mysql_show_columns_output_from_demo.txt -- has the results of using "show all columns" on several tables in the mysql database on demo.webwork.... (the one causing troubles).
demo_mysql_variables_output.txt - has the results of "show variables" and lists all of the database variables for the webwork database on demo.webwork.
docker2_mysql_variables_output.txt - has the "show variables" results from the mariadb database in (dbMB4) in docker.

At the moment I can't see anything wrong, but this is a good way to document what I've found so far. Perhaps you'll notice something I missed.

I'm working on some more tests suggested by your analysis -- but I have a graduation to attend this afternoon so I may not get the results posted until late tonight or tomorrow.

demo_mysql_variables_output.txt

docker2_mysql_variables_output.txt

mysql_show_columns_output_from_demo.txt

@mgage
Copy link
Copy Markdown
Member Author

mgage commented May 19, 2019

Here is the output showing some contents of the table chinese_language_course2_user on demo.webwork:

mysql> select * from chinese_language_course2_user;
+-----------+------------+-----------+---------------------------+------------+--------+---------+------------+---------+-------------+----------------+-------------+----------------+----------------+
| user_id   | first_name | last_name | email_address             | student_id | status | section | recitation | comment | displayMode | showOldAnswers | useMathView | useWirisEditor | lis_source_did |
+-----------+------------+-----------+---------------------------+------------+--------+---------+------------+---------+-------------+----------------+-------------+----------------+----------------+
| madhu     | 呵         | madhu�    | kmadhu@ur.rochester.edu   | madhu      | C      | NULL    | NULL       | NULL    | MathJax     |              1 |        NULL |           NULL | NULL           |
| herman    | �        | herman    | herman@math.rochester.edu | herman     | C      | NULL    | NULL       | NULL    | MathJax     |              1 |        NULL |           NULL | NULL           |
| gage      | 呵         | Gage      | gage@math.rochester.edu   | gage       | C      | NULL    | NULL       | NULL    | NULL        |           NULL |        NULL |           NULL | NULL           |

The first name for herman has been retrieved to editing page and then re-saved and is therefore garbled. The first name of gage and madhu are chinese characters entered directly from the keyboard.

@taniwallach
Copy link
Copy Markdown
Member

I saved the source of this page to a file (using the browser), deleted almost everything except the first names in the listing you gave of the output of select * from chinese_language_course2_user; above and ran the file through od -t x1 to see what the hex codes of the bytes encoding these characters are.

The Chinese character 呵 for the first name for madhu is encoded using three bytes: e5 91 b5 which is the correct UTF-8 byte sequence for that letter.

The three characters � for the first name of herman seem to have the 7 bytes: c3 a5 ef bf bd c2 b5

  • c3 a5 is the UTF-8 encoding of the latin1 character whose single byte encoding in latin1 is e5.
  • ef bf bd is the hex representation of the 3 bytes for the UTF-8 replacement character (https://apps.timwhitlock.info/unicode/inspect?s=%EF%BF%BD) almost certainly put there as hex 91 is not a valid character in latin1 - so cannot be converted to a Unicode character.
  • c2 b5 is the UTF-8 encoding of the latin1 character whose single byte encoding in latin1 is b5.

This strongly hints to me that the three byte sequence e5 91 b5 which is already a UTF-8 encoded byte sequence is being treated as if it were a sequence of 8-bit characters in latin1 and that sequence is being encoded again into utf-8 giving the 7 byte sequence which appears as the first name of herman.

The big question is where exactly this second encoding to UTF-8 is happening and why it is happening.


I took a look at hebrew_language_webwork on demo.webwork, and unsurprising see the same sort of issue when trying special UTF8 characters in a name. However, there are also warning message which have the string variants and which seem to show the correct characters I tried to save, even though the database seems to have saved the wrong thing.


I tried checking whether the HTML page of demo.webwork for the hebrew_language_webwork has the character-set set to utf-8 somewhere in the HTML head block, and how it is done. It seems fine to me and is using the math4 template.

Note: The math4 template does this in one manner (<meta charset='utf-8'>), and the older math3 template does it a different manner (<meta http-equiv="content-type" content="text/html; charset=utf-8" />).

I am not having any problem with this Chinese character in my testing on a Docker based server with either math4 or math3.

I also tried using browser tools to examine the data being sent by POST when a change is made. It looked fine to me. Thus, I suspect that the mangling is happening on the server side.

Maybe have a look at the Apache configuration. Could there be an Apache setting forcing a different default charset instead if UTF8?

@mgage
Copy link
Copy Markdown
Member Author

mgage commented May 19, 2019

Some light at the end of the tunnel? :-) With these lines

my $fieldValue = $User->$field;
warn "user values ".join(" ", $User->user_id,encode_utf8($User->first_name), encode_utf8($User->last_name))."\n";
warn "variants". join(" ",$User->user_id, decode_utf8($User->first_name), decode_utf8($User->last_name))."\n";

The variants line prints correctly. So apparently I need to decode the return values from the database with decode_utf8(). This still doesn't tell me what configuration setting of the database allows decode_utf8() to be omitted from the docker version of the course.
(About the minor point. There were two different spots where I was checking the contents of first_name -- the contents ended up being the same and I didn't include all of the data in the previous post. )

@mgage
Copy link
Copy Markdown
Member Author

mgage commented May 19, 2019

Maybe have a look at the Apache configuration. Could there be an Apache setting forcing a different default charset instead if UTF8?

That's a good idea. I'll look into it.

@mgage
Copy link
Copy Markdown
Member Author

mgage commented May 21, 2019

for reference. On docker installation:

mysql> select last_name, hex(last_name), length(last_name), char_length(last_name) from test_chinese_language_user;
+-----------+----------------+-------------------+------------------------+
| last_name | hex(last_name) | length(last_name) | char_length(last_name) |
+-----------+----------------+-------------------+------------------------+
| Admin | 41646D696E | 5 | 5 |
| Gage | 47616765 | 4 | 4 |
| 呵 | E591B5 | 3 | 1 |
+-----------+----------------+-------------------+------------------------+
3 rows in set (0.00 sec)

Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 7
Server version: 5.5.5-10.1.40-MariaDB-1~bionic mariadb.org binary distribution

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.


@mgage
Copy link
Copy Markdown
Member Author

mgage commented May 21, 2019

Update: The bad chinese characters below were caused by the setting on the terminal
I was using at the time. It was set to latin1 and converted the utf8mb4 chinese character to
a latin1 string. Fixing the terminal configuration let demo.webwork.rochester.edu properly
print the chinese character for the first_name in the test below.

while on demo.webwork.rochester (and similar but not identical class list database) I get

select first_name, hex(first_name), length(first_name), char_length(first_name) from chinese_language_course2_user;
+------------+-----------------+--------------------+-------------------------+
| first_name | hex(first_name) | length(first_name) | char_length(first_name) |
+------------+-----------------+--------------------+-------------------------+
| � | E591B5 | 3 | 1 |
| � | E591B5 | 3 | 1 |
| � | E591B5 | 3 | 1 |
| JANE | 4A414E45 | 4 | 4 |
| JANE | 4A414E45 | 4 | 4 |
| JANE | 4A414E45 | 4 | 4 |
| JANE | 4A414E45 | 4 | 4 |
| JANE | 4A414E45 | 4 | 4 |
| JANE | 4A414E45 | 4 | 4 |
| JANE | 4A414E45 | 4 | 4 |
| JANE | 4A414E45 | 4 | 4 |
| JANE | 4A414E45 | 4 | 4 |
| � | E591B5 | 3 | 1 |
+------------+-----------------+--------------------+-------------------------+
13 rows in set (0.00 sec)

Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 512
Server version: 5.5.50 MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

so the servers are not the same but it doesn't seem to me that they should be this far apart.

mysql> show variables like 'character%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

@taniwallach
Copy link
Copy Markdown
Member

Here are some ideas:

The mysql_enable_utf8mb4 switch is a relatively recent (2015) addition to DBD-mysql (mysql.pm), apparently added in version 4.032_02, while mysql_enable_utf8 is much older.

The version in Docker (from Ubuntu 16.04) is 4.033, which probably explains why utf8mb4 databases are working properly in Docker.

It seems that many distributions have older versions of DBD-mysql, and "even" RHEL 7 has only version 4.023, which is too old to have the utf8mb4 support.

See perl5-dbi/DBD-mysql#37

Based on the files changed in that PR for DBD-mysql, it seems that setting mysql_enable_utf8mb4 to 1 when the version of DBD-mysql is too old does nothing, so that DBD-mysql will not be running
sv_utf8_decode(sv); in dbdimp.c to decode the UTF-8 data into Perl characters.

This might explain why we seem to have encountered "double encoding" on @mgage's tests on demo.webwork.

It seems to me that for older versions of DBD-mysql, it might help to try using mysql_enable_utf8 instead when either utf8 or utf8mb4 is the database character set. At the very least that will trigger the use of sv_utf8_decode(sv); in dbdimp.c which may solve the problem.

As far as I can tell, using the "wrong" switch is likely to force the connection charset to mySQL's limited (max 3 byte) utf8 instead of utf8mb4 which also supports the 4-byte code of the full UTF-8, but that would be far better than having no utf-8 support at all. Most "normal" letters is most languages do not need the fourth byte of UTF-8, while other special symbols (such as emoji, if I recall) do need to extra byte.

mgage added 3 commits May 21, 2019 10:48
…itch to SQL.pm to accommodate older browsers.
… multilingual

# Conflicts:
#	lib/WeBWorK/DB/Driver/SQL.pm

fixed merge conflict
@mgage
Copy link
Copy Markdown
Member Author

mgage commented May 21, 2019

OK.

I think that with this change -- adding both mysql_enable_utf8 and mysql_enable_utf8mb4
we have a version that will work with both newer and older versions of perl and mysql.
Much thanks to Tani Wallach for his help in getting this part debugged!!!

We'll test this branch a bit more and then I'll merge it in to develop where we can test it further.
If the $ENABLE_UTF8MB4 switch in site.conf is set to zero then all current webwork sites should work although they will not handle multibyte characters. with $ENABLE_UTF8MB4 enabled multibyte characters will work.

A lot of work still needs to be done on translation dictionaries, sorting orders and many other things but that can be done on development which will become release 2.15 sometime this summer.

# Conflicts:
#	Dockerfile
@mgage
Copy link
Copy Markdown
Member Author

mgage commented May 22, 2019

I'm ready to merge this PR (multilingual) into develop and to begin refining develop as release 2.15.
Are there any comments as I'm getting started? @xcompass, @Alex-Jordan , @pstaabp, @dpvc,@nmoller, @taniwallach, @dlglin, @dsteinmo, @siefkinj, @drandrew42, @dsteinmo, @heiderich, @aubreyja, @sean-fitzpartick, @drjt,@LFCheung.

This branch is up and running on the site: https://demo.webwork.rochester.edu/webwork2 . There are several language courses set up there. I can set up others (or give you access to existing language courses) if someone is interested. You have professor access to the courses UR101, UR102, HKUST101, HKUST102 using profa/profa if you want to try entering foreign names in class lists.

-- mike (gage@math.rochester.edu)

taniwallach and others added 3 commits May 23, 2019 17:05
like Hebrew or Arabic.
The data from
	lib/WeBWorK/ContentGenerator/Instructor/ShowAnswers.pm
   (using table formatting)
and	lib/WeBWorK/ContentGenerator/Instructor/StudentProgress.pm
   which are preformatted using a <pre> tag
are not displated properly when the overall HTML document in in RTL direction.
Force this data to be displayed LTR using an HTML dir attribute.
For StudentProgress data an enveloping SPAN tag was needed.
no changes added to commit (use "git add" and/or "git commit -a")
[tani@webwork webwork2]$ git commit lib/WeBWorK/ContentGenerator/Instructor/ShowAnswers.pm lib/WeBWorK/ContentGenerator/Instructor/StudentProgress.pm
…ebrew_fixes2

RTL display fixes and Hebrew translation fixes
textarea for "Edit" and in a (new) enveloping DIV around the <pre> for
"View". Then in courses whose primarly language used the RTL text
direction, lines which start in LTL languages (most of the lines in PG files,
the set definition files, etc) will be left justified and formatted nicely.
Without this, all lines would be right justified based on the main dir="rtl"
which applies to the entire HTML page.
Copy link
Copy Markdown
Member

@taniwallach taniwallach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgage has indicated that he would like to merge this PR into develop, and fix remaining problems there, so that the code starts to get wider testing and feedback from additional people.

There are some loose ends (see below) but overall - I think that the UTF-8 / multilingual support is almost ready for public release, and should be ready for 2.15.

I agree that that approach is probably needed, as it seems that those people already using this and prior branches to provide UTF-8 support have reported and fixed whatever issues have been discovered.


I'm running the code from this PR on a production server (using Docker) which is serving courses in Hebrew. My database is using utf8mb4 as the character set. Overall, my system is working fine, including support for LTI with UTF-8 (Hebrew) characters in the student names.

The following are not yet working:

  • The link to send email to a TA fails when the student name has wide UTF-8 characters:
    • Failed to send message: Wide character in syswrite at /usr/local/share/perl/5.22.1/Net/Cmd.pm line 210.
    • However, wide-characters in the body do not trigger the error, but arrive as gibberish.
    • The solution is most probably to apply encode_utf8() before sending data to the mailer, and to make sure that the email headers/settings are set up to support UTF-8 encoded text.
    • I have not looked into this yet or at any of the WW code which generates the emails.
    • My site currently has mailto: links in the course_info file for now and we told out students to use them, as the main email feature does not yet work with Hebrew.
  • PDF generation with Hebrew question.
    • The LaTeX headers will needs more configuration work, and additional LaTeX packages and fonts will need to be installed.
    • This sort of issue will apply to any language which requires additional LaTeX packages and/or fonts.
    • For Hebrew, Arabic, etc - the need to swap between RTL and LTR text directions make the matter even more complex.
  • Importing set definition files created with "bad" short time-zone codes (like "IDT") does not work.
    *See #928 where there is additional information.
    • Deleting the time zone code in the SDF file bypasses the problem.
    • A different solution is proposed in #941.
  • I don't think these are reasons to hold back on merging the PR into develop, but the first issue is something to be fixed before version 2.15 can be released.

I just made mgage#28 to address another RTL-LTR issue. Edit and View in RTL language courses have issues without dir="auto" added, and that is done in mgage#28.

Some of the main features I have not tested:

  • Generating scoring files (not needed yet), but probably tested by prior users of UTF-8 branches.
  • Achievements
  • Gateway quizzes
  • Recent changes to the Docker config files, as I am using customized versions which added support for SSL and local configuration on my server.

I think it would be helpful for someone else to test Gateway quizzes and achievements with this branch before or after it gets merged into develop.

RTL fixes for View and Edit in FileManager.pm
@taniwallach
Copy link
Copy Markdown
Member

@mgage I also added mgage/pg#10 for pg, adding macros to add DIV and SPAN HTML tags into problems. The primary need is to set the HTML dir and lang attributes when swapping between RTL and LTR and switching languages. The code also allows setting class and style on the DIV and SPAN tags.

@mgage
Copy link
Copy Markdown
Member Author

mgage commented May 24, 2019

OK!!! Here goes. Multilingual is merging into develop. Be cautious downloading develop for the next few weeks as we work out bugs. It works great on our machines!!! :-)

There are outstanding (relatively minor) things to be checked up on if you actually use the extended utf8 characters. See #942, #943, #944, #945, #946, #947,

@mgage mgage merged commit 075d8a8 into openwebwork:develop May 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants