Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: broken mysql after palm upgrade #890

Merged
merged 1 commit into from Aug 16, 2023

Conversation

regisb
Copy link
Contributor

@regisb regisb commented Aug 15, 2023

This fix is for a rather serious issue that affects users who upgrade from Olive to Palm. The client mysql charset and collation was incorrectly set to utf8mb4, while the server stil runs utf8mb3. Only users who run the mysql container are affected.

To resolve this issue, we explicitely configure the client to use the utf8mb3 charset/collation.

Important note: users who have somehow managed to upgrade from olive to Palm before may find themselves in an undefined state. They might have to fix their mysql data manually. Same thing for users who launched Palm from scratch; although, according to my preliinary tests, they should be able to downgrade their connection from utf8mb4 to utf8mb3 without issue.

Close #887.

@jmbowman
Copy link

jmbowman commented Aug 15, 2023

Note that we're actively recommending utf8mb4 for new installations, and trying to switch old installations (including edx.org) to it. utf8(mb3) is missing large parts of Unicode needed for many languages (and omits some popular newer emoji). See https://discuss.openedx.org/t/upgrading-mysql-charset-to-utf8mb4/7927 for more details.

@regisb
Copy link
Contributor Author

regisb commented Aug 15, 2023

I agree Jeremy and trust me: in general I'm all for upgrading software as much as possible.
But in this particular case I think we need to address an emergency: people running Palm today may be writing data in a format which is very inconsistent. According to my preliminary research, upgrading from utf8mb3 to utf8mb4 is far from trivial, and it's even more complex when some of the data is corrupt. I would like to publish a quick fix ASAP and then take the time to think about the migration later.

@nratineau
Copy link

nratineau commented Aug 16, 2023

@regisb After further testing, the data corruption I encountered during the upgrade happens because of two issues:
1 - The collation as you suspected.
2 - The MySQL version v8.0.33 (not sure which version introduced it) has bug with InnoDB which is fixed in v8.0.34
I think this is the one causing the second round of data corruption:

An in-place upgrade from MySQL 5.7 to MySQL 8.0, without a server restart, could result in unexpected errors when executing queries on tables. This fix eliminates the need to restart the server between the upgrade and queries. (Bug #35410528

This PR + setting the MySQLv8.0.34 allow an upgrade from v15.3.7 to v16.0.5 without any data corruption.

The PR should be updated to include the MySQL version v8.0.34

EDIT: source https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-34.html

@regisb regisb force-pushed the regisb/fix-mysql-upgrade-charset branch from f70f84f to f199ef5 Compare August 16, 2023 15:17
@regisb
Copy link
Contributor Author

regisb commented Aug 16, 2023

OK, I suggest we upgrade directly to MySQL 8.1.0, as this releases seems to fix a large number of bugs: https://dev.mysql.com/doc/relnotes/mysql/8.1/en/news-8-1-0.html

There are two categories of users:

  1. Those who managed to upgrade from Olive to Palm despite this bug. I don't know how they achieved that.
  2. Those who started running Open edX from Palm.

I don't know how to handle group 1. I want to check that this change does not break Open edX for users in group 2. If it doesn't, then I'll merge this PR quickly.

This fix is for a rather serious issue that affects users who upgrade
from Olive to Palm. The client mysql charset and collation was
incorrectly set to utf8mb4, while the server stil runs utf8mb3. Only
users who run the mysql container are affected.

To resolve this issue, we explicitely configure the client to use the
utf8mb3 charset/collation.

Important note: users who have somehow managed to upgrade from olive to
Palm before may find themselves in an undefined state. They might have
to fix their mysql data manually. Same thing for users who launched Palm
from scratch; although, according to my preliinary tests, they should be
able to downgrade their connection from utf8mb4 to utf8mb3 without
issue.

In addition, we upgrade to mysql 8.1.0. Among many other fixes, this
avoids a server restart after the upgrade:

> An in-place upgrade from MySQL 5.7 to MySQL 8.0, without a server
> restart, could result in unexpected errors when executing queries on
> tables. This fix eliminates the need to restart the server between the
> upgrade and queries. (Bug #35410528)

https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-34.html

See also the 8.1.0 release notes:

https://dev.mysql.com/doc/relnotes/mysql/8.1/en/news-8-1-0.html

Close #887.
@regisb regisb force-pushed the regisb/fix-mysql-upgrade-charset branch from f199ef5 to 703a35f Compare August 16, 2023 16:56
@regisb
Copy link
Contributor Author

regisb commented Aug 16, 2023

I just tested an upgrade of an existing 16.0.5 platform, so I'm going to merge this now and release 16.1.0.

@regisb regisb merged commit 2a47100 into master Aug 16, 2023
1 check passed
@nratineau
Copy link

I tested from v16.0.1 with the PR changes and no issue with data corruption or the collation.

@regisb regisb deleted the regisb/fix-mysql-upgrade-charset branch August 16, 2023 17:09
@regisb
Copy link
Contributor Author

regisb commented Aug 16, 2023

Thanks for your help @nratineau!

@ormsbee
Copy link
Contributor

ormsbee commented Aug 24, 2023

@nratineau: Just so I'm clear: Does the corruption still happen if the connection is made with utf8mb4 but the database has been upgraded to MySQL 8.1.0 (or 8.0.34) beforehand?

@regisb
Copy link
Contributor Author

regisb commented Aug 28, 2023

Does the corruption still happen if the connection is made with utf8mb4 but the database has been upgraded to MySQL 8.1.0 (or 8.0.34) beforehand?

Yes. This is exactly the scenario that was occurring:

1. Upgrade MySQL to 8.1.0 (or 8.0.34)
2. Apply edx-platform Django migrations with a utf8mb4 client.

EDIT: This was wrong. See below.

@ormsbee
Copy link
Contributor

ormsbee commented Aug 28, 2023

@regisb: Please pardon my paranoid checking here, but I really want to make sure I understand this issue and what it means for our future migration plans.

mysql client run by Django defaults to a utf8mb4 character set and collation, but the mysql server still runs with utf8mb3

Why did Django connect using utf8mb4? Django shouldn't be using the MySQL database or client defaults–it sets its own default of utf8 (a.k.a. utf8mb3) as a connection string parameter, which it will later override with the explicitly set Django settings value. So unless we had utf8mb4 specified in the DATABASE->OPTIONS Django setting somewhere, it should have still been connecting via utf8, and this PR specifying utf8mb3 should not have made a difference. So as far as I can tell, this PR either didn't change the Django database connection behavior, or it did change it because we're specifying utf8mb4 somewhere else.

@regisb
Copy link
Contributor Author

regisb commented Aug 29, 2023

Please keep asking questions! I'm eager to get to the bottom of this, and I'm not sure I'm able to understand all the elements myself.

Why did Django connect using utf8mb4?

I'm not sure. I'll outline my discoveries here, though none of them are conclusive yet.

Evidence 1: ./manage.py lms dbshell uses the wrong charset

First of all, let's diagnose the issue by looking at the charset and collation used by the Django connection both in Olive and Palm. To do that, we open a mysql shell from django with:

tutor local run lms ./manage.py lms dbshell

Then we look at the connection charset and collation with the following query:

show variables where variable_name like 'character%' or variable_name like 'collat%';

In Olive:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
| collation_connection     | utf8_general_ci            |
| collation_database       | utf8_general_ci            |
| collation_server         | utf8_general_ci            |
+--------------------------+----------------------------+
11 rows in set (0.01 sec)

In Palm, with tutor v16.0.0 (slightly different queries but you get the idea):

mysql> show variables where variable_name like 'character%';
+--------------------------+--------------------------------+
| Variable_name            | Value                          |
+--------------------------+--------------------------------+
| character_set_client     | utf8mb4                        |
| character_set_connection | utf8mb4                        |
| character_set_database   | utf8mb3                        |
| character_set_filesystem | binary                         |
| character_set_results    | utf8mb4                        |
| character_set_server     | utf8mb3                        |
| character_set_system     | utf8mb3                        |
| character_sets_dir       | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+
8 rows in set (0.01 sec)

mysql> show variables where variable_name like 'collat%';
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8mb4_0900_ai_ci |
| collation_database   | utf8mb3_general_ci |
| collation_server     | utf8mb3_general_ci |
+----------------------+--------------------+
3 rows in set (0.00 sec)

In Palm, with tutor v16.1.0:

+--------------------------+--------------------------------+
| Variable_name            | Value                          |
+--------------------------+--------------------------------+
| character_set_client     | utf8mb3                        |
| character_set_connection | utf8mb3                        |
| character_set_database   | utf8mb3                        |
| character_set_filesystem | binary                         |
| character_set_results    | utf8mb3                        |
| character_set_server     | utf8mb3                        |
| character_set_system     | utf8mb3                        |
| character_sets_dir       | /usr/share/mysql-8.1/charsets/ |
| collation_connection     | utf8mb3_general_ci             |
| collation_database       | utf8mb3_general_ci             |
| collation_server         | utf8mb3_general_ci             |
+--------------------------+--------------------------------+
11 rows in set (0.00 sec)

This shows how the mysql --host=... command that is run by Django uses an incorrect client and connection charset. Indeed, when the "charset" option is not set in the database options dict, the mysql command line does not include the --default-character-set option: https://github.com/django/django/blob/fc42edd2e63e89a80e7ca81486291f74359ef8be/django/db/backends/mysql/client.py#L56

Thus, not setting the DATABASES["OPTIONS"]["charset"] breaks the dbshell command.

Evidence 2: ./manage.py lms shell uses the right charset

But after all we shouldn't be too concerned about dbshell, as Django should not be using this client to connect to mysql.

What we observe is that the mysql client used in regular python code uses the right character_set_client:

$ tutor local run lms ./manage.py lms shell
>> from django.db import connection
>> c = connection.cursor()
>> c.execute("show variables where variable_name = 'character_set_connection';")
1
>> c.fetchone()
('character_set_connection', 'utf8mb3')

So it looks like the client is behaving well.

Evidence 3: it looks like the problem comes from the client used in migrations

I think our problem really comes from the mysql client that is used during migrations. As highlighted by @nratineau, the problem does not occur when migrations are run on mysql 5.7, before upgrading the database mysql 8.0.

Evidence 4: some tables from "blockstore.bundles" are already using utf8mb4

Following this lead, I think I may have found a clue. Let's list all the columns that have a utf8mb4 charset in the "openedx" database:

mysql> select table_schema, table_name, column_name, character_set_name from information_schema.COLUMNS where character_set_name="utf8mb4" and table_schema="openedx";
+--------------+-----------------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME            | COLUMN_NAME        | CHARACTER_SET_NAME |
+--------------+-----------------------+--------------------+--------------------+
| openedx      | bundles_bundle        | description        | utf8mb4            |
| openedx      | bundles_bundle        | slug               | utf8mb4            |
| openedx      | bundles_bundle        | title              | utf8mb4            |
| openedx      | bundles_bundleversion | change_description | utf8mb4            |
| openedx      | bundles_collection    | title              | utf8mb4            |
| openedx      | bundles_draft         | name               | utf8mb4            |
+--------------+-----------------------+--------------------+--------------------+
6 rows in set (0.04 sec)

So maybe the bundles app is causing an issue during migrations? To disprove that, I faked the bundles migration:

$ tutor local run lms ./manage.py lms migrate --fake bundles
Operations to perform:                                                                                                                                                                                     
  Apply all migrations: bundles                                                                                                                                                                            
Running migrations:                                                                                                                                                                                        
  Applying bundles.0004_alter_bundle_collection... FAKED

And then I applied all other migrations:

$ tutor local do init
...
Operations to perform:                                                                                                                                                                                     
  Apply all migrations: admin, agreements, announcements, api_admin, assessment, auth, badges, blackboard, block_structure, bookmarks, branding, bulk_email, bulk_grades, bundles, calendar_sync, canvas, c
atalog, celery_utils, certificates, commerce, completion, consent, content_libraries, content_type_gating, contentserver, contenttypes, cornerstone, cors_csrf, course_action_state, course_apps, course_da
te_signals, course_duration_limits, course_goals, course_groups, course_home_api, course_live, course_modes, course_overviews, courseware, crawlers, credentials, credit, dark_lang, database_fixups, degre
ed, degreed2, demographics, discounts, discussions, django_celery_results, django_comment_common, django_notify, edx_name_affirmation, edx_proctoring, edx_when, edxval, email_marketing, embargo, enterpri
se, entitlements, experiments, external_user_ids, grades, instructor_task, integrated_channel, learner_pathway_progress, learning_sequences, lms_xblock, lti1p3_tool_config, lti_consumer, milestones, mobi
le_api, moodle, oauth2_provider, oauth_dispatch, organizations, outcome_surveys, program_enrollments, programs, redirects, rss_proxy, sap_success_factors, save_for_later, schedules, sessions, site_config
uration, sites, social_django, splash, split_modulestore_django, staffgrader, static_replace, status, student, submissions, super_csv, support, survey, survey_report, system_wide_roles, teams, theming, t
hird_party_auth, thumbnail, track, user_api, user_authn, user_tours, util, verify_student, video_config, video_pipeline, waffle, waffle_utils, wiki, workflow, xapi, xblock_django                         
Running migrations:                                                                                                                                                                                        
  Applying blackboard.0010_auto_20221021_0159... OK                                                                                                                                                        
  Applying blackboard.0011_auto_20221031_1855... OK          
...

This still corrupted the database. Thus, I don't think that the bundles migration is causing the issue.

Evidence 5: the error occurs after running the following migrations

After a couple of retries, I could reliably detect the issue after the following migrations have run:

blackboard.0010_auto_20221021_0159
blackboard.0011_auto_20221031_1855
blackboard.0012_move_and_recrete_completed_timestamp
blackboard.0013_alter_blackboardlearnerdatatransmissionaudit_index_together
blackboard.0014_auto_20230105_2122
blackboard.0015_auto_20230112_2002
bundles.0004_alter_bundle_collection
canvas.0025_auto_20221021_0159
canvas.0026_auto_20221031_1855
canvas.0027_move_and_recrete_completed_timestamp
canvas.0028_alter_canvaslearnerdatatransmissionaudit_index_together
canvas.0029_auto_20230105_2122
canvas.0030_auto_20230112_2002
content_libraries.0009_alter_contentlibrary_authorized_lti_configs
cornerstone.0023_auto_20221021_0159
cornerstone.0024_auto_20221031_1855
cornerstone.0025_alter_cornerstonelearnerdatatransmissionaudit_index_together
cornerstone.0026_auto_20230105_2122
cornerstone.0027_auto_20230112_2002
course_modes.0014_auto_20230207_1212
course_modes.0015_expiration_datetime_explicit_admin
course_overviews.0027_auto_20221102_1109
degreed.0024_auto_20221021_0159
degreed.0025_auto_20221031_1855
degreed.0026_move_and_recrete_completed_timestamp
degreed.0027_alter_degreedlearnerdatatransmissionaudit_index_together
degreed.0028_auto_20230105_2122
degreed.0029_auto_20230112_2002
degreed2.0016_auto_20221021_0159

So it seems like the problem is caused by a migration in one of these apps: blackboard, bundles, canvas, content_libraries, cornerstone, course_modes, course_overviews, degreed, degreed2.

I migrated these apps one by one:

./manage.py lms migrate blackboard
./manage.py lms migrate bundles
./manage.py lms migrate canvas
./manage.py lms migrate content_libraries
./manage.py lms migrate cornerstone
./manage.py lms migrate course_modes
./manage.py lms migrate course_overviews
./manage.py lms migrate degreed
./manage.py lms migrate degreed2

And then applied all remaining migrations:

./manage.py lms migrate

And the problem did not occur again... Somehow, running migrations for each app one by one seems to resolve the issue.

Conclusion

I'm not sure which part of the code base is causing the client to connect with an incorrect charset. What I'm fairly sure about is that the problem occurs during the migration step. I do not know how to troubleshoot the problem further.

@ormsbee
Copy link
Contributor

ormsbee commented Aug 29, 2023

In "Evidence 5", what is the exact version of MySQL being used?

@regisb
Copy link
Contributor Author

regisb commented Aug 29, 2023

I used 8.0.33 for all my tests.

@ormsbee
Copy link
Contributor

ormsbee commented Aug 29, 2023

@regisb: Django emits pre_migrate and post_migrate signals. You could make a function catch that signal and print out the connection's client charset each time.

When you try to run this same test with MySQL 8.0.34 or 8.1.0, does it break down in the same set of apps, or does it happen with a different set?

@ormsbee
Copy link
Contributor

ormsbee commented Aug 30, 2023

@adzuci: Wanted to make sure that you saw this, since this complicates plans to do incremental upgrades to utf8mb4 on prod systems.

@fghaas
Copy link
Contributor

fghaas commented Aug 31, 2023

Just taking the liberty to back-link the Discourse thread related to this subject. (Side note: it would be terrific if this could somehow be automated, that is, a bot creating a comment linking to any Discourse thread that a GitHub issue or PR is mentioned in.)

@regisb
Copy link
Contributor Author

regisb commented Sep 6, 2023

OK I've done some more research and I think I've finally found the root cause, which is not at all related to the utf8mb3 charset. Instead the issue is caused by that MySQL bug #35410528:

An in-place upgrade from MySQL 5.7 to MySQL 8.0, without a server restart, could result in unexpected errors when executing queries on tables. This fix eliminates the need to restart the server between the upgrade and queries.

The bug is described more extensively in the upstream commit for the fix: mysql/mysql-server@bbcb9bc

Thus, the following are all working fixes:

  1. Restarting the mysql 8.0.33 container just prior to applying migrations.
  2. Upgrading to mysql 8.0.34 or 8.1.0 before applying migrations. (in that case restarting the container is unnecessary)

For the record, I was wrong when I made that comment about 8.0.34 not fixing the issue. I'm not sure why I was led to believe that upgrading by itself did not resolve the issue. I must have made an error during my many experiments.

Still, I think that we want to keep the changes that were made in this PR. Being explicit about the client charset cannot hurt, as long as we don't force that charset for users who run their own MySQL cluster.

@ormsbee
Copy link
Contributor

ormsbee commented Sep 6, 2023

@regisb: Fantastic news! Thank you so much for tracking this all the way down. I learned a lot just from following your debugging descriptions.

@adzuci: It looks like this clears the way for us to at least start tiptoeing towards migrating larger installations in smaller chunks and experiment with connecting via utf8mb4.

@regisb
Copy link
Contributor Author

regisb commented Sep 7, 2023

Thanks for your help @ormsbee! For the record I did try the pre_migrate idea. I added the following piece of code to a random app.py module:

from django.db.models.signals import pre_migrate

def debug_pre_migrate(*args, **kwargs):
    print(args, kwargs)
    from django.db import connections
    for conn in connections.all():
        print(conn.connection.get_character_set_info())

pre_migrate.connect(debug_pre_migrate)

When it yielded only results with utf8mb3 charsets, I knews I had to look elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MySQL issue Tutor upgrade from v15.3.7 to v16.0.3
5 participants