Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Thread 3 issue restoring [...]: Table 'tablename' already exists while restoring tables #469

Closed
64kramsystem opened this issue Nov 2, 2021 · 31 comments · Fixed by #554
Labels

Comments

@64kramsystem
Copy link
Contributor

64kramsystem commented Nov 2, 2021

Describe the bug

While restoring tables, an error Table 'table1' already exists is reported.

To Reproduce

This seems to happen when restoring multiple tables. In my case, I've tried with 1 table and 3 tables; got no error with 1 table, and the error each time with 3 tables. Since this is likely a non-deterministic problem (due to threading), it may be hard to reproduce or pinpoint the necessary conditions.

Command executed:

mydumper \
  -h $host -u $user -p $pwd \
  --verbose 2 \
  --compress-protocol \
  --database $source_db \
  --regex $pattern \
  --where $condition

myloader \
  -u $user --socket $socket \
  --directory export-* \
  --verbose 3 \
  --source-db $source_db --database $dest_db \
  --innodb-optimize-keys \
  --overwrite-tables

Expected behavior

No errors should be reported.

Log

The log has been anonymized, however, the table names mapping has been kept:

** Message: 12:11:17.558: Fast index creation will be use for table: $dest_db.table1
** Message: 12:11:17.558: Fast index creation will be use for table: $dest_db.table2
** Message: 12:11:17.558: Thread 3 restoring table `$dest_db`.`table1` from export-20211102-121057/$source_db.table1-schema.sql
** Message: 12:11:17.558: Dropping table or view (if exists) `$dest_db`.`table1`
** Message: 12:11:17.558: Thread 2 restoring table `$dest_db`.`table2` from export-20211102-121057/$source_db.table2-schema.sql
** Message: 12:11:17.558: Thread 1 restoring table `$dest_db`.`table3` from export-20211102-121057/$source_db.table3-schema.sql
** Message: 12:11:17.558: Dropping table or view (if exists) `$dest_db`.`table2`
** Message: 12:11:17.558: Dropping table or view (if exists) `$dest_db`.`table3`
** Message: 12:11:17.563: Creating table `$dest_db`.`table3` from export-20211102-121057/$source_db.table3-schema.sql
** Message: 12:11:17.563: Creating table `$dest_db`.`table1` from export-20211102-121057/$source_db.table1-schema.sql

** (myloader:322529): CRITICAL **: 12:11:17.563: Thread 3 issue restoring export-20211102-121057/$source_db.table1-schema.sql: Table 'table1' already exists
** Message: 12:11:17.569: Creating table `$dest_db`.`table2` from export-20211102-121057/$source_db.table2-schema.sql
** Message: 12:11:17.574: Thread 2 restoring `$dest_db`.`table3` part 0 of 0 from $source_db.table3.00000.sql. Progress 1 of 2 .
** Message: 12:11:17.574: Thread 1 restoring `$dest_db`.`table1` part 0 of 0 from $source_db.table1.00000.sql. Progress 2 of 2 .
** Message: 12:11:17.574: Thread 3 restoring indexes `$dest_db`.`table2`

** (myloader:322529): CRITICAL **: 12:11:17.574: Thread 1 issue restoring $source_db.table1.00000.sql: 
** Message: 12:11:17.574: Thread 1 restoring indexes `$dest_db`.`table1`
** Message: 12:11:17.596: Import timings:
** Message: 12:11:17.596: Thread 1 restoring constraint `$dest_db`.`table2` from export-20211102-121057/$source_db.table2-schema.sql
** Message: 12:11:17.596: Thread 4 restoring constraint `$dest_db`.`table3` from export-20211102-121057/$source_db.table3-schema.sql
** Message: 12:11:17.596: Data      	| Index    	| Total   	| Table
** Message: 12:11:17.596: 0 00:00:00	| 0 00:00:00	| 0 00:00:00	| `$dest_db`.`table3`
** Message: 12:11:17.596: 0 00:00:00	| 0 00:00:00	| 0 00:00:00	| `$dest_db`.`table1`
** Message: 12:11:17.596: 0 00:00:00	| 0 00:00:00	| 0 00:00:00	| `$dest_db`.`table2`
** Message: 12:11:17.608: Starting table checksum verification

Backup

Should not be relevant.

Environment (please complete the following information):

  • OS version: Client = Ubuntu 20.04
  • MyDumper version: current regex branch (commit 0b14ec3)

Additional context

MyDumper rocks! 😃

@davidducos
Copy link
Member

Hi @64kramsystem,
does $dest_db.table1 exist after failure?

@64kramsystem
Copy link
Contributor Author

Hi @64kramsystem, does $dest_db.table1 exist after failure?

Hello!

I've reviewed today the issue. It seems to be harmless (although I can't confirm - considerations below), although it's very confusing, as CRITICAL errors definitely catch the attention. It also causes an exit with error, which is troublesome in scripts.

The table does exist after failure, and it contains records. In order to get a better idea, I've truncated the table, but not dropped it; the load inserts all the records. I can't confirm that the problem is harmless (as mentioned above) because the table records are only 8, and I don't know what happens for larger datasets.

I've further isolated the conditions for the problem to happen; they are:

  • running --overwrite-tables and
  • the tables already and
  • the number of threads is gteq the number of tables

I'm going to add some further debug information shortly.

@64kramsystem
Copy link
Contributor Author

64kramsystem commented Nov 8, 2021

I'm examining the myloader source (myloader.c).

In process_restore_job, table overwriting errors are swallowed in some cases, causing failures to fall through.

See the overwrite invocation here:

  • if there is an error,
  • and purge_mode is not TRUNCATE/DELETE, for example is DROP, the error won't be caught, and it will likely cause restore_data_in_gstring to fail.

I don't know if it's intentional or not, but if all errors happing on overwrite_table invocation were raised immediately, it would be clearer in situations like this what's actually happening.

(I'm not familiar with the code, so there's a chance I may be misunderstanding the logic).

@davidducos
Copy link
Member

Hi @64kramsystem,

I didn't check the line that you mention yet, but I just wanted to let you know, that I'm already working to better handling those errors, as it is very difficult to follow the thread that it is failing. You can check #393 and test that branch.

Please, let me know

@64kramsystem
Copy link
Contributor Author

I may have found some pointer.

If I force error printing after table dropping, applying this diff to the current master:

diff --git i/myloader.c w/myloader.c
index 1d4d947..725e5c3 100644
--- i/myloader.c
+++ w/myloader.c
@@ -1084,8 +1084,10 @@ void process_restore_job(MYSQL *thrconn,struct restore_job *rj, int thread_id, i
       g_message("Thread %d restoring table `%s`.`%s` from %s", thread_id, 
                 dbt->real_database, dbt->real_table, rj->filename);
       int truncate_or_delete_failed=0;
-      if (overwrite_tables)
+      if (overwrite_tables) {
         truncate_or_delete_failed=overwrite_table(thrconn,thread_id,dbt->real_database, dbt->real_table);
+        g_critical("Thread %d FORCED MESSAGE %s: %s",thread_id,dbt->real_table, mysql_error(thrconn));
+      }
       if ((purge_mode == TRUNCATE || purge_mode == DELETE) && !truncate_or_delete_failed){
         g_message("Skipping table creation `%s`.`%s` from %s", dbt->real_database, dbt->real_table, rj->filename);
       }else{

The output shows (filtered for the problematic table):

** Message: 16:11:04.000: Fast index creation will be use for table: my_dest_db.mytable
** Message: 16:11:04.000: Thread 1 restoring table `my_dest_db`.`mytable` from /path/to/export-20211108-151930/my_source_db.mytable-schema.sql
** Message: 16:11:04.000: Thread 1 dropping table or view (if exists) `my_dest_db`.`mytable`
** (myloader:421500): CRITICAL **: 16:11:04.003: Thread 1 FORCED MESSAGE mytable: 'my_dest_db.mytable' is not VIEW
** Message: 16:11:04.004: Thread 1 creating table `my_dest_db`.`mytable` from /path/to/export-20211108-151930/my_source_db.mytable-schema.sql
** (myloader:421500): CRITICAL **: 16:11:04.004: Thread 1 issue restoring /path/to/export-20211108-151930/my_source_db.mytable-schema.sql: Table 'mytable' already exists
** Message: 16:11:04.016: Thread 3 restoring `my_dest_db`.`mytable` part 0 of 0 from my_source_db.mytable.00000.sql. Progress 2 of 3 .
** Message: 16:11:04.016: Thread 3 restoring indexes `my_dest_db`.`mytable`
** Message: 16:11:04.031: 0 00:00:00	| 0 00:00:00	| 0 00:00:00	| `my_dest_db`.`mytable`%  

I wonder if, for some reason, a DROP VIEW is issued instead of a DROP TABLE, which explains the error, although it's not clear to me, if this is the case, why it'd happen only for one table.

@64kramsystem
Copy link
Contributor Author

Hi @64kramsystem,

I didn't check the line that you mention yet, but I just wanted to let you know, that I'm already working to better handling those errors, as it is very difficult to follow the thread that it is failing. You can check #393 and test that branch.

Please, let me know

Thanks. I will debug also using the branch mentioned 🙂.

@64kramsystem
Copy link
Contributor Author

64kramsystem commented Nov 8, 2021

Latest finding, based on an updated diff (which actually explains the is not VIEW error):

diff --git i/myloader.c w/myloader.c
index 1d4d947..2988756 100644
--- i/myloader.c
+++ w/myloader.c
@@ -966,9 +966,9 @@ int overwrite_table(MYSQL *conn,int thread_id,gchar * database, gchar * table){
     query = g_strdup_printf("DROP TABLE IF EXISTS `%s`.`%s`",
                             database, table);
     mysql_query(conn, query);
-    query = g_strdup_printf("DROP VIEW IF EXISTS `%s`.`%s`", database,
-                            table);
-    mysql_query(conn, query);
+    // query = g_strdup_printf("DROP VIEW IF EXISTS `%s`.`%s`", database,
+    //                         table);
+    // mysql_query(conn, query);
   } else if (purge_mode == TRUNCATE) {
     g_message("Truncating table `%s`.`%s`", database, table);
     query= g_strdup_printf("TRUNCATE TABLE `%s`.`%s`", database, table);
@@ -1084,8 +1084,10 @@ void process_restore_job(MYSQL *thrconn,struct restore_job *rj, int thread_id, i
       g_message("Thread %d restoring table `%s`.`%s` from %s", thread_id, 
                 dbt->real_database, dbt->real_table, rj->filename);
       int truncate_or_delete_failed=0;
-      if (overwrite_tables)
+      if (overwrite_tables) {
         truncate_or_delete_failed=overwrite_table(thrconn,thread_id,dbt->real_database, dbt->real_table);
+        g_critical("Thread %d FORCED MESSAGE %s: %s",thread_id,dbt->real_table, mysql_error(thrconn));
+      }
       if ((purge_mode == TRUNCATE || purge_mode == DELETE) && !truncate_or_delete_failed){
         g_message("Skipping table creation `%s`.`%s` from %s", dbt->real_database, dbt->real_table, rj->filename);
       }else{

The log is now (filtered):

** Message: 16:16:59.150: Fast index creation will be use for table: dest_db.mytable
** Message: 16:16:59.150: Thread 3 restoring table `dest_db`.`mytable` from /path/to/export-20211108-151930/source_db.mytable-schema.sql
** Message: 16:16:59.150: Thread 3 dropping table or view (if exists) `dest_db`.`mytable`
** (myloader:425457): CRITICAL **: 16:16:59.151: Thread 3 FORCED MESSAGE mytable: Deadlock found when trying to get lock; try restarting transaction
** Message: 16:16:59.151: Thread 3 creating table `dest_db`.`mytable` from /path/to/export-20211108-151930/source_db.mytable-schema.sql
** (myloader:425457): CRITICAL **: 16:16:59.155: Thread 3 issue restoring /path/to/export-20211108-151930/source_db.mytable-schema.sql: Table 'mytable' already exists
** Message: 16:16:59.166: Thread 2 restoring `dest_db`.`mytable` part 0 of 0 from source_db.mytable.00000.sql. Progress 2 of 3 .
** Message: 16:16:59.166: Thread 2 restoring indexes `dest_db`.`mytable`
** Message: 16:16:59.185: 0 00:00:00	| 0 00:00:00	| 0 00:00:00	| `dest_db`.`mytable`%

My educated guess is that it's the following race condition:

  • the records insertion threads starts
  • the drop/create thread starts after
  • the drop table fails, due to an insert being performed
  • the drop table error is swallowed
  • the create table fails, because the table has never been dropped

this conflicts with the actual timings, but the log timing may be asynchronous, so a few ms of difference may be ignored.

This is an educated guess, I could be wrong. I'll follow up later in the week.

@davidducos
Copy link
Member

Can you test with just 1 thread? -t 1. Can you confirm that you are only running myloader when you have this error?

@64kramsystem
Copy link
Contributor Author

Can you test with just 1 thread? -t 1. Can you confirm that you are only running myloader when you have this error?

The error doesn't happen with 1 or 2 threads. Tested around 7/8 times in total.

@davidducos
Copy link
Member

@64kramsystem why don't to test with G_MESSAGES_DEBUG=all as might be possible that the code that you are running has some DEBUG messages that might be useful.

@davidducos davidducos added this to the Release 0.11.5 milestone Nov 10, 2021
@64kramsystem
Copy link
Contributor Author

64kramsystem commented Nov 12, 2021

So, I dug a bit. I also used, as referred, #476, however, it doesn't help. Also using G_MESSAGES_DEBUG didn't show anything that I could use.

Now, this is an issue inside an issue :)

The error observed, Table 'table1' already exists is not the real error. The real error is mytable: Deadlock found when trying to get lock; try restarting transaction. This happens when trying to drop the table; since this error is swallowed, the next operation, table creation, causes the visible error (which is skipped).

As of now, I don't know what causes the deadlock. I've tried to remove almost everything from process_restore_job (even the create table operations!), but I still get the deadlock. I've enabled the general log, but there seems to be no explicit SQL statement that would cause locking on any table. Interestingly, even if I enable deadlocks printing in the MySQL server, no deadlocks are sent to the mysqld error log.

The obvious guess is that there's something sneaky that, before the table drop(s) in process_restore_job, that locks the tables. However (worse case), it could also be that some thread-unsafe API is accessed without exclusive locking. The latter hypothesis actually explains why mysqld doesn't report any deadlock or locking query - on the other hand, there may be API calls that don't execute queries, but still hold locks (e.g. table metadata access).

I've spent a considerable time, so I'm stopping here.

As a conclusion, I propose to rewrite Mydumper... in Perl 😂

@davidducos
Copy link
Member

Hi @64kramsystem, can you share the table structure? Do you foreign keys in other tables that point to this table?

@64kramsystem
Copy link
Contributor Author

Hi @64kramsystem, can you share the table structure? Do you foreign keys in other tables that point to this table?

Hello! I'll prepare, over this week, an anonymized, minimal test case, and post it.

@64kramsystem
Copy link
Contributor Author

64kramsystem commented Nov 14, 2021

Hi there! I've diagnosed the issue more precisely, and prepared the simplest dataset I could.

The issue is caused by a combination of two factors:

  • a FK on one of the (two) tables
  • the option --innodb-optimize-keys enabled

In order to reproduce, unzip the dataset to a directory, and execute:

function restore {
  # add credentials and connection options
  /path/to/myloader \
    --socket /tmp/mysql.sock \
    --directory /path/to/restore_files \
    --verbose 3 \
    --innodb-optimize-keys \
    --threads 3 \
    --overwrite-tables \
    2>&1 | grep already
}

mysql -e 'drop schema if exists mydb; create schema mydb'; while true; do restore; echo -n .; done

The message:

.** (myloader:216042): CRITICAL **: 22:15:57.143: Thread 3 issue restoring /path/to/restore_files/mydb.parents-schema.sql: Table 'parents' already exists

will be printed until the script is stopped. Sometimes the error message will refer to the table children.

Dataset: myloader_issue.zip

@davidducos
Copy link
Member

Hi @64kramsystem,
Thanks for sharing a test case and all the work that you have done so far! I will be reviewing it today and share with you what I have found!

@davidducos
Copy link
Member

Hi @64kramsystem, in what database version are you trying to restore?

@64kramsystem
Copy link
Contributor Author

Hi @64kramsystem, in what database version are you trying to restore?

mysql-8.0.25-linux-glibc2.17-x86_64-minimal

@davidducos
Copy link
Member

Oh! good that I asked, as this issue is not happening in 5.7.

@davidducos
Copy link
Member

Ok, this is an issue that can be reproduced only in MySQL/Percona Server 8, this is not happening in 5.7.
The test case is:
Session 1:

set foreign_key_checks=1;
DROP TABLE IF EXISTS `mydb`.`parents`; *1
set foreign_key_checks=0;
DROP TABLE IF EXISTS `mydb`.`parents`; *2

Session 2:

set foreign_key_checks=1;
select sleep(0.00001); DROP TABLE IF EXISTS `mydb`.`children`;  *1
set foreign_key_checks=0;
select sleep(0.00001); DROP TABLE IF EXISTS `mydb`.`children`;  *2

Statements with *1 and *2 are executed at the exact same time.
All the command in Session 2 will complete correctly.
In 5.7 you will get in Session 1:

ERROR 1217 (23000): Cannot delete or update a parent row: a foreign key constraint fails

on *1 and will succeed on *2.
In 8 you will get on Session 1:

ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

on all the cases.

@davidducos
Copy link
Member

davidducos commented Nov 15, 2021

@64kramsystem,
I reported this bug to MySQL, let's see what they answer. If they decide not to fix this issue, then we need to add option that drop the table in a single thread fashion. This issue was introduced when we started to create the tables in parallel #169

Ah, other interesting things, I was not able to see the deadlock found error in the innodb monitor output, so it is an deadlock that it is not reported as a deadlock... 🤔

@64kramsystem
Copy link
Contributor Author

@64kramsystem, I reported this bug to MySQL, let's see what they answer. If they decide not to fix this issue, then we need to add option that drop the table in a single thread fashion. This issue was introduced when we started to create the tables in parallel #169

Ah, other interesting things, I was not able to see the deadlock found error in the innodb monitor output, so it is an deadlock that it is not reported as a deadlock... thinking

Wow. I didn't think at all it was a MySQL issue. 😅 Thanks for following up 😄

@davidducos
Copy link
Member

@64kramsystem , looks like they don't want to fix it, actually, they didn't acknowledge the problem. Now that we know what is causing the issue and a test case, the solution is to create the tables with a single thread, but I would like to add an option to force using just one thread.
I will need to think the best way to implement it. I will keep you posted.

@davidducos
Copy link
Member

@64kramsystem
I think that this is related to https://dev.mysql.com/doc/refman/8.0/en/create-table-foreign-keys.html#foreign-key-locking which changed the behaviour from 5.7 to 8.0.

@64kramsystem
Copy link
Contributor Author

He! I was just now reading the manuals as well - the below were my observations.

@64kramsystem , looks like they don't want to fix it, actually, they didn't acknowledge the problem. Now that we know what is causing the issue and a test case, the solution is to create the tables with a single thread, but I would like to add an option to force using just one thread. I will need to think the best way to implement it. I will keep you posted.

Ok, well, the MySQL bug tracker issue is dead and buried now. Deadlocks are overrated, anyway 😂

At least they gave a hint, which allows understanding the nature of problem. I think that this is the key:

Metadata locks are extended, as necessary, to tables related by a foreign key constraint to prevent conflicting DML and DDL operations from executing concurrently on the related tables. When updating a parent table, a metadata lock is taken on the child table while updating foreign key metadata. Foreign key metadata is owned by the child table.

This specific paragraph, from the [8.0 MDL page](it's located at https://dev.mysql.com/doc/refman/8.0/en/metadata-locking.html) was indeed not present in the 5.7 version.

@64kramsystem
Copy link
Contributor Author

It does make sense, ultimately.

If DDLs/LOCK tables acquire locks over related tables, with one thread starting from the parent, and another thread starting from the child, then this is a typical deadlock.

It is consistent with the hypothesis the operations putting MDLs don't honor the FOREIGN_KEY_CHECKS variable. It's obscure whether this is intentional or not, although based on this:

MySQL extends metadata locks, as necessary, to tables that are related by a foreign key constraint. Extending metadata locks prevents conflicting DML and DDL operations from executing concurrently on related tables.

one may infer that it's intentional.

I agree with the single thread solution, which I think is the most solid. Using LOCK $table statements as alternative may not work, if the behave the same way as the DDLs used to drop the tables (but I haven't tried).

Thanks for the digging 🙂

@davidducos
Copy link
Member

I was thinking on (other) options to implement this workaround:

  1. Perform the DROP and CREATE single threaded with an option like: --single-threaded-table-creation
  2. As the problem is only at DROP stage, divide the DROP and CREATE in separate stages, then myloader executes the DROP single threaded and the CREATE can be multithreaded.
  3. When deadlock is detected, we can just retry, we can add an option to set a retry limit.

@64kramsystem what do you think about 3. ??

@64kramsystem
Copy link
Contributor Author

64kramsystem commented Nov 23, 2021

I was thinking on (other) options to implement this workaround:

1. Perform the DROP and CREATE single threaded with an option like: --single-threaded-table-creation

2. As the problem is only at DROP stage, divide the DROP and CREATE in separate stages, then myloader executes the DROP single threaded and the CREATE can be multithreaded.

3. When deadlock is detected, we can just retry, we can add an option to set a retry limit.

@64kramsystem what do you think about 3. ??

I'm personally not a big fan of basing workflows on error states 😁 However, I also recognize that of all the options, this is the only one that maintains concurrency as it is now, so I'm not against it (it really depends on how much one considers the tradeoff valid).

Of the options, my favourite is 2 (although I suppose is technically more complex than 1).

If speed is a concern (as the DROPs will execute sequentially) for some users and there is space for a new option, an approach I find interesting is to add a database drop functionality (option), which is the "as-fast-as-it-gets" way to drop the tables; it's also not unusual, as mysqldump/pump have an option to add the drop destination db statement. I'd use it all the time, at least! 😆

@davidducos
Copy link
Member

@64kramsystem I had just realized that you can use --purge-mode=TRUNCATE to avoid the issue and get the expected behavior. Unless the table structure is not the same... 🤔

@davidducos
Copy link
Member

Ok... from my analysis a can tell, that the only REAL solution is 1. Let me explain why I'm not considering 2 and 3 as a solution.
3- At the end, it might fail anyway, and in a scenario where you have a few amount of rows, the 2 threads will be competing each other. The only 'solution' will be add a sleep... but I don't want to do that.
2- This is based that the problem is only at DROP stage but I found that it is not true in all the cases that this issue happens.

So, as this only happens if there are foreign key, it might be good only in parent and childs tables. However, in the childs table structure, there is no info that it is referenced by another table. That is why it is all or none in a single threaded creation mode. But it will be optional, so users that are not using foreign keys, will have the opportunity to create tables in parallel.

@davidducos davidducos linked a pull request Jan 21, 2022 that will close this issue
@64kramsystem
Copy link
Contributor Author

Ok... from my analysis a can tell, that the only REAL solution is 1

Excellent, thanks for the fix! I've also realized that its implementation is simpler than I imagined 👍 Thanks again 😄

@davidducos
Copy link
Member

#922

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants