Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer sharding deferrer to improve parallelism & some mydumper options #1381

Merged
merged 13 commits into from
Jan 15, 2024

Conversation

midenok
Copy link
Collaborator

@midenok midenok commented Jan 3, 2024

--clear, --dirty control how mydumper treats output directory. It doesn't touch dirty out dir by default.
--skip-defer turns off integer sharding deferrer

Based on PR #1386

@midenok midenok force-pushed the fixes branch 9 times, most recently from b4db033 to d163a5d Compare January 6, 2024 11:23
Removed useless proxies:

  give_me_another_non_innodb_chunk_step()
  give_me_another_innodb_chunk_step()

Removed useless hooks in process_queue()

Some formatting, some initialization cleanups

Comments

update_files_on_table_job() cleanup
  - Instrumentation via /usr/bin/time;

  - Backtrace in bash to uderstand where you failed;

  - Fix Unknown database 'empty_db' in test mydumper#811 (empty_db is dropped
    between test groups);

  - Core limit print;

  - --case/-c option for repeated single case testing;

  - Diffirent location for sakila-db.tar.gz (MySQL doc site blocks
    CircleCI for downloading);

  - Use socket connection when possible (a little bit of refactoring
    to prettify the code);

  - MYLOADER_ARGS, MYDUMPER_ARGS env variables pass parameters from
    the shell.
Client library kind of "supports" some of these environment variables,
but this support is buggy and incomplete. F.ex. mysqldump doesn't
respect MYSQL_HOST and setting MYSQL_TCP_PORT will not force it to use
TCP protocol. We treat this environment in upper layer so we can
guarantee and understand their work.

--debug enabled for any version. Tracing works ok on earlier versions.

Check error status of mysql_select_db().
After INTERMEDIATE_ENDED we may not get any more DATA from
refresh_db_queue, therefore we don't push THREAD into refresh_db_queue
and don't trigger SHUTDOWN.

Good sequence:

[DEBUG] - [CJ] Thread control_job_thread started
[DEBUG] - [CJ] refresh_db_queue -> THREAD (4 loaders waiting)
[DEBUG] - [CJ] refresh_db_queue -> THREAD (4 loaders waiting)
[DEBUG] - [CJ] refresh_db_queue -> THREAD (4 loaders waiting)
[DEBUG] - [CJ] refresh_db_queue -> INTERMEDIATE_ENDED (4 loaders waiting)
[DEBUG] - [CJ] refresh_db_queue -> DATA (4 loaders waiting)
[DEBUG] - [CJ] refresh_db_queue <- THREAD
[DEBUG] - [CJ] refresh_db_queue -> THREAD (4 loaders waiting)
[DEBUG] - [CJ] here_is_your_job <- SHUTDOWN (5 times)

Bad sequence:

[DEBUG] - [CJ] Thread control_job_thread started
[DEBUG] - [CJ] refresh_db_queue -> THREAD (4 loaders waiting)
[DEBUG] - [CJ] refresh_db_queue -> THREAD (4 loaders waiting)
[DEBUG] - [CJ] refresh_db_queue -> DATA (4 loaders waiting)
[DEBUG] - [CJ] refresh_db_queue -> THREAD (4 loaders waiting)
[DEBUG] - [CJ] refresh_db_queue -> THREAD (4 loaders waiting)
[DEBUG] - [CJ] refresh_db_queue -> INTERMEDIATE_ENDED (4 loaders waiting)

In good sequence DATA triggered SHUTDOWN (see "if
(intermediate_queue_ended_local && giveup)" in wake_threads_waiting().

In bad sequence DATA doesn't trigger SHUTDOWN as
intermediate_queue_ended_local is still false.

Probably (only probably) we may omit first wake_threads_waiting():

  wake_threads_waiting(conf, &threads_waiting);
  intermediate_queue_ended_local = TRUE;

But it is not guaranteed (I don't understand all the outcomes
possible), so it is most safe to keep both calls of
wake_threads_waiting().
Without --clear leftover chunks from previous dump may cause
duplicate/wrong data!
Fail for dirty dir if no --clear or --dirty specified.
Integer PK is the only PK that allows parallelized sharding on
mydumper. Non-integer PK tables utilize only one thread per table and
if they are going in the end such tables cause some worker threads to
be idle. To mitigate such cases we enqueue integer tables only after
all non-integer tables processed.
Integer sharding defer may cause large RSS consumption for huge amount
of tables since queue will grow until all non-integer tables
processed. --skip-defer should not be used if you don't have >100k
tables.
common.c cleanup: use mydumper_global.h instead of extern
declarations.
@davidducos davidducos merged commit 3830708 into mydumper:master Jan 15, 2024
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants