[ Enhancement ] Improve --rows implementation #529

davidducos · 2021-12-20T15:30:38Z

Currently, when you use --rows, mydumper is going to:

determine the min and max pk
Based on a heuristic and the value in --rows determine the amount of rows per chunk
Creates all the jobs and enqueue them

This is simple and works for the cases where there are no gaps in the pk. However, it might be cases where jobs takes few milliseconds and other that take hundreds of seconds. Another thing to take into account is the amount of files that are going to be created, you just don't know.

So, my idea is to increase the chunk size dynamically taking in account the execution time of the chunk. The goal is to keep the time as nearest to 1 second as possible. We could start reducing to the half if the size if it is larger than 2 seconds and increase to the double it if it is lower than 0.5 seconds.

At this moment, I'm not 100% sure how to implement it. However, other issues that will be merged will allow it to be simpler to implement.

balusarakesh · 2022-03-04T19:28:22Z

we are seeing the above issue while dumping a table with a billion rows
our table has the min pk as 2 and max pk as 132157838608762 and has so many gaps that the total row count is just a billion which is way smaller than the max pk value

when we try to dump with --rows 100000 the command is taking up all the RAM and getting killed by the system

we don'y really want to use the chunk-filesize option because no matter the chunk-filesize value it is putting a lot of pressure on Read-IOPS for the DB which we want to avoid at all costs.

Any alternative options you can suggest where we can limit the amount of read activity per operation?

davidducos · 2022-03-05T03:47:42Z

Hi @balusarakesh, if you want to decrease the reads, you should decrease the amoun of threads with -t

davidducos · 2022-03-05T03:48:56Z

Btw what version of mydumper are you using?

balusarakesh · 2022-03-11T17:30:15Z

@davidducos we are using the latest version of mydumper and the number of threads is set to just 1 and --rows is set to 100,000 and we see a lot of spike in read-iops

balusarakesh · 2022-03-11T21:14:00Z

not really sure if this is related to this issue, we use an AWS RDS replica to perform the dump and we see an intermittent spike in **WRITE-IOPS**, anyone know why this happens?

davidducos · 2022-03-12T10:33:28Z

Hi @balusarakesh , mydumper doesn't perform write operations in the source servers. It is myloader the one that execute the inserts.

jgurney-owneriq · 2022-04-09T06:49:58Z

I'm also running into this problem with --rows since upgrading to 0.11.5. Prior to that I was using 0.9.1-5 from the Ubuntu 18.04 repo and the behaviour of --rows on this table was very fast.

When trying to backup the following table with 0.11.5, it basically never completes. I let it run for a few hours before giving up, by which time it had generated number suffix files over 100000.

mysql> select min(id), max(id) from exampleTable;
+----------+--------------+
| min(id)  | max(id)      |
+----------+--------------+
| 52120659 | 182862300321 |
+----------+--------------+

I've switched to --chunk-filesize for now, with which this table backs up as follows:

# time mydumper --user="root" --password="" --database="[REDACTED]" --outputdir=. --tables-list "[REDACTED]" --threads=2 --chunk-filesize=200 --host localhost

real	0m19.191s
user	0m6.720s
sys	0m1.075s

For comparison, here's performance using --rows with 0.9.1-5:

# time mydumper --user="root" --password="[REDACTED]" --database="[REDACTED]" --outputdir=. --tables-list "[REDACTED]" --threads=2 --rows=500000 --host localhost

real	0m9.921s
user	0m5.192s
sys	0m0.883s

I would be happy to provide timing comparisons with any proposed changed version on this table, to confirm that this use case is handled.

Using Server version: 5.7.36-39-log Percona Server (GPL), Release '39', Revision '305619d'

xiaoxuanfeng · 2022-07-12T09:36:47Z

I also encountered the above problem. I hope the author can optimize this performance problem as soon as possible. Thank you

davidducos · 2022-08-30T16:23:19Z

#492 (comment)

davidducos · 2022-09-09T19:07:49Z

Stage I should be reducing the amount of memory used.

The Stage II will be implemented on next releases.

hustegg · 2022-09-30T03:02:03Z

@davidducos
If we have a huge number of tables(i.e. 500,000 tables, empty most of them), mydumper query each table with --rows:
SELECT min(c1), max(c1) FROM t1; EXPLAIN SELECT * FROM t1;
FTWRL hold until all tables min/max got, slave replay binlog would be blocked.

Could we detect if a table need to be split by rows when detect engine with show table status in advance(maybe not so accurate)? I think it will reduce amount of time holding global read lock since only huge table queried min/max.

davidducos · 2022-10-26T12:01:41Z

Chunk builder has been added and it is better understanding of the chunk that is going to be executed next time that we need to get a chunk of data from the table. So, what remains? the logic to dynamically increase or reduce the chunk size, as currently the step is static (chunk size and step are synonyms in this context). My idea is to use --rows to set the initial step and then add 2 more parameter for min and max, or use something like 1000:5000:100000 for min:initial:max (the default could be 10% of -r for min and 10x for max), which means 1000 for minimal step, 5000 for the initial and 100000 for the maximal step. We are going to be using query time to determine if we need to increase or reduce the chunk size. If query took less than 1 second we multiply by 2 the current step or limit to max, if query took more than 2 seconds we divide it by 2 or limit to min, we do nothing if time is between 1 and 2 seconds.

davidducos added the enhancement label Dec 20, 2021

davidducos added this to the Release 0.12.1 milestone Dec 20, 2021

davidducos modified the milestones: Release 0.12.1, Release 0.12.3 Feb 22, 2022

davidducos modified the milestones: Release 0.12.3-1, Release 0.12.5-1 Mar 22, 2022

davidducos modified the milestones: Release 0.12.5-1, Release 0.12.7-1 Jun 16, 2022

davidducos mentioned this issue Jul 29, 2022

[BUG] Wrong chunking by rows #770

Closed

davidducos linked a pull request Sep 9, 2022 that will close this issue

[Stage I] Moving where clause building to the end #817

Merged

davidducos closed this as completed in #817 Sep 9, 2022

davidducos reopened this Sep 9, 2022

davidducos modified the milestones: Release 0.12.7-1, Release 1.0.1-1 Sep 9, 2022

davidducos mentioned this issue Sep 12, 2022

[BUG] mydumper csv option is missing records while dumping #790

Closed

davidducos mentioned this issue Sep 30, 2022

Create new structure with show table status per database to avoid get min/max when not needed #846

Closed

This was referenced Oct 11, 2022

[BUG] mydumper csv option is missing records while dumping #838

Closed

[BUG] MyDumper is using more memory than it should #857

Closed

davidducos linked a pull request Oct 25, 2022 that will close this issue

Valgrind review and chunk builder added #859

Merged

davidducos mentioned this issue Nov 3, 2022

[BUG] Increasing the amount of threads, increases the amount of memory usage #850

Closed

davidducos linked a pull request Nov 3, 2022 that will close this issue

Getting query time and updating chunk size #878

Merged

davidducos closed this as completed in #878 Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Enhancement ] Improve --rows implementation #529

[ Enhancement ] Improve --rows implementation #529

davidducos commented Dec 20, 2021

balusarakesh commented Mar 4, 2022

davidducos commented Mar 5, 2022

davidducos commented Mar 5, 2022

balusarakesh commented Mar 11, 2022

balusarakesh commented Mar 11, 2022

davidducos commented Mar 12, 2022

jgurney-owneriq commented Apr 9, 2022 •

edited

xiaoxuanfeng commented Jul 12, 2022

davidducos commented Aug 30, 2022

davidducos commented Sep 9, 2022 •

edited

hustegg commented Sep 30, 2022 •

edited

davidducos commented Oct 26, 2022 •

edited

[ Enhancement ] Improve --rows implementation #529

[ Enhancement ] Improve --rows implementation #529

Comments

davidducos commented Dec 20, 2021

balusarakesh commented Mar 4, 2022

davidducos commented Mar 5, 2022

davidducos commented Mar 5, 2022

balusarakesh commented Mar 11, 2022

balusarakesh commented Mar 11, 2022

davidducos commented Mar 12, 2022

jgurney-owneriq commented Apr 9, 2022 • edited

xiaoxuanfeng commented Jul 12, 2022

davidducos commented Aug 30, 2022

davidducos commented Sep 9, 2022 • edited

hustegg commented Sep 30, 2022 • edited

davidducos commented Oct 26, 2022 • edited

jgurney-owneriq commented Apr 9, 2022 •

edited

davidducos commented Sep 9, 2022 •

edited

hustegg commented Sep 30, 2022 •

edited

davidducos commented Oct 26, 2022 •

edited