Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Enhancement ] Improve --rows implementation #529

Closed
davidducos opened this issue Dec 20, 2021 · 12 comments · Fixed by #817, #859 or #878
Closed

[ Enhancement ] Improve --rows implementation #529

davidducos opened this issue Dec 20, 2021 · 12 comments · Fixed by #817, #859 or #878

Comments

@davidducos
Copy link
Member

Currently, when you use --rows, mydumper is going to:

  • determine the min and max pk
  • Based on a heuristic and the value in --rows determine the amount of rows per chunk
  • Creates all the jobs and enqueue them

This is simple and works for the cases where there are no gaps in the pk. However, it might be cases where jobs takes few milliseconds and other that take hundreds of seconds. Another thing to take into account is the amount of files that are going to be created, you just don't know.

So, my idea is to increase the chunk size dynamically taking in account the execution time of the chunk. The goal is to keep the time as nearest to 1 second as possible. We could start reducing to the half if the size if it is larger than 2 seconds and increase to the double it if it is lower than 0.5 seconds.

At this moment, I'm not 100% sure how to implement it. However, other issues that will be merged will allow it to be simpler to implement.

@balusarakesh
Copy link

we are seeing the above issue while dumping a table with a billion rows
our table has the min pk as 2 and max pk as 132157838608762 and has so many gaps that the total row count is just a billion which is way smaller than the max pk value

when we try to dump with --rows 100000 the command is taking up all the RAM and getting killed by the system

we don'y really want to use the chunk-filesize option because no matter the chunk-filesize value it is putting a lot of pressure on Read-IOPS for the DB which we want to avoid at all costs.

Any alternative options you can suggest where we can limit the amount of read activity per operation?

@davidducos
Copy link
Member Author

Hi @balusarakesh, if you want to decrease the reads, you should decrease the amoun of threads with -t

@davidducos
Copy link
Member Author

Btw what version of mydumper are you using?

@balusarakesh
Copy link

@davidducos we are using the latest version of mydumper and the number of threads is set to just 1 and --rows is set to 100,000 and we see a lot of spike in read-iops

@balusarakesh
Copy link

not really sure if this is related to this issue, we use an AWS RDS replica to perform the dump and we see an intermittent spike in **WRITE-IOPS**, anyone know why this happens?

Screen Shot 2022-03-11 at 1 11 40 PM

@davidducos
Copy link
Member Author

Hi @balusarakesh , mydumper doesn't perform write operations in the source servers. It is myloader the one that execute the inserts.

@jgurney-owneriq
Copy link

jgurney-owneriq commented Apr 9, 2022

I'm also running into this problem with --rows since upgrading to 0.11.5. Prior to that I was using 0.9.1-5 from the Ubuntu 18.04 repo and the behaviour of --rows on this table was very fast.

When trying to backup the following table with 0.11.5, it basically never completes. I let it run for a few hours before giving up, by which time it had generated number suffix files over 100000.

mysql> select min(id), max(id) from exampleTable;
+----------+--------------+
| min(id)  | max(id)      |
+----------+--------------+
| 52120659 | 182862300321 |
+----------+--------------+

I've switched to --chunk-filesize for now, with which this table backs up as follows:

# time mydumper --user="root" --password="" --database="[REDACTED]" --outputdir=. --tables-list "[REDACTED]" --threads=2 --chunk-filesize=200 --host localhost

real	0m19.191s
user	0m6.720s
sys	0m1.075s

For comparison, here's performance using --rows with 0.9.1-5:

# time mydumper --user="root" --password="[REDACTED]" --database="[REDACTED]" --outputdir=. --tables-list "[REDACTED]" --threads=2 --rows=500000 --host localhost

real	0m9.921s
user	0m5.192s
sys	0m0.883s

I would be happy to provide timing comparisons with any proposed changed version on this table, to confirm that this use case is handled.

Using Server version: 5.7.36-39-log Percona Server (GPL), Release '39', Revision '305619d'

@xiaoxuanfeng
Copy link

I also encountered the above problem. I hope the author can optimize this performance problem as soon as possible. Thank you

@davidducos
Copy link
Member Author

#492 (comment)

@davidducos davidducos linked a pull request Sep 9, 2022 that will close this issue
@davidducos
Copy link
Member Author

davidducos commented Sep 9, 2022

Stage I should be reducing the amount of memory used.

The Stage II will be implemented on next releases.

@hustegg
Copy link

hustegg commented Sep 30, 2022

@davidducos
If we have a huge number of tables(i.e. 500,000 tables, empty most of them), mydumper query each table with --rows:
SELECT min(c1), max(c1) FROM t1; EXPLAIN SELECT * FROM t1;
FTWRL hold until all tables min/max got, slave replay binlog would be blocked.

Could we detect if a table need to be split by rows when detect engine with show table status in advance(maybe not so accurate)? I think it will reduce amount of time holding global read lock since only huge table queried min/max.

@davidducos
Copy link
Member Author

davidducos commented Oct 26, 2022

Chunk builder has been added and it is better understanding of the chunk that is going to be executed next time that we need to get a chunk of data from the table. So, what remains? the logic to dynamically increase or reduce the chunk size, as currently the step is static (chunk size and step are synonyms in this context). My idea is to use --rows to set the initial step and then add 2 more parameter for min and max, or use something like 1000:5000:100000 for min:initial:max (the default could be 10% of -r for min and 10x for max), which means 1000 for minimal step, 5000 for the initial and 100000 for the maximal step. We are going to be using query time to determine if we need to increase or reduce the chunk size. If query took less than 1 second we multiply by 2 the current step or limit to max, if query took more than 2 seconds we divide it by 2 or limit to min, we do nothing if time is between 1 and 2 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment