-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python GCS Client library low performance on multi-thread #69
Comments
It might be interesting to log the http traffic and inspect what is going on. PythonEnabling HTTP requests in Python GCS library can be done using the logging module. In the following example, I'm enabling logging.DEBUG in the example:
Ref: https://docs.python.org/3/library/logging.html GSUTIL
For example:
Ref: https://cloud.google.com/storage/docs/gsutil/addlhelp/TopLevelCommandLineOptions |
httplog_4workers_20files.txt "Per Chris's request I ran the test program with HTTP logging turned on. Here is the output for a run with 1 worker retrieving one file." |
First off, I made some test data by running this locally and uploading to a directory in storage:
I also made some small modifications to the code to make it a bit more flexible.
|
After a bit of investigation. testing from my network (Seattle, WA). running 8 workers on a machine with a Quad-Core Intel Core i7 (8 vCores). Bucket is multiregion us. I tracked metadata retrieval, downloading the 1kb file, setting metadata. Each takes right around .15 - .25 seconds. If it takes longer than .25 I print a warning. The attached log has a single warning from a metadata update that took .27 seconds. The code has change slightly from above as I added additional logging. Time to download using code.py is around 40 seconds (timing is capturing the final sleep on threads so actual time is less) Using |
Closing this out as customer has been helped. Will open bugs to dig into specific things we can do to help folks avoid this in the future. It seems the threaded version of this code has some contention. Moving to multiprocessing is much faster. Using threads: ~30 seconds |
Experiencing slow performance on a multi-thread script in a GCE VM, the Bucket and the VM are in the same zone (us-east1). After upgrading the library to the latest (1.25), the performance increased, but they found a bottle neck when starting to use 10 threads and over.
Threads timeGCP timeAWS
5 48.4 118.0
10 25.1 58.6
15 22.5 41.3
20 24.1 30.9
25 24.5 25.3
The test data set consists of 114,750 files of ~25GB in size.
Comparing the results with the same app hosted in a VM in AWS. I Want to decrease the time as the threads are increased.
Is the Library going through internet instead of having the communication inside the GCP network?
Are there some limitation that can be solved by some kind of congratulation on the Library?
How to improve the performance avoiding the bottle neck?
Checked the performance of the Bucket with cp and perf-diag directly on the VM in GCE and the results were just fine . This delimit the issue to the library directly.
Just as a reference, this are the values of CP from VM in GCE and AWS with SDK 1.20
Source Multi-thread App gsutil -m cp
GCE VM 30+min 8.5 min
AWS EC2 VM 25 min 26 min
The text was updated successfully, but these errors were encountered: