Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python GCS Client library low performance on multi-thread #69

Closed
braussjoss opened this issue Feb 18, 2020 · 5 comments
Closed

Python GCS Client library low performance on multi-thread #69

braussjoss opened this issue Feb 18, 2020 · 5 comments
Assignees
Labels
api: storage Issues related to the googleapis/python-storage API. performance priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release.

Comments

@braussjoss
Copy link

Experiencing slow performance on a multi-thread script in a GCE VM, the Bucket and the VM are in the same zone (us-east1). After upgrading the library to the latest (1.25), the performance increased, but they found a bottle neck when starting to use 10 threads and over.
Threads timeGCP timeAWS
5 48.4 118.0
10 25.1 58.6
15 22.5 41.3
20 24.1 30.9
25 24.5 25.3

The test data set consists of 114,750 files of ~25GB in size.

Comparing the results with the same app hosted in a VM in AWS. I Want to decrease the time as the threads are increased.

Is the Library going through internet instead of having the communication inside the GCP network?
Are there some limitation that can be solved by some kind of congratulation on the Library?
How to improve the performance avoiding the bottle neck?

Checked the performance of the Bucket with cp and perf-diag directly on the VM in GCE and the results were just fine . This delimit the issue to the library directly.

Just as a reference, this are the values of CP from VM in GCE and AWS with SDK 1.20
Source Multi-thread App gsutil -m cp
GCE VM 30+min 8.5 min
AWS EC2 VM 25 min 26 min

@crwilcox crwilcox transferred this issue from googleapis/google-cloud-python Feb 20, 2020
@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label Feb 20, 2020
@crwilcox crwilcox added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. performance labels Feb 20, 2020
@crwilcox
Copy link
Contributor

crwilcox commented Feb 20, 2020

It might be interesting to log the http traffic and inspect what is going on.

Python

Enabling HTTP requests in Python GCS library can be done using the logging module. In the following example, I'm enabling logging.DEBUG in the example:

from google.cloud import storage
# Python3 required
import http
http.client.HTTPConnection.debuglevel=5
# Necessary to turn on logging

storage_client = storage.Client()

blobs = storage_client.list_blobs("anima-frank")
for blob in blobs:
    print(blob.name)

Ref: https://docs.python.org/3/library/logging.html
Ref: https://docs.python.org/2/library/logging.html

GSUTIL

gsutil has the flag --debug to enable HTTP request logs

For example:

gsutil --debug ls gs://bucket-name

Ref: https://cloud.google.com/storage/docs/gsutil/addlhelp/TopLevelCommandLineOptions

@braussjoss
Copy link
Author

httplog_4workers_20files.txt
httplog.txt

"Per Chris's request I ran the test program with HTTP logging turned on. Here is the output for a run with 1 worker retrieving one file."

@crwilcox
Copy link
Contributor

crwilcox commented Feb 20, 2020

First off, I made some test data by running this locally and uploading to a directory in storage:

for n in {1..1000}; do                        
    dd if=/dev/urandom of=file$( printf %03d "$n" ).data bs=1 count=1024
done

I also made some small modifications to the code to make it a bit more flexible.

  • I set defaults in the file to make running in a debugger easier. if you set your own they will still be used.
  • the code didn't support objects without metadata. It checks now before assuming there is metadata to access.

code.txt

log_crw_100f_10t.txt

@crwilcox
Copy link
Contributor

crwilcox commented Feb 21, 2020

After a bit of investigation. testing from my network (Seattle, WA). running 8 workers on a machine with a Quad-Core Intel Core i7 (8 vCores). Bucket is multiregion us.

I tracked metadata retrieval, downloading the 1kb file, setting metadata. Each takes right around .15 - .25 seconds. If it takes longer than .25 I print a warning. The attached log has a single warning from a metadata update that took .27 seconds.

The code has change slightly from above as I added additional logging.
code.txt
log_1000f_8w.txt

Time to download using code.py is around 40 seconds (timing is capturing the final sleep on threads so actual time is less)

Using gsutil -m cp -r gs://bucket/demo-data/ I see it taking 35.76s

@crwilcox
Copy link
Contributor

Closing this out as customer has been helped. Will open bugs to dig into specific things we can do to help folks avoid this in the future. It seems the threaded version of this code has some contention. Moving to multiprocessing is much faster.

Using threads: ~30 seconds
After moving to multiprocesing:
16 workers: 19.4 seconds
32 workers: 13.2 seconds
64 workers: 10.3 seconds
128 workers: 9.1 secondsF

multiprocessing_code.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/python-storage API. performance priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release.
Projects
None yet
Development

No branches or pull requests

2 participants