Python GCS Client library low performance on multi-thread #69

braussjoss · 2020-02-18T22:36:47Z

Experiencing slow performance on a multi-thread script in a GCE VM, the Bucket and the VM are in the same zone (us-east1). After upgrading the library to the latest (1.25), the performance increased, but they found a bottle neck when starting to use 10 threads and over.
Threads timeGCP timeAWS
5 48.4 118.0
10 25.1 58.6
15 22.5 41.3
20 24.1 30.9
25 24.5 25.3

The test data set consists of 114,750 files of ~25GB in size.

Comparing the results with the same app hosted in a VM in AWS. I Want to decrease the time as the threads are increased.

Is the Library going through internet instead of having the communication inside the GCP network?
Are there some limitation that can be solved by some kind of congratulation on the Library?
How to improve the performance avoiding the bottle neck?

Checked the performance of the Bucket with cp and perf-diag directly on the VM in GCE and the results were just fine . This delimit the issue to the library directly.

Just as a reference, this are the values of CP from VM in GCE and AWS with SDK 1.20
Source Multi-thread App gsutil -m cp
GCE VM 30+min 8.5 min
AWS EC2 VM 25 min 26 min

crwilcox · 2020-02-20T17:52:43Z

It might be interesting to log the http traffic and inspect what is going on.

Python

Enabling HTTP requests in Python GCS library can be done using the logging module. In the following example, I'm enabling logging.DEBUG in the example:

from google.cloud import storage
# Python3 required
import http
http.client.HTTPConnection.debuglevel=5
# Necessary to turn on logging

storage_client = storage.Client()

blobs = storage_client.list_blobs("anima-frank")
for blob in blobs:
    print(blob.name)

Ref: https://docs.python.org/3/library/logging.html
Ref: https://docs.python.org/2/library/logging.html

GSUTIL

gsutil has the flag --debug to enable HTTP request logs

For example:

gsutil --debug ls gs://bucket-name

Ref: https://cloud.google.com/storage/docs/gsutil/addlhelp/TopLevelCommandLineOptions

braussjoss · 2020-02-20T18:41:53Z

httplog_4workers_20files.txt
httplog.txt

"Per Chris's request I ran the test program with HTTP logging turned on. Here is the output for a run with 1 worker retrieving one file."

crwilcox · 2020-02-20T19:25:45Z

First off, I made some test data by running this locally and uploading to a directory in storage:

for n in {1..1000}; do                        
    dd if=/dev/urandom of=file$( printf %03d "$n" ).data bs=1 count=1024
done

I also made some small modifications to the code to make it a bit more flexible.

I set defaults in the file to make running in a debugger easier. if you set your own they will still be used.
the code didn't support objects without metadata. It checks now before assuming there is metadata to access.

code.txt

log_crw_100f_10t.txt

crwilcox · 2020-02-21T01:10:15Z

After a bit of investigation. testing from my network (Seattle, WA). running 8 workers on a machine with a Quad-Core Intel Core i7 (8 vCores). Bucket is multiregion us.

I tracked metadata retrieval, downloading the 1kb file, setting metadata. Each takes right around .15 - .25 seconds. If it takes longer than .25 I print a warning. The attached log has a single warning from a metadata update that took .27 seconds.

The code has change slightly from above as I added additional logging.
code.txt
log_1000f_8w.txt

Time to download using code.py is around 40 seconds (timing is capturing the final sleep on threads so actual time is less)

Using gsutil -m cp -r gs://bucket/demo-data/ I see it taking 35.76s

crwilcox · 2020-02-21T18:35:06Z

Closing this out as customer has been helped. Will open bugs to dig into specific things we can do to help folks avoid this in the future. It seems the threaded version of this code has some contention. Moving to multiprocessing is much faster.

Using threads: ~30 seconds
After moving to multiprocesing:
16 workers: 19.4 seconds
32 workers: 13.2 seconds
64 workers: 10.3 seconds
128 workers: 9.1 secondsF

multiprocessing_code.txt

crwilcox transferred this issue from googleapis/google-cloud-python Feb 20, 2020

product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label Feb 20, 2020

crwilcox added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. performance labels Feb 20, 2020

frankyn assigned crwilcox Feb 21, 2020

crwilcox closed this as completed Feb 21, 2020

crwilcox mentioned this issue Feb 21, 2020

Create documentation for parallel uploads, suggest multiprocessing. #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python GCS Client library low performance on multi-thread #69

Python GCS Client library low performance on multi-thread #69

braussjoss commented Feb 18, 2020

crwilcox commented Feb 20, 2020 •

edited

Loading

braussjoss commented Feb 20, 2020

crwilcox commented Feb 20, 2020 •

edited

Loading

crwilcox commented Feb 21, 2020 •

edited

Loading

crwilcox commented Feb 21, 2020

Python GCS Client library low performance on multi-thread #69

Python GCS Client library low performance on multi-thread #69

Comments

braussjoss commented Feb 18, 2020

crwilcox commented Feb 20, 2020 • edited Loading

Python

GSUTIL

braussjoss commented Feb 20, 2020

crwilcox commented Feb 20, 2020 • edited Loading

crwilcox commented Feb 21, 2020 • edited Loading

crwilcox commented Feb 21, 2020

crwilcox commented Feb 20, 2020 •

edited

Loading

crwilcox commented Feb 20, 2020 •

edited

Loading

crwilcox commented Feb 21, 2020 •

edited

Loading