DM-38215: Parallelize upload_hsc_rc2.py #82

kfindeisen · 2023-09-08T18:05:42Z

This PR reorganizes the raw file upload in uplaod_hsc_rc2.py to use a process pool that runs in parallel with the main "exposure-taking" thread. The result is that I/O is no longer a bottleneck for this program; it now runs in the time needed to perform the virtual slew and exposure operations (plus a few seconds for the last visit's upload).

This factoring gives the program more flexibility in how upload is handled for large numbers of files.

This change allows the bucket to be re-initialized for each process (it's not picklable), without being re-initialized for each task (it's expensive).

upload_hsc_rc2.py needs to modify and upload roughly a hundred images for each visit, and doing so serially is much too slow -- it takes 1-2 minutes to do a full visit, which is slower than the complete system is supposed to run.

Experiments show that the main overhead is from initializing the process pool itself, not from initializing the bucket. Chunk size adds 50-100% overhead if it's not optimal.

This change minimizes the pool startup overhead, by making sure it's initialized exactly once.

Pool startup time increases with process count, but doing it in parallel with the "exposure" makes it a non-issue. Processing time is very sensitive to chunk size in non-obvious ways.

python/tester/upload_hsc_rc2.py

This lets the uploader(s) run in parallel with the fake observations, eliminating overhead and letting the exposures be generated at exactly the advertised cadence.

hsinfang

Looks good to me and the speedup is awesome!

I was a bit surprised that boto3 doesn't have a built-in multithreading/miultiprocessing option, but a quick search seems to suggest so too and I'm sure you looked into it even more. The client, unlike resource, is thread safe but it's not clear to me if we can refactor the script to use just the client and whether that'd help.

hsinfang · 2023-09-11T18:10:53Z

python/tester/upload_hsc_rc2.py

        The datasets to upload
    """
+    try:
+        max_processes = math.ceil(0.25*multiprocessing.cpu_count())


Curious why you chose 0.25 here?

It was pretty arbitrary -- I didn't want to use a large fraction of shared resources, even if it would only be in bursts of ~15 seconds. On the current rubin-devl, max_processes = 32.

python/tester/upload_hsc_rc2.py

kfindeisen · 2023-09-11T22:55:44Z

I was a bit surprised that boto3 doesn't have a built-in multithreading/miultiprocessing option, but a quick search seems to suggest so too and I'm sure you looked into it even more. The client, unlike resource, is thread safe but it's not clear to me if we can refactor the script to use just the client and whether that'd help.

Even if we could, we might not want to use it, since thread-safe objects slow down processing even if they're not shared. The deciding issue was not any explicit parallel-processing support, but the fact that Bucket objects aren't picklable.

The previous cycle was slew, send next_visit, expose, upload. The sequence next_visit, slew, expose, upload is more representative of real observing procedures.

kfindeisen added 8 commits September 6, 2023 13:58

Fix typo in upload_hsc_images docstring.

bfb6031

Factor out upload process for a single file.

9762211

This factoring gives the program more flexibility in how upload is handled for large numbers of files.

Move S3 bucket to global variable.

fd82842

This change allows the bucket to be re-initialized for each process (it's not picklable), without being re-initialized for each task (it's expensive).

Use multiprocessing in upload_hsc_rc2.py.

9572c67

upload_hsc_rc2.py needs to modify and upload roughly a hundred images for each visit, and doing so serially is much too slow -- it takes 1-2 minutes to do a full visit, which is slower than the complete system is supposed to run.

Factor out process count calculation.

bf38499

Add profiling of parallel upload.

5400293

Experiments show that the main overhead is from initializing the process pool itself, not from initializing the bucket. Chunk size adds 50-100% overhead if it's not optimal.

Share pool between upload of different visits.

4678568

This change minimizes the pool startup overhead, by making sure it's initialized exactly once.

Optimize uploader pool parameters.

433a9af

Pool startup time increases with process count, but doing it in parallel with the "exposure" makes it a non-issue. Processing time is very sensitive to chunk size in non-obvious ways.

kfindeisen requested a review from hsinfang September 8, 2023 18:05

kfindeisen marked this pull request as ready for review September 8, 2023 18:05

kfindeisen commented Sep 8, 2023

View reviewed changes

python/tester/upload_hsc_rc2.py Show resolved Hide resolved

Use non-blocking method to add upload tasks.

199b520

This lets the uploader(s) run in parallel with the fake observations, eliminating overhead and letting the exposures be generated at exactly the advertised cadence.

hsinfang approved these changes Sep 11, 2023

View reviewed changes

Revise uploaders' event order.

81b32cc

The previous cycle was slew, send next_visit, expose, upload. The sequence next_visit, slew, expose, upload is more representative of real observing procedures.

kfindeisen force-pushed the tickets/DM-38215 branch from 06ff9c1 to 81b32cc Compare September 12, 2023 18:04

kfindeisen merged commit 5cc56d2 into main Sep 12, 2023

kfindeisen deleted the tickets/DM-38215 branch September 12, 2023 19:18

kfindeisen mentioned this pull request Sep 15, 2023

DM-37387: A tester script using LATISS data in /repo/embargo #84

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DM-38215: Parallelize upload_hsc_rc2.py #82

DM-38215: Parallelize upload_hsc_rc2.py #82

Uh oh!

kfindeisen commented Sep 8, 2023

Uh oh!

Uh oh!

hsinfang left a comment

Uh oh!

hsinfang Sep 11, 2023

Uh oh!

kfindeisen Sep 11, 2023

Uh oh!

Uh oh!

kfindeisen commented Sep 11, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DM-38215: Parallelize upload_hsc_rc2.py #82

DM-38215: Parallelize upload_hsc_rc2.py #82

Uh oh!

Conversation

kfindeisen commented Sep 8, 2023

Uh oh!

Uh oh!

hsinfang left a comment

Choose a reason for hiding this comment

Uh oh!

hsinfang Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

kfindeisen Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kfindeisen commented Sep 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kfindeisen commented Sep 11, 2023 •

edited

Loading