-
Notifications
You must be signed in to change notification settings - Fork 0
DM-38215: Parallelize upload_hsc_rc2.py #82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This factoring gives the program more flexibility in how upload is handled for large numbers of files.
This change allows the bucket to be re-initialized for each process (it's not picklable), without being re-initialized for each task (it's expensive).
upload_hsc_rc2.py needs to modify and upload roughly a hundred images for each visit, and doing so serially is much too slow -- it takes 1-2 minutes to do a full visit, which is slower than the complete system is supposed to run.
Experiments show that the main overhead is from initializing the process pool itself, not from initializing the bucket. Chunk size adds 50-100% overhead if it's not optimal.
This change minimizes the pool startup overhead, by making sure it's initialized exactly once.
Pool startup time increases with process count, but doing it in parallel with the "exposure" makes it a non-issue. Processing time is very sensitive to chunk size in non-obvious ways.
This lets the uploader(s) run in parallel with the fake observations, eliminating overhead and letting the exposures be generated at exactly the advertised cadence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me and the speedup is awesome!
I was a bit surprised that boto3 doesn't have a built-in multithreading/miultiprocessing option, but a quick search seems to suggest so too and I'm sure you looked into it even more. The client
, unlike resource
, is thread safe but it's not clear to me if we can refactor the script to use just the client
and whether that'd help.
python/tester/upload_hsc_rc2.py
Outdated
The datasets to upload | ||
""" | ||
try: | ||
max_processes = math.ceil(0.25*multiprocessing.cpu_count()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why you chose 0.25 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was pretty arbitrary -- I didn't want to use a large fraction of shared resources, even if it would only be in bursts of ~15 seconds. On the current rubin-devl
, max_processes = 32
.
Even if we could, we might not want to use it, since thread-safe objects slow down processing even if they're not shared. The deciding issue was not any explicit parallel-processing support, but the fact that |
The previous cycle was slew, send next_visit, expose, upload. The sequence next_visit, slew, expose, upload is more representative of real observing procedures.
06ff9c1
to
81b32cc
Compare
This PR reorganizes the raw file upload in
uplaod_hsc_rc2.py
to use a process pool that runs in parallel with the main "exposure-taking" thread. The result is that I/O is no longer a bottleneck for this program; it now runs in the time needed to perform the virtual slew and exposure operations (plus a few seconds for the last visit's upload).