-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client for FAH resources, reference implementation for compute service #7
base: main
Are you sure you want to change the base?
Conversation
Uses a process pool for CPU-bound units, async/await for Fah units
…onds to a FAH RUN Building out API points in FahAdaptiveSamplingClient to support this unit, since it is largely responsible for managing its own state on the work server. Considering ways of making the work server support partial execution, or at least not compute the same FahOpenMMSimulationUnit twice from the same Task twice.
Adding in run files manipulation next, followed by clones, gens. I think the model we want to use will generate these imperatively, but will need to verify this will work with @jcoffland.
Trying to give clear, relatively atomic methods for file interactions, RUN, CLONE creation, etc.
@ianmkenney, since this is such a monster PR, can I get an initial review from you? This will help us identify gaps in testing, clarity, etc. as I finish up the test suite for the compute service and the CLI. It will also get you oriented enough with what's here to assist me in troubleshooting issues as I finish out those tests. |
The compute service appears to work as desired, though we aren't explicitly testing out things like certificate refreshes. May add these in the future, but moving on to CLI tests next.
Looks like CI is now pulling in |
CI passes! Now working on:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments.
alchemiscale_fah/tests/integration/compute/protocols/test_protocolunit.py
Show resolved
Hide resolved
) | ||
self.fah_cert_update_thread.start() | ||
|
||
# check that heartbeat is still alive; if not, resurrect it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change comment contents
self.heartbeat_thread = threading.Thread(target=self.heartbeat, daemon=True) | ||
self.heartbeat_thread.start() | ||
|
||
def _refresh_cert_update_thread(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could probably extract this method (and the above heartbeat one) to a general thread refresher.
index: FahComputeServiceIndex, | ||
encryption_public_key: Optional[str] = None, | ||
) -> ProtocolDAGResult: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete parameter documentation.
context=fah_context, raise_error=raise_error, **inputs | ||
) | ||
|
||
# if this is a FahProtocolUnit, then we await its execution in-process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# if this is a FahProtocolUnit, then we await its execution in-process | |
# if this is a FahSimulationUnit, then we await its execution in-process |
run_id = 0 | ||
clone_id = 0 | ||
|
||
project_data = ProjectData( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
project creation is done many times, better to put in a fixture if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking fantastic! Just some questions and nitpicks mostly but good to go through and clarify.
- uses: conda-incubator/setup-miniconda@v2 | ||
with: | ||
auto-update-conda: true | ||
use-mamba: true | ||
python-version: ${{ matrix.python-version }} | ||
miniforge-variant: Mambaforge | ||
environment-file: devtools/conda-envs/test.yml | ||
activate-environment: alchemiscale-fah-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: better to use https://github.com/mamba-org/setup-micromamba but no biggie.
effort_func = NONBONDED_EFFORT[nonbonded_settings] | ||
effort = effort_func(n_atoms) | ||
credit = assign_credit(effort) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate core_id model side, also what is the format? hex: 0x24
or int: 36
? EDIT: I see it is hex format in models
.
# index = FahComputeServiceIndex(index_file) | ||
# index.set_project(project_id, fah_project) | ||
# index.db.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index not being set?
if "scopes" in params_init: | ||
params_init["scopes"] = [ | ||
Scope.from_str(scope) for scope in params_init["scopes"] | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could fold into FahAsynchronousComputeServiceSettings
model as a validator if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, esp with the way alchemiscale
is structured, drawing parallels this makes me think that the API is a live service.
# if no tasks claimed, sleep and return | ||
if all([task_sk is None for task_sk in task_sks]): | ||
self.logger.info( | ||
"No tasks claimed; sleeping for %d seconds", self.sleep_interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fstring here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up to you ofc, but IMO better to be modern and consistent.
|
||
except KeyboardInterrupt: | ||
# if we "fail" due to a KeyboardInterrupt, we always want to raise | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raise something concrete here to distinguish from other error paths that have raw raise.
Eg.
except KeyboardInterrupt as e
# if we "fail" due to a KeyboardInterrupt, we always want to raise
raise RuntimeError("Caught keyboard interrupt") from e
# TODO: add encryption of files here if enabled as a setting on the | ||
# service use configured public key | ||
if ctx.encryption_public_key: | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
) | ||
|
||
science_log_path = ctx.shared / "science.log" | ||
with open(ctx.shared / "science.log", "wb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this async writing all the science.logs
to one spot? If so are the writes atomic or will they end up jumbled?
1, | ||
description="Either disable (1) or enable (0) separate PME stream (default: 1); warning, setting 0 may cause failures on some cards", | ||
) | ||
globalVarFilename: str = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can set (or stub) MWExclusionThreshold
as well if you want.
Thank you @ianmkenney and @hmacdope! These reviews are extremely helpful! I'm making my way through your recommendations! |
Note for self: need to make sure we add a file-based indicator to completed RUNs/GENs so that a separate archive/cleanup service can consume these. |
Also, during my deployment testing I realized we've made an oversight in the design of how we interface Fixing this isn't too difficult given how we've laid things out, but making this adjustment will require changes in a few places. |
Closes #1, #3.