Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client for FAH resources, reference implementation for compute service #7

Open
wants to merge 57 commits into
base: main
Choose a base branch
from

Conversation

dotsdl
Copy link
Member

@dotsdl dotsdl commented Sep 27, 2023

Closes #1, #3.

Uses a process pool for CPU-bound units, async/await for Fah units
…onds to a FAH RUN

Building out API points in FahAdaptiveSamplingClient to support this
unit, since it is largely responsible for managing its own state on the
work server.

Considering ways of making the work server support partial execution, or
at least not compute the same FahOpenMMSimulationUnit twice from the
same Task twice.
@dotsdl
Copy link
Member Author

dotsdl commented Apr 30, 2024

@ianmkenney, since this is such a monster PR, can I get an initial review from you? This will help us identify gaps in testing, clarity, etc. as I finish up the test suite for the compute service and the CLI. It will also get you oriented enough with what's here to assist me in troubleshooting issues as I finish out those tests.

The compute service appears to work as desired, though we aren't
explicitly testing out things like certificate refreshes.
May add these in the future, but moving on to CLI tests next.
@dotsdl
Copy link
Member Author

dotsdl commented May 30, 2024

Looks like CI is now pulling in openfe and gufe 1.0, which alchemiscale and feflow are not yet compatible with. I'll proceed with pinning our test and deployment env to what we're currently using in alchemiscale so we aren't blocked in doing deployment test cycles.

@dotsdl
Copy link
Member Author

dotsdl commented May 31, 2024

CI passes! Now working on:

  • CLI entrypoint for service startup
  • YAML config template for service
  • tests for CLI entrypoint
  • tests for cert renewal functionality

Copy link
Collaborator

@ianmkenney ianmkenney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.

alchemiscale_fah/compute/client.py Show resolved Hide resolved
alchemiscale_fah/compute/client.py Show resolved Hide resolved
)
self.fah_cert_update_thread.start()

# check that heartbeat is still alive; if not, resurrect it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change comment contents

self.heartbeat_thread = threading.Thread(target=self.heartbeat, daemon=True)
self.heartbeat_thread.start()

def _refresh_cert_update_thread(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably extract this method (and the above heartbeat one) to a general thread refresher.

index: FahComputeServiceIndex,
encryption_public_key: Optional[str] = None,
) -> ProtocolDAGResult:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete parameter documentation.

context=fah_context, raise_error=raise_error, **inputs
)

# if this is a FahProtocolUnit, then we await its execution in-process
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# if this is a FahProtocolUnit, then we await its execution in-process
# if this is a FahSimulationUnit, then we await its execution in-process

run_id = 0
clone_id = 0

project_data = ProjectData(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

project creation is done many times, better to put in a fixture if possible

Copy link
Collaborator

@hmacdope hmacdope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking fantastic! Just some questions and nitpicks mostly but good to go through and clarify.

Comment on lines +35 to +42
- uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
use-mamba: true
python-version: ${{ matrix.python-version }}
miniforge-variant: Mambaforge
environment-file: devtools/conda-envs/test.yml
activate-environment: alchemiscale-fah-test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: better to use https://github.com/mamba-org/setup-micromamba but no biggie.

effort_func = NONBONDED_EFFORT[nonbonded_settings]
effort = effort_func(n_atoms)
credit = assign_credit(effort)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate core_id model side, also what is the format? hex: 0x24 or int: 36 ? EDIT: I see it is hex format in models.

Comment on lines +170 to +172
# index = FahComputeServiceIndex(index_file)
# index.set_project(project_id, fah_project)
# index.db.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index not being set?

Comment on lines +207 to +210
if "scopes" in params_init:
params_init["scopes"] = [
Scope.from_str(scope) for scope in params_init["scopes"]
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could fold into FahAsynchronousComputeServiceSettings model as a validator if you want.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, esp with the way alchemiscale is structured, drawing parallels this makes me think that the API is a live service.

# if no tasks claimed, sleep and return
if all([task_sk is None for task_sk in task_sks]):
self.logger.info(
"No tasks claimed; sleeping for %d seconds", self.sleep_interval
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fstring here and below

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to you ofc, but IMO better to be modern and consistent.


except KeyboardInterrupt:
# if we "fail" due to a KeyboardInterrupt, we always want to raise
raise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raise something concrete here to distinguish from other error paths that have raw raise.

Eg.

        except KeyboardInterrupt as e
            # if we "fail" due to a KeyboardInterrupt, we always want to raise
            raise RuntimeError("Caught keyboard interrupt") from e
            

# TODO: add encryption of files here if enabled as a setting on the
# service use configured public key
if ctx.encryption_public_key:
...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

)

science_log_path = ctx.shared / "science.log"
with open(ctx.shared / "science.log", "wb") as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this async writing all the science.logs to one spot? If so are the writes atomic or will they end up jumbled?

1,
description="Either disable (1) or enable (0) separate PME stream (default: 1); warning, setting 0 may cause failures on some cards",
)
globalVarFilename: str = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can set (or stub) MWExclusionThreshold as well if you want.

@dotsdl
Copy link
Member Author

dotsdl commented Jun 25, 2024

Thank you @ianmkenney and @hmacdope! These reviews are extremely helpful! I'm making my way through your recommendations!

@dotsdl
Copy link
Member Author

dotsdl commented Jul 7, 2024

Note for self: need to make sure we add a file-based indicator to completed RUNs/GENs so that a separate archive/cleanup service can consume these.

@dotsdl
Copy link
Member Author

dotsdl commented Jul 9, 2024

Also, during my deployment testing I realized we've made an oversight in the design of how we interface ProtocolDAG execution with the FAH PRC(G) system. Instead of a Task mapping to a CLONE, we should instead have a ProtocolUnit map to a CLONE, since a given Task's ProtocolDAG may in principle feature any number of ProtocolUnits that are FahSimulationUnits. This means that for a single Task, it's possible multiple FAH CLONEs will be performed.

Fixing this isn't too difficult given how we've laid things out, but making this adjustment will require changes in a few places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FAHComputeService for execuing FAH-based Protocols via a Folding@Home work server
4 participants