Client for FAH resources, reference implementation for compute service #7

dotsdl · 2023-09-27T21:15:49Z

Closes #1, #3.

Closes #1.

Uses a process pool for CPU-bound units, async/await for Fah units

…onds to a FAH RUN Building out API points in FahAdaptiveSamplingClient to support this unit, since it is largely responsible for managing its own state on the work server. Considering ways of making the work server support partial execution, or at least not compute the same FahOpenMMSimulationUnit twice from the same Task twice.

@jcoffland

Adding in run files manipulation next, followed by clones, gens. I think the model we want to use will generate these imperatively, but will need to verify this will work with @jcoffland.

Trying to give clear, relatively atomic methods for file interactions, RUN, CLONE creation, etc.

dotsdl · 2024-04-30T15:28:39Z

@ianmkenney, since this is such a monster PR, can I get an initial review from you? This will help us identify gaps in testing, clarity, etc. as I finish up the test suite for the compute service and the CLI. It will also get you oriented enough with what's here to assist me in troubleshooting issues as I finish out those tests.

…n service

The compute service appears to work as desired, though we aren't explicitly testing out things like certificate refreshes. May add these in the future, but moving on to CLI tests next.

dotsdl · 2024-05-30T18:06:35Z

Looks like CI is now pulling in openfe and gufe 1.0, which alchemiscale and feflow are not yet compatible with. I'll proceed with pinning our test and deployment env to what we're currently using in alchemiscale so we aren't blocked in doing deployment test cycles.

dotsdl · 2024-05-31T05:06:24Z

CI passes! Now working on:

CLI entrypoint for service startup
YAML config template for service
tests for CLI entrypoint
tests for cert renewal functionality

ianmkenney

See comments.

alchemiscale_fah/protocols/feflow/nonequilibrium_cycling.py

alchemiscale_fah/tests/integration/compute/protocols/test_protocolunit.py

alchemiscale_fah/compute/client.py

ianmkenney · 2024-06-10T21:54:02Z

alchemiscale_fah/compute/service.py

+            )
+            self.fah_cert_update_thread.start()
+
+        # check that heartbeat is still alive; if not, resurrect it


change comment contents

ianmkenney · 2024-06-10T21:57:17Z

alchemiscale_fah/compute/service.py

+            self.heartbeat_thread = threading.Thread(target=self.heartbeat, daemon=True)
+            self.heartbeat_thread.start()
+
+    def _refresh_cert_update_thread(self):


Could probably extract this method (and the above heartbeat one) to a general thread refresher.

ianmkenney · 2024-06-10T21:58:11Z

alchemiscale_fah/compute/service.py

+    index: FahComputeServiceIndex,
+    encryption_public_key: Optional[str] = None,
+) -> ProtocolDAGResult:
+    """


Incomplete parameter documentation.

ianmkenney · 2024-06-10T21:58:39Z

alchemiscale_fah/compute/service.py

+                    context=fah_context, raise_error=raise_error, **inputs
+                )
+
+                # if this is a FahProtocolUnit, then we await its execution in-process


Suggested change

# if this is a FahProtocolUnit, then we await its execution in-process

# if this is a FahSimulationUnit, then we await its execution in-process

ianmkenney · 2024-06-10T22:00:36Z

alchemiscale_fah/tests/integration/compute/test_compute_client.py

+        run_id = 0
+        clone_id = 0
+
+        project_data = ProjectData(


project creation is done many times, better to put in a fixture if possible

… coverage

hmacdope

Looking fantastic! Just some questions and nitpicks mostly but good to go through and clarify.

hmacdope · 2024-06-17T00:37:57Z

.github/workflows/ci-integration.yml

+      - uses: conda-incubator/setup-miniconda@v2
+        with:
+            auto-update-conda: true
+            use-mamba: true
+            python-version: ${{ matrix.python-version }}
+            miniforge-variant: Mambaforge
+            environment-file: devtools/conda-envs/test.yml
+            activate-environment: alchemiscale-fah-test


NIT: better to use https://github.com/mamba-org/setup-micromamba but no biggie.

hmacdope · 2024-06-17T00:40:28Z

alchemiscale_fah/cli.py

+    effort_func = NONBONDED_EFFORT[nonbonded_settings]
+    effort = effort_func(n_atoms)
+    credit = assign_credit(effort)
+


Validate core_id model side, also what is the format? hex: 0x24 or int: 36 ? EDIT: I see it is hex format in models.

hmacdope · 2024-06-17T00:43:25Z

alchemiscale_fah/cli.py

+    # index = FahComputeServiceIndex(index_file)
+    # index.set_project(project_id, fah_project)
+    # index.db.close()


Index not being set?

hmacdope · 2024-06-17T00:44:27Z

alchemiscale_fah/cli.py

+    if "scopes" in params_init:
+        params_init["scopes"] = [
+            Scope.from_str(scope) for scope in params_init["scopes"]
+        ]


Could fold into FahAsynchronousComputeServiceSettings model as a validator if you want.

hmacdope · 2024-06-17T00:46:45Z

alchemiscale_fah/compute/api.py

Agree, esp with the way alchemiscale is structured, drawing parallels this makes me think that the API is a live service.

hmacdope · 2024-06-17T01:42:19Z

alchemiscale_fah/compute/service.py

+        # if no tasks claimed, sleep and return
+        if all([task_sk is None for task_sk in task_sks]):
+            self.logger.info(
+                "No tasks claimed; sleeping for %d seconds", self.sleep_interval


Fstring here and below

up to you ofc, but IMO better to be modern and consistent.

hmacdope · 2024-06-17T01:48:42Z

alchemiscale_fah/protocols/protocolunit.py

+
+        except KeyboardInterrupt:
+            # if we "fail" due to a KeyboardInterrupt, we always want to raise
+            raise


Raise something concrete here to distinguish from other error paths that have raw raise.

Eg.

except KeyboardInterrupt as e # if we "fail" due to a KeyboardInterrupt, we always want to raise raise RuntimeError("Caught keyboard interrupt") from e

hmacdope · 2024-06-17T01:51:33Z

alchemiscale_fah/protocols/protocolunit.py

+            # TODO: add encryption of files here if enabled as a setting on the
+            # service use configured public key
+            if ctx.encryption_public_key:
+                ...


hmacdope · 2024-06-17T01:52:40Z

alchemiscale_fah/protocols/protocolunit.py

+        )
+
+        science_log_path = ctx.shared / "science.log"
+        with open(ctx.shared / "science.log", "wb") as f:


Is this async writing all the science.logs to one spot? If so are the writes atomic or will they end up jumbled?

hmacdope · 2024-06-17T01:54:55Z

alchemiscale_fah/settings/fah_settings.py

+        1,
+        description="Either disable (1) or enable (0) separate PME stream (default: 1); warning, setting 0 may cause failures on some cards",
+    )
+    globalVarFilename: str = Field(


Can set (or stub) MWExclusionThreshold as well if you want.

dotsdl · 2024-06-25T04:44:27Z

Thank you @ianmkenney and @hmacdope! These reviews are extremely helpful! I'm making my way through your recommendations!

dotsdl · 2024-07-07T01:06:00Z

Note for self: need to make sure we add a file-based indicator to completed RUNs/GENs so that a separate archive/cleanup service can consume these.

dotsdl · 2024-07-09T15:24:07Z

Also, during my deployment testing I realized we've made an oversight in the design of how we interface ProtocolDAG execution with the FAH PRC(G) system. Instead of a Task mapping to a CLONE, we should instead have a ProtocolUnit map to a CLONE, since a given Task's ProtocolDAG may in principle feature any number of ProtocolUnits that are FahSimulationUnits. This means that for a single Task, it's possible multiple FAH CLONEs will be performed.

Fixing this isn't too difficult given how we've laid things out, but making this adjustment will require changes in a few places.

dotsdl added 8 commits September 27, 2023 14:14

Added initial client for FAH resources, init for compute service

c4bc367

Closes #1.

Added async_cycle, async_execute to compute service; cleaned up settings

8e2970a

Added custom execute_DAG to compute service

250cbd1

Uses a process pool for CPU-bound units, async/await for Fah units

Black!

82b559b

Added env file, moving some things around for live prototyping

2e7e075

Added compute models for FahAdaptiveSamplingClient

0149ae0

Added FahCoreSettings structure, still populating

690d69f

dotsdl mentioned this pull request Oct 17, 2023

FAHComputeService for execuing FAH-based Protocols via a Folding@Home work server #1

Open

dotsdl added 21 commits October 23, 2023 21:56

Refining adaptive sampling client; added in project files manipulation

bd9a681

Adding in run files manipulation next, followed by clones, gens. I think the model we want to use will generate these imperatively, but will need to verify this will work with @jcoffland.

Adding more methods to adaptive sampling client

5ae1d7f

Trying to give clear, relatively atomic methods for file interactions, RUN, CLONE creation, etc.

Added openmm-core settings

395de86

Black!

23dd61f

Live testing, env file trimming

0a8db84

Fixes to urljoin usage

1b250a8

Small fix to list_projects

2049f75

Hammering away...

4d1c173

Additional model fixes for FahAdaptiveSamplingClient

f6c72bb

Added ability to create projects using CLI

24d3365

Added FahComputeServiceIndex; fleshing out execution codepaths

5194e8b

Almost finished with a workable protocolunit

6dedd5d

Few more additions

2420ad5

Adding in selection of Tasks based on eligible Protocols

1bd9170

Black!

688b887

Added mock WS RESTful API for test suite

1e95d6e

Moving some things around

e6df371

Additional test apparatus, more mock WS endpoints, etc.

b7efa69

Black!

d08155a

Added project file upload and download to mock API, tests

7550472

Black!

6a7c9d6

dotsdl added 3 commits April 12, 2024 15:33

Update conda envs, in particular add pytest-asyncio

04e656a

Added certificate updates to Fah client

ee41806

Black!

13d6340

dotsdl mentioned this pull request Apr 16, 2024

Decrease execution time of tests by switching to formal charges on SmallMoleculeComponents #8

Closed

dotsdl mentioned this pull request Apr 30, 2024

Changes needed in support of alchemiscale-fah choderalab/feflow#45

Merged

dotsdl requested review from ianmkenney and hmacdope May 28, 2024 15:51

dotsdl added 5 commits May 29, 2024 12:51

Added content for service integration tests, automatic cert renewal o…

f2a99b5

…n service

Black!

2e1c21b

Compute service integration tests passing!

86c4eef

The compute service appears to work as desired, though we aren't explicitly testing out things like certificate refreshes. May add these in the future, but moving on to CLI tests next.

Black!

20a61b2

Updated testing env to use feflow branch for now

a7daa27

Pin test env to use openfe 0.14.0, gufe 0.9.5, same as prod alchemiscale

d02e64d

dotsdl added 2 commits May 31, 2024 18:06

Added CLI point for service startup, template config file

743cdb5

Black!

c2bab40

ianmkenney requested changes Jun 10, 2024

View reviewed changes

dotsdl added 4 commits June 10, 2024 18:45

Reorganized integration tests with single conftest; finished CLI test…

be1610d

… coverage

Black!

732fc0a

Added coverage of cert refreshes in client, enabled in service

6abc0b9

Black!

1da6b33

hmacdope requested changes Jun 17, 2024

View reviewed changes

dotsdl added 2 commits June 25, 2024 00:44

Changes in response to @ianmkenney review

17abc4c

Merge branch 'main' into fahcompute

e7f3bd8

Added key and CSR generation entrypoints to CLI

0ff636f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client for FAH resources, reference implementation for compute service #7

Client for FAH resources, reference implementation for compute service #7

dotsdl commented Sep 27, 2023 •

edited

Loading

dotsdl commented Apr 30, 2024

dotsdl commented May 30, 2024

dotsdl commented May 31, 2024 •

edited

Loading

ianmkenney left a comment

ianmkenney Jun 10, 2024

ianmkenney Jun 10, 2024

ianmkenney Jun 10, 2024

ianmkenney Jun 10, 2024

ianmkenney Jun 10, 2024

hmacdope left a comment

hmacdope Jun 17, 2024

hmacdope Jun 17, 2024

hmacdope Jun 17, 2024

hmacdope Jun 17, 2024

hmacdope Jun 17, 2024

hmacdope Jun 17, 2024

hmacdope Jun 17, 2024

hmacdope Jun 17, 2024

hmacdope Jun 17, 2024

hmacdope Jun 17, 2024

hmacdope Jun 17, 2024

dotsdl commented Jun 25, 2024

dotsdl commented Jul 7, 2024

dotsdl commented Jul 9, 2024

	# if this is a FahProtocolUnit, then we await its execution in-process
	# if this is a FahSimulationUnit, then we await its execution in-process

Client for FAH resources, reference implementation for compute service #7

Are you sure you want to change the base?

Client for FAH resources, reference implementation for compute service #7

Conversation

dotsdl commented Sep 27, 2023 • edited Loading

dotsdl commented Apr 30, 2024

dotsdl commented May 30, 2024

dotsdl commented May 31, 2024 • edited Loading

ianmkenney left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmacdope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotsdl commented Jun 25, 2024

dotsdl commented Jul 7, 2024

dotsdl commented Jul 9, 2024

dotsdl commented Sep 27, 2023 •

edited

Loading

dotsdl commented May 31, 2024 •

edited

Loading