-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FAHComputeService for execuing FAH-based Protocols via a Folding@Home work server #1
Comments
Evaluating options for
Pursuing option (3) for now, since this affords us the greatest flexibility and is conceptually simplest (and possibly simplest in implementation). If we encounter problems, then we can pursue other options. Thoughts on this @jchodera? Does option (3) present obvious problems to you? |
@dotsdl : You'll want to check with Joseph Coffland on the maximum number of RUNs, CLONEs, and GENs. I suspect these are There is no real need to hew to the traditional conception of CLONEs as being different replicates---each CLONE can still be a completely different We'll still want to have a variety of PROJECTs set up with different system sizes so we don't make volunteers too upset by run-to-run variation. We can start with a single PROJECT for debugging, but then perhaps set up a group of 10 projects of varying size to use. Happy to chat more online to help navigate the choices here. |
One other consideration may be that it might be inconvenient to end up 65636 files in each directory if you make everything a RUN. In fact, I'm not even sure |
Thank you for these insights @jchodera! We may be able to take advantage of CLONEs then to avoid too filesystem issues, and also give a larger space for jobs within a project. I'll consider this in my approach. I have several new questions as I'm moving forward in #7:
|
It would be good to work with @sukritsingh, who knows a whole lot about how this stuff works and can help resolve questions and debug things. |
Happy to help in any way I can! Just building off of John's comments about the questions asked:
This would probably be the "best" way to programmatically add projects, runs, clones, etc. However, like John said it would be best to check with Joseph if you intend on adding many multiples of projects in case there are integer limitations.
I know the API has some endpoints for
There was also briefly a discussion about including a way to get custom Reporter or customCVForce outputs from the core, which may help here with appropriate changes to the integrator, state, and system, but I believe those are still in development from the OpenMM-core side. |
Thank you both! This has been very helpful in resolving our approach.
Here is my current proposal for how we will interface the
Please point out any problems you see with this approach, or any invalid assumptions. I also have several questions related to the above:
|
So the PROJECT, RUN, CLONE, GEN (PRCG) system is a way to track and identify any individual work unit in the FAH ecosystem. When you are setting up a A practical example I often use - if I was running an equilibrium MD simulation dataset, I would set up a single PROJECT, where each RUN corresponds to a unique starting structure/pdb file/configuration of some kind. These are generally all meant to be same/similar number of atoms, but may have small variations in conformations, ligand placement, etc. The value in Each CLONE would then be a unique initialization of the corresponding RUN (ie. unique initial velocities). The number of unique CLONEs per RUN are specified by the value in Each CLONE then has a latest GEN that is the start of a fresh work unit (ie trajectory) that is sent to a client who runs the complete trajectory, is sent back, and becomes the next GEN for the same CLONE. The number in I think at some point I had given a group meeting presentation on this to the chodera lab, if you are able to find it! I don't think I can link it here publicly at the moment, but I'll check!
Telling the WS to create new CLONEs means that you are telling the WS to add more trajectories for a specific RUN (ie in a case where you need more statistics). If you just update the Some thoughts and clarifications I'm curious about:
I would double check if the API does not require you to restart the WS (by restart that just means restart the WS service to be clear, just running
Just making sure we're clear on the terminology translating between FAH and alchemiscale: In the context of a single free energy calculation dataset, I'm imagining CLONEs being just additional switching cycles for a single RUN/project (with each GEN identifying one of the unique cycles, as mentioned above).
I'm assuming by deleted you mean that you would be removing the files from the |
@sukritsingh and I met on Friday, 10/27, and discussed many of the points above. We agreed that there may be several functionality gaps in the adaptive sampling API to enable my proposed operation model above. We will seek to discuss directly with @jcoffland to see what is possible and report back here. |
More detailed notes from meeting with @sukritsingh last week:
|
From my understanding the AS can pick up changes in clones, runs and gens automatically if you edit the corresponding project.xml or filesystem without a service restart, IIRC (could be wrong). Just thought i would add obs from recent projects I set up. |
This is intentional. For security reasons the API is not able to set any options that could be used for arbitrary remote command execution. You can set defaults for all projects per core in the WS'
You can delete an entire project. You can "restart" a CLONE or all the CLONEs in a RUN. This will not delete the files immediately but they will be replaced. I could add a delete CLONE/RUN API end point if you need it. Deleting each file would be tedious.
Runs do not need to exist before applying the "create" action. This takes parameters
The files needed by
The Are you looking for a job queuing system that works something like this?:
You could treat the WS this way but then we are shoehorning a more basic queuing system into the F@H's traditional RUN/CLONE/GEN system. Also, downloading, analyzing and then reuploading the data for each WU is costly. It would be most effective if the bulk of the data analysis could be performed on the WS itself or even on F@H clients. How often do you need to analyze result data? After every gen? |
|
This will not work. A particular PRCG should exists only once in F@H. A PRCG should only be credited once. You could use |
You do not need to restart the WS when using the API.
You can programmatically renew the cert via the API. You just need to do so before it expires. An as yet unwritten API wrapper in Python should do this automatically. |
Thank you so much for this @jcoffland! This has helped me understand a lot better what is actually possible here. Yes, we are effectively trying to solution a simple job queue as you describe using the PRCG system. Would it be possible for the WS adaptive sampling API to expose an alternative scheme to PRCG, such as PJ (PROJECT, JOB)? This would remove the need for our compute service to externally shoehorn this pattern into the PRCG model.
If we did continue with casting this behavior into the PRCG model on our own, it could still be done I think. A few more clarifying questions:
Misc. comments/questions:
|
I agree with sticking to just runs/clones in this case. Since your code is running on the same server there's no need to upload or download files via the API. The files will be in Set By calling You should keep track of the RUN and CLONE of the last job created. Then just increment these values as you create new jobs. Your <config>
...
<core type="0x23">
<gens v="1"/>
<create-command v="/bin/true"/>
<send>
core.xml
system.xml
integrator.xml
state.xml
</send>
</core>
</config> Any core
You need to create a CSR (Certificate Signing Request) just like you do initially, then using the credentials you've already acquired to access the WS/AS APIs submit the CSR to the AS API endpoint {"csr": "<CSR content here>"} The response should look like this: {
"certificate": "<New certificate>",
"as-cert": "<AS certificate>"
} Then in subsequent calls to the AS or WS use the new certificate. |
Thanks @jcoffland for this! I've taken your feedback and created an updated proposal from the first one given above, and sharing it here for visibility. I'm working to implement this now, and will follow up here if I hit any snags. @jchodera and @sukritsingh: if you see any issues with this, please let me know.
Here is my updated proposal for how we will interface the
Defaults defined in <config>
...
<core type="0x23">
<gens v="1"/>
<create-command v="/bin/true"/>
<send>
$home/RUNS/RUN$run/CLONE$clone/core.xml
$home/RUNS/RUN$run/CLONE$clone/system.xml.bz2
$home/RUNS/RUN$run/CLONE$clone/integrator.xml.bz2
$home/RUNS/RUN$run/CLONE$clone/state.xml.bz2
</send>
</core>
</config> |
@jcoffland I'm currently getting Any insights as to why this might occur? The PROJECT exists, and I don't think it's necessary to explicitly create a RUN first (there is no adaptive sampling endpoint for creating a RUN, as far as I can tell)? Sukrit (CC'd) mentioned that this may have something to do with the WS not creating some state for itself until a FAH client connects to it requesting work. Is this the case? |
@jcoffland regarding points: in our usage pattern detailed above each FAH RUN corresponds to an Is there a mechanism (like a multiplier?) that can be applied per-RUN on the base credit for the PROJECT? If so, any recommendation on how we might apply it based on run-length of the work units in a RUN? |
From looking at the code, there appears to be two ways in which you can get that message:
I think the first option is the most likely. Make sure you're really doing a PUT request. |
@jcoffland thanks for this! Just checked: we are indeed doing a Can you give guidance on what to try next? Also, if you're able to perform the |
It could be a version issue. The latest WS release is v10.3.4 you were running v10.3.1. I push an upgrade to your machine. |
Implement a FAH-oriented compute service that utilizes a Folding@Home work server to execute the simulation
ProtocolUnit
s ofProtocolDAG
s produced by the FAH-specific protocols implemented in this library.This compute service should efficiently execute multiple
Task
s at once, perhaps with a combination ofProcessPoolExecutor
s andasyncio
,await
ing results from the work server and processing them as they arrive.The text was updated successfully, but these errors were encountered: