Conversation
…vendored-GenoFLU-multi subrepo: subdir: "ingest/vendored-GenoFLU-multi" merged: "2a548a9" upstream: origin: "https://github.com/moncla-lab/GenoFLU-multi" branch: "main" commit: "2a548a9" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"
Add rules to run the vendored GenoFLU on sequences. Creates a
final_metadata.tsv that has a new column `genoflu_genotype`.
This is currently not directly runnable via `nextstrain build` because
we do not have the GenoFLU dependencies in the Nextstrain runtimes.
I was able to run this locally by installing the genoflu dependencies
in my Nextstrain conda runtime with
```
mamba install -c conda-forge -c bioconda genoflu \
--prefix ~/.nextstrain/runtimes/conda/env/ \
--platform osx-64
```
Then I could run the new rules for the NCBI data with
```
nextstrain build --conda ingest \
joined-ncbi/results/final_metadata.tsv \
--configfile build-configs/ncbi/defaults/config.yaml
```
Installation
Upgrade GenoFLU to 1.06GenoFLU is now 1.06 whereas genoflu-multi still uses/contains 1.05, and this version bump is important for D1.1. Since I have no idea how to upgrade the fork from within a subrepo I ended up cloning the GenoFLU repo itself and modifying sys.path.insert(0, '/Users/naboo/github/GenoFLU/bin')
import genoflu as gfhowever the Moncla lab fork makes changes to genoflu itself so this doesn't work. I think we need to rebase/merge the Moncla lab fork with its upstream and then subsequently bring those changes in to our Slow runtimeI didn't dive into what's happening but I saw very little CPU / memory usage while running Restrict to H5N1/2.3.4.4bFauna has 50k sequences which is (a) very slow to run through GenoFLU and (b) GenoFLU should only be run on H5N1/2.3.4.4b. For D1.1 purposes I did this in a very ad-hoc way via: mv fauna/results fauna/results-pre-genoflu
mkdir fauna/results
for segment in mp np pa pb2 na ns pb1 ha; do
augur filter \
--min-date 2024 --query "gisaid_clade=='2.3.4.4b'" \
--metadata fauna/results-pre-genoflu/metadata.tsv \
--sequences fauna/results-pre-genoflu/sequences_${segment}.fasta \
--output-metadata fauna/results/metadata.tsv \
--output-sequences fauna/results/sequences_${segment}.fasta
done |
Using files ingested via <#127> (see comments in that PR for how to run ingest). I then copied those files to skip the initial download steps in the phylo workflows via: ``` cp <ingest>/fauna/results/final_metadata.tsv data/gisaid/metadata.tsv cp <ingest>/fauna/results/sequences_*fasta data/gisaid/ ``` (similar for `<ingest>/joined-ncbi` to `data/ncbi`) Note: Running NCBI requires switching the `s3_src` in `h5n1-d1.1.yaml`
There was a problem hiding this comment.
In terms of steps I think we need to take to get this into our canonical ingested files, I think:
- Upgrade GenoFLU to v1.06, and ideally document how to do this for next time we need to upgrade
- Add BLAST into the runtimes. I think starting with Docker-only would be ok, but 🤷
- Change the pipelines so that the results in
results/have genoflu calls. There's a bunch of ways to do this, and one way would be:- Change
merge_segment_metadatato output metadata todata, and - Filter segments into a subdirectory (probably in
data) as I think genoflu will gather up all sequences in the given directory. The filtering for GISAID would begisaid_clade=='2.3.4.4b', unsure for NCBI / Andersen lab - maybe we call everything? - Run GenoFLU within this directory
- Merge the genoflu TSV with the metadata (from
merge_segment_metadata) and output toresults
- Change
- Add a coloring to our auspice configs to export this
|
One comment here:
This will work for HA, but the genotype system includes assignments for segments that are not clade 2.3.4.4b associated. For example, D1.1 involves a low path NA gene, which is really genetically distinct and was previously not associated with a 2.3.4.4b. The clade annotations are currently assigned based on HA sequence, but then all segments are colored by the HA clade. So as long as we are just inheriting the 2.3.4.4b designation for the HA segments, this should be fine. I hope that makes sense. |
|
Added dependencies for GenoFLU in docker-base in nextstrain/docker-base#242. I am able to run workflow in Docker runtime with |
Combined with the previous commit we now have the ability to produce 'ingest/<data-source>/results/metadata.tsv' with GenoFLU calls. This is config-controllable and is currently set up to run GenoFLU for NCBI & Andersen lab but not for Fauna. The number of threads for actually running GenoFLU has been incresed to 12 as (from my testing) each thread has low CPU & memory usage, so setting a large number of threads (even threads >> cores) improves performance. We should revisit what exactly is happening here. Note that the filtering approach for fauna may not be correct as implemented here - see <#127 (comment)> however fauna is currently not run through GenoFLU.
|
Closing this PR as I've cherry-picked these commits into #126. Having everything in one PR allows us to easily run the ingest locally and then the phylo workflow using the locally ingested files to generate the D1.1 build. |
Combined with the previous commit we now have the ability to produce 'ingest/<data-source>/results/metadata.tsv' with GenoFLU calls. This is config-controllable and is currently set up to run GenoFLU for NCBI & Andersen lab but not for Fauna. The number of threads for actually running GenoFLU has been incresed to 12 as (from my testing) each thread has low CPU & memory usage, so setting a large number of threads (even threads >> cores) improves performance. We should revisit what exactly is happening here. Note that the filtering approach for fauna may not be correct as implemented here - see <#127 (comment)> however fauna is currently not run through GenoFLU.
Description of proposed changes
Seems straightforward to add steps to include GenoFLU in ingest workflow. The part that we'll have to work on is adding the dependencies for GenoFLU in the Nextstrain runtimes. This PR vendors https://github.com/moncla-lab/GenoFLU-multi with
git subrepoand then adds the new rules to include GenoFLU genotypes in the final metdata.I was able to run this locally by first installing GenoFLU into my Nextstrain conda runtime with
(
--platform osx-64was needed to getblastinstalled)Then I was able to run NCBI ingest w/ GenoFLU with
Related issue(s)
Related to #80
Checklist