WIP: Ingest with GenoFLU by joverlee521 · Pull Request #127 · nextstrain/avian-flu

joverlee521 · 2025-02-15T00:11:31Z

Description of proposed changes

Seems straightforward to add steps to include GenoFLU in ingest workflow. The part that we'll have to work on is adding the dependencies for GenoFLU in the Nextstrain runtimes. This PR vendors https://github.com/moncla-lab/GenoFLU-multi with git subrepo and then adds the new rules to include GenoFLU genotypes in the final metdata.

I was able to run this locally by first installing GenoFLU into my Nextstrain conda runtime with

mamba install -c conda-forge -c bioconda genoflu \
    --prefix ~/.nextstrain/runtimes/conda/env/ \
    --platform osx-64

(--platform osx-64 was needed to get blast installed)

Then I was able to run NCBI ingest w/ GenoFLU with

nextstrain build --conda ingest \
    joined-ncbi/results/final_metadata.tsv \
    --configfile build-configs/ncbi/defaults/config.yaml

Related issue(s)

Related to #80

Checklist

Checks pass

…vendored-GenoFLU-multi subrepo: subdir: "ingest/vendored-GenoFLU-multi" merged: "2a548a9" upstream: origin: "https://github.com/moncla-lab/GenoFLU-multi" branch: "main" commit: "2a548a9" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"

Add rules to run the vendored GenoFLU on sequences. Creates a final_metadata.tsv that has a new column `genoflu_genotype`. This is currently not directly runnable via `nextstrain build` because we do not have the GenoFLU dependencies in the Nextstrain runtimes. I was able to run this locally by installing the genoflu dependencies in my Nextstrain conda runtime with ``` mamba install -c conda-forge -c bioconda genoflu \ --prefix ~/.nextstrain/runtimes/conda/env/ \ --platform osx-64 ``` Then I could run the new rules for the NCBI data with ``` nextstrain build --conda ingest \ joined-ncbi/results/final_metadata.tsv \ --configfile build-configs/ncbi/defaults/config.yaml ```

jameshadfield · 2025-02-17T00:31:39Z

Installation

I was able to run this locally by first installing GenoFLU into my Nextstrain conda runtime

genoflu-multi.py naturally imports genoflu via the sister genoflu.py file within the genoflu-multi repo so we don't need to additionally install genoflu via conda - unless you were doing so to get the dependencies?

--platform osx-64 was needed to get blast installed

brew install blast works for osx-arm64, although I realise that's not how our managed runtimes work. That was the only dependency I needed to install - everything else is included in an environment which can run augur.

Upgrade GenoFLU to 1.06

GenoFLU is now 1.06 whereas genoflu-multi still uses/contains 1.05, and this version bump is important for D1.1. Since I have no idea how to upgrade the fork from within a subrepo I ended up cloning the GenoFLU repo itself and modifying genofly-multi.py to use that via:

sys.path.insert(0, '/Users/naboo/github/GenoFLU/bin')
import genoflu as gf

however the Moncla lab fork makes changes to genoflu itself so this doesn't work. I think we need to rebase/merge the Moncla lab fork with its upstream and then subsequently bring those changes in to our subrepo. cc @jordan-ort

Slow runtime

I didn't dive into what's happening but I saw very little CPU / memory usage while running python ./vendored-GenoFLU-multi/bin/genoflu-multi.py -f joined-ncbi/results/ -n 4. Overall it completed in ~20min for 3.9k genomes (NCBI joined, 4 cores).

Restrict to H5N1/2.3.4.4b

Fauna has 50k sequences which is (a) very slow to run through GenoFLU and (b) GenoFLU should only be run on H5N1/2.3.4.4b. For D1.1 purposes I did this in a very ad-hoc way via:

mv fauna/results fauna/results-pre-genoflu
mkdir fauna/results
for segment in mp np pa pb2 na ns pb1 ha; do
    augur filter \
        --min-date 2024 --query "gisaid_clade=='2.3.4.4b'" \
        --metadata fauna/results-pre-genoflu/metadata.tsv \
        --sequences fauna/results-pre-genoflu/sequences_${segment}.fasta \
        --output-metadata fauna/results/metadata.tsv \
        --output-sequences fauna/results/sequences_${segment}.fasta
done

Using files ingested via <#127> (see comments in that PR for how to run ingest). I then copied those files to skip the initial download steps in the phylo workflows via: ``` cp <ingest>/fauna/results/final_metadata.tsv data/gisaid/metadata.tsv cp <ingest>/fauna/results/sequences_*fasta data/gisaid/ ``` (similar for `<ingest>/joined-ncbi` to `data/ncbi`) Note: Running NCBI requires switching the `s3_src` in `h5n1-d1.1.yaml`

jameshadfield

In terms of steps I think we need to take to get this into our canonical ingested files, I think:

Upgrade GenoFLU to v1.06, and ideally document how to do this for next time we need to upgrade
Add BLAST into the runtimes. I think starting with Docker-only would be ok, but 🤷
Change the pipelines so that the results in results/ have genoflu calls. There's a bunch of ways to do this, and one way would be:
- Change merge_segment_metadata to output metadata to data, and
- Filter segments into a subdirectory (probably in data) as I think genoflu will gather up all sequences in the given directory. The filtering for GISAID would be gisaid_clade=='2.3.4.4b', unsure for NCBI / Andersen lab - maybe we call everything?
- Run GenoFLU within this directory
- Merge the genoflu TSV with the metadata (from merge_segment_metadata) and output to results
Add a coloring to our auspice configs to export this

lmoncla · 2025-02-19T16:09:22Z

One comment here:

Filter segments into a subdirectory (probably in data) as I think genoflu will gather up all sequences in the given directory. The filtering for GISAID would be gisaid_clade=='2.3.4.4b', unsure for NCBI / Andersen lab - maybe we call everything?

This will work for HA, but the genotype system includes assignments for segments that are not clade 2.3.4.4b associated. For example, D1.1 involves a low path NA gene, which is really genetically distinct and was previously not associated with a 2.3.4.4b. The clade annotations are currently assigned based on HA sequence, but then all segments are colored by the HA clade. So as long as we are just inheriting the 2.3.4.4b designation for the HA segments, this should be fine. I hope that makes sense.

joverlee521 · 2025-02-20T00:53:58Z

Added dependencies for GenoFLU in docker-base in nextstrain/docker-base#242.

I am able to run workflow in Docker runtime with

nextstrain build --image nextstrain/base:branch-genoflu-deps ingest \
  joined-ncbi/results/final_metadata.tsv \
  --configfile build-configs/ncbi/defaults/config.yaml

Combined with the previous commit we now have the ability to produce 'ingest/<data-source>/results/metadata.tsv' with GenoFLU calls. This is config-controllable and is currently set up to run GenoFLU for NCBI & Andersen lab but not for Fauna. The number of threads for actually running GenoFLU has been incresed to 12 as (from my testing) each thread has low CPU & memory usage, so setting a large number of threads (even threads >> cores) improves performance. We should revisit what exactly is happening here. Note that the filtering approach for fauna may not be correct as implemented here - see <#127 (comment)> however fauna is currently not run through GenoFLU.

jameshadfield · 2025-02-20T02:10:21Z

Closing this PR as I've cherry-picked these commits into #126. Having everything in one PR allows us to easily run the ingest locally and then the phylo workflow using the locally ingested files to generate the D1.1 build.

Combined with the previous commit we now have the ability to produce 'ingest/<data-source>/results/metadata.tsv' with GenoFLU calls. This is config-controllable and is currently set up to run GenoFLU for NCBI & Andersen lab but not for Fauna. The number of threads for actually running GenoFLU has been incresed to 12 as (from my testing) each thread has low CPU & memory usage, so setting a large number of threads (even threads >> cores) improves performance. We should revisit what exactly is happening here. Note that the filtering approach for fauna may not be correct as implemented here - see <#127 (comment)> however fauna is currently not run through GenoFLU.

joverlee521 added 2 commits February 14, 2025 13:12

This was referenced Feb 17, 2025

H5N1 D1.1 genome build #126

Merged

[bugfix add-clades.py] preserve columns #128

Merged

jameshadfield reviewed Feb 17, 2025

View reviewed changes

joverlee521 mentioned this pull request Feb 19, 2025

Add dependencies for GenoFLU to runtime nextstrain/docker-base#242

Closed

joverlee521 mentioned this pull request Feb 20, 2025

Add GenoFLU dependencies nextstrain/docker-base#243

Merged

1 task

jameshadfield closed this Feb 20, 2025

jameshadfield deleted the ingest-with-genoflu branch February 20, 2025 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Ingest with GenoFLU#127

WIP: Ingest with GenoFLU#127
joverlee521 wants to merge 2 commits intomasterfrom
ingest-with-genoflu

joverlee521 commented Feb 15, 2025

Uh oh!

jameshadfield commented Feb 17, 2025 •

edited

Loading

Uh oh!

jameshadfield left a comment •

edited

Loading

Uh oh!

lmoncla commented Feb 19, 2025

Uh oh!

joverlee521 commented Feb 20, 2025

Uh oh!

jameshadfield commented Feb 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joverlee521 commented Feb 15, 2025

Description of proposed changes

Related issue(s)

Checklist

Uh oh!

jameshadfield commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Installation

Upgrade GenoFLU to 1.06

Slow runtime

Restrict to H5N1/2.3.4.4b

Uh oh!

jameshadfield left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lmoncla commented Feb 19, 2025

Uh oh!

joverlee521 commented Feb 20, 2025

Uh oh!

jameshadfield commented Feb 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jameshadfield commented Feb 17, 2025 •

edited

Loading

jameshadfield left a comment •

edited

Loading