Skip to content

WIP: Ingest with GenoFLU#127

Closed
joverlee521 wants to merge 2 commits intomasterfrom
ingest-with-genoflu
Closed

WIP: Ingest with GenoFLU#127
joverlee521 wants to merge 2 commits intomasterfrom
ingest-with-genoflu

Conversation

@joverlee521
Copy link
Contributor

Description of proposed changes

Seems straightforward to add steps to include GenoFLU in ingest workflow. The part that we'll have to work on is adding the dependencies for GenoFLU in the Nextstrain runtimes. This PR vendors https://github.com/moncla-lab/GenoFLU-multi with git subrepo and then adds the new rules to include GenoFLU genotypes in the final metdata.

I was able to run this locally by first installing GenoFLU into my Nextstrain conda runtime with

mamba install -c conda-forge -c bioconda genoflu \
    --prefix ~/.nextstrain/runtimes/conda/env/ \
    --platform osx-64 

(--platform osx-64 was needed to get blast installed)

Then I was able to run NCBI ingest w/ GenoFLU with

nextstrain build --conda ingest \
    joined-ncbi/results/final_metadata.tsv \
    --configfile build-configs/ncbi/defaults/config.yaml

Related issue(s)

Related to #80

Checklist

  • Checks pass

…vendored-GenoFLU-multi

subrepo:
  subdir:   "ingest/vendored-GenoFLU-multi"
  merged:   "2a548a9"
upstream:
  origin:   "https://github.com/moncla-lab/GenoFLU-multi"
  branch:   "main"
  commit:   "2a548a9"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
Add rules to run the vendored GenoFLU on sequences. Creates a
final_metadata.tsv that has a new column `genoflu_genotype`.

This is currently not directly runnable via `nextstrain build` because
we do not have the GenoFLU dependencies in the Nextstrain runtimes.

I was able to run this locally by installing the genoflu dependencies
in my Nextstrain conda runtime with

```
mamba install -c conda-forge -c bioconda genoflu \
    --prefix ~/.nextstrain/runtimes/conda/env/ \
    --platform osx-64
```

Then I could run the new rules for the NCBI data with
```
nextstrain build --conda ingest \
    joined-ncbi/results/final_metadata.tsv \
    --configfile build-configs/ncbi/defaults/config.yaml
```
@jameshadfield
Copy link
Member

jameshadfield commented Feb 17, 2025

Installation

I was able to run this locally by first installing GenoFLU into my Nextstrain conda runtime

genoflu-multi.py naturally imports genoflu via the sister genoflu.py file within the genoflu-multi repo so we don't need to additionally install genoflu via conda - unless you were doing so to get the dependencies?

--platform osx-64 was needed to get blast installed

brew install blast works for osx-arm64, although I realise that's not how our managed runtimes work. That was the only dependency I needed to install - everything else is included in an environment which can run augur.

Upgrade GenoFLU to 1.06

GenoFLU is now 1.06 whereas genoflu-multi still uses/contains 1.05, and this version bump is important for D1.1. Since I have no idea how to upgrade the fork from within a subrepo I ended up cloning the GenoFLU repo itself and modifying genofly-multi.py to use that via:

sys.path.insert(0, '/Users/naboo/github/GenoFLU/bin')
import genoflu as gf

however the Moncla lab fork makes changes to genoflu itself so this doesn't work. I think we need to rebase/merge the Moncla lab fork with its upstream and then subsequently bring those changes in to our subrepo. cc @jordan-ort

Slow runtime

I didn't dive into what's happening but I saw very little CPU / memory usage while running python ./vendored-GenoFLU-multi/bin/genoflu-multi.py -f joined-ncbi/results/ -n 4. Overall it completed in ~20min for 3.9k genomes (NCBI joined, 4 cores).

Restrict to H5N1/2.3.4.4b

Fauna has 50k sequences which is (a) very slow to run through GenoFLU and (b) GenoFLU should only be run on H5N1/2.3.4.4b. For D1.1 purposes I did this in a very ad-hoc way via:

mv fauna/results fauna/results-pre-genoflu
mkdir fauna/results
for segment in mp np pa pb2 na ns pb1 ha; do
    augur filter \
        --min-date 2024 --query "gisaid_clade=='2.3.4.4b'" \
        --metadata fauna/results-pre-genoflu/metadata.tsv \
        --sequences fauna/results-pre-genoflu/sequences_${segment}.fasta \
        --output-metadata fauna/results/metadata.tsv \
        --output-sequences fauna/results/sequences_${segment}.fasta
done

jameshadfield added a commit that referenced this pull request Feb 17, 2025
Using files ingested via <#127>
(see comments in that PR for how to run ingest). I then copied those
files to skip the initial download steps in the phylo workflows via:

```
cp <ingest>/fauna/results/final_metadata.tsv data/gisaid/metadata.tsv
cp <ingest>/fauna/results/sequences_*fasta data/gisaid/
```

(similar for `<ingest>/joined-ncbi` to `data/ncbi`)

Note: Running NCBI requires switching the `s3_src` in `h5n1-d1.1.yaml`
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of steps I think we need to take to get this into our canonical ingested files, I think:

  • Upgrade GenoFLU to v1.06, and ideally document how to do this for next time we need to upgrade
  • Add BLAST into the runtimes. I think starting with Docker-only would be ok, but 🤷
  • Change the pipelines so that the results in results/ have genoflu calls. There's a bunch of ways to do this, and one way would be:
    • Change merge_segment_metadata to output metadata to data, and
    • Filter segments into a subdirectory (probably in data) as I think genoflu will gather up all sequences in the given directory. The filtering for GISAID would be gisaid_clade=='2.3.4.4b', unsure for NCBI / Andersen lab - maybe we call everything?
    • Run GenoFLU within this directory
    • Merge the genoflu TSV with the metadata (from merge_segment_metadata) and output to results
  • Add a coloring to our auspice configs to export this

@lmoncla
Copy link
Collaborator

lmoncla commented Feb 19, 2025

One comment here:

Filter segments into a subdirectory (probably in data) as I think genoflu will gather up all sequences in the given directory. The filtering for GISAID would be gisaid_clade=='2.3.4.4b', unsure for NCBI / Andersen lab - maybe we call everything?

This will work for HA, but the genotype system includes assignments for segments that are not clade 2.3.4.4b associated. For example, D1.1 involves a low path NA gene, which is really genetically distinct and was previously not associated with a 2.3.4.4b. The clade annotations are currently assigned based on HA sequence, but then all segments are colored by the HA clade. So as long as we are just inheriting the 2.3.4.4b designation for the HA segments, this should be fine. I hope that makes sense.

@joverlee521
Copy link
Contributor Author

Added dependencies for GenoFLU in docker-base in nextstrain/docker-base#242.

I am able to run workflow in Docker runtime with

nextstrain build --image nextstrain/base:branch-genoflu-deps ingest \
  joined-ncbi/results/final_metadata.tsv \
  --configfile build-configs/ncbi/defaults/config.yaml

jameshadfield added a commit that referenced this pull request Feb 20, 2025
Combined with the previous commit we now have the ability to produce
'ingest/<data-source>/results/metadata.tsv' with GenoFLU calls.
This is config-controllable and is currently set up to run GenoFLU
for NCBI & Andersen lab but not for Fauna.

The number of threads for actually running GenoFLU has been incresed to
12 as (from my testing) each thread has low CPU & memory usage, so
setting a large number of threads (even threads >> cores) improves
performance. We should revisit what exactly is happening here.

Note that the filtering approach for fauna may not be correct as
implemented here - see <#127 (comment)>
however fauna is currently not run through GenoFLU.
@jameshadfield
Copy link
Member

Closing this PR as I've cherry-picked these commits into #126. Having everything in one PR allows us to easily run the ingest locally and then the phylo workflow using the locally ingested files to generate the D1.1 build.

@jameshadfield jameshadfield deleted the ingest-with-genoflu branch February 20, 2025 02:16
jameshadfield added a commit that referenced this pull request Feb 23, 2025
Combined with the previous commit we now have the ability to produce
'ingest/<data-source>/results/metadata.tsv' with GenoFLU calls.
This is config-controllable and is currently set up to run GenoFLU
for NCBI & Andersen lab but not for Fauna.

The number of threads for actually running GenoFLU has been incresed to
12 as (from my testing) each thread has low CPU & memory usage, so
setting a large number of threads (even threads >> cores) improves
performance. We should revisit what exactly is happening here.

Note that the filtering approach for fauna may not be correct as
implemented here - see <#127 (comment)>
however fauna is currently not run through GenoFLU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants