Running issue nextstrain pipeline #149

sekhwal · 2024-02-15T03:03:22Z

I am trying to run the nextstrain pipeline, I did following steps but I am getting an error.

Step 1. I downloaded seasonal-flu repo and created a directory called "data/h3n2", and placed the fasta files. 1. raw_sequences_ha.fasta (downloaded from the gisaid database 2. raw_ha.fasta (my query H3N2 HA fasta file).

Step 2. build the "seasonal-flu/profiles/gisaid/build.yaml" file

Step 3. Install nextstrain cli on conda and run the following command.

nextstrain build . --configfile profiles/gisaid/builds.yaml --use-conda --conda-frontend mamba

I do not have lat-longs: "config/lat_longs.tsv" file so I am using the default and not have a metadata file in the data/h3n2 folder.

##error
Building DAG of jobs...
/home/mmk6053/.nextstrain/runtimes/conda/env/bin/bash: line 1: conda: command not found
Traceback (most recent call last):
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/init.py", line 792, in snakemake
success = workflow.execute(
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/workflow.py", line 1078, in execute
dag.create_conda_envs(
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/dag.py", line 359, in create_conda_envs
env.create(dryrun)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 393, in create
pin_file = self.pin_file
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/common/init.py", line 217, in get
value = self.method(instance)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 103, in pin_file
f".{self.conda.platform}.pin.txt"
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/common/init.py", line 217, in get
value = self.method(instance)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 96, in conda
return Conda(
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 653, in init
shell.check_output(self._get_cmd("conda info --json"), text=True)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/shell.py", line 61, in check_output
return sp.check_output(cmd, shell=True, executable=executable, **kwargs)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/subprocess.py", line 421, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'conda info --json' returned non-zero exit status 127.

###builds.yaml
custom_rules:

profiles/gisaid/prepare_data.smk

metadata_fields:

Identifier
Collection_Date
Clade

renamed_metadata_fields:

strain
date
clade

lat-longs: "config/lat_longs.tsv"

segments:

ha

submission_date_field: date

recency:
date_bins: [7, 30, 90]
date_bin_labels: ["last week", "last month", "last quarter"]
upper_bin_label: older

builds:
"h3n2":
lineage: h3n2
reference: "config/h3n2/{segment}/reference.fasta"
annotation: "config/h3n2/{segment}/genemap.gff"
tree_exclude_sites: "config/h3n2/{segment}/exclude-sites.txt"
clades: "config/h3n2/ha/clades.tsv"
subclades: "config/h3n2/ha/subclades.tsv"
auspice_config: "config/h3n2/auspice_config.json"
enable_lbi: true
enable_glycosylation: true
subsamples:
global:
filters: "--group-by region year month --subsample-max-sequences 100"

joverlee521 · 2024-02-15T18:16:48Z

Sorry you ran into this error @sekhwal! Looks like we need to update the instructions in the Quickstart with GISAID data.

The --use-conda flag should not be used with the Nextstrain managed environments.
As @victorlin previously stated in a separate issue:

--use-conda falls under the advanced use case of managing Conda environments, which we recommend the ambient runtime for.

Please try running the workflow again with:

nextstrain build . --configfile profiles/gisaid/builds.yaml

sekhwal · 2024-02-16T01:50:36Z

When I run the following command on the "seasonal flu" repo, it runs and generates files in auspice folder. But it seems it does not include the fasta and metadata files that I provided. Please let me know how to run the pipeline with real data that I downloaded from gisaid and one sequence that I my own.

I am using nextstrain pipeline on the conda environment using "conda activate clade_env". I followed all the steps of "https://github.com/nextstrain/seasonal-flu". Created folder data/h3n2 and added metadata.xls and raw_sequences_ha.fasta.

nextstrain shell .
nextstrain build . --configfile profiles/gisaid/builds.yaml

nextstrain view auspice/

joverlee521 · 2024-02-16T18:52:56Z

Just to clarify, did you add your own sequence to metadata.xls and raw_sequences_ha.fasta? Or is it in separate files? The workflow currently does not support multiple inputs, so you would have to manually add your own data to the GISAID data.

sekhwal · 2024-02-16T19:55:40Z

Yes, I have added my own sequence information to metadata.xls that I downloaded from GISAID and the fasta sequence to raw_sequences_ha.fasta and placed these files in data/h3n2 folder. When I run the pipeline, it generates files "ha.fasta", metadata.tsv in data/h3n2 folder and in addition "auspice", "benchmarks", "builds" and "logs". Then I run the command "nextstrain view auspice/" so it generates the phylogenetic tree but I can not find my own sequence in the tree.
If possible can I email you the metadata.xls file and the "raw_sequences_ha.fasta".

sekhwal · 2024-02-16T19:58:43Z

Do you think that my own sequence belongs to clade "3C.2a1b.2a.2b" and in the metadata, other sequences that I downloaded from GISAID, none of them are not from this clade, so I am not getting it in the tree?

The `nextstrain build` command should be the same regardless of which runtime the user has installed as the default. Snakemake's `--use-conda` flag also causes an error when used with the Nextstrain managed Conda runtime as described in #149

joverlee521 · 2024-02-16T20:30:48Z

Ah, I see. Your sequences may be randomly removed during subsampling.

Subsampling parameters are defined in the profiles/gisaid/builds.yaml file:

seasonal-flu/profiles/gisaid/builds.yaml

Lines 45 to 47 in 13a4211

    
           subsamples: 
        
             global: 
        
                 filters: "--group-by region year month --subsample-max-sequences 100"

These parameters are being used by augur filter within the workflow. You can read more about augur filter to determine if you want to change the subsampling parameters.

For your specific case of wanting to include your own sequences in the tree, it is possible to force sequences to be included regardless of subsampling parameters by using the --include flag.

Create a text file that lists the sequences that you want to force include, with a single strain name per line.
This file must be added within the seasonal-flu directory, for example you can add it to profiles/gisaid/include.txt.
Edit the parameters in the profiles/gisaid/builds.yaml file to use your include.txt file.

filters: "--group-by region year month --subsample-max-sequences 100 --include profiles/gisaid/include.txt"

Re-run the workflow with the --forceall flag.

nextstrain build . --configfile profiles/gisaid/builds.yaml --forceall

sekhwal · 2024-02-16T20:49:18Z

Just little more clarification. What should I include in the include.txt, a fasta sequence of my own file or the ID of my sequence that I added it in raw_sequences_ha.fasta and metadata.xls. Is there any template of include.txt?

joverlee521 · 2024-02-16T20:56:25Z

The include.txt should have a single sequence ID per line, matching the sequence ID used in the metadata and FASTA file.

sekhwal · 2024-02-16T21:16:15Z

I added a single seq ID "EPI_ISL_170" in include.txt and updated the builds.yaml

filters: "--group-by region year month --subsample-max-sequences 1000 --include profiles/gisaid/include.txt"

However, it is not showing in the tree.

##It is my seq header

A/Pennsylvania/01/2022 | EPI_ISL_170 | A / H3N2 | |3C.2a1b.2a.2b | | 2022-01-01

##metadata file screenshot

sekhwal · 2024-02-16T21:26:51Z

Here is the clades_h3n2.txt log file, I think there might be an issue of clade mismatching?

clades_h3n2.txt

joverlee521 · 2024-02-16T21:30:30Z

I added a single seq ID "EPI_ISL_170" in include.txt and updated the builds.yaml

Sorry, when I said the sequence ID, I mean the first part of the FASTA header.
You should add the strain name "A/Pennsylvania/01/2022" to include.txt.

sekhwal · 2024-02-16T21:55:56Z

Sorry, even if I add this strain name "A/Pennsylvania/01/2022" to include.txt, it shows the following error.

##error

[Fri Feb 16 16:54:04 2024]
Error in rule annotate_epiweeks:
jobid: 17
input: builds/h3n2/metadata.tsv
output: builds/h3n2/epiweeks.json
log: logs/annotate_epiweeks_h3n2.txt (check log file(s) for error details)
conda-env: /home/mmk6053/Dropbox/influenza_data/seasonal-flu/.snakemake/conda/b2dc7dab0f55bf6c5699067912b18fca_
shell:

    python3 scripts/calculate_epiweek.py             --metadata builds/h3n2/metadata.tsv             --output-node-data builds/h3n2/epiweeks.json 2>&1 | tee logs/annotate_epiweeks_h3n2.txt
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Fri Feb 16 16:54:04 2024]
Finished job 18.
5 of 26 steps (19%) done
[Fri Feb 16 16:54:04 2024]
Finished job 9.
6 of 26 steps (23%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-02-16T165400.558340.snakemake.log

joverlee521 · 2024-02-16T22:08:37Z

There should be more error details in the log file. Can you check the logs/annotate_epiweeks_h3n2.txt file?

sekhwal · 2024-02-20T00:44:53Z

I tried running the pipeline in different ways, but it does not include my sequence in the tree. Also when I filter the gisaid downloaded sequences such as one seq per clade to make less number of sequences, the pipeline does not allow this formatted file and generates an error. I think it is strictly designed to run GISAID formatted data only.

I will use multiple seq alignment software such as mafft and will run Treetime and auspice.

joverlee521 · 2024-02-22T19:14:34Z

It was nice to meet you in office hours today @sekhwal! In case you lose the link that @huddlej had shared during office hours, here's the example of the more complex subsampling scheme:

seasonal-flu/profiles/nextstrain-public.yaml

Lines 78 to 86 in d439f89

    
           subsamples: &subsampling-scheme 
        
             regions_except_europe: 
        
                 filters: --query "(passage_category != 'egg') & (region != 'Europe') & (ha == True) & (na == True)" --group-by region year month --subsample-max-sequences 2700 --min-date {min_date} --exclude {exclude} --exclude-where passage=egg 
        
                 priorities: "titers" 
        
             europe: 
        
                 filters: --query "(passage_category != 'egg') & (region == 'Europe') & (ha == True) & (na == True)" --group-by country year month --subsample-max-sequences 300 --min-date {min_date} --exclude {exclude} --exclude-where passage=egg 
        
                 priorities: "titers" 
        
             references: 
        
                 filters: --query "(is_reference == True)" --min-date {reference_min_date} --exclude {exclude} --exclude-where passage=egg

sekhwal · 2024-02-23T21:17:04Z

Thank you for sharing the sub-sampling scheme, I would like to discuss in our next week meeting to understand more about the scheme. When I download the data from 2018 to 2023 from GISAID, it comes a large datasets. How I can use the sub-sampling scheme to filter the data and get a decent number of dataset to make a tree.

sekhwal · 2024-02-23T21:27:52Z

Also, I tried running seasonal_flu repo with include.txt but still I am not able to find my seq in the analysis. I am using the filter "filters: "--group-by region year month --subsample-max-sequences 3300 --include profiles/gisaid/include.txt" in the builds.yaml.

huddlej · 2024-03-01T21:52:39Z

@sekhwal and I met today to walk through the workflow and figure out where his local sequence was getting dropped. We found that it was dropped at the "flag outliers" step because its collection date was later than expected by the defined clock rate. We added an include entry to the GISAID build config to keep his strain from getting added to those outliers. After this change, the strain got pruned by augur refine's clock filter. We figured that the sequence was dropped because it was a 2022 sequence in a tree with sequences collected up to December 2020, so the sequence would appear to be an outlier. We also realized that the collection date had an ambiguous month and day, so we modified the starting metadata to have a collection date of 2022-XX-XX. This change should force augur refine to infer the likely date of the sequence from the rest of the sequences in the tree.

@sekhwal is going to try the analysis again with more sequences from 2021 and 2022 so his sequence has more appropriate temporal context. Still, we may want to add an option to the workflow that allows users to disable the clock filter in augur refine.

sekhwal added the bug Something isn't working label Feb 15, 2024

joverlee521 transferred this issue from nextstrain/nextstrain.org Feb 15, 2024

joverlee521 mentioned this issue Feb 16, 2024

GISAID Quickstart: remove Conda runtime specific command #151

Merged

1 task

joverlee521 mentioned this issue Feb 16, 2024

Integrating FASTA Sequence into Influenza A/H3N2 Evolution Analysis and Visualizing in Nextclade #147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running issue nextstrain pipeline #149

Running issue nextstrain pipeline #149

sekhwal commented Feb 15, 2024 •

edited

Loading

joverlee521 commented Feb 15, 2024

sekhwal commented Feb 16, 2024 •

edited

Loading

joverlee521 commented Feb 16, 2024

sekhwal commented Feb 16, 2024

sekhwal commented Feb 16, 2024

joverlee521 commented Feb 16, 2024

sekhwal commented Feb 16, 2024

joverlee521 commented Feb 16, 2024

sekhwal commented Feb 16, 2024 •

edited

Loading

sekhwal commented Feb 16, 2024

joverlee521 commented Feb 16, 2024

sekhwal commented Feb 16, 2024

joverlee521 commented Feb 16, 2024

sekhwal commented Feb 20, 2024 •

edited

Loading

joverlee521 commented Feb 22, 2024

sekhwal commented Feb 23, 2024

sekhwal commented Feb 23, 2024

huddlej commented Mar 1, 2024

Running issue nextstrain pipeline #149

Running issue nextstrain pipeline #149

Comments

sekhwal commented Feb 15, 2024 • edited Loading

I do not have lat-longs: "config/lat_longs.tsv" file so I am using the default and not have a metadata file in the data/h3n2 folder.

joverlee521 commented Feb 15, 2024

sekhwal commented Feb 16, 2024 • edited Loading

joverlee521 commented Feb 16, 2024

sekhwal commented Feb 16, 2024

sekhwal commented Feb 16, 2024

joverlee521 commented Feb 16, 2024

sekhwal commented Feb 16, 2024

joverlee521 commented Feb 16, 2024

sekhwal commented Feb 16, 2024 • edited Loading

sekhwal commented Feb 16, 2024

joverlee521 commented Feb 16, 2024

sekhwal commented Feb 16, 2024

joverlee521 commented Feb 16, 2024

sekhwal commented Feb 20, 2024 • edited Loading

joverlee521 commented Feb 22, 2024

sekhwal commented Feb 23, 2024

sekhwal commented Feb 23, 2024

huddlej commented Mar 1, 2024

sekhwal commented Feb 15, 2024 •

edited

Loading

sekhwal commented Feb 16, 2024 •

edited

Loading

sekhwal commented Feb 16, 2024 •

edited

Loading

sekhwal commented Feb 20, 2024 •

edited

Loading