-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running issue nextstrain pipeline #149
Comments
Sorry you ran into this error @sekhwal! Looks like we need to update the instructions in the Quickstart with GISAID data. The
Please try running the workflow again with:
|
When I run the following command on the "seasonal flu" repo, it runs and generates files in auspice folder. But it seems it does not include the fasta and metadata files that I provided. Please let me know how to run the pipeline with real data that I downloaded from gisaid and one sequence that I my own. I am using nextstrain pipeline on the conda environment using "conda activate clade_env". I followed all the steps of "https://github.com/nextstrain/seasonal-flu". Created folder data/h3n2 and added metadata.xls and raw_sequences_ha.fasta. nextstrain shell . nextstrain view auspice/ |
Just to clarify, did you add your own sequence to |
Yes, I have added my own sequence information to metadata.xls that I downloaded from GISAID and the fasta sequence to raw_sequences_ha.fasta and placed these files in data/h3n2 folder. When I run the pipeline, it generates files "ha.fasta", metadata.tsv in data/h3n2 folder and in addition "auspice", "benchmarks", "builds" and "logs". Then I run the command "nextstrain view auspice/" so it generates the phylogenetic tree but I can not find my own sequence in the tree. |
Do you think that my own sequence belongs to clade "3C.2a1b.2a.2b" and in the metadata, other sequences that I downloaded from GISAID, none of them are not from this clade, so I am not getting it in the tree? |
The `nextstrain build` command should be the same regardless of which runtime the user has installed as the default. Snakemake's `--use-conda` flag also causes an error when used with the Nextstrain managed Conda runtime as described in #149
Ah, I see. Your sequences may be randomly removed during subsampling. Subsampling parameters are defined in the seasonal-flu/profiles/gisaid/builds.yaml Lines 45 to 47 in 13a4211
These parameters are being used by For your specific case of wanting to include your own sequences in the tree, it is possible to force sequences to be included regardless of subsampling parameters by using the
filters: "--group-by region year month --subsample-max-sequences 100 --include profiles/gisaid/include.txt"
|
Just little more clarification. What should I include in the include.txt, a fasta sequence of my own file or the ID of my sequence that I added it in raw_sequences_ha.fasta and metadata.xls. Is there any template of include.txt? |
The |
I added a single seq ID "EPI_ISL_170" in include.txt and updated the builds.yaml filters: "--group-by region year month --subsample-max-sequences 1000 --include profiles/gisaid/include.txt" However, it is not showing in the tree. ##It is my seq header
|
Here is the clades_h3n2.txt log file, I think there might be an issue of clade mismatching? |
Sorry, when I said the sequence ID, I mean the first part of the FASTA header. |
Sorry, even if I add this strain name "A/Pennsylvania/01/2022" to include.txt, it shows the following error. ##error [Fri Feb 16 16:54:04 2024]
[Fri Feb 16 16:54:04 2024] |
There should be more error details in the log file. Can you check the |
I tried running the pipeline in different ways, but it does not include my sequence in the tree. Also when I filter the gisaid downloaded sequences such as one seq per clade to make less number of sequences, the pipeline does not allow this formatted file and generates an error. I think it is strictly designed to run GISAID formatted data only. I will use multiple seq alignment software such as mafft and will run Treetime and auspice. |
It was nice to meet you in office hours today @sekhwal! In case you lose the link that @huddlej had shared during office hours, here's the example of the more complex subsampling scheme: seasonal-flu/profiles/nextstrain-public.yaml Lines 78 to 86 in d439f89
|
Thank you for sharing the sub-sampling scheme, I would like to discuss in our next week meeting to understand more about the scheme. When I download the data from 2018 to 2023 from GISAID, it comes a large datasets. How I can use the sub-sampling scheme to filter the data and get a decent number of dataset to make a tree. |
Also, I tried running seasonal_flu repo with include.txt but still I am not able to find my seq in the analysis. I am using the filter "filters: "--group-by region year month --subsample-max-sequences 3300 --include profiles/gisaid/include.txt" in the builds.yaml. |
@sekhwal and I met today to walk through the workflow and figure out where his local sequence was getting dropped. We found that it was dropped at the "flag outliers" step because its collection date was later than expected by the defined clock rate. We added an @sekhwal is going to try the analysis again with more sequences from 2021 and 2022 so his sequence has more appropriate temporal context. Still, we may want to add an option to the workflow that allows users to disable the clock filter in augur refine. |
I am trying to run the nextstrain pipeline, I did following steps but I am getting an error.
Step 1. I downloaded seasonal-flu repo and created a directory called "data/h3n2", and placed the fasta files. 1. raw_sequences_ha.fasta (downloaded from the gisaid database 2. raw_ha.fasta (my query H3N2 HA fasta file).
Step 2. build the "seasonal-flu/profiles/gisaid/build.yaml" file
Step 3. Install nextstrain cli on conda and run the following command.
nextstrain build . --configfile profiles/gisaid/builds.yaml --use-conda --conda-frontend mamba
I do not have lat-longs: "config/lat_longs.tsv" file so I am using the default and not have a metadata file in the data/h3n2 folder.
##error
Building DAG of jobs...
/home/mmk6053/.nextstrain/runtimes/conda/env/bin/bash: line 1: conda: command not found
Traceback (most recent call last):
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/init.py", line 792, in snakemake
success = workflow.execute(
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/workflow.py", line 1078, in execute
dag.create_conda_envs(
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/dag.py", line 359, in create_conda_envs
env.create(dryrun)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 393, in create
pin_file = self.pin_file
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/common/init.py", line 217, in get
value = self.method(instance)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 103, in pin_file
f".{self.conda.platform}.pin.txt"
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/common/init.py", line 217, in get
value = self.method(instance)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 96, in conda
return Conda(
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 653, in init
shell.check_output(self._get_cmd("conda info --json"), text=True)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/shell.py", line 61, in check_output
return sp.check_output(cmd, shell=True, executable=executable, **kwargs)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/subprocess.py", line 421, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'conda info --json' returned non-zero exit status 127.
###builds.yaml
custom_rules:
metadata_fields:
renamed_metadata_fields:
lat-longs: "config/lat_longs.tsv"
segments:
submission_date_field: date
recency:
date_bins: [7, 30, 90]
date_bin_labels: ["last week", "last month", "last quarter"]
upper_bin_label: older
builds:
"h3n2":
lineage: h3n2
reference: "config/h3n2/{segment}/reference.fasta"
annotation: "config/h3n2/{segment}/genemap.gff"
tree_exclude_sites: "config/h3n2/{segment}/exclude-sites.txt"
clades: "config/h3n2/ha/clades.tsv"
subclades: "config/h3n2/ha/subclades.tsv"
auspice_config: "config/h3n2/auspice_config.json"
enable_lbi: true
enable_glycosylation: true
subsamples:
global:
filters: "--group-by region year month --subsample-max-sequences 100"
The text was updated successfully, but these errors were encountered: