Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phylogenetic build fails because of missing nextalign #2

Closed
genehack opened this issue May 14, 2024 · 5 comments
Closed

phylogenetic build fails because of missing nextalign #2

genehack opened this issue May 14, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@genehack
Copy link
Contributor

Current Behavior

# from repo root
nextstrain build ./ingest
# lots of output, things work

nextstrain build ./phylogenetic
# lots of output, things don't work; first error: 

/bin/bash: line 1: nextalign: command not found

Possible solution

Based on the archived repo, nextalign was moved into nextclade -- but the page linked for nextalign-cli 404s.

I'm guessing the right answer here is to update the Snakemake file to either replace the nextalign call with nextclade run with some set of options, or (looking at the zika repo) covert things over to using augur for the alignment?

@kimandrews any insight you can provide would be appreciated!

@genehack genehack added the bug Something isn't working label May 14, 2024
@genehack genehack self-assigned this May 14, 2024
@kimandrews
Copy link

kimandrews commented May 14, 2024

I used augur align for whole genome alignment in the measles phylogenetic workflow, whereas I used nextclade run for aligning the shorter N450 region

@victorlin
Copy link
Member

The context here is that nextalign was bundled with Nextclade in v2 and removed in v3, which is probably the version you have. This pathogen repo seems to be written for Nextclade v2 though that dependency isn't stated anywhere.

I'm assuming nextalign was chosen for this repo intentionally, so potential fixes would be to (1) mention the Nextclade v2 dependency explicitly and set your environment up with that or (2) migrate to Nextclade v3 by using nextclade run as you've mentioned. (2) is probably the best move.

@joverlee521
Copy link

Ah, the workflow was created before Nextclade v3 was released.

I think we'd want to migrate it to nextclade3 run following Nextclade's migration guide.

@genehack
Copy link
Contributor Author

I think we'd want to migrate it to nextclade3 run following Nextclade's migration guide.

Thanks! I'll check out that guide.

genehack added a commit that referenced this issue May 14, 2024
* `output.insertions` will be a TSV file now
* `--reference` is now spelled `--input-ref`
* `--genemap` is now spelled `--input-annotation`
* `--retry-reverse-complement` is no longer supported
* `--output-insertions` is now spelled `--output-tsv`

Note: dropping `--retry-reverse-complement` is the one that I am most
unsure about, but this version completes this step.
genehack added a commit that referenced this issue May 14, 2024
Initially, the workflow failed with the following error:

```
Error:
   0: When reading genome annotation
   1: When reading file: "config/hku1/genemap.gff"
   2: Attempted to parse the genome annotation as JSON and as GFF, but both attempts failed:
      JSON error: invalid type: string "NC_006577.2\tfeature\tsource\t1\t29926\t.\t+\t.\tgene=nuc NC_006577.2\tfeature\tgene\t206\t13600
\t.\t+\t.\tgene=ORF1a NC_006577.2\tfeature\tgene\t13600\t21753\t.\t+\t.\tgene=ORF1b NC_006577.2\tfeature\tgene\t21773\t22933\t.\t+\t.\tg
ene=HE NC_006577.2\tfeature\tgene\t22942\t27012\t.\t+\t.\tgene=Spike NC_006577.2\tfeature\tgene\t22978\t25221\t.\t+\t.\tgene=S1 NC_00657
7.2\tfeature\tgene\t27051\t27380\t.\t+\t.\tgene=S2 NC_006577.2\tfeature\tgene\t27051\t27380\t.\t+\t.\tgene=ORF4 NC_006577.2\tfeature\tge
ne\t27373\t27621\t.\t+\t.\tgene=E NC_006577.2\tfeature\tgene\t27633\t28304\t.\t+\t.\tgene=M NC_006577.2\tfeature\tgene\t28320\t29645\t.\
t+\t.\tgene=N NC_006577.2\tfeature\tgene\t28342\t28959\t.\t+\t.\tgene=N2", expected struct GeneMap at line 2 column 1

      GFF3 error: When processing gene, 'N': When processing feature group 'N' ('N') of type 'gene': genes must consist of exactly one f
eature: Expected exactly one element, but found: 2
   2:

Location:
   /workdir/packages/nextclade/src/gene/gene_map.rs:56
```

While looking at the referenced file, and comparing it to the other
`genemap.gff` files in the config, I noticed that all the others used
`gene_name` for everything after the first `gene` line. I changed this
file to match, and the workflow got past the point where it was
previously erroring out.

I have no idea why this worked; hopefully somebody will explain in the
code review.
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented May 17, 2024

We should probably also document the Nextalign-like usage in the main Nextclade docs, i.e. using Nextclade v3 without a dataset and providing individual files using --input-* args instead. The invocation of Nextclade v3 with individual args is mostly the same or is very similar to what Nextalign v2 used to be. And I believe that swapping nextclade in place of nextalign executables should produce somewhat informative errors.

Documenting it better would allow for smoother transition for v2 users and also highlight that Nextclade v3 can be used as an aligner even where there's no dataset for a particular organism.

Upd: I created an issue: nextstrain/nextclade#1456

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants