Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allows for CDS (as well as gene) features to generate a new gene reference #55

Merged
merged 2 commits into from
Mar 12, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Mar 8, 2024

Description of proposed changes

It seems that we intend to use the newreference.py script as a template for creating gene trees for other viruses (e.g. dengue, measles). The newreference.py script processes a GenBank file, producing new gene reference files (GenBank and FASTA) which are required for creating the Nextstrain gene tree views.

When using the scripts/newreference.py to generate an E gene reference for Dengue:

curl "https://raw.githubusercontent.com/nextstrain/dengue/main/phylogenetic/config/reference_dengue_all.gb" > dengue_reference.gb
python scripts/newreference.py \
  --reference dengue_reference.gb \
  --output-fasta d.fa \
  --output-genbank d.gb \
  --gene E

Encountered error

Traceback (most recent call last):
  File "/Users/jchang3/github/nextstrain/rsv_branches/main/scripts/newreference.py", line 43, in <module>
    new_reference(args.reference, args.output_genbank, args.output_fasta, args.gene)
  File "/Users/jchang3/github/nextstrain/rsv_branches/main/scripts/newreference.py", line 18, in new_reference
    record = ref[startofgene:endofgene]
UnboundLocalError: local variable 'startofgene' referenced before assignment

The underlying issue was that the script is looking for a "gene" feature" and was ignoring "CDS" features and this PR extends the functionality to CDS.

Looking at the measles genbank reference, it will also require the CDS functionality to pull out the N gene.

Related issue(s)

Checklist

  • Checks pass

…rence

The newreference.py script processes a GenBank file, producing new gene
reference files (GenBank and FASTA) which are required for creating the
Nextstrain gene tree views.

As we intend to use this script as a template for other viruses (e.g. dengue, measles),
it was necessary to allow the use of CDS features in the GenBank file, not just the gene.
@rneher
Copy link
Member

rneher commented Mar 8, 2024

for generating whole genome references, this script might also be useful:

https://github.com/nextstrain/nextclade_data/blob/master/docs/example-workflow/scripts/generate_from_genbank.py

We use this to bootstrap nextclade datasets

for feature in ref.features:
if feature.type == 'source':
ref_source_feature = feature
if feature.type =='gene':
if feature.type =='gene' or feature.type == 'CDS':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd use a centralised GenBank parser, such as augur.utils.load_features. It seem like our parser only considers CDS feature types at the moment, but that could be easily extended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I'm open to using augur.utils.load_features. Ah, is there an example script that can be easily referenced?

For extending to "gene feature types" is this basically adding a duplicate code block like the following?

if feat.type=='gene':
    # ... copy internal code from "if feat.type=='CDS'"

I assume if both a "gene" and a "CDS" feature are named "E", one won't overwrite the nucleotide coordinates of the other (although hopefully they match)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for nextstrain/augur#1435! Upon review, and in order to not hold up progress here (and measles), I'd suggest using the code as it stands and then we can move to using a util function once we're happy with the changes there.

Copy link

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, left a single non-blocking code style suggestion.

scripts/newreference.py Outdated Show resolved Hide resolved
Currently, if the --gene option is used and the gene name is not found,
the script will use the entire genome which may cause some silent undesired
behaviors.

This commit changes that such that the script will error out if the gene
is not found in the GenBank file as this indicates the gene name may be
misspelled or the user may be using the wrong GenBank file.

If the --gene option is not used, the script will continue to process the
entire genome as expected.

Co-authored-by: Jover Lee <joverlee521@gmail.com>
@j23414 j23414 force-pushed the allow-cds-in-new-gene-reference branch from 95b7446 to 8cd6a13 Compare March 12, 2024 22:19
@j23414 j23414 merged commit 1df2377 into master Mar 12, 2024
3 checks passed
@j23414 j23414 deleted the allow-cds-in-new-gene-reference branch March 12, 2024 22:27
@j23414
Copy link
Contributor Author

j23414 commented Mar 21, 2024

Just tagging that this caused some downstream bugs where "CDS" and "gene" coordinates did not match, subsequently fixed by #59 (comment)

In the future we should pick a preference for "gene" or "CDS" coordinates and update scripts accordingly.

@joverlee521
Copy link
Contributor

In the future we should pick a preference for "gene" or "CDS" coordinates and update scripts accordingly.

Agree, definitely better to make this choice explicit.
We can add a --feature-type flag and maybe rename --gene to --feature-name?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants