Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keeping terminal repeats #9

Closed
xvazquezc opened this issue Apr 19, 2024 · 2 comments
Closed

keeping terminal repeats #9

xvazquezc opened this issue Apr 19, 2024 · 2 comments

Comments

@xvazquezc
Copy link

Hi there,

First at all, I must say this is the only program that got to annotate many of the normal viral marker genes in a very weird virus we are working on.

Second, is there a way to keep the terminal repeats at both ends after re-circularisation? Quite a few tools for downstream analysis e.g. CheckV don't like it when they are missing.

Thank you in advance,
Xabi

@mtisza1
Copy link
Owner

mtisza1 commented Apr 19, 2024

Xabi,

Thanks for your kind words.

I hope using the flag --wrap F will do the trick for you. This flag will leave DTRs at the ends. But, it will also not re-orient/re-circularize the contig.

If you still want the re-orientation behavior but want the DTRs as well, you'll have to come up with a custom downstream solution after running this tool with --wrap T. I suggest a script to read the CT3 output *_virus_summary.tsv and *_virus_sequences.fna files then add a short (e.g. 25 nt) DTR to the end of contigs with the DTR label. Then pipe these modified .fna files to checkv. You could use biopython in a python script or seqkit in a bash script to accomplish this.

Unrelatedly, if you notice that this tool is missing any hallmark genes on your very weird viruses, it would help me out if you could share those protein sequences (to be included in database updates). This would totally be up to you, of course.

Mike

@xvazquezc
Copy link
Author

I completely misunderstood the wrap function 🤦

Without getting into details, despite being a jumbo virus according to the genome size is hard to say if it belongs to any known class. cenotetaker3 only detected 3 hallmark genes out of nearly 600 (not surprised there based on results with other tools) and one of them (MCP) I'm not convinced (PF13252.5 matches about 120 aa in a ~1700 aa prot).

I noticed that many of the evidences coming from PDB derive from complexes so it is the evidence_description. However, the evidence_accession point to specific chains in those complexes. For example, I got 3 instances of "DNA-directed-RNA-polymerase-II" but referring to PDB:4V1N_M. The accession 4V1N corresponds to "Architecture of the RNA polymerase II-Mediator core transcription initiation complex", but specifically, the subunit M corresponds to the "transcription initiation factor IIB" and not the DNA-directed-RNA-polymerase-II.

@mtisza1 mtisza1 closed this as completed Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants