-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Circularity detection and BlastN #15
Comments
Kailun, Thanks for using the tool and for putting in an issue. I think I have answers to both of your questions. I do want to apologize for the slow reply, but I haven't been at work most of this week, so I'm just now getting a chance to get to your question. Regarding the differences with circularity detection: The problem is that your Regarding the BLASTN step: The blastn DB need to be downloaded locally. If the entirety of Please let me know if this helps. Mike |
Hello Mike, I really appreciate your response and it's helped me fix the circularity detection issue. I basically just replaced "-" with "_". I guess some contigs annotated as "metagenomic plasmid" would just not be labeled as DTR? Also, I found that some of the circular contigs included in "all_circular_contigs.fna" file would not be summarized into the "CONTIG_SUMMARY.tsv" file, such as all_circular_contigs_1003_01.fna.txt and 1003_01_CONTIG_SUMMARY.tsv.txt (the output file looks like this ). I hope this is correct. For the BlastN, I downloaded nt files into our virtual system but it still did not detect any high coverage hits. Here are my steps using Linux:
module load miniconda3 CONDA_BASE=$(conda info --base) source $CONDA_BASE/etc/profile.d/conda.sh # activate env conda activate /opt/apps/labs/gdlab/envs/cenote-taker2/2.1.3/cenote-taker2_env CENOTE_BASE="/opt/apps/labs/gdlab/envs/cenote-taker2/2.1.3/Cenote-Taker2" DATA_BASE="/scratch/gdlab/kailun/Phageome/NEC/metaspades" sample=`sed -n ${SLURM_ARRAY_TASK_ID}p ${DATA_BASE}/scaffold_list.txt` sample_out=`sed -n ${SLURM_ARRAY_TASK_ID}p ${DATA_BASE}/scaffold_list_new.txt` set -x time ${CENOTE_BASE}/run_cenote-taker2.py \ -c ${DATA_BASE}/scaffolds/${sample}.fasta \ -r ${sample_out} \ -p True \ -m 30 \ --orf-within-orf True \ --known_strains blast_knowns \ --blastn_db /scratch/ref/gdlab/blast_db/nt_2021_10_07 \ -t ${SLURM_CPUS_PER_TASK} RC=$? set +x if [ $RC -eq 0 ] then echo "Job completed successfully" else echo "Error Occured!" exit $RC fi The log file: I'm not sure if it really doesn't have any conserved sequence or I did something wrong. It seems not related with format. What would be the rate of coverage hits detection from your experience? Another questions of mine is if the bacterial false positives are removed during Cenote-Taker2 analysis? Thank you! Best, |
Kailun, I'm glad we could fix some of your issues. Yes, plasmids are not identified by being circular contigs, but rather by their genetic content. Regarding circular sequences that are not in the contig summary table: I actually suggest using A final note on this, you could use the option Regarding BLASTN: I think you are specifying the BLASTN directory instead of the the BLASTN database name. For example I have a directory
I would use Regarding bacterial false positives: I'm not exactly sure what you mean, but I can say that Cenote-Taker 2 is not perfect, and has some false positive rate. Using On the other hand, if you are asking if Cenote-Taker 2 removes flanking bacterial genes from contigs containing an integrated prophage, then, yes it does if you use Let me know if this helps or if you have any other questions. Mike |
Hello Mike, BLASTN has been working for me. Thanks a lot for your detailed suggestions! Though it doubled the running time than without Blast function, I'm really glad that to get this informartion. For the false negatives, you've suggested to inspect genome maps. It makes a lot of sense when I look into the map of each sequence. However, I generated about 25k sequences and I'm not sure if I can look through all of them. Do you have any experience of doing it more effectively? For figuring bacterial false positives, I'm using Thank you for you help again! Please let me know if you have any good methods of map inspection. Best, |
Kailun, Great to hear! Yes, the BLASTN step does take a while (when using all of nt as a database) and I haven't figured out how to speed it up any further. 25k sequences is certainly a lot to look through. (Also, please note that metagenomic plasmids and conjugative transposons should NOT be considered viruses). My unsolicited advice in this project is to use caution with such large data. Personally, I would discard "linear" (i.e. non-DTR/circular and non-ITR) contigs under 5kb, and would again urge you to use I've played around a little with the BUSCO aproach described in the excellent study above. It's OK, but (in my opinion) there are 2 points that it fails to account for:
Best, Mike |
Hi Mike, Really good points for BUSCO. Just to make it clear, because of point 1, I probably would remove some prophages, right? I'll definitely learn deeply about CheckV. From my superficial understanding, the sequences with higher level of completeness or quality have higher potential to be real viral genomes and I need to pay more attention to judge the sequences with lower completeness. Thanks! Best, |
I think the BUSCO approach could remove a small number of prophages, but you could use the approach and just filter the prophage/mobile genetic element replication gene HMMs from your "hits". I don't recall which HMMs these were, however. I think there are several HMM sets you choose from when using the BUSCO tool. Yes, the main function of checkv is to assign completeness to putative viral sequences, the second function is to prune prophage regions, and the third thing (anicalc/aniclust) is to cluster similar sequences. With a large scale project like yours, you might consider making a checkv cutoff (e.g. 10% completeness). Good luck! Mike |
Got it. Thank you again for your answers and suggestions! They are really helpful. Best, |
Hello Mike,
I'm using Cenote-Taker2 to discover viral elements in my samples. I'm really glad that there are lots of viral genes are identified but I have two questions and hope to have you suggestions.
I got different end_feature results of analyzing the same assembly data file when I run it separately (s00.1003single.txt) or with other samples (s00.1003list.txt). The latter one means that I have a list of assemblies (scaffold_list.txt) from SPAdes and I run the program follow the list. Below are the scripts for both runs. The first sample in the list is 1003-01.fasta. What I found was that, when it was analyzed separately, the end features of some viral sequences could be annotated as DTR (as the picture shown above). However, when I run it in a list, all of the metagenomic plasmids were labeled as NONE, though other info were the same such as LENGTH.
Also, different outputs were observed. In separated run, I got AA.fasta files (e.g., d00_ct1003single1033.AA.fasta.txt) but when run in list, I got permuted.fa files (e.g., permuted.1033.fa.txt) instead. I know it's really confused and I'm not sure if these problems are connected to each other. I attached the log files for both runs and please let me know if you need more information to figure the problem.
s00.1003single.txt
s00.1003list.txt
Log file for separated running:
x_dna_test_33051552.txt
y_dna_test_33051552.txt
Log file for running in the list:
x_dna_test_1.txt
y_dna_test_1.txt
Best,
Kailun
The text was updated successfully, but these errors were encountered: