Linking rna_virus search to original contig #2

mihinduk · 2020-06-25T14:55:31Z

Hi,

Is it possible to have the original contig name in the output include the information after the space so that the name of the contig from the original contig dictionary would be captured?

KM_ct2761 contig_1689 (from non_viral_domains_contigs.fna) would be reported as KM_ct2761 contig_1689, not just KM_ct2761. This would save a lookup step.

Thank you,
Kathie Mihindukulasuriya

mtisza1 · 2020-07-01T15:09:08Z

Hi Kathie,

My apologies for the slow response. I've been on vacation for the past week.

I think I understand your issue, and Cenote-Taker2 should be reporting the contigs in 'non_viral_domains_contigs.fna' exactly as you are requesting it. One consideration is that it will only retain header information before the first whitespace character. Is it possible that you included a whitespace directly after the '>' character?

If possible please attach the create a new .fasta file from "contig_1689" formatted like your original .fasta and I'll see if I can recapitulate the error.

Best,

Mike

P.S. Here's an example from the test contigs. Original .fasta input:

From non_viral_domains_contigs.fna:

mihinduk · 2020-07-01T15:25:16Z

Hi Mike, I hope you had a nice vacation. Thanks for getting back to me. I made 2 files: contig_1689.fasta = the input I used, but just for contig 1689 contig_1689_non_viral_domains_contigs.fna = the output for contig 1689 only that I got from the command: conda activate /mnt/pathogen1/rrodgers/miniconda2/envs/cenote-taker2_env MIN=1000 nohup python /mnt/pathogen1/rrodgers/Cenote-Taker2/run_cenote-taker2.0.1.py \ --contigs /mnt/pathogen1/kathiem/2020_03_20_IBS_virome/KM_ct2/other_contigs/non_viral_domains_contigs.fna \ --run_title KM_ct2_RNA \ --template_file ../template.sbt \ --mem 80 --cpu 20 \ --virus_domain_db rna_virus \ --prune_prophage FALSE \ --filter_out_plasmids FALSE \ --minimum_length_circular $MIN \ --minimum_length_linear $MIN \ --hhsuite_tool hhsearch \ --handle_contigs_without_hallmark sketch_all > out.log 2>&1 & Thank you for your help, Kathie

…

________________________________ From: Mike Tisza <notifications@github.com> Sent: Wednesday, July 1, 2020 10:09 AM To: mtisza1/Cenote-Taker2 <Cenote-Taker2@noreply.github.com> Cc: Mihindukulasuriya, Kathie <mihindu@wustl.edu>; Author <author@noreply.github.com> Subject: Re: [mtisza1/Cenote-Taker2] Linking rna_virus search to original contig (#2) * External Email - Caution * Hi Kathie, My apologies for the slow response. I've been on vacation for the past week. I think I understand your issue, and Cenote-Taker2 should be reporting the contigs in 'non_viral_domains_contigs.fna' exactly as you are requesting it. One consideration is that it will only retain header information before the first whitespace character. Is it possible that you included a whitespace directly after the '>' character? If possible please attach the create a new .fasta file from "contig_1689" formatted like your original .fasta and I'll see if I can recapitulate the error. Best, Mike P.S. Here's an example from the test contigs. Original .fasta input: [image]<https://user-images.githubusercontent.com/37546741/86260284-0f507200-bb8b-11ea-9b8c-53f4f798145c.png> From non_viral_domains_contigs.fna: [image]<https://user-images.githubusercontent.com/37546741/86260373-3018c780-bb8b-11ea-9e5a-b2d019d27fc7.png> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#2 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANDVLDI254XBFGV6IBFDIMDRZNGSHANCNFSM4OIOOM5Q>.

________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

mtisza1 · 2020-07-01T17:52:46Z

Kathie,

I'm not seeing the files you referred to. Could you try attaching them once more or send them to my email at michael.tisza@gmail.com?

Mike

mtisza1 · 2020-07-01T19:52:36Z

OK I got your files, and they look as I expected. I think I misunderstood your question.

I thought you just wanted the string 'contig_1689' present in the output files, which it does in the various .fna, .fsa, .gbf, and tsv files.
Were you hoping that the files produced from the analysis of contig_1689 would contain the string 'contig_1689' in the file name? This would cause some issues for me and I wouldn't really want to do it. That said, you could write a bash script renaming the files after the run is over. Such as with the script below. Let me know if I'm understanding you correctly, and, if I'm not, please be more explicit about what you'd like from the output.


#!/bin/bash

for FSA in *fsa ; do
	ORIGINAL_TITLE=$( head -n1 $FSA | sed 's/.*note= \(.*\) ; .*/\1/' )
	echo ${FSA%.fsa} to ${ORIGINAL_TITLE} 
	mv $FSA ${ORIGINAL_TITLE}_${FSA}
	mv ${FSA%.fsa}.gbf ${ORIGINAL_TITLE}_${FSA%.fsa}.gbf
	mv ${FSA%.fsa}.cmt ${ORIGINAL_TITLE}_${FSA%.fsa}.cmt
	mv ${FSA%.fsa}.tbl ${ORIGINAL_TITLE}_${FSA%.fsa}.tbl
	mv ${FSA%.fsa}.val ${ORIGINAL_TITLE}_${FSA%.fsa}.val
	mv ${FSA%.fsa}.sqn ${ORIGINAL_TITLE}_${FSA%.fsa}.sqn
done

mihinduk · 2020-07-01T20:54:48Z

Hi Mike, What I was hoping to capture was the original contig name in the output of the rna_virus search. So, when I submit my original file for the initial search: MIN=1000 python /mnt/pathogen1/rrodgers/Cenote-Taker2/run_cenote-taker2.0.1.py \ --contigs /mnt/pathogen1/kathiem/2020_03_20_IBS_virome/assembly/contig_dictionary/contig_dictionary.fasta \ --run_title KM_ct2 \ --template_file ../template.sbt \ --mem 80 --cpu 20 \ --prune_prophage FALSE \ --filter_out_plasmids FALSE \ --minimum_length_circular $MIN \ --minimum_length_linear $MIN \ --hhsuite_tool hhsearch \ --handle_contigs_without_hallmark sketch_all My infile consists of named contigs: Ex:

contig_1689 contig_191 contig_2107

and the outfile has the column "Cenote-taker contig name" ct2_contig_dictionary1006 ct2_contig_dictionary1525 ct2_contig_dictionary1526 and the column "original contig name" which contains the contig names: contig_191 contig_2385 contig_2386 The input fasta for the --virus_domain_db rna_virus run contains both these names, separated by a space:

KM_ct2761 contig_1689 KM_ct21225 contig_2107 KM_ct21484 contig_2347 KM_ct2188 contig_492

The output file from the --virus_domain_db rna_virus run has the column "Cenote-taker contig name" KM_ct2_RNA1551_vs01 KM_ct2_RNA198_vs01 KM_ct2_RNA432_vs01 KM_ct2_RNA681_vs01 and the column "original contig name" which contains the contig names: KM_ct2761 KM_ct21225 KM_ct21484 KM_ct2188 I would like the "original contig name" column to contain either: KM_ct2761 contig_1689 OR contig_1689 as I now have an added parsing step, where I have to go back to the input fasta for the --virus_domain_db rna_virus run (non_viral_domains_contigs.fna) and search for the "original contig name" to link the original contig name in my 1st fasta infile (contig_dictionary.fasta), which is how I have the contigs linked to my metadata. I was hoping to avoid the extra parsing step and be able to read the RNA output into R to link it with my metadata for analysis. Thank you, Kathie

…

________________________________ From: Mike Tisza <notifications@github.com> Sent: Wednesday, July 1, 2020 2:52 PM To: mtisza1/Cenote-Taker2 <Cenote-Taker2@noreply.github.com> Cc: Mihindukulasuriya, Kathie <mihindu@wustl.edu>; Author <author@noreply.github.com> Subject: Re: [mtisza1/Cenote-Taker2] Linking rna_virus search to original contig (#2) * External Email - Caution * OK I got your files, and they look as I expected. I think I misunderstood your question. I thought you just wanted the string 'contig_1689' present in the output files, which it does in the various .fna, .fsa, .gbf, and tsv files. Were you hoping that the files produced from the analysis of contig_1689 would contain the string 'contig_1689' in the file name? This would cause some issues for me and I wouldn't really want to do it. That said, you could write a bash script renaming the files after the run is over. Such as with the script below. Let me know if I'm understanding you correctly, and, if I'm not, please be more explicit about what you'd like from the output.

________________________________ #!/bin/bash for FSA in fsa ; do ORIGINAL_TITLE=$( head -n1 $FSA | sed 's/.note= (.) ; ./\1/' ) echo ${FSA%.fsa} to ${ORIGINAL_TITLE} mv $FSA ${ORIGINAL_TITLE}${FSA} mv ${FSA%.fsa}.gbf ${ORIGINAL_TITLE}${FSA%.fsa}.gbf mv ${FSA%.fsa}.cmt ${ORIGINAL_TITLE}${FSA%.fsa}.cmt mv ${FSA%.fsa}.tbl ${ORIGINAL_TITLE}${FSA%.fsa}.tbl mv ${FSA%.fsa}.val ${ORIGINAL_TITLE}${FSA%.fsa}.val mv ${FSA%.fsa}.sqn ${ORIGINAL_TITLE}${FSA%.fsa}.sqn done — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#2 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANDVLDLBELMOTLVSSQVXYQLRZOHZDANCNFSM4OIOOM5Q>.

________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

mtisza1 · 2020-07-02T15:17:45Z

OK so I was clearly misunderstanding your question before. I'm sorry about that.

There are a variety of reasons that I'm not willing to change the pipeline to preserve the fasta header information beyond the first whitespace character, so, unfortunately, you'll have to parse the 'non_viral_domains_contigs.fna' file before feeding it to another cenote-taker2 run. A quick manipulation would be to remove the space character in (from your example) 'KM_ct2761 contig_1689'. For example you could change it to an '@' character (KM_ct2761@contig_1689), so the information would all be preserved:

sed 's/ /@/g' non_viral_domains_contigs.fna > contigs_for_next_run.fna

I don't think this was the answer you were hoping for, but I'm hoping it won't be too onerous either.

Best,

Mike

mihinduk · 2020-07-02T15:21:08Z

Hi Mike, Thanks for the reply. Now that I know the format of the output, you are right, the simplest fix on my end will be to replace the space with a character before submitting it, so I can parse it when I get the results. Kathie

…

________________________________ From: Mike Tisza <notifications@github.com> Sent: Thursday, July 2, 2020 10:18 AM To: mtisza1/Cenote-Taker2 <Cenote-Taker2@noreply.github.com> Cc: Mihindukulasuriya, Kathie <mihindu@wustl.edu>; Author <author@noreply.github.com> Subject: Re: [mtisza1/Cenote-Taker2] Linking rna_virus search to original contig (#2) * External Email - Caution * OK so I was clearly misunderstanding your question before. I'm sorry about that. There are a variety of reasons that I'm not willing to change the pipeline to preserve the fasta header information beyond the first whitespace character, so, unfortunately, you'll have to parse the 'non_viral_domains_contigs.fna' file before feeding it to another cenote-taker2 run. A quick manipulation would be to remove the space character in (from your example) 'KM_ct2761 contig_1689'. For example you could change it to an '@' character (KM_ct2761@contig_1689), so the information would all be preserved: sed 's/ /@/g' non_viral_domains_contigs.fna > contigs_for_next_run.fna I don't think this was the answer you were hoping for, but I'm hoping it won't be too onerous either. Best, Mike — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#2 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANDVLDJUVOBN3NII4YXHXMTRZSQKRANCNFSM4OIOOM5Q>.

________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

mtisza1 · 2020-07-02T15:33:41Z

OK, great. And, please make me aware of any additional issues you encounter.

I am now closing this issue.

mtisza1 closed this as completed Jul 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linking rna_virus search to original contig #2

Linking rna_virus search to original contig #2

mihinduk commented Jun 25, 2020

mtisza1 commented Jul 1, 2020

mihinduk commented Jul 1, 2020 via email

mtisza1 commented Jul 1, 2020

mtisza1 commented Jul 1, 2020 •

edited

Loading

mihinduk commented Jul 1, 2020 via email

mtisza1 commented Jul 2, 2020

mihinduk commented Jul 2, 2020 via email

mtisza1 commented Jul 2, 2020

Linking rna_virus search to original contig #2

Linking rna_virus search to original contig #2

Comments

mihinduk commented Jun 25, 2020

mtisza1 commented Jul 1, 2020

mihinduk commented Jul 1, 2020 via email

mtisza1 commented Jul 1, 2020

mtisza1 commented Jul 1, 2020 • edited Loading

mihinduk commented Jul 1, 2020 via email

mtisza1 commented Jul 2, 2020

mihinduk commented Jul 2, 2020 via email

mtisza1 commented Jul 2, 2020

mtisza1 commented Jul 1, 2020 •

edited

Loading