-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linking rna_virus search to original contig #2
Comments
Hi Mike,
I hope you had a nice vacation. Thanks for getting back to me. I made 2 files:
contig_1689.fasta = the input I used, but just for contig 1689
contig_1689_non_viral_domains_contigs.fna = the output for contig 1689 only that I got from the command:
conda activate /mnt/pathogen1/rrodgers/miniconda2/envs/cenote-taker2_env
MIN=1000
nohup python /mnt/pathogen1/rrodgers/Cenote-Taker2/run_cenote-taker2.0.1.py \
--contigs /mnt/pathogen1/kathiem/2020_03_20_IBS_virome/KM_ct2/other_contigs/non_viral_domains_contigs.fna \
--run_title KM_ct2_RNA \
--template_file ../template.sbt \
--mem 80 --cpu 20 \
--virus_domain_db rna_virus \
--prune_prophage FALSE \
--filter_out_plasmids FALSE \
--minimum_length_circular $MIN \
--minimum_length_linear $MIN \
--hhsuite_tool hhsearch \
--handle_contigs_without_hallmark sketch_all > out.log 2>&1 &
Thank you for your help,
Kathie
…________________________________
From: Mike Tisza <notifications@github.com>
Sent: Wednesday, July 1, 2020 10:09 AM
To: mtisza1/Cenote-Taker2 <Cenote-Taker2@noreply.github.com>
Cc: Mihindukulasuriya, Kathie <mihindu@wustl.edu>; Author <author@noreply.github.com>
Subject: Re: [mtisza1/Cenote-Taker2] Linking rna_virus search to original contig (#2)
* External Email - Caution *
Hi Kathie,
My apologies for the slow response. I've been on vacation for the past week.
I think I understand your issue, and Cenote-Taker2 should be reporting the contigs in 'non_viral_domains_contigs.fna' exactly as you are requesting it. One consideration is that it will only retain header information before the first whitespace character. Is it possible that you included a whitespace directly after the '>' character?
If possible please attach the create a new .fasta file from "contig_1689" formatted like your original .fasta and I'll see if I can recapitulate the error.
Best,
Mike
P.S. Here's an example from the test contigs. Original .fasta input:
[image]<https://user-images.githubusercontent.com/37546741/86260284-0f507200-bb8b-11ea-9b8c-53f4f798145c.png>
From non_viral_domains_contigs.fna:
[image]<https://user-images.githubusercontent.com/37546741/86260373-3018c780-bb8b-11ea-9e5a-b2d019d27fc7.png>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#2 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANDVLDI254XBFGV6IBFDIMDRZNGSHANCNFSM4OIOOM5Q>.
________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
|
Kathie, I'm not seeing the files you referred to. Could you try attaching them once more or send them to my email at michael.tisza@gmail.com? Mike |
OK I got your files, and they look as I expected. I think I misunderstood your question. I thought you just wanted the string 'contig_1689' present in the output files, which it does in the various .fna, .fsa, .gbf, and tsv files.
|
Hi Mike,
What I was hoping to capture was the original contig name in the output of the rna_virus search. So, when I submit my original file for the initial search:
MIN=1000
python /mnt/pathogen1/rrodgers/Cenote-Taker2/run_cenote-taker2.0.1.py \
--contigs /mnt/pathogen1/kathiem/2020_03_20_IBS_virome/assembly/contig_dictionary/contig_dictionary.fasta \
--run_title KM_ct2 \
--template_file ../template.sbt \
--mem 80 --cpu 20 \
--prune_prophage FALSE \
--filter_out_plasmids FALSE \
--minimum_length_circular $MIN \
--minimum_length_linear $MIN \
--hhsuite_tool hhsearch \
--handle_contigs_without_hallmark sketch_all
My infile consists of named contigs:
Ex:
contig_1689
contig_191
contig_2107
and the outfile has the column "Cenote-taker contig name"
ct2_contig_dictionary1006
ct2_contig_dictionary1525
ct2_contig_dictionary1526
and the column "original contig name" which contains the contig names:
contig_191
contig_2385
contig_2386
The input fasta for the --virus_domain_db rna_virus run contains both these names, separated by a space:
KM_ct2761 contig_1689
KM_ct21225 contig_2107
KM_ct21484 contig_2347
KM_ct2188 contig_492
The output file from the --virus_domain_db rna_virus run has the column "Cenote-taker contig name"
KM_ct2_RNA1551_vs01
KM_ct2_RNA198_vs01
KM_ct2_RNA432_vs01
KM_ct2_RNA681_vs01
and the column "original contig name" which contains the contig names:
KM_ct2761
KM_ct21225
KM_ct21484
KM_ct2188
I would like the "original contig name" column to contain either:
KM_ct2761 contig_1689
OR
contig_1689
as I now have an added parsing step, where I have to go back to the input fasta for the --virus_domain_db rna_virus run (non_viral_domains_contigs.fna) and search for the "original contig name" to link the original contig name in my 1st fasta infile (contig_dictionary.fasta), which is how I have the contigs linked to my metadata. I was hoping to avoid the extra parsing step and be able to read the RNA output into R to link it with my metadata for analysis.
Thank you,
Kathie
…________________________________
From: Mike Tisza <notifications@github.com>
Sent: Wednesday, July 1, 2020 2:52 PM
To: mtisza1/Cenote-Taker2 <Cenote-Taker2@noreply.github.com>
Cc: Mihindukulasuriya, Kathie <mihindu@wustl.edu>; Author <author@noreply.github.com>
Subject: Re: [mtisza1/Cenote-Taker2] Linking rna_virus search to original contig (#2)
* External Email - Caution *
OK I got your files, and they look as I expected. I think I misunderstood your question.
I thought you just wanted the string 'contig_1689' present in the output files, which it does in the various .fna, .fsa, .gbf, and tsv files.
Were you hoping that the files produced from the analysis of contig_1689 would contain the string 'contig_1689' in the file name? This would cause some issues for me and I wouldn't really want to do it. That said, you could write a bash script renaming the files after the run is over. Such as with the script below. Let me know if I'm understanding you correctly, and, if I'm not, please be more explicit about what you'd like from the output.
________________________________
#!/bin/bash
for FSA in fsa ; do
ORIGINAL_TITLE=$( head -n1 $FSA | sed 's/.note= (.) ; ./\1/' )
echo ${FSA%.fsa} to ${ORIGINAL_TITLE}
mv $FSA ${ORIGINAL_TITLE}${FSA}
mv ${FSA%.fsa}.gbf ${ORIGINAL_TITLE}${FSA%.fsa}.gbf
mv ${FSA%.fsa}.cmt ${ORIGINAL_TITLE}${FSA%.fsa}.cmt
mv ${FSA%.fsa}.tbl ${ORIGINAL_TITLE}${FSA%.fsa}.tbl
mv ${FSA%.fsa}.val ${ORIGINAL_TITLE}${FSA%.fsa}.val
mv ${FSA%.fsa}.sqn ${ORIGINAL_TITLE}${FSA%.fsa}.sqn
done
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#2 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANDVLDLBELMOTLVSSQVXYQLRZOHZDANCNFSM4OIOOM5Q>.
________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
|
OK so I was clearly misunderstanding your question before. I'm sorry about that. There are a variety of reasons that I'm not willing to change the pipeline to preserve the fasta header information beyond the first whitespace character, so, unfortunately, you'll have to parse the 'non_viral_domains_contigs.fna' file before feeding it to another cenote-taker2 run. A quick manipulation would be to remove the space character in (from your example) 'KM_ct2761 contig_1689'. For example you could change it to an '@' character (KM_ct2761@contig_1689), so the information would all be preserved:
I don't think this was the answer you were hoping for, but I'm hoping it won't be too onerous either. Best, Mike |
Hi Mike,
Thanks for the reply. Now that I know the format of the output, you are right, the simplest fix on my end will be to replace the space with a character before submitting it, so I can parse it when I get the results.
Kathie
…________________________________
From: Mike Tisza <notifications@github.com>
Sent: Thursday, July 2, 2020 10:18 AM
To: mtisza1/Cenote-Taker2 <Cenote-Taker2@noreply.github.com>
Cc: Mihindukulasuriya, Kathie <mihindu@wustl.edu>; Author <author@noreply.github.com>
Subject: Re: [mtisza1/Cenote-Taker2] Linking rna_virus search to original contig (#2)
* External Email - Caution *
OK so I was clearly misunderstanding your question before. I'm sorry about that.
There are a variety of reasons that I'm not willing to change the pipeline to preserve the fasta header information beyond the first whitespace character, so, unfortunately, you'll have to parse the 'non_viral_domains_contigs.fna' file before feeding it to another cenote-taker2 run. A quick manipulation would be to remove the space character in (from your example) 'KM_ct2761 contig_1689'. For example you could change it to an '@' character (KM_ct2761@contig_1689), so the information would all be preserved:
sed 's/ /@/g' non_viral_domains_contigs.fna > contigs_for_next_run.fna
I don't think this was the answer you were hoping for, but I'm hoping it won't be too onerous either.
Best,
Mike
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#2 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANDVLDJUVOBN3NII4YXHXMTRZSQKRANCNFSM4OIOOM5Q>.
________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
|
OK, great. And, please make me aware of any additional issues you encounter. I am now closing this issue. |
Hi,
Is it possible to have the original contig name in the output include the information after the space so that the name of the contig from the original contig dictionary would be captured?
Thank you,
Kathie Mihindukulasuriya
The text was updated successfully, but these errors were encountered: