translate.seqs #788

pschloss · 2021-08-26T16:47:22Z

This function would take...

DNA sequences and translate it to an amino acid sequence
- By default it would use the first frame
- The user could also specify the frame (1, 2, 3, -1, -2, -3) or possibly use all 6 frames
- Another option would be stop=T/F. If T, then if the translation hits a stop codon, it stops before that codon. If F, it returns the full translation with a * as the stop codon
- Output as *.aa#.fasta where # is the frame
Amino acid sequences and translate it to a DNA sequence
- Because of degeneracies there will be non-ATGC IUPAC codes in the output sequence
- Output as *.dna.fasta
Unaligned DNA and unaligned/aligned Amino acid sequences
- Back translate the amino acid sequence to the DNA sequence so that the DNA is aligned. This should result in the DNA bases being clustered in groups of 3 corresponding to each amino acid codon
- Hopefully the DNA sequence and the amino acid sequence will be in the same frame
- Output alignment as *.dna.align

mothur-westcott · 2021-11-10T16:59:41Z

@pschloss, for the translation to dna from amino acids, how do we choose between the multiple options in the compressed column?

For example how is L translated?

CTN
TTR
CTY
YTR
YTN

AminoAcid DNA_codons Compressed
Leu, L CTT,CTC,CTA,CTG; TTA,TTG CTN,TTR; or CTY,YTR

pschloss · 2021-11-15T17:56:23Z

For DNA to amino acids if it runs into CTT (or CTC,CTA,CTG, TTA,TTG) then it is replaced with an L.

For amino acids to DNA... Hmm. I'm not sure what to do there. I think the main use of amino acid to DNA would be to align a DNA sequence the user already has. So if the amino acid sequence had an L, then we would look for CTT,CTC,CTA,CTG,TTA, or TTG in the DNA sequence to be aligned. Does that make sense?

Basically - I don't think we'd ever have a DNA sequence without an amino acid sequence, but we will likely have a DNA sequence without amino acid sequence.

mothur-westcott · 2021-11-15T18:25:32Z

Hmm... so should we scrap this bullet point?

Amino acid sequences and translate it to a DNA sequence

Because of degeneracies there will be non-ATGC IUPAC codes in the output sequence
Output as *.dna.fasta

pschloss · 2021-11-15T18:28:56Z

Yeah that's probably best for `translate.seqs`

…

On Mon, Nov 15, 2021 at 1:25 PM Sarah Westcott ***@***.***> wrote: Hmm... so should we scrap this bullet point? - Amino acid sequences and translate it to a DNA sequence Because of degeneracies there will be non-ATGC IUPAC codes in the output sequence Output as *.dna.fasta — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#788 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJUUBBBDCZDMPDAIBSVE6LUMFGCPANCNFSM5C3YMIKQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

#788 #792

#792 #788

#788 #792

mothur-westcott · 2022-02-01T16:16:47Z

Unaligned DNA and unaligned/aligned Amino acid sequences

Back translate the amino acid sequence to the DNA sequence so that the DNA is aligned. This should result in the DNA bases being clustered in groups of 3 corresponding to each amino acid codon
Hopefully the DNA sequence and the amino acid sequence will be in the same frame
Output alignment as *.dna.align

@pschloss, could you explain this part a bit more and correct any bad assumptions on my part?

Inputs: The user is providing a reference file of DNA sequences, and an input file containing amino acids.
For each amino acid sequence in the input file:
- translate amino acid into compressed DNA
- find closest sequence in reference to align to (kmer search)
- align compressed DNA to reference

Trivial example:

aminoAcidFragment
KC_CK

becomes

compressedDNA
AARTGY---TGYAAR

Perfect matches in reference database could include:

AAATGC---TGCAAA, AAGTGC---TGCAAA, AAATGT---TGCAAA, ... , AAGTGT---TGTAAG

alignedDNA
AAATGC---TGCAAA

pschloss · 2022-02-01T19:58:44Z

For this option, the user would provide a DNA sequence and an amino acid sequence. One of the two would be aligned. For example, the user could provide an aligned amino acid sequence and an unaligned DNA sequence. They'd want the DNA sequence to come out aligned...

>unaligned_dna
AAATGCTGCAAA

>aligned_amino_acid
KC_CK

Output would then be...

>aligned_dna
AAATGC---TGCAAA

If instead they provided...

>aligned_dna
AAATGC-TGCAAA

>unaligned_amino_acid
KCCK

then the output would be...

>aligned_amino_acid
KC-CK

Does that make sense? I don't think there's a need here for a compressed DNA alphabet.

mothur-westcott · 2022-02-01T20:05:53Z

That makes sense, thanks for clarifying. No reference file, a pair of files, like the fasta/qfile pairings. The command would look like:

mothur > translate.seqs(fasta=alignedOrUnalignedDNA, amino=alignedOrUnalignedAminoAcids)

assumes sequences are in the same order in both files

mothur-westcott added this to the Version 1.47.0 milestone Aug 30, 2021

mothur-westcott added Enhancement translate.seqs command labels Aug 30, 2021

mothur-westcott added a commit that referenced this issue Nov 8, 2021

Rough in translate.seqs command

a8b2bab

#788 #792

mothur-westcott added a commit that referenced this issue Nov 9, 2021

Adds createProcesses function in translate.seqs

f859581

#788 #792

mothur-westcott added a commit that referenced this issue Nov 16, 2021

Adds codon constructor and handling to amino acids class

c2d5bd5

#788 #792

mothur-westcott added a commit that referenced this issue Nov 16, 2021

Adds getProtein function to sequence class

ed1e71d

#792 #788

mothur-westcott added a commit that referenced this issue Nov 16, 2021

Removes amino acid to dna code

02a4cce

#788 #792

mothur-westcott added a commit that referenced this issue Dec 14, 2021

Adds trimming to translate.seqs

6abd4db

#788 #792

mothur-westcott modified the milestones: Version 1.47.0, 1.48.0 Jan 3, 2022

mothur-westcott added a commit that referenced this issue Feb 8, 2022

Work on translate.seeqs

cff5cbf

#788 #812

mothur-westcott added a commit that referenced this issue Feb 8, 2022

Adds isAligned functions to sequence and protein

ef2e737

#812 #788

mothur-westcott added a commit that referenced this issue Feb 8, 2022

Builds out parallel processing for translate.seqs

21e04f4

#788 #812

mothur-westcott added a commit that referenced this issue Feb 15, 2022

Adds align function for sequence / protein

1de67ab

#812 #788

mothur-westcott added a commit that referenced this issue Feb 15, 2022

Updates data structures for protein align

1e1f6e2

#812 #788

mothur-westcott added a commit that referenced this issue Feb 28, 2022

WIP

260b9ec

#812 #788

mothur-westcott added a commit that referenced this issue Feb 28, 2022

Finailizes translate.seqs

bc6205f

#788 #812

mothur-westcott closed this as completed Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

translate.seqs #788

translate.seqs #788

pschloss commented Aug 26, 2021 •

edited

mothur-westcott commented Nov 10, 2021

pschloss commented Nov 15, 2021

mothur-westcott commented Nov 15, 2021

pschloss commented Nov 15, 2021 via email

mothur-westcott commented Feb 1, 2022

pschloss commented Feb 1, 2022

mothur-westcott commented Feb 1, 2022

translate.seqs #788

translate.seqs #788

Comments

pschloss commented Aug 26, 2021 • edited

mothur-westcott commented Nov 10, 2021

pschloss commented Nov 15, 2021

mothur-westcott commented Nov 15, 2021

pschloss commented Nov 15, 2021 via email

mothur-westcott commented Feb 1, 2022

pschloss commented Feb 1, 2022

mothur-westcott commented Feb 1, 2022

pschloss commented Aug 26, 2021 •

edited