Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

translate.seqs #788

Closed
pschloss opened this issue Aug 26, 2021 · 7 comments
Closed

translate.seqs #788

pschloss opened this issue Aug 26, 2021 · 7 comments

Comments

@pschloss
Copy link
Contributor

pschloss commented Aug 26, 2021

This function would take...

  • DNA sequences and translate it to an amino acid sequence

    • By default it would use the first frame
    • The user could also specify the frame (1, 2, 3, -1, -2, -3) or possibly use all 6 frames
    • Another option would be stop=T/F. If T, then if the translation hits a stop codon, it stops before that codon. If F, it returns the full translation with a * as the stop codon
    • Output as *.aa#.fasta where # is the frame
  • Amino acid sequences and translate it to a DNA sequence

    • Because of degeneracies there will be non-ATGC IUPAC codes in the output sequence
    • Output as *.dna.fasta
  • Unaligned DNA and unaligned/aligned Amino acid sequences

    • Back translate the amino acid sequence to the DNA sequence so that the DNA is aligned. This should result in the DNA bases being clustered in groups of 3 corresponding to each amino acid codon
    • Hopefully the DNA sequence and the amino acid sequence will be in the same frame
    • Output alignment as *.dna.align
@mothur-westcott
Copy link
Contributor

@pschloss, for the translation to dna from amino acids, how do we choose between the multiple options in the compressed column?

For example how is L translated?

  1. CTN
  2. TTR
  3. CTY
  4. YTR
  5. YTN

AminoAcid DNA_codons Compressed
Leu, L CTT,CTC,CTA,CTG; TTA,TTG CTN,TTR; or CTY,YTR

@pschloss
Copy link
Contributor Author

For DNA to amino acids if it runs into CTT (or CTC,CTA,CTG, TTA,TTG) then it is replaced with an L.

For amino acids to DNA... Hmm. I'm not sure what to do there. I think the main use of amino acid to DNA would be to align a DNA sequence the user already has. So if the amino acid sequence had an L, then we would look for CTT,CTC,CTA,CTG,TTA, or TTG in the DNA sequence to be aligned. Does that make sense?

Basically - I don't think we'd ever have a DNA sequence without an amino acid sequence, but we will likely have a DNA sequence without amino acid sequence.

@mothur-westcott
Copy link
Contributor

Hmm... so should we scrap this bullet point?

  • Amino acid sequences and translate it to a DNA sequence

Because of degeneracies there will be non-ATGC IUPAC codes in the output sequence
Output as *.dna.fasta

@pschloss
Copy link
Contributor Author

pschloss commented Nov 15, 2021 via email

@mothur-westcott
Copy link
Contributor

  • Unaligned DNA and unaligned/aligned Amino acid sequences

    Back translate the amino acid sequence to the DNA sequence so that the DNA is aligned. This should result in the DNA bases being clustered in groups of 3 corresponding to each amino acid codon
    Hopefully the DNA sequence and the amino acid sequence will be in the same frame
    Output alignment as *.dna.align

@pschloss, could you explain this part a bit more and correct any bad assumptions on my part?

  1. Inputs: The user is providing a reference file of DNA sequences, and an input file containing amino acids.
  2. For each amino acid sequence in the input file:
    • translate amino acid into compressed DNA
    • find closest sequence in reference to align to (kmer search)
    • align compressed DNA to reference

Trivial example:

aminoAcidFragment
KC_CK

becomes

compressedDNA
AARTGY---TGYAAR

Perfect matches in reference database could include:

AAATGC---TGCAAA, AAGTGC---TGCAAA, AAATGT---TGCAAA, ... , AAGTGT---TGTAAG

alignedDNA
AAATGC---TGCAAA

@pschloss
Copy link
Contributor Author

pschloss commented Feb 1, 2022

For this option, the user would provide a DNA sequence and an amino acid sequence. One of the two would be aligned. For example, the user could provide an aligned amino acid sequence and an unaligned DNA sequence. They'd want the DNA sequence to come out aligned...

>unaligned_dna
AAATGCTGCAAA

>aligned_amino_acid
KC_CK

Output would then be...

>aligned_dna
AAATGC---TGCAAA

If instead they provided...

>aligned_dna
AAATGC-TGCAAA

>unaligned_amino_acid
KCCK

then the output would be...

>aligned_amino_acid
KC-CK

Does that make sense? I don't think there's a need here for a compressed DNA alphabet.

@mothur-westcott
Copy link
Contributor

That makes sense, thanks for clarifying. No reference file, a pair of files, like the fasta/qfile pairings. The command would look like:

mothur > translate.seqs(fasta=alignedOrUnalignedDNA, amino=alignedOrUnalignedAminoAcids)

assumes sequences are in the same order in both files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants