### 1. What is BioPython??
The BioPython Project is an international association of developers of freely available Python tools for computational molevular biology.

### 2.What can I find in the Biopython Package

The ability to parse bioinformatics les into Python utilizable data structures, including support for<br>
the following formats:<br>
 Blast output { both from standalone and WWW Blast<br>
. Clustalw <br>
. FASTA <br>
. GenBank <br>
. PubMed and Medline <br>
. ExPASy les, like Enzyme and Prosite <br>
. SCOP, including dom and **lin** les  <br>
. UniGene  <br>
. SwissProt <br>
 Files in the supported formats can be iterated over record by record or indexed and accessed via a
Dictionary interface.<br>

In [None]:
# !pip install biopython
# if something error try uninstall and reboot and again install 
# !pip uninstall biopython


## Working with Sequences

In [10]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
my_seq

Seq('AGTACACTGGT')

In [11]:
print(my_seq)

AGTACACTGGT


What we have here is a sequence object with a generic alphabet - re
ecting the fact we have not speci
ed if this is a DNA or protein sequence ```(okay, a protein with a lot of Alanines, Glycines, Cysteines and
Threonines!)```

In [14]:
# Reverse the Sequence
print(my_seq.complement())  #complemetn A-T , G-C , T-A , C-G
print(my_seq.reverse_complement())  #IN REVERSE ORDER

TCATGTGACCA
ACCAGTGTACT


### PARSING SEQUENCE FILE FORMATS

https://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide

## Simple FASTA parsing example

In [18]:
from Bio import SeqIO
for seq_record in SeqIO.parse("datas/ls_orchid.fasta","fasta"):
    #file location data/
    print(seq_record.id)
    print(repr(seq_record)) #The repr() method returns a printable representational string of the given object.
    print(len(seq_record))


gi|2765658|emb|Z78533.1|CIZ78533
SeqRecord(seq=Seq('C', SingleLetterAlphabet()), id='gi|2765658|emb|Z78533.1|CIZ78533', name='gi|2765658|emb|Z78533.1|CIZ78533', description='gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[])
740
gi|2765657|emb|Z78532.1|CCZ78532
SeqRecord(seq=Seq('C', SingleLetterAlphabet()), id='gi|2765657|emb|Z78532.1|CCZ78532', name='gi|2765657|emb|Z78532.1|CCZ78532', description='gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[])
753
gi|2765656|emb|Z78531.1|CFZ78531
SeqRecord(seq=Seq('C', SingleLetterAlphabet()), id='gi|2765656|emb|Z78531.1|CFZ78531', name='gi|2765656|emb|Z78531.1|CFZ78531', description='gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[])
748
gi|2765655|emb|Z78530.1|CMZ78530
SeqRecord(seq=Seq('C', SingleLetterAlphabet()), id='gi|2765655|emb|Z78530.1|CMZ78530', name='gi|2765655|emb|Z78530.1|CMZ78530', description

## Simple GenBank parsing example 

In [21]:
from Bio import SeqIO
for seq_record in SeqIO.parse("datas/ls_orchid.gbk","genbank"):
    print(seq_record.id)
    print(seq_record.seq)  #repr(seq_record.seq) #check the differences
    print(len(seq_record))

Z78533.1
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC
740
Z78532.1
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTACGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAG

## Chapter 3 
# Sequence objects

#### We'll use the IUPAC alphabets here to deal with some of our favorite objects: DNA, RNA and Proteins.

<p> Bio.Alphabet.IUPAC provides basic denitions for proteins, DNA and RNA, but additionally provides
the ability to extend and customize the basic denitions. For instance, for proteins, there is a basic IUPACProtein
class, but there is an additional ExtendedIUPACProtein class providing for the additional
elements \U" (or \Sec" for selenocysteine) and \O" (or \Pyl" for pyrrolysine), plus the ambiguous symbols
\B" (or \Asx" for asparagine or aspartic acid), \Z" (or \Glx" for glutamine or glutamic acid), \J" (or \Xle"
for leucine isoleucine) and \X" (or \Xxx" for an unknown amino acid). For DNA you've got choices of IUPACUnambiguousDNA,
which provides for just the basic letters, IUPACAmbiguousDNA (which provides for
ambiguity letters for every possible situation) and ExtendedIUPACDNA, which allows letters for modied
bases. Similarly, RNA can be represented by IUPACAmbiguousRNA or IUPACUnambiguousRNA.
The advantages of having an alphabet class are two fold. First, this gives an idea of the type of information
the Seq object contains. Secondly, this provides a means of constraining the information, as a means of type
checking.
Now that we know what we are dealing with, let's look at how to utilize this class to do interesting work.
You can create an ambiguous sequence with the default generic alphabet like this: </p>

In [2]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
my_seq

Seq('AGTACACTGGT')

In [3]:
# Alphabet
my_seq.alphabet

Alphabet()

However, where possible you should specify the alphabet explicitly when creating your sequence objects
- in this case an unambiguous DNA alphabet object:

In [4]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("AGTACACTGGT",IUPAC.unambiguous_dna)
my_seq

Seq('AGTACACTGGT', IUPACUnambiguousDNA())

In [5]:
my_seq.alphabet

IUPACUnambiguousDNA()

In [7]:
#Unless of course, this really is an amino acid sequence:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_prot = Seq("AGTACACTGGT", IUPAC.protein)
print(my_prot)
print(my_prot.alphabet)


AGTACACTGGT
IUPACProtein()


In [8]:
my_prot.alphabet

IUPACProtein()

#### Sequences act like strings

In [1]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("GATCG", IUPAC.unambiguous_dna)
for index, letter in enumerate(my_seq):
    print("%i %s"%(index,letter))

0 G
1 A
2 T
3 C
4 G


In [4]:
print(len(my_seq))
my_seq

5


Seq('GATCG', IUPACUnambiguousDNA())

In [6]:
print(my_seq[0]) #First Letter
print(my_seq[1]) #Second Letter
print(my_seq[-1])  #Last letter
# same like Strintg Function

G
A
G


#### The Seq object has a .count() method , just like a string

In [8]:
from Bio.Seq import Seq
print("AAAA".count("AA"))
print(Seq("AAAA").count("AA"))

2
2


### For some biological uses, you may actually want an overlapping count (i.e. 3 in this trivial example). When
searching for single letters, this makes no dierence:

In [19]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq(" GATCGATGGGCCTATATAGGATCGAAAATCGC ", IUPAC.unambiguous_dna)
len(my_seq)

34

In [14]:
my_seq.count("AT")  #try different terms

6

In [16]:
50 * float(my_seq.count("G")+my_seq.count("C"))/len(my_seq) #can perform mathmatical performace

22.058823529411764

 ### While you could use the above snippet of code to calculate a GC%, note that the Bio.SeqUtils module
has several GC functions already built. For example:

In [21]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio.SeqUtils import GC
my_seq  = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",IUPAC.unambiguous_dna)
GC(my_seq)

46.875

Note that using the Bio.SeqUtils.GC() function should automatically cope with mixed case sequences and
the ambiguous nucleotide S which means G or C.
Also note that just like a normal Python string, the Seq object is in some ways \read-only". If you need
to edit your sequence, for example simulating a point mutation, look at the Section 3.12 below which talks
about the MutableSeq object.

## Slecing a sequence

In [23]:
# A more complicated example, let's get a slice of the sequence
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
my_seq[4:12]

Seq('GATGGGCC', IUPACUnambiguousDNA())

In [28]:
my_seq[0::3]

Seq('GCTGTAGTAAG', IUPACUnambiguousDNA())

### Turning Seq objects into strings

If you really do just need a plain string, for example to write to a file, or inssert into a database 
then this is very easy to get

my_seq

In [36]:
my_seq

Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())

In [32]:
str(my_seq) # change Seq object to str file

'GATCGATGGGCCTATATAGGATCGAAAATCGC'

In [37]:
print(my_seq)

GATCGATGGGCCTATATAGGATCGAAAATCGC


In [38]:
# you can also use Seq object directly with a %s placeholder when using the Python string formatting
# or interpolation operator(%)

fasta_format_string = ">Name\n%s\n1"%my_seq
print(fasta_format_string)

>Name
GATCGATGGGCCTATATAGGATCGAAAATCGC



### Concatenating or adding sequences

Naturally you can in principle add any two Seq objects together just like you can with Python strings to 
concatenate them. However, you can't add sequences with incompatible alphabets, such as a proteinsequence and a DNA sequence.

from Bio.Alphabet import IUPAC <br>
from Bio.Seq import Seq <br>
protein_seq = Seq("EVERNAK", IUPAC.protein) <br>
dna_seq = Seq("ACGT",IUPAC.unambiguous_dna)  <br>
protein_seq + dna_seq


### the abovecode give Error
If you really wanted to do this, you'd have to first give both sequences generic alphabets:


In [40]:
from Bio.Alphabet import generic_alphabet
protein_seq.alphabet = generic_alphabet
dna_seq.alphabet = generic_alphabet
protein_seq + dna_seq

Seq('EVERNAKACGT')

Here is an example of adding a generic nucleotide sequence to an unambiguous IUPAC DNA sequence, resulting in an  ambiguous nucleotide sequence:

In [43]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_nucleotide
from Bio.Alphabet import IUPAC
nuc_seq = Seq("GATCGATGC",generic_nucleotide)
dna_seq= Seq("ACGT", IUPAC.unambiguous_dna)
nuc_seq

Seq('GATCGATGC', NucleotideAlphabet())

In [44]:
dna_seq

Seq('ACGT', IUPACUnambiguousDNA())

In [45]:
nuc_seq

Seq('GATCGATGC', NucleotideAlphabet())

In [46]:
dna_seq

Seq('ACGT', IUPACUnambiguousDNA())

In [47]:
nuc_seq+dna_seq

Seq('GATCGATGCACGT', NucleotideAlphabet())

In [48]:
#You may often have many sequences to add together , which can be done with a for loop like this:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
list_of_seqs = [Seq("ACGT",generic_dna),Seq("AACC",generic_dna), Seq("GGTT", generic_dna)]
concatenated = Seq("", generic_dna)
for s in list_of_seqs:
    concatenated +=s
    

In [49]:
concatenated

Seq('ACGTAACCGGTT', DNAAlphabet())

Or, a more elegant approach is to the use built in sum function with its optional start value argument
(which otherwise defaults to zero):

In [51]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
list_of_seqs = [Seq("ACGT",generic_dna),Seq("AACC", generic_dna), Seq("GGTT", generic_dna)]
sum(list_of_seqs , Seq("", generic_dna))

Seq('ACGTAACCGGTT', DNAAlphabet())

### 3.6 Changing Case

## BioPython has the same methods like string

In [53]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
dna_seq = Seq("acgtACGT", generic_dna)
print(dna_seq)
print(dna_seq.upper())
print(dna_seq.lower())
print("GTAC" in dna_seq)
print("GTAC" in dna_seq.upper())
print("gtac" in dna_seq.lower()) 

acgtACGT
ACGTACGT
acgtacgt
False
True
True


### 3.7Nucleotide Sequences and (reverse) complements
(page 24) 

For nucleotide sequences you can easily obtain the complement or reverse complement of s Seq object using its built-in methods:


In [55]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
my_seq

Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())

In [56]:
my_seq.complement() #Complement

Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())

In [57]:
my_seq.reverse_complement()  #Reverse the Seq.

Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())

##### An easy way to just reverse a Seq object (or a Python string) is slice it with -1 step:


In [64]:
name = "krishna"
name[::-2]
my_seq[::-1]
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
protein_seq = Seq("EVERNAK", IUPAC.protein)
protein_seq.complement()
# ValueError :Proteins do not have complements!

ValueError: Proteins do not have complements!

### Transcription 

DNA coding strand (aka Crick strand, strand +1)<br>
5' ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3'<br>
|||||||||||||||||||||||||||||||||||||||<br>
3' TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5'<br>
DNA template strand (aka Watson strand, strand 􀀀1)
j<br><br><br>
Transcription
<br><br><br>
5' AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3'<br>
Single stranded messenger RNA<br>

#### to understand clear biopython tutorial page 25

The actual biological transcription process works from the template strand, doing a reverse complement <br>
(TCAG ! CUGA) to give the mRNA. However, in Biopython and bioinformatics in general, we typically
work directly with the coding strand because this means we can get the mRNA sequence just by <br> switching<br>
T ! U.<br>
Now let's actually get down to doing a transcription in Biopython. First, let's create Seq objects for the<br>
coding and template DNA strands:<br>

In [70]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
print(coding_dna)
template_dna = coding_dna.reverse_complement()
print(template_dna)

ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT


In [67]:
coding_dna

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

In [68]:
messanger_rna  = coding_dna.transcribe()  #Change T to U
messanger_rna

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

As you can see, all this does is switch <br> **T ! U** ,<br> and adjust the alphabet.
If you do want to do a true biological transcription starting with the template strand, then this becomes
a two-step process:

In [72]:
template_dna.reverse_complement().transcribe()

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

In [73]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
messanger_rna


Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

In [74]:
messenger_rna.back_transcribe()

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

### Translation

Sticking with the same example discussed in the transcription section above, now let's translate this mRNA into the corresponding protein sequence again taking advantage of one of the Seq object's biological methods:


In [75]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
messenger_rna

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

In [76]:
messenger_rna.translate()

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

In [78]:
#You can also translate directly from the coding strand DNA sequence:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
coding_dna
coding_dna.translate()

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

###### The translation tables available in Biopython are based on those from the NCBI (see the next section of this tutorial). By default, translation will use the standard genetic code (NCBI table id 1). Suppose we are dealing with a mitochondrial sequence. We need to tell the translation function to use the relevant genetic code instead:

In [79]:
coding_dna.translate(table= "Vertebrate Mitochondrial")

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

You can also specify the table using the NCBI table number which is shorter, and often included in the
feature annotation of GenBank les:

In [80]:
coding_dna.translate(table=2)

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

Now, You may want to translate the nucleotides up to the first in frame stop codon, and then stop( as happens in nature):

In [81]:
coding_dna.translate()

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

In [82]:
coding_dna.translate(to_stop = True)

Seq('MAIVMGR', IUPACProtein())

In [83]:
# You can even specify the stop symbol if you don't like the default asterisk:

In [84]:
coding_dna.translate(table=2, stop_symbol="@")


Seq('MAIVMGRWKGAR@', HasStopCodon(IUPACProtein(), '@'))

given a complete CDS, the default translate method will do what you want (perhaps with the
to_stop option). However, what if your sequence uses a non-standard start codon? This happens a lot in bacteria 
<br>for example the **gene yaaX in E. coli K12**:

In [88]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
gene = Seq("GTGAAAAAGATGCAATCTATCDTACTATCGTACTAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" +\
"GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
"AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
"TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
"AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",generic_dna)


In [89]:
gene

Seq('GTGAAAAAGATGCAATCTATCDTACTATCGTACTAATCTATCGTACTCGCACTT...TAA', DNAAlphabet())

In [90]:
gene.translate(table="Bacterial")

Seq('VKKMQSIXLSY*SIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWD...HR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

In [91]:
gene.translate(table = "Bacterial", to_stop=True)

Seq('VKKMQSIXLSY', ExtendedIUPACProtein())

In [92]:
# In the bacterial genetic code GTG is a valid start codon, and while it does normally encode Valine, if used as
 #a start codon it should be translated as methionine. This happens if you tell Biopython your sequence is a
#complete CDS:

In [94]:
gene.translate(table="Bacterial", cds = True)

TranslationError: Extra in frame stop codon found.

#### 3.1 Translation Tables

In [95]:
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
mito_table = CodonTable.unambiguous_dna_by_id[2]

# Alternatively these tables are alabeled with ID numbers 1 and 2, respectively

In [97]:
from Bio.Data import CodonTable
Standard_table = CodonTable.unambiguous_dna_by_id[1]
mito_table = CodonTable.unambiguous_dna_by_id[2]

# You can compare the actual tables visually by printing them:
print(standard_table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

In [98]:
#Mitotable
print(mito_table)

Table 2 Vertebrate Mitochondrial, SGC1

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA W   | A
T | TTG L   | TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L   | CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA M(s)| ACA T   | AAA K   | AGA Stop| A
A | ATG M(s)| ACG T   | AAG K   | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V(s)| GCG A   | GAG E   | GGG G   

### You may find these following properties useful - for example if you are trying to do your own gene finding:

In [99]:
mito_table.stop_codons

['TAA', 'TAG', 'AGA', 'AGG']

In [100]:
mito_table.start_codons

['ATT', 'ATC', 'ATA', 'ATG', 'GTG']

In [101]:
mito_table.forward_table["ACG"]

'T'

### Comparing Seq objects

Sequence comparison is actually a very complicated topic, and there is no easy way to decide if two sequences are equal

 "A" could be part of a DNA, RNA or protein sequence. Biopython uses alphabet objects as part of each<br>
Seq object to try to capture this information - so comparing two Seq objects could mean considering both<br>
the sequence strings and the alphabets.

In [104]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
seq1 = Seq("ACGT", IUPAC.unambiguous_dna)
seq2 = Seq("ACGT",IUPAC.ambiguous_dna)
print(str(seq1)==str(seq1))
print(str(seq1)==str(seq1))

True
True


In [108]:
seq1 == seq2, seq1 =="ACGT"

(True, True)

In [109]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna , generic_protein
dna_seq = Seq("ACGT", generic_dna)
prot_seq = Seq("ACGT", generic_protein)

dna_seq == prot_seq



True

### Mutable Seq objects

In [111]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",IUPAC.unambiguous_dna)

#Observe what happens if you try to edit the sequence:
my_seq[5]="G" #No item assignment

TypeError: 'Seq' object does not support item assignment

In [112]:
# However, you can convert it into a mutable sequ3ence( a MutableSeq object) and do pretty much anything
mutable_seq = my_seq.tomutable()
mutable_seq

MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())

#### Alternatively you can create  a Mutable Seq object directly from a string

In [115]:
from Bio.Seq import MutableSeq
from Bio.Alphabet import IUPAC
mutable_seq  = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)
mutable_seq

MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())

In [121]:
# Either way will give you a sequence object which can be changed:
mutable_seq[3]="C"
mutable_seq #C has been replaced

MutableSeq('GCCCTCGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())

In [124]:
mutable_seq.remove("T")  # all T has been removed
mutable_seq

MutableSeq('GCCCCGAAGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())

In [126]:
mutable_seq.reverse()
mutable_seq

MutableSeq('GCCCCGAAGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())

In [128]:
#Once you have nished editing your a MutableSeq object, it's easy to get back to a read-only Seq object
#should you need to:
new_seq = mutable_seq.toseq()
new_seq

Seq('GCCCCGAAGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())

### UnknownSeq objects

In [132]:
# The UnknownSeq object is a subclass of the basic Seq object and its purpose is to represent a sequence 
# where we know the length but not the actual letters making it up.
from Bio.Seq import UnknownSeq
unk = UnknownSeq(20)
print(unk)
unk

????????????????????


UnknownSeq(20, character='?')

In [133]:
# you can of course specify an alphabet, meaning for nucleotide for nucleotide sequences the letter 
# defaults to "N" and for proteins "X" rather than just "?"

from Bio.Seq import UnknownSeq
from Bio.Alphabet import IUPAC
unk_dna  = UnknownSeq(20, alphabet = IUPAC.ambiguous_dna)
print(unk_dna)
unk_dna

NNNNNNNNNNNNNNNNNNNN


UnknownSeq(20, alphabet=IUPACAmbiguousDNA(), character='N')

In [134]:
# YOU CAN OF COURSE SPECIFY AN ALPHABET, MEANING FOR NUCLEOTIDE SEQUUENCES THE LETTER DEFAULTS TO "N"
from Bio.Seq import UnknownSeq
from Bio.Alphabet import IUPAC
unk_dna = UnknownSeq(20 , alphabet = IUPAC.ambiguous_dna)
print(unk_dna)

NNNNNNNNNNNNNNNNNNNN


In [135]:
unk_dna

UnknownSeq(20, alphabet=IUPACAmbiguousDNA(), character='N')

In [136]:
unk_dna.complement()

UnknownSeq(20, alphabet=IUPACAmbiguousDNA(), character='N')

In [137]:
unk_dna.reverse_complement()

UnknownSeq(20, alphabet=IUPACAmbiguousDNA(), character='N')

In [138]:
unk_dna.transcribe()

UnknownSeq(20, alphabet=IUPACAmbiguousRNA(), character='N')

In [139]:
unk_proteinn = unk_dna.translate()
unk_proteinn

UnknownSeq(6, alphabet=ProteinAlphabet(), character='X')

In [140]:
print(unk_proteinn)

XXXXXX


In [141]:
len(unk_proteinn)

6

### Working with strings directly 

In [142]:
from Bio.Seq import reverse_complement, transcribe,  back_transcribe, translate
my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG"
reverse_complement(my_string)

'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC'

In [143]:
transcribe(my_string)

'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG'

In [144]:
back_transcribe(my_string)

'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG'

In [145]:
translate(my_string)

'AVMGRWKGGRAAG*'

### Chapter 3 Finished Thank You very Much