# What is biopython 

<img src="../images/biopy.png" style="width:500px;height:300;">

The Biopython Project is an international association of developers of freely available Python (http://www. python.org) tools for computational molecular biology. Python is an object oriented, interpreted, flexible language that is becoming increasingly popular for scientific computing. Python is easy to learn, has a very clear syntax and can easily be extended with modules written in C, C++ or FORTRAN.

The Biopython web site (http://www.biopython.org) provides an online resource for modules, scripts, and web links for developers of Python-based software for bioinformatics use and research. Basically, the goal of Biopython is to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and classes. Biopython features include parsers for various Bioinformatics file formats (BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy,...), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS...), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation.
Basically, we just like to program in Python and want to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and scripts.

### What's there in the the Biopython Package?

- The ability to parse bioinformatics files into Python utilizable data structures, including support for the following formats:
        – Blast output – both from standalone and WWW Blast – Clustalw
        – FASTA
        – GenBank
        – PubMed and Medline
        – ExPASy files, like Enzyme and Prosite – SCOP, including ‘dom’ and ‘lin’ files
        – UniGene
        – SwissProt
- Files in the supported formats can be iterated over record by record or indexed and accessed via a Dictionary interface.
-  Code to deal with popular on-line bioinformatics destinations such as: 
        – NCBI – Blast, Entrez and PubMed services
        – ExPASy – Swiss-Prot and Prosite entries, as well as Prosite searches
- Interfaces to common bioinformatics programs such as:
        – Standalone Blast from NCBI 
        – Clustalw alignment program 
        – EMBOSS command line tools
- A standard sequence class that deals with sequences, ids on sequences, and sequence features.
- Tools for performing common operations on sequences, such as translation, transcription and weight
calculations.
- Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support Vector Machines.
- Code for dealing with alignments, including a standard way to create and deal with substitution matrices.
- Code making it easy to split up parallelizable tasks into separate processes.
- GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc.
- Extensive documentation and help with using the modules, including this file, on-line wiki documen- tation, the web site, and the mailing list.
- Integration with BioSQL, a sequence database schema also supported by the BioPerl and BioJava projects.

# Getting biopython


In [3]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.70-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (2.1MB)
[K    100% |████████████████████████████████| 2.1MB 380kB/s eta 0:00:01
Installing collected packages: biopython
Successfully installed biopython-1.70


## Step 1: Understanding what Biopython can do for you

Bioinformatics or Computational Biology is all about alphabets. DNA, as most of you may have learned is a 4-letter alphabet namelt `(A,C,T,G)` whereas proteins have their 20 amino acids, each of them represeted by a letter.

**NOTE**: 
- There are also special “alphabets” like the ones that contemplate ambiguity positions. These are positions where more than one nucleotide may be present. For example, the letter `S` may represent the nucleic acids `C` or `G`, and the letter `H` represents `A`, `C`, or `T`. 
- This ambiguous alphabet in Biopython is called ambigu- ous_dna. Concerning the proteins, there is also an extended dictionary, which is the dictionary that contains amino acids that are not normally found in pro- teins8 (ExtendedIUPACProtein). Similarly, there is an extended alphabet for nucleotides (**ExtendedIUPACProtein**)
- Similarly, there is an extended alphabet for nucleotides (**ExtendedIUPACDNA**) that allows letters with modified bases. 

Let's revist proteins where there is also a reduced alphabet that,taking into account common physicochemical properties, *lumps* together several amino acids into one letter!

We also have a alphabet that is not DNA or amino-acid based: **SecondaryStructure**. This alphabet represents domains like **H**elix, **T**urn, **S**trand and **C**oil.

Another thing: Alphabets defined by IUPAC are stored in biopython as classes of the IUPAC module. Parent mobule being **Bio.Alphabet** which includes more general/generic cases.

OK, enough boring theory, let's now play with the attributes of these alphabets:

In [1]:
import Bio.Alphabet as bioalp

In [5]:
print(bioalp.ThreeLetterProtein.letters)

['Ala', 'Asx', 'Cys', 'Asp', 'Glu', 'Phe', 'Gly', 'His', 'Ile', 'Lys', 'Leu', 'Met', 'Asn', 'Pro', 'Gln', 'Arg', 'Ser', 'Thr', 'Sec', 'Val', 'Trp', 'Xaa', 'Tyr', 'Glx']


In [6]:
from Bio.Alphabet import IUPAC as bioiupac

In [7]:
print(bioiupac.IUPACProtein.letters)

ACDEFGHIKLMNPQRSTVWY


In [8]:
# Unambiguous DNA letters
bioiupac.unambiguous_dna.letters

'GATC'

In [9]:
# Ambiguous DNA letters
bioiupac.ambiguous_dna.letters

'GATCRYWSMKHBVDN'

In [10]:
# Unambiguous RNA letters
bioiupac.unambiguous_rna.letters

'GAUC'

In [11]:
# Ambiguous RNA letters
bioiupac.ambiguous_rna

IUPACAmbiguousRNA()

In [12]:
# ExtendedIUPACProtein Letters
bioiupac.ExtendedIUPACProtein.letters

'ACDEFGHIKLMNPQRSTVWYBXZJUO'

In [14]:
# #IUPACDNA letters
bioiupac.ExtendedIUPACDNA.letters

'GATCBDSW'

### Sequences : Alphabets are used to define the content of a sequence

**Question**: For instance, would we know if a sequence made of `CCGGGTT` is a small peptide with several cysteine, glycine and threonine or is it a DNA fragment of cytosine, guanine and thymine?

Fortunately we have **Seq** object that helps us with this because if things were stored as strings like we usually have in Python regular library, we wouldn't know what kind of sequence it is.

In [15]:
"""
Here the object is composed of the sequence itself and an alphabet
which defines the nature of the sequence
"""
from Bio.Seq import Seq
seq = Seq('CCGGGTT', bioalp.IUPAC.unambiguous_dna) # define this Seq as a DNA

In [16]:
# Next, we use the transcribe and translate methods within the Seq
seq.transcribe()

Seq('CCGGGUU', IUPACUnambiguousRNA())

In [17]:
seq.translate()



Seq('PG', IUPACProtein())

### But, can we transcribe an RNA sequence? 

Let's find out...