In [1]:
# general libraries
import json
import urllib.request
from copy import copy
# biopython libraries
from Bio.Alphabet import generic_dna, generic_protein
## for sequence handling
from Bio import SeqIO
from Bio.Seq import Seq
## for alignments
from Bio import pairwise2
from Bio.SubsMat import MatrixInfo
## for BLAST
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

import pandas as pd

from IPython.display import Image
from IPython.display import HTML

# DNA to protein sequences

## You will KNOW:
* What is a gene in biology
* How does a gene become a protein in the cell
* How a gene is represented in Ensembl
* What is a REST web-service
* Some magic in Jupyter notebooks


## You will BE ABLE TO:
* Retrieve gene sequences with a REST service
* Align biological sequences with the BioPython library
* Visualize alignments in Jalview
* Analyze a protein sequence launching InterProScan remotely
* Find similar sequences launching pHMMER remotely

## We will use
* Python3 (Jupyter in this case but vanilla python2.7/3.x is OK)
* BioPython library
* Jalview

### Install python and required libraries (Unix-like) (may require sudo)
`apt install python3`<br>
`apt install python-pip`<br>
`pip install biopython`<br>
`pip install pandas` (just for visualization purposes) <br>
`pip install xmltramp2` (required by InterProScan API script) <br>

### If you want to use Jupyter 
`pip install jupyter`<br>
`jupyter notebook`

## What is a gene

A gene is a *discrete* *inheritable* unit that gives rise to observable *physical characteristics*.

* **It gives rise to physical characteristics** means that it produces an observable phenomenon (a.k.a. phenotype)
* **Inheritable** means that can be transferred across generations
* **Discrete** means that phenotypes can be inherited singularly

## Discovery of DNA and invention of genetics

1860: The concept of discrete inheritable units was first suggested by Gregor Mendel

1905: The name *gene* comes up from William Bateson

1923: Frederick Griffith observed that DNA carries genes

1941: Edward Lawrie Tatum and George Wells Beadle show that genes code for proteins. This is known as the original **central dogma of molecular biology**

## The central dogma of molecular biology
<img width="80%" src="media/centraldogma.jpg">

A gene is **transcribed** to RNA first, then **translated** to Protein
* DNA sits in the nucleus of the cell and is made of Adenine (A), Thymine (T), Cytosine (C) and Guanine (G) **nucleotides**
* RNA is made in the nucleus and then brought outside. It uses Uracile (U) instead of Thymine (T)
* Proteins are made of **amino acids**, completely different molecules from nucleotides
* A **codon** (3 nucleotides) code for 1 amino acid following the **genetic code**, which is universal across all life forms.

## The structure of DNA

<img align='centetr' width=80% src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/DNA_chemical_structure_2.svg/1200px-DNA_chemical_structure_2.svg.png">


In 1953 DNA structure is resolved to be a double helix by Rosalind Franklin, James Watson and Francis Crick

It's a molecule composed of two chains (made of *nucleotides*) that coil around each other to form a double helix. 

Each chain is a sequence of nucleotides and the two helices are kept together by interactions between complementary nucleotides

## Genes are highly complex units
<img align='center' width=80% src="https://upload.wikimedia.org/wikipedia/commons/5/54/Gene_structure_eukaryote_2_annotated.svg">

There are many portion of the gene that have **regulatory functions**

The portion of a gene that is transcribed to RNA has its own regulatory sequences (UTRs, introns) and need to be processed and trimmed (splicing) beforme becoming a **mature** messenger

Introns are removed and exons are combined to form a mature messenger RNA, which is then **translated** to protein

In [7]:
HTML('<iframe width="729" height="410" src="https://www.youtube.com/embed/gG7uCskUOrA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

In [8]:
HTML('<iframe width="688" height="410" src="https://www.youtube.com/embed/2BwWavExcFI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

## Some numbers
A ribosome translates approximately $20$ amino acids per second (reading $60$ nuceotides per second)

In a typical cell there are around $50\,000$ ribosomes

In a typical human there are around $37$ trillions $(10^{12})$ cells

$$ 20 * (5*10^5) * (37 *10^{12}) = 3.7*10^{20}$$

Just accounting for translation events. Other operations going on at cellular level:
* DNA replication
* Chromosome breathing (folding/unfolding)
* DNA repairing
* Epygenetic modifications (DNA methylation)
* RNA-mediated cell regulation
* Protein post translational modifications
* Enzyme-mediated catalysis or chemical reactions
* ...

## Genes in Ensembl
Genes are stored as sequences representing nucleotides (ACTG)

The central resource containing gene sequences is __[Ensembl](https://www.ensembl.org/index.html)__

Ensembl is an example of **genome browser**, a tool to browse the genome.

It's highly organized and has an elaborate interface to retrieve and __[visualize data](https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000139618;r=13:32315474-32400266)__

Ensembl - like many EBI services - offer a **REST API**

### What is a REST service
Representational State Transfer (**REST**) is a set of constraints to be used for creating web services.

RESTful web services, provide **interoperability** between computer systems on the Internet.

In a RESTful web service, requests are made to a resource's URL (Uniform Resource Locator) or colloquially a **web address**

Requests usually follow the **HTTP** protocol (the operations available are GET, POST, ...)

RESTful web services expose one or more **endpoints** to access different data. 

## Retrieve a gene sequence from Ensembl
* Use the RESTful API to retrieve a gene sequence from the Ensemble database.
* The gene we want to obtain is the *cyclin dependent kinase 20* (`ENSG00000156345`)
* Documentation for Ensembl endpoints is at __[https://rest.ensembl.org/](https://rest.ensembl.org/)__
* There are different kind of sequence we can ask for
    * **genomic**: the DNA sequence
    * **cdna**: complementary DNA (complementary to a messenger RNA)
    * **cds**: coding sequence (complementary to the coding sequence of mRNA)
    * **protein**: protein sequence



In [4]:
base_url = 'https://rest.ensembl.org/'
endpoint = 'sequence/id/'
seq_id = 'ENSG00000156345'
parameters = '?content-type=application/json&type=genomic'
gdna_url = base_url + endpoint + seq_id + parameters

gdna = urllib.request.urlopen(gdna_url).read()
gdna

b'{"version":17,"desc":"chromosome:GRCh38:9:87966441:87974753:-1","seq":"GCTGTCATCGTTCCGTGGGCCCTGCTGCGGGCACGCTCTCGGCGCATGCGTTTTTTATGCGGGATTAAGCTTGCTGCTGCGTGACAGCGGAGGGCTAGGAAAAGGCGCAGTGGGGCCCGGAGCTGTCACCCCTGACTCGACGCAGCTTCCGTTCTCCTGGTGACGTCGCCTACAGGAACCGCCCCAGTGGTCAGCTGCCGCGCTGTTGCTAGGCAACAGCGTGCGAGCTCAGATCAGCGTGGGGTGGAGGAGAAGTGGAGTTTGGAAGTTCAGGGGCACAGGGGCACAGGCCCACGACTGCAGCGGGATGGACCAGTACTGCATCCTGGGCCGCATCGGGGAGGGCGCCCACGGCATCGTCTTCAAGGCCAAGCACGTGGAGGTGAGGCTGGACCGCGGCCGGCAGCCTGGCGGGGGTGTGCCCCCGCCACCCTCCGGCTAACGCTCTAAACTGTTTCGGTTCCCTTTTTACATCCAGTACAGTTTTTAAAACCTACTCATATTCTAAACCTACTTTGGGCCGTTGCGCTTCCCTCCGCACAGCTGGCTTGGTCCCCTACCCCAGCGGCTGGGTCCCAGGCTAGTCCTAGACCCCCGAGGAGGGCCTCTGGCCGAGCCGGGGGCGCGTGTCTCTCTCTCAACCACCTTCCCCTCACCCACCTTCATCTCTCTTTCCCAGCCGAGGGTGGGCTGGCAGTGTCTGCCTTCTATCCTGCAGACTGGCGAGATAGTTGCCCTCAAGAAGGTGGCCCTAAGGCGGTTGGAGGACGGCTTCCCTAACCAGGCCCTGCGGGAGATTAAGGCTCTGCAGGAGATGGAGGACAATCAGTATGTGAGTAGGGGAGGGGGGGCATGGTATTCTCACCCTCAGTCGCTCGTTCCCACTTCTTTGTGCCTTCATTTTCCCAGCTACGCCTTCACCAGCTT

In [5]:
# import json
gdna = json.loads(gdna)
gdna

{'version': 17,
 'desc': 'chromosome:GRCh38:9:87966441:87974753:-1',
 'seq': 'GCTGTCATCGTTCCGTGGGCCCTGCTGCGGGCACGCTCTCGGCGCATGCGTTTTTTATGCGGGATTAAGCTTGCTGCTGCGTGACAGCGGAGGGCTAGGAAAAGGCGCAGTGGGGCCCGGAGCTGTCACCCCTGACTCGACGCAGCTTCCGTTCTCCTGGTGACGTCGCCTACAGGAACCGCCCCAGTGGTCAGCTGCCGCGCTGTTGCTAGGCAACAGCGTGCGAGCTCAGATCAGCGTGGGGTGGAGGAGAAGTGGAGTTTGGAAGTTCAGGGGCACAGGGGCACAGGCCCACGACTGCAGCGGGATGGACCAGTACTGCATCCTGGGCCGCATCGGGGAGGGCGCCCACGGCATCGTCTTCAAGGCCAAGCACGTGGAGGTGAGGCTGGACCGCGGCCGGCAGCCTGGCGGGGGTGTGCCCCCGCCACCCTCCGGCTAACGCTCTAAACTGTTTCGGTTCCCTTTTTACATCCAGTACAGTTTTTAAAACCTACTCATATTCTAAACCTACTTTGGGCCGTTGCGCTTCCCTCCGCACAGCTGGCTTGGTCCCCTACCCCAGCGGCTGGGTCCCAGGCTAGTCCTAGACCCCCGAGGAGGGCCTCTGGCCGAGCCGGGGGCGCGTGTCTCTCTCTCAACCACCTTCCCCTCACCCACCTTCATCTCTCTTTCCCAGCCGAGGGTGGGCTGGCAGTGTCTGCCTTCTATCCTGCAGACTGGCGAGATAGTTGCCCTCAAGAAGGTGGCCCTAAGGCGGTTGGAGGACGGCTTCCCTAACCAGGCCCTGCGGGAGATTAAGGCTCTGCAGGAGATGGAGGACAATCAGTATGTGAGTAGGGGAGGGGGGGCATGGTATTCTCACCCTCAGTCGCTCGTTCCCACTTCTTTGTGCCTTCATTTTCCCAGCTACGCCTTCACC

In [6]:
# items() reutrn an iterator with keys and values
# the format method allow alignments {:>x}
# ternary operators
for k, v in gdna.items():
    print('{:>8}: {}{}'.format(k, str(v)[:40], '... {} more'.format(len(gdna['seq']) - 40) if k == 'seq' else ''))

 version: 17
    desc: chromosome:GRCh38:9:87966441:87974753:-1
     seq: GCTGTCATCGTTCCGTGGGCCCTGCTGCGGGCACGCTCTC... 8273 more
   query: ENSG00000156345
      id: ENSG00000156345
molecule: dna


## Retrieve all transcripts relative to a gene
By using the same endpoint with different parameters we have access to the transcript(s) of a gene. Remember that a single gene may have **multiple** transcripts.

* Sequence endpoint documentation: __[https://rest.ensembl.org/documentation/info/sequence_id](https://rest.ensembl.org/documentation/info/sequence_id)__
* Ensembl REST base url `https://rest.ensembl.org/`
* Sequence endpoint     `sequence/id/`
* Gene id               `ENSG00000156345`
* python function       `urllib.request.urlopen(URL).read()`
* Hint 1: Remember to set the `type` parameter to `cdna` and `cds`
* Hint 2: You need to account for the multiple sequences in the response setting the parameter `multiple_sequences=1`

Try to answer these questions:
* How many cdna transcripts do you find? How many cds transcripts? 
* Can you imagine why the numbers are different? 
* Do you notice any differences in the sequences of cdna and cds transcripts?
* Why do you think they are different?

In [7]:
# import urllib.request, json
# import pandas as pd
params = '?content-type=application/json&type=cdna&multiple_sequences=1'
cdna_url = base_url + endpoint + seq_id + params
cdna = urllib.request.urlopen(cdna_url).read()
pd.DataFrame.from_dict(json.loads(cdna)).set_index('id')

Unnamed: 0_level_0,desc,molecule,query,seq,version
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENST00000459720,,dna,ENSG00000156345,GCTTGCTGCTGCGTGACAGCGGAGGGCTAGGAAAAGGCGCAGTGGG...,5
ENST00000603475,,dna,ENSG00000156345,CGTGCGAGCTCAGATCAGCGTGGGGTGGAGGAGAAGTGGAGTTTGG...,1
ENST00000336654,,dna,ENSG00000156345,GTCGCCTACAGGAACCGCCCCAGTGGTCAGCTGCCGCGCTGTTGCT...,9
ENST00000375883,,dna,ENSG00000156345,GCTGTCATCGTTCCGTGGGCCCTGCTGCGGGCACGCTCTCGGCGCA...,7
ENST00000375871,,dna,ENSG00000156345,AGTTCAGGGGCACAGGGGCACAGGCCCACGACTGCAGCGGGATGGA...,8
ENST00000605159,,dna,ENSG00000156345,GGAGAAGTGGAGTTTGGAAGTTCAGGGGCACAGGGGCACAGGCCCA...,5
ENST00000325303,,dna,ENSG00000156345,CTGTCATCGTTCCGTGGGCCCTGCTGCGGGCACGCTCTCGGCGCAT...,8
ENST00000486228,,dna,ENSG00000156345,TTTTATGCGGGATTAAGCTTGCTGCTGCGTGACAGCGGAGGGCTAG...,7
ENST00000604175,,dna,ENSG00000156345,GCGAGCTCAGATCAGCGTGGGGTGGAGGAGAAGTGGAGTTTGGAAG...,5
ENST00000605591,,dna,ENSG00000156345,GCGAGCTCAGATCAGCGTGGGGTGGAGGAGAAGTGGAGTTTGGAAG...,1


In [8]:
cds_url = base_url + endpoint + seq_id + params.replace('cdna', 'cds')
cds = json.loads(urllib.request.urlopen(cds_url).read())
pd.DataFrame.from_dict(cds).set_index('id')

Unnamed: 0_level_0,desc,molecule,query,seq,version
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENST00000336654,,dna,ENSG00000156345,ATGGACCAGTACTGCATCCTGGGCCGCATCGGGGAGGGCGCCCACG...,9
ENST00000375883,,dna,ENSG00000156345,ATGGACCAGTACTGCATCCTGGGCCGCATCGGGGAGGGCGCCCACG...,7
ENST00000375871,,dna,ENSG00000156345,ATGGACCAGTACTGCATCCTGGGCCGCATCGGGGAGGGCGCCCACG...,8
ENST00000605159,,dna,ENSG00000156345,ATGGACCAGTACTGCATCCTGGGCCGCATCGGGGAGGGCGCCCACG...,5
ENST00000325303,,dna,ENSG00000156345,ATGGACCAGTACTGCATCCTGGGCCGCATCGGGGAGGGCGCCCACG...,8


## Retrieve exons of a gene
Ensembl APIs allow to get the list of exons found in a genomic region throught the `overlap` endpoint. 

This endpoint allows to find sequence features overlapping a genomic regions, in this case `feature=exon`.



In [9]:
exons_url = base_url + 'overlap/id/' + seq_id + '?content-type=application/json&feature=exon'
exons = urllib.request.urlopen(exons_url).read()
exons = json.loads(exons)
pd.DataFrame.from_dict(exons).set_index('id').head()

Unnamed: 0_level_0,Parent,assembly_name,constitutive,end,ensembl_end_phase,ensembl_phase,exon_id,feature_type,rank,seq_region_name,source,start,strand,version
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
ENSE00001862627,ENST00000459720,GRCh38,0,87974685,-1,-1,ENSE00001862627,exon,1,9,havana,87974372,-1,1
ENSE00003580906,ENST00000459720,GRCh38,0,87974035,-1,-1,ENSE00003580906,exon,2,9,havana,87973922,-1,1
ENSE00003551454,ENST00000459720,GRCh38,0,87971335,-1,-1,ENSE00003551454,exon,3,9,havana,87971147,-1,1
ENSE00001872570,ENST00000459720,GRCh38,0,87970897,-1,-1,ENSE00001872570,exon,4,9,havana,87969194,-1,1
ENSE00001822512,ENST00000459720,GRCh38,0,87967659,-1,-1,ENSE00001822512,exon,5,9,havana,87966441,-1,1


In [10]:
len(exons)

55

There are so many exons because in Ensembl exons are redundant. Each transcript has its own set of exons even if they are the same exact genomic region.

Exons have parents transcript identified by their `Parent` key.

We can filter them by `Parent` key to select only the exons whose parent is the id of one (or all) transcript(s).

In [11]:
transcript = cds[0]['id']
exons_example = list(filter(lambda d: d['Parent']==transcript, exons))
pd.DataFrame.from_dict(exons_example).set_index('id')

Unnamed: 0_level_0,Parent,assembly_name,constitutive,end,ensembl_end_phase,ensembl_phase,exon_id,feature_type,rank,seq_region_name,source,start,strand,version
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
ENSE00001890030,ENST00000336654,GRCh38,0,87974589,0,-1,ENSE00001890030,exon,1,9,ensembl_havana,87974372,-1,1
ENSE00001342208,ENST00000336654,GRCh38,0,87974074,0,0,ENSE00001342208,exon,2,9,ensembl_havana,87973922,-1,2
ENSE00003501098,ENST00000336654,GRCh38,0,87971335,0,0,ENSE00003501098,exon,3,9,ensembl_havana,87971147,-1,1
ENSE00003471288,ENST00000336654,GRCh38,0,87970897,2,0,ENSE00003471288,exon,4,9,ensembl_havana,87970776,-1,1
ENSE00003508198,ENST00000336654,GRCh38,0,87969919,0,2,ENSE00003508198,exon,5,9,ensembl_havana,87969796,-1,1
ENSE00003535476,ENST00000336654,GRCh38,0,87969349,0,0,ENSE00003535476,exon,6,9,ensembl_havana,87969194,-1,1
ENSE00001902992,ENST00000336654,GRCh38,0,87967659,-1,0,ENSE00001902992,exon,7,9,ensembl_havana,87966446,-1,1


### Exons usage
Not all transcript use the same exons. 

Some transcripts are shorter because they are assmebled starting from different combination of exons.

Furthermore, exons can be assembled in different orders.

Luckily not all combinations are viable

In [12]:
exons_per_transcript = [[tr['id']] + [e['id'] for e in filter(lambda d: d['Parent'] == tr['id'], exons)] for tr in cds]
pd.DataFrame(exons_per_transcript).set_index(0).transpose()

Unnamed: 0,ENST00000336654,ENST00000375883,ENST00000375871,ENST00000605159,ENST00000325303
1,ENSE00001890030,ENSE00003592303,ENSE00003553919,ENSE00003641963,ENSE00001468688
2,ENSE00001342208,ENSE00003694606,ENSE00003694606,ENSE00003694606,ENSE00003694606
3,ENSE00003501098,ENSE00003501098,ENSE00003501098,ENSE00003501098,ENSE00003501098
4,ENSE00003471288,ENSE00003471288,ENSE00003471288,ENSE00003471288,ENSE00003471288
5,ENSE00003508198,ENSE00003508198,ENSE00003658341,ENSE00003508198,ENSE00003532544
6,ENSE00003535476,ENSE00003535476,ENSE00001701771,ENSE00003535476,ENSE00003508198
7,ENSE00001902992,ENSE00001902992,,ENSE00003527648,ENSE00003535476
8,,,,,ENSE00001902501


## Visualization of exon usage in Ensembl
This is an example of the information found on a genome browser. In this case, exon usage for different transcripts is shown in a intuitive way. 

<div style='overflow-y: scroll; max-height: 400px'><img align='left' width=100% src="media/Human_98796627587974919.png"></div>

## Biopython
### Writing exons in a FASTA file
Biologists love text files! One of them (the most frequently used) is the FASTA format. It's a very simple format to store nucleotide or amino acid sequences. It can contain a single or multiple sequence record. Each sequence record is composed of two parts:
* **Header** Identified by its first character `>`
* **Sequence**: any other line

In [13]:
with open('exons.fasta', 'w') as of:
    for exon_id in exons_per_transcript[0]:
        exon_seq_url = base_url + 'sequence/id/' + exon_id + '?content-type=text/x-fasta'
        exon_fasta = urllib.request.urlopen(exon_seq_url).read().decode('utf-8')
        of.write(exon_fasta + '\n')
    print(exon_fasta[:250] + '...')

>ENSE00001902992.1 chromosome:GRCh38:9:87966446:87967659:-1
GCTCTCCTCCATCAGTACTTCTTCACAGCTCCCCTGCCTGCCCATCCATCTGAGCTGCCG
ATTCCTCAGCGTCTAGGGGGACCTGCCCCCAAGGCCCATCCAGGGCCCCCCCACATCCAT
GACTTCCACGTGGACCGGCCTCTTGAGGAGTCGCTGTTGAACCCAGAGCTGATTCGGCCC
TTCATCC...


In order to comply with biologists' love for text files, computer scientists came up with different solutions.

One of them is __[BioPython](https://biopython.org/)__, a python library that let's you manage all kinds of biological file formats.

It also has some useful utilities to:
* align sequences
* launch BLAST remotely
* handle protein structures


In [14]:
# parse FASTA file with biopython module SeqIO
seq_records = list(SeqIO.parse("exons.fasta", "fasta"))
# select one exon from the list of results
an_exon = seq_records[1]
# __repr__ method of Fasta record objects
print(an_exon)

ID: ENSE00001890030.1
Name: ENSE00001890030.1
Description: ENSE00001890030.1 chromosome:GRCh38:9:87974372:87974589:-1
Number of features: 0
Seq('GTCGCCTACAGGAACCGCCCCAGTGGTCAGCTGCCGCGCTGTTGCTAGGCAACA...GAG', SingleLetterAlphabet())


In [15]:
# __repr__ method of the Seq objects
print(an_exon.seq)

GTCGCCTACAGGAACCGCCCCAGTGGTCAGCTGCCGCGCTGTTGCTAGGCAACAGCGTGCGAGCTCAGATCAGCGTGGGGTGGAGGAGAAGTGGAGTTTGGAAGTTCAGGGGCACAGGGGCACAGGCCCACGACTGCAGCGGGATGGACCAGTACTGCATCCTGGGCCGCATCGGGGAGGGCGCCCACGGCATCGTCTTCAAGGCCAAGCACGTGGAG


In [16]:
# transcription is the conversion of DNA to messenger RNA
print(an_exon.seq.transcribe())

GUCGCCUACAGGAACCGCCCCAGUGGUCAGCUGCCGCGCUGUUGCUAGGCAACAGCGUGCGAGCUCAGAUCAGCGUGGGGUGGAGGAGAAGUGGAGUUUGGAAGUUCAGGGGCACAGGGGCACAGGCCCACGACUGCAGCGGGAUGGACCAGUACUGCAUCCUGGGCCGCAUCGGGGAGGGCGCCCACGGCAUCGUCUUCAAGGCCAAGCACGUGGAG


In [17]:
# translation is the conversion of messenger RNA to amino acid
print(an_exon.seq.transcribe().translate())

VAYRNRPSGQLPRCC*ATACELRSAWGGGEVEFGSSGAQGHRPTTAAGWTSTASWAASGRAPTASSSRPSTW




In [18]:
def print_multiline_alignment(alignment, n=80):
    seq1 = [alignment[0][i:i+n] for i in range(0, len(alignment[0]), n)]
    seq2 =  [alignment[1][i:i+n] for i in range(0, len(alignment[1]), n)]
    nstart1, nend1 = 1, 0
    nstart2, nend2 = 0, 0
    for s1, s2 in zip(seq1, seq2):
        print(s1)
        print(''.join(['|' if x == y else ' ' for x, y in zip(s1, s2)]))
        print(s2)
        print()

## Pairwise alignment
Biopython has functions to perform pairwise alignments. It has different functions:
* globalxx 
* globalms
* localds...

where global and local indicate the type of alignment and the two letters are match and gap-penalty parameters

The **match** parameters are:
* `x`     No parameters. Identical characters have score of 1, otherwise 0.
* `m`     A match score is the score of identical chars, otherwise mismatch score.
* `d`     A dictionary returns the score of any pair of characters.

The **gap penalty** parameters are:
* `x`     No gap penalties.
* `s`     Same open and extend gap penalties for both sequences.
* `d`     The sequences have different open and extend gap penalties.



In [19]:
# from Bio import pairwise2
alignment = pairwise2.align.globalxx(cds[0]['seq'], an_exon.seq)[0]
# custom print function
print_multiline_alignment(alignment)

ATGGACCAGTACTGCATCCTGGGCCGCATCGGGGAGGGCGCCCACGGCATCGTCTTCAAGGCCAAGCACGTGGAGCCGAG
                                                                                
--------------------------------------------------------------------------------

GGTGGGCTGGCAGTGTCTGCCTTCTATCCTGCAGACTGGCGAGATAGTTGCCCTCAAGAAGGTGGCCCTAAGGCGGTTGG
                                                                                
--------------------------------------------------------------------------------

AGGACGGCTTCCCTAACCAGGCCCTGCGGGAGATTAAGGCTCTGCAGGAGATGGAGGACAATCAGTATGTGGTACAACTG
                                                        |    ||        |     |  
--------------------------------------------------------G----TC--------G-----C--

AAGGCTGTGTTCCCACACGGTGGAGGCTTTGTGCTGGCCTTTGAGTTCATGCTGTCGGATCTGGCCGAGGTGGTGCGCCA
    |     |   |||    |||                   |   |   | | |    |   || |    ||| |   
----C-----T---ACA----GGA-------------------A---C---C-G-C----C---CC-A----GTG-G---

TGCCCAGAGGCCACTAGCCCAGGC

## Aligning two sequences

By using the the `pairwise2` module, correctly align the **cds** `ENST00000336654` and **exon** `ENSE00001890030.1`

Retrieve the sequence of the two entities using the endpoint:<br>__[https://rest.ensembl.org/documentation/info/sequence_id](https://rest.ensembl.org/documentation/info/sequence_id)__

Find the correct way to align the two sequences. Documentation of the pairwise2 module is here: <br>__[http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html](http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html)__

* Hint 1: function to print alignments __[https://github.com/marnec/practical](https://github.com/marnec/practical)__
* Hint 2: local alignment is the way to go
* Hint 3: You probably want a substitution matrix (http://biopython.org/DIST/docs/api/Bio.SubsMat.MatrixInfo-module.html)
* Hint 4: `print(MatrixInfo.available_matrices)`