In [1]:
import jupman
jupman.init()


# Practical 11

In this practical we will will see some other functionalities of Biopython.

## Slides

The slides of the introduction can be found here: [Intro](docs/Practical11.pdf)

## Biopython

From the Biopython tutorial: The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology. The goal of Biopython is to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and classes. Biopython features include parsers for various Bioinformatics file formats (BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy,...), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS...), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation.

In this practical we will see some features of Biopython but refer to [biopython documentation](http://biopython.org/wiki/Documentation) to discover all its features, recipes etc.

These notes are largely based on what available [here](http://biopython.org/DIST/docs/tutorial/Tutorial.pdf).



## BLAST

[Blast (Basic logical alignment search tool)](https://www.ncbi.nlm.nih.gov/pubmed/2231712) is a well known tool to find similarities between biological sequences. It compares DNA or protein sequences and calculates the statistical significance of the matches found.

The typical interaction with BLAST sees the user submit some sequences to the tool to get an alignment and then the hits are parsed to obtain information on the matches. Both these steps can be performed from within Biopython. Although it is possible to interact directly with a local installation of BLAST, in this practical we will work with the tool made available by NCBI (available [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi)). 

### The function qblast

The online version of blast can be accessed through the ```Bio.Blast.NCBIWWW.qblast()``` function.

It's basic syntax is the following (to use it we need to make the import ```from Bio.Blast import NCBIWWW```):

```
result_handle = Bio.Blast.NCBIWWW.qblast(blast_program, database, query_str)
```
where ```blast_program``` is the program to perform the alignment. The options are **blastn, blastp, blastx, tblast or tblastx**. ```database``` is the database to search against and ```query_str``` is a string containing the query to search against the database. The query can be a sequence or a fasta file entry or an identifier like a GI number (NCIBI's sequence identification number). Among the others, some optional parameters are the output format (```format_type``` that by default is "XML" which is the most stable output format but results can be stored also as text with "Text") and ```expect``` (the e-value threshold).

Some databases to search against are reported below:

![](img/pract11/blast_dbs.png)

The query string can be obtained in different ways, for example it is possible to load sequences from a fasta file with:

```
from Bio.Blast import NCBIWWW
fasta_string = open("myfile.fasta").read()
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)
```

or we can give a SeqRecord:

```
from Bio.Blast import NCBIWWW
from Bio import SeqIO
record = SeqIO.read("myfile.fasta", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
```

It is also possible to specify some optional paramters in the ```entrez_query``` for example we can limit the search to specific organisms with: ```entrez_query='"Malus Domestica" [Organism]'```.


### Parsing qblast output

Once the qblast call returns, it gives the results in a handle object ```result_handle``` that we can parse or we can write to disk to avoid having to rerun the query other times. If we expect to get one alignment only, we can use the method **read** otherwise (if we have multiple query sequences) we should use the method **parse**: 

```
blast_record = NCBIXML.read(result_handle)
```

or 

```
blast_records = NCBIXML.parse(result_handle)
```
Note that to use these methods we first need to import the ```NCBIXML``` module with ```from Bio.Blast import NCBIXML```.

These methods are analogous to what seen in the case of SeqIo and AlignIO. In the case of multiple entries we can loop through them with:

```
blast_records = NCBIXML.parse(result_handle)
for record in blast_records:
    #do something with it...
    
```
or we can retrieve one record at a time with ```record = next(blast_records)```.


### Saving results to file

To save the results present in the result_handle we can simply write them to file. In case we have only one entry we can read it and write it to file:

```
out_f = open("my_blast_result.xml", "w")
out_f.write(result_handle.read())
out_f.close 
result_handle.close()
```
If we have more than one entry we need to loop thorugh all the entries and save them in the file:

```
out_f = open("my_blast_result.xml", "w")
for entry in result_handle.parse():
    out_f.write(entry)
out_f.close 
result_handle.close()
```

**Example:**

Let's BLAST the galactosidase alpha (gi number: 2717) against the human database on NCBI and save the results to file. (**Note that it can take several seconds/minutes to run!**).

In [8]:
from Bio.Blast import NCBIWWW

result_handle = NCBIWWW.qblast("blastn", "nt", "2717")


with open("file_samples/blast_res.xml","w") as out_f:
    out_f.write(result_handle.read())

result_handle.close()

### Open a blast .XML file

A BLAST output file can be read by opening the file to get the handler and then parse it with the method **parse** seen above:

```
from Bio.Blast import NCBIXML
result_handle = open("my_blast.xml")
blast_records = NCBIXML.parse(result_handle)
```
This will end up in a handler to the blast results.


### The BLAST record class

The ```Bio.Blast.Record.Blast``` class holds the results of the alignment. In particular it is composed of the following three information:

1. *Descriptions* : a list of Description objects. Each ```Description``` holds the following information:

    - ```Description.title``` : a string with the title of the hit;
    - ```Description.score``` : a float with the score of the alignment;
    - ```Description.num_alignments``` : an int with the number of alignments with the same subject;
    - ```Description.e``` : a float with the e-value of the alignment.
    
2. *Alignments* : a list of Alignment objects. Each ```Alignment``` holds the following information:

    - ```Alignment.title``` : a string with the title of the hit (identical to ```Description.title```);
    - ```Alignment.length``` : an int with the length of the alignment;
    - ```Alignment.hsps``` : a list of HSP objects (High Scoring Pair). Each ```HSP``` has the following info:
        
        - ```HSP.score``` : the BLAST score of the hit
        - ```HSP.bits``` :  the bits score of the hit (x: on average 2^x pairs to find such a good hit by chance)
        - ```HSP.expect``` : the evalue of the hit
        - ```HSP.num_alignments``` : the number of alignments for the same subject
        - ```HSP.identities``` : the numbe of identities between query and subject
        - ```HSP.positives``` : the number of identical bases/aminos or having similar chemical properties
        - ```HSP.gaps``` : the number of gaps between query and subject
        - ```HSP.strand``` : a **tuple** with (query,subject) strands
        - ```HSP.frame``` : a **tuple** with the frame shifts
        - ```HSP.query/HSP.sbjct``` : query/subject sequence
        - ```HSP.query_start/HSP.sbjct_start``` :query/subject start point
        - ```HSP.match``` : the match sequence (basically "|" for matches and spaces for mismatches)
        - ```HSP.align_length``` : the alignment length.

More information on the BLAST record can be found [here](http://biopython.org/DIST/docs/api/Bio.Blast.Record-module.html).

**Example:**

Let's blast the serum albumin sequence (gi number [23307792](https://www.ncbi.nlm.nih.gov/nuccore/AF542069.1)) on the human genome and report all the information reported by BLAST. (warning might take a while to run!)

In [54]:
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

result_handle = NCBIWWW.qblast("blastn", "nr", "23307792", 
                               entrez_query='"Homo Sapiens" [Organism]'
                               )



for res in NCBIXML.parse(result_handle):
    for d in res.descriptions:
        
        print("TITLE:{}\nSCORE:{}\nN.ALIGN:{}\nE-VAL:{}".format(
            d.title,d.score, d.num_alignments,d.e))
        
    for a in res.alignments:
        print("Align Title:{}\nAlign Len: {}".format(a.title, a.length))

        
    
        for h in a.hsps:
            s = h.score
            b = h.bits
            e = h.expect
            n = h.num_alignments
            i = h.identities
            p = h.positives
            g = h.gaps
            st = h.strand
            f = h.frame
            q = h.query
            sb = h.sbjct
            qs = h.query_start
            ss = h.sbjct_start
            qe = h.query_end
            se = h.sbjct_end
            m = h.match 
            al = h.align_length
            
            print("Score: {} Bits: {} E-val: {}".format(s,b,e))
            print("N.aligns:{} Ident:{} Pos.:{} Gaps:{} Align len:{}".format(
                n,i,p,g,al))
            print("Strand: {} Frame: {}".format(st,f))
            print("Query:", q, " start:", qs, " end:", qe)
            print("Match:",m)
            print("Subjc:",sb, " start:", ss, " end:", se)
            

result_handle.close()

TITLE:gi|23307792|gb|AF542069.1| Homo sapiens serum albumin (HSA) mRNA, complete cds
SCORE:4352.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|1046552723|ref|NM_000477.6| Homo sapiens albumin (ALB), mRNA
SCORE:4304.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|28591|emb|V00495.1| H.sapiens mRNA for serum albumin
SCORE:4252.0
N.ALIGN:2
E-VAL:0.0
TITLE:gi|7770116|gb|AF119840.1|AF119840 Homo sapiens PRO0903 mRNA, complete cds
SCORE:4062.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|25058738|gb|BC039235.1| Homo sapiens albumin, mRNA (cDNA clone IMAGE:4768004), containing frame-shift errors
SCORE:4056.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|23243417|gb|BC036003.1| Homo sapiens albumin, mRNA (cDNA clone MGC:32850 IMAGE:4724105), complete cds
SCORE:4052.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|158258946|dbj|AK292755.1| Homo sapiens cDNA FLJ78413 complete cds, highly similar to Homo sapiens albumin, mRNA
SCORE:4036.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|28589|emb|V00494.1| Human messenger RNA for serum albumin (HSA)
SCORE:4032.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|2170645

Match: ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Subjc: GTTTTTGTATGAATATGCAAGAAGGCATCCTGATTACTCTGTCGTGCTGCTGCTGAGACTTGCCAAGACATATGAAACCACTCTAGAGAAGTGCTGTGCCGCTGCAGATCCTCATGAATGCTATGCCAAAGTG  start: 12731  end: 12863
Score: 256.0 Bits: 232.118 E-val: 1.99393e-57
N.aligns: None Idents: 131 Positives: 131 Gaps: 0 Align len: 133
Strand: (None, None) Frame: (1, 1)
Query: ATACTTATATGAAATTGCCAGAAGACATCCTTACTTCTATGCCCCGGAACTCCTTTTCTTTGCTAAAAGGTATAAAGCTGCTTTTACAGAATGTTGCCAAGCTGCTGATAAGGCTGCCTGCCTGTTGCCAAAG  start: 516  end: 648
Match: |||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||
Subjc: ATACTTATATGAAATTGCCAGAAGACATCCTTACTTTTATGCCCCGGAACTCCTTTTCTTTGCTAAAAGGTATAAAGCTGCTTTTACAGAATGTTGCCAAGCTGCTGATAAAGCTGCCTGCCTGTTGCCAAAG  start: 7051  end: 7183
Score: 220.0 Bits: 199.657 E-val: 1.17852e-47
N.aligns: None Idents: 115 Posi

## Getting data from NCBI

Biopython provides a module (```Bio.Entrez```) to pull data off resources like PubMed or GenBank, and other repositories programmatically through [Entrez](http://www.ncbi.nlm.nih.gov/Entrez/). 

There are some limitations (mostly taken care directly by Biopython) that you should be aware of when you use NCBI's services. Check [here](http://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Usage_Guidelines_and_Requiremen) if you want to learn what these limitations are.

First of all we need to import the Entrez module with (```from Bio import Entrez```) and then we can start interacting with Entrez, then we should specify (optional) an email setting ```Entrez.email```.

In particular the module (complete info on [Entrez module are here](http://biopython.org/DIST/docs/api/Bio.Entrez-module.html)) provides, among the others, the following functions:

1. ```res_handle = Entrez.einfo(db)``` returns a summary of the Entez databases as a results handle. ```db``` is an optional paramter specifying the resource of interest;
2. ```res_handle = Entrez.esearch(db, term,id)``` returns all the entries in ```db``` having query matching the term ```term```. It is also possible to specify an ```id``` to get the information relative to that resource id;
3. ```res_handle = Entrez.efetch(db, id, rettype, retmode)``` returns full record corresponding to the identifier ``id`` from the database ``db`` formatted in ```rettype``` (eg. gb, fasta,... [complete list](https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly)) and return mode ```retmode``` (eg. text); 
4. ```res_handle = Entrez.esummary(db, id)``` returns the summary of the entry ```id``` from the database ```db``` as a handle;
5. ```result = Entrez.read(res_handle)``` reads the information on the XML handle ```res_handle``` and stores them in a dictionary, list or string, depending on the case.  


**Example:**

Let's get a list of all available databases in Entrez as a dictionary. Let's then get a summary of the entries in 'sra'.

In [72]:
from Bio import Entrez

Entrez.email = "my_email"
handle = Entrez.einfo()
res = Entrez.read(handle)
print(res)
print("")
print("As a list:")
print(res['DbList'])

res = Entrez.read(Entrez.einfo(db = "sra"))
#uncomment to see all the information captured
#print(res)
#for el in res["DbInfo"].keys():
#    print(el) 
print("")
print("Entries count:", res["DbInfo"]["Count"])
print("LastUpdate:", res["DbInfo"]["LastUpdate"])
print("Description:", res["DbInfo"]["Description"])




{'DbList': ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']}

As a list:
['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosy

Note that effectively **einfo** returned a handler to an object that can be read by the **read** function that produces a dictionary. This dictionary had one key only "DbList" that is the list of available databases in the first case, while the key when db was specified is "DbInfo".

**Example:**
Fetch the first three entries (retmax = 3) in pubmed that are related to the species "Malus Domestica" and report the title of the publication.

In [106]:
from Bio import Entrez

Entrez.email = "my_email"
handle = Entrez.esearch(db="pubmed", term="Malus Domestica", retmax = 3)
res = Entrez.read(handle)
for el in res.keys():
    print(el , " : ", res[el])

print("")
for ids in res["IdList"]:    
    print("Results for id:", ids)
    handle = Entrez.esummary(db="pubmed",  id = ids)
    res = Entrez.read(handle)
    #uncomment to see all info
    #print(res)
    for r in res:
        print(r["Title"])
        print(r["Source"])
        print("")


Count  :  4727
RetMax  :  3
IdList  :  ['29078754', '29073205', '29063961']
TranslationSet  :  [{'To': '"malus"[MeSH Terms] OR "malus"[All Fields] OR ("malus"[All Fields] AND "domestica"[All Fields]) OR "malus domestica"[All Fields]', 'From': 'Malus Domestica'}]
QueryTranslation  :  "malus"[MeSH Terms] OR "malus"[All Fields] OR ("malus"[All Fields] AND "domestica"[All Fields]) OR "malus domestica"[All Fields]
RetStart  :  0
TranslationStack  :  [{'Term': '"malus"[MeSH Terms]', 'Field': 'MeSH Terms', 'Count': '4086', 'Explode': 'Y'}, {'Term': '"malus"[All Fields]', 'Field': 'All Fields', 'Count': '4734', 'Explode': 'N'}, 'OR', {'Term': '"malus"[All Fields]', 'Field': 'All Fields', 'Count': '4734', 'Explode': 'N'}, {'Term': '"domestica"[All Fields]', 'Field': 'All Fields', 'Count': '5211', 'Explode': 'N'}, 'AND', 'GROUP', 'OR', {'Term': '"malus domestica"[All Fields]', 'Field': 'All Fields', 'Count': '630', 'Explode': 'N'}, 'OR', 'GROUP']

Results for id: 29078754
Comprehensive analysis 

**Example:**
Retrieve genbank formatted information of the Malus x domestica MYB domain class transcription factor (MYB1) mRNA complete cds (nucleotide database id:HM122614.1). Parse it as a SeqRecord, printing only the sequence (remember previous practical's SeqIO).

In [114]:
from Bio import Entrez
from Bio import SeqIO

Entrez.email = "my_email"
handle = Entrez.efetch(db="nucleotide", id = "HM122614.1", rettype = "gb", retmode="text")
my_seq = SeqIO.read(handle, format = "genbank")
print(handle.read())
print(my_seq)
print("")
print("SEQUENCE:")
print(my_seq.seq)


ID: HM122614.1
Name: HM122614
Description: Malus x domestica MYB domain class transcription factor (MYB1) mRNA, complete cds
Number of features: 3
/source=Malus domestica (apple)
/data_file_division=HTC
/organism=Malus domestica
/sequence_version=1
/references=[Reference(title='Transcription Factors in Apple', ...), Reference(title='The FruiTFul database; full length cDNAs of apple transcription factors', ...), Reference(title='Direct Submission', ...)]
/accessions=['HM122614']
/keywords=['HTC']
/date=15-AUG-2010
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'eudicotyledons', 'Gunneridae', 'Pentapetalae', 'rosids', 'fabids', 'Rosales', 'Rosaceae', 'Maloideae', 'Maleae', 'Malus']
/topology=linear
/molecule_type=mRNA
Seq('TTTGGTCTGCTGGGTAGGTACTCATAAAAACAAACCAACCGAAGCCTCCGAACC...AAA', IUPACAmbiguousDNA())

SEQUENCE:
TTTGGTCTGCTGGGTAGGTACTCATAAAAACAAACCAACCGAAGCCTCCGAACCGACCACCAATGACGGCCCCAAACGGCGCCGTCCCCAAACAAGCC

## Getting data from ExPASy

Similarly to what done with Entrez, it is possible to pull data out of ExPASy (https://www.expasy.org/) through the Bio.ExPASy module. We will not cover this in detail. All information can be found here: [Bio.ExPASy module](http://biopython.org/DIST/docs/api/toc-Bio.ExPASy-module.html).

As an example, we will see how to download a couple of SwissProt entries (the human and mouse P53 protein).

In [118]:
from Bio import ExPASy
from Bio import SwissProt

accessions = ["P04637", "P02340"] #the ids of the human and mouse proteins

for accession in accessions:
    handle = ExPASy.get_sprot_raw(accession)
    record = SwissProt.read(handle)
    print(record.entry_name)
    print(",".join(record.accessions))
    print(record.keywords)
    print(repr(record.organism))
    print(record.sequence[:30] + "...")



P53_HUMAN
P04637,Q15086,Q15087,Q15088,Q16535,Q16807,Q16808,Q16809,Q16810,Q16811,Q16848,Q2XN98,Q3LRW1,Q3LRW2,Q3LRW3,Q3LRW4,Q3LRW5,Q86UG1,Q8J016,Q99659,Q9BTM4,Q9HAQ8,Q9NP68,Q9NPJ2,Q9NZD0,Q9UBI2,Q9UQ61
['3D-structure', 'Acetylation', 'Activator', 'Alternative promoter usage', 'Alternative splicing', 'Apoptosis', 'Biological rhythms', 'Cell cycle', 'Complete proteome', 'Cytoplasm', 'Disease mutation', 'DNA-binding', 'Endoplasmic reticulum', 'Glycoprotein', 'Host-virus interaction', 'Isopeptide bond', 'Li-Fraumeni syndrome', 'Metal-binding', 'Methylation', 'Mitochondrion', 'Necrosis', 'Nucleus', 'Phosphoprotein', 'Polymorphism', 'Reference proteome', 'Repressor', 'Transcription', 'Transcription regulation', 'Tumor suppressor', 'Ubl conjugation', 'Zinc']
'Homo sapiens (Human).'
MEEPQSDPSVEPPLSQETFSDLWKLLPENN...
P53_MOUSE
P02340,Q9QUP3
['3D-structure', 'Acetylation', 'Activator', 'Apoptosis', 'Biological rhythms', 'Cell cycle', 'Complete proteome', 'Cytoplasm', 'Disease mutation', 'DNA-bindin

You can find [here](http://biopython.org/DIST/docs/api/toc-Bio.SeqRecord-module.html) all the details on how to deal with the SwissProt records.

## 3D structure and PDB

Biopython can also deal with data coming from the [Protein Data Bank database](https://www.rcsb.org/pdb/home/home.do). It is a database of structural information of 3D shapes of proteins, nucleic acids, and complex assemblies. The database currently contains more than 134,000 total structures.

To deal with this kind of data we first need to import Biopython's module ```Bio.PDB``` with ```from Bio.PDB import *```. All the information on this module can be found [here](http://biopython.org/DIST/docs/api/toc-Bio.PDB-module.html). 

It is possible to download a structure directly from PDB by using a ```PDBList``` object that features a function called ```download_pdb_files``` having the basic syntax:

```
PDBList.download_pdb_files(pdb_codes, pdir, file_format)
```
that downloads the ```file_format``` formatted structures defined in the ```pdb_codes``` list of 4 symbols structure Ids from PDB, stores them in the directory ```pdir```. The safer ```file_format``` to use is "mmCif". The function will not download the structures more than once. If a file is already present in the specified directory, a message **Structure exists** will be displayed. 

**Example:**

Let's programmatically download two different structures of the DNA polymerase [3C2K](https://www.rcsb.org/pdb/explore/explore.do?structureId=3C2K) and [3C2L](https://www.rcsb.org/pdb/explore/explore.do?structureId=3C2K)

In [146]:
from Bio.PDB import *

pdbl = PDBList()
structures = ["1D09", "4FYW"]
el = pdbl.download_pdb_files(structures, file_format = "mmCif", pdir = "file_samples/")


Once deleted, variables cannot be recovered. Proceed (y/[n])? y
Structure exists: 'file_samples/3c2k.cif' 
Structure exists: 'file_samples/3c2l.cif' 


Macromolecular Crystallographic Information Files (mmCif files .cif) is a paired collection of names (starting with "\_") and values.They also contain a description of the 3D placement of every crystalized atom of the structure. A detailed description of the format can be found [here](http://mmcif.wwpdb.org/docs/tutorials/mechanics/pdbx-mmcif-syntax.html).  

Once the structures are available locally, one can start parsing them to do something useful.
Parsing can be done through the ```MMCIFParser``` object:

```
parser = MMCIFParser()
```

The ```parser``` object has several methods able to deal with structures. One of these is the ```get_structure``` that creates a ```PDB.Structure.Structure``` object with all the data present in the structure file.

The basic syntax is:
```
structure = parser.get_structure(pdb_code, filename)
```

where ```pdb_code``` is the PDB code of the structure contained in the file ```filename```. The method returns a ```PDB.Structure.Structure``` that contains one or more models.  


A ```Structure``` consists of a collection of one or more ```Model``` (different 3D conformations of the very same structure) that is a collection of ```Chain``` that is a collection of ```Residues``` that is a collection of ```Atoms```. Look in the documentation to get the information on each of these classes. This is the diagram of a structure:

![](img/pract11/structure1.png)

Given a ```Structure``` we can obtain iterators to **models**, **chains**, **residues** or **atoms** with:

```
Structure.get_models() 
Structure.get_chains()
Structure.get_residues()
Structure.get_atoms()
```

For each model obtained with ```structure.get_models()``` function we can loop through its chains, residues and atoms. For atoms we can get the 3D coordinates with ```Atom.get_coord()```.

**Example:**
Let's loop through all the models, chain, residues and atoms of the DNA polymerase structure 3C2K. Print the 3D coordinates for each atom.

In [182]:

from Bio.PDB import *
from Bio.PDB.MMCIF2Dict import MMCIF2Dict

parser = MMCIFParser(QUIET=True) #To disable warnings


filename = "file_samples/3c2l.cif"
structure = parser.get_structure("3c2l", filename)

print(structure.header)

for model in structure.get_models():
    print("model", model, "has {} chains".format(len(model)))
    
    for chain in model:
        print(" - chain ", chain, "has {} residues".format(len(chain)))
        
        for residue in chain:
            print ("      - residue", residue.get_resname(), "has {} atoms".format(len(residue)))
            
            for atom in residue:
                x,y,z = atom.get_coord()
                print("        - atom:", atom.get_name(), "x: {} y:{} z:{}".format(x,y,z))
                
                

Structure exists: 'file_samples/4fyw.cif' 
{}
model <Model id=0> has 4 chains
 - chain  <Chain id=A> has 519 residues
      - residue ALA has 5 atoms
        - atom: N x: 36.611000061035156 y:47.24100112915039 z:-9.623000144958496
        - atom: CA x: 37.3650016784668 y:46.2760009765625 z:-8.822999954223633
        - atom: C x: 36.3849983215332 y:45.34400177001953 z:-8.12600040435791
        - atom: O x: 35.191001892089844 y:45.606998443603516 z:-8.118000030517578
        - atom: CB x: 38.22600173950195 y:47.00600051879883 z:-7.800000190734863
      - residue ASN has 8 atoms
        - atom: N x: 36.87799835205078 y:44.26599884033203 z:-7.525000095367432
        - atom: CA x: 35.9900016784668 y:43.37799835205078 z:-6.804999828338623
        - atom: C x: 35.4739990234375 y:44.125 z:-5.571000099182129
        - atom: O x: 36.108001708984375 y:45.08700180053711 z:-5.113999843597412
        - atom: CB x: 36.70100021362305 y:42.066001892089844 z:-6.440000057220459
        - atom: CG x: 37.6

        - atom: O x: 53.34600067138672 y:37.944000244140625 z:23.083999633789062
        - atom: CB x: 54.082000732421875 y:40.33700180053711 z:21.84600067138672
        - atom: CG x: 54.46099853515625 y:41.45600128173828 z:20.926000595092773
        - atom: OD1 x: 55.27799987792969 y:41.22100067138672 z:20.013999938964844
        - atom: OD2 x: 53.95600128173828 y:42.58100128173828 z:21.128000259399414
      - residue SER has 6 atoms
        - atom: N x: 55.479000091552734 y:37.24300003051758 z:22.937000274658203
        - atom: CA x: 55.474998474121094 y:36.569000244140625 z:24.222000122070312
        - atom: C x: 56.20199966430664 y:37.47200012207031 z:25.216999053955078
        - atom: O x: 57.433998107910156 y:37.551998138427734 z:25.209999084472656
        - atom: CB x: 56.14699935913086 y:35.20000076293945 z:24.128999710083008
        - atom: OG x: 55.3120002746582 y:34.270999908447266 z:23.45400047302246
      - residue ALA has 5 atoms
        - atom: N x: 55.4119987487793 y:38

        - atom: OE1 x: 36.926998138427734 y:22.37299919128418 z:-0.5350000262260437
        - atom: OE2 x: 36.4119987487793 y:22.094999313354492 z:1.597000002861023
      - residue THR has 7 atoms
        - atom: N x: 31.496999740600586 y:20.600000381469727 z:2.4679999351501465
        - atom: CA x: 31.07699966430664 y:19.253999710083008 z:2.8559999465942383
        - atom: C x: 29.572999954223633 y:19.038000106811523 z:2.7790000438690186
        - atom: O x: 29.11199951171875 y:17.902999877929688 z:2.697000026702881
        - atom: CB x: 31.54199981689453 y:18.871000289916992 z:4.276000022888184
        - atom: OG1 x: 30.957000732421875 y:19.768999099731445 z:5.230000019073486
        - atom: CG2 x: 33.06999969482422 y:18.8700008392334 z:4.367000102996826
      - residue GLN has 9 atoms
        - atom: N x: 28.812000274658203 y:20.1200008392334 z:2.8529999256134033
        - atom: CA x: 27.357999801635742 y:20.003999710083008 z:2.86899995803833
        - atom: C x: 26.73200035095215 y

        - atom: N x: 14.597999572753906 y:37.50400161743164 z:16.476999282836914
        - atom: CA x: 14.222000122070312 y:38.60200119018555 z:15.593000411987305
        - atom: C x: 14.184000015258789 y:38.17399978637695 z:14.128000259399414
        - atom: O x: 13.357999801635742 y:38.65299987792969 z:13.359000205993652
        - atom: CB x: 15.208000183105469 y:39.75400161743164 z:15.786999702453613
        - atom: CG x: 14.890999794006348 y:41.005001068115234 z:15.015000343322754
        - atom: CD x: 15.673999786376953 y:42.191001892089844 z:15.529999732971191
        - atom: OE1 x: 16.538000106811523 y:41.983001708984375 z:16.41200065612793
        - atom: OE2 x: 15.418000221252441 y:43.321998596191406 z:15.062999725341797
      - residue LYS has 9 atoms
        - atom: N x: 15.065999984741211 y:37.25899887084961 z:13.739999771118164
        - atom: CA x: 15.11299991607666 y:36.80500030517578 z:12.35200023651123
        - atom: C x: 14.251999855041504 y:35.56999969482422 z:12.10

        - atom: CZ2 x: 36.387001037597656 y:16.469999313354492 z:4.072000026702881
        - atom: CZ3 x: 36.12799835205078 y:14.437000274658203 z:5.355000019073486
        - atom: CH2 x: 36.36199951171875 y:15.10200023651123 z:4.144999980926514
      - residue TYR has 12 atoms
        - atom: N x: 38.060001373291016 y:18.07200050354004 z:11.420000076293945
        - atom: CA x: 38.72100067138672 y:19.231000900268555 z:12.048999786376953
        - atom: C x: 40.099998474121094 y:19.54599952697754 z:11.47599983215332
        - atom: O x: 40.50699996948242 y:20.698999404907227 z:11.418999671936035
        - atom: CB x: 38.797000885009766 y:19.06800079345703 z:13.579000473022461
        - atom: CG x: 38.733001708984375 y:17.632999420166016 z:14.050000190734863
        - atom: CD1 x: 37.507999420166016 y:16.98900032043457 z:14.213000297546387
        - atom: CD2 x: 39.89899826049805 y:16.917999267578125 z:14.317999839782715
        - atom: CE1 x: 37.44200134277344 y:15.654999732971191 z:14

        - atom: CG x: 28.604000091552734 y:72.08000183105469 z:36.57500076293945
        - atom: CD1 x: 29.40399932861328 y:73.20899963378906 z:36.46500015258789
        - atom: CD2 x: 29.20599937438965 y:70.8550033569336 z:36.84199905395508
        - atom: CE1 x: 30.77899932861328 y:73.1240005493164 z:36.632999420166016
        - atom: CE2 x: 30.58300018310547 y:70.76599884033203 z:37.00400161743164
        - atom: CZ x: 31.3700008392334 y:71.90699768066406 z:36.902000427246094
      - residue LYS has 9 atoms
        - atom: N x: 26.58099937438965 y:69.91799926757812 z:34.1879997253418
        - atom: CA x: 27.033000946044922 y:68.72699737548828 z:33.470001220703125
        - atom: C x: 27.135000228881836 y:68.9990005493164 z:31.958999633789062
        - atom: O x: 28.108999252319336 y:68.61299896240234 z:31.31599998474121
        - atom: CB x: 26.11400032043457 y:67.54000091552734 z:33.7760009765625
        - atom: CG x: 26.599000930786133 y:66.20700073242188 z:33.2400016784668
     

        - atom: CD2 x: 27.899999618530273 y:58.805999755859375 z:27.288999557495117
      - residue VAL has 7 atoms
        - atom: N x: 32.14099884033203 y:57.02000045776367 z:27.8700008392334
        - atom: CA x: 33.30400085449219 y:56.16899871826172 z:28.117000579833984
        - atom: C x: 32.86399841308594 y:54.7130012512207 z:28.06399917602539
        - atom: O x: 31.881999969482422 y:54.345001220703125 z:28.69700050354004
        - atom: CB x: 33.935001373291016 y:56.457000732421875 z:29.5
        - atom: CG1 x: 35.07400131225586 y:55.48500061035156 z:29.784000396728516
        - atom: CG2 x: 34.42900085449219 y:57.90800094604492 z:29.597000122070312
      - residue CYS has 6 atoms
        - atom: N x: 33.5880012512207 y:53.87699890136719 z:27.32200050354004
        - atom: CA x: 33.1879997253418 y:52.472999572753906 z:27.195999145507812
        - atom: C x: 33.409000396728516 y:51.698001861572266 z:28.500999450683594
        - atom: O x: 34.50299835205078 y:51.715999603271484 

        - atom: C x: 69.75 y:55.630001068115234 z:73.875
        - atom: O x: 69.95800018310547 y:54.57899856567383 z:74.51200103759766
        - atom: CB x: 67.46600341796875 y:56.75400161743164 z:74.04100036621094
        - atom: OG1 x: 66.76599884033203 y:57.79499816894531 z:74.74299621582031
        - atom: CG2 x: 66.80500030517578 y:55.40700149536133 z:74.34400177001953
      - residue ALA has 5 atoms
        - atom: N x: 70.26599884033203 y:55.85900115966797 z:72.66899871826172
        - atom: CA x: 71.10199737548828 y:54.86399841308594 z:71.99199676513672
        - atom: C x: 72.37100219726562 y:54.56999969482422 z:72.78600311279297
        - atom: O x: 72.79499816894531 y:53.415000915527344 z:72.91400146484375
        - atom: CB x: 71.44599914550781 y:55.33700180053711 z:70.5790023803711
      - residue ALA has 5 atoms
        - atom: N x: 72.98999786376953 y:55.62300109863281 z:73.30599975585938
        - atom: CA x: 74.20500183105469 y:55.45100021362305 z:74.0790023803711
   

        - atom: CB x: 58.71500015258789 y:56.40800094604492 z:51.742000579833984
        - atom: CG x: 59.01900100708008 y:57.845001220703125 z:52.165000915527344
        - atom: OD1 x: 60.119998931884766 y:58.08599853515625 z:52.70899963378906
        - atom: OD2 x: 58.17399978637695 y:58.74100112915039 z:51.92900085449219
      - residue GLY has 4 atoms
        - atom: N x: 60.62900161743164 y:55.516998291015625 z:48.82699966430664
        - atom: CA x: 61.09600067138672 y:56.02000045776367 z:47.5369987487793
        - atom: C x: 62.323001861572266 y:56.89799880981445 z:47.7140007019043
        - atom: O x: 63.180999755859375 y:56.60200119018555 z:48.54100036621094
      - residue SER has 6 atoms
        - atom: N x: 62.39699935913086 y:58.00199890136719 z:46.97700119018555
        - atom: CA x: 63.542999267578125 y:58.89099884033203 z:47.10300064086914
        - atom: C x: 63.3120002746582 y:59.970001220703125 z:48.1609992980957
        - atom: O x: 63.96900177001953 y:61.0110015869

      - residue ALA has 5 atoms
        - atom: N x: 75.89900207519531 y:66.54100036621094 z:41.21099853515625
        - atom: CA x: 76.0719985961914 y:67.5719985961914 z:42.24599838256836
        - atom: C x: 74.97799682617188 y:67.56700134277344 z:43.31100082397461
        - atom: O x: 73.84100341796875 y:67.18000030517578 z:43.040000915527344
        - atom: CB x: 76.13200378417969 y:68.9469985961914 z:41.604000091552734
      - residue MET has 8 atoms
        - atom: N x: 75.31199645996094 y:68.03399658203125 z:44.512001037597656
        - atom: CA x: 74.29900360107422 y:68.25900268554688 z:45.542999267578125
        - atom: C x: 73.23699951171875 y:69.21099853515625 z:44.97700119018555
        - atom: O x: 73.59200286865234 y:70.24199676513672 z:44.4109992980957
        - atom: CB x: 74.9489974975586 y:68.87300109863281 z:46.78799819946289
        - atom: CG x: 73.99700164794922 y:69.03500366210938 z:47.9640007019043
        - atom: SD x: 73.45700073242188 y:67.4749984741211 z:48.

        - atom: N x: 85.90399932861328 y:57.652000427246094 z:58.39899826049805
        - atom: CA x: 85.3030014038086 y:57.41299819946289 z:57.084999084472656
        - atom: C x: 84.00399780273438 y:56.619998931884766 z:57.255001068115234
        - atom: O x: 83.99700164794922 y:55.50699996948242 z:57.814998626708984
        - atom: CB x: 86.26300048828125 y:56.66400146484375 z:56.13600158691406
        - atom: CG1 x: 85.5780029296875 y:56.39500045776367 z:54.768001556396484
        - atom: CG2 x: 87.55799865722656 y:57.46900177001953 z:55.94200134277344
      - residue LEU has 8 atoms
        - atom: N x: 82.90399932861328 y:57.21200180053711 z:56.79899978637695
        - atom: CA x: 81.58000183105469 y:56.582000732421875 z:56.88199996948242
        - atom: C x: 81.07599639892578 y:56.27399826049805 z:55.481998443603516
        - atom: O x: 81.52100372314453 y:56.867000579833984 z:54.494998931884766
        - atom: CB x: 80.5510025024414 y:57.46699905395508 z:57.604000091552734
    

        - atom: O x: 80.2959976196289 y:71.01699829101562 z:71.1259994506836
      - residue HOH has 1 atoms
        - atom: O x: 75.72699737548828 y:46.6510009765625 z:74.12000274658203
      - residue HOH has 1 atoms
        - atom: O x: 72.18399810791016 y:63.54600143432617 z:36.49399948120117
      - residue HOH has 1 atoms
        - atom: O x: 44.1870002746582 y:50.25299835205078 z:65.1760025024414
      - residue HOH has 1 atoms
        - atom: O x: 54.05099868774414 y:50.28200149536133 z:44.88600158691406
      - residue HOH has 1 atoms
        - atom: O x: 68.71900177001953 y:68.40899658203125 z:73.51599884033203
      - residue HOH has 1 atoms
        - atom: O x: 66.34300231933594 y:48.44300079345703 z:49.025001525878906
      - residue HOH has 1 atoms
        - atom: O x: 88.67500305175781 y:58.305999755859375 z:36.395999908447266
      - residue HOH has 1 atoms
        - atom: O x: 79.05699920654297 y:51.78499984741211 z:75.51899719238281
      - residue HOH has 1 atoms
   

        - atom: CB x: 38.696998596191406 y:84.09100341796875 z:56.20000076293945
        - atom: CG x: 39.972999572753906 y:84.83499908447266 z:55.84000015258789
        - atom: CD x: 40.66699981689453 y:85.41699981689453 z:57.055999755859375
        - atom: OE1 x: 40.07600021362305 y:85.37200164794922 z:58.15700149536133
        - atom: OE2 x: 41.803001403808594 y:85.91799926757812 z:56.9119987487793
      - residue ASP has 8 atoms
        - atom: N x: 39.02299880981445 y:81.24700164794922 z:54.790000915527344
        - atom: CA x: 39.895999908447266 y:80.27200317382812 z:54.14899826049805
        - atom: C x: 39.347999572753906 y:79.822998046875 z:52.79800033569336
        - atom: O x: 40.1150016784668 y:79.59100341796875 z:51.86000061035156
        - atom: CB x: 40.141998291015625 y:79.0530014038086 z:55.02899932861328
        - atom: CG x: 40.89400100708008 y:77.95999908447266 z:54.28799819946289
        - atom: OD1 x: 42.07699966430664 y:78.18399810791016 z:53.94300079345703
     

        - atom: C5' x: 29.643999099731445 y:94.2030029296875 z:44.108001708984375
        - atom: O5' x: 28.641000747680664 y:95.11399841308594 z:44.51499938964844
        - atom: PA x: 28.917999267578125 y:96.7040023803711 z:44.57899856567383
        - atom: O1A x: 29.993000030517578 y:96.94499969482422 z:45.6150016784668
        - atom: O2A x: 27.638999938964844 y:97.4229965209961 z:44.9370002746582
        - atom: O3A x: 29.464000701904297 y:97.15699768066406 z:43.124000549316406
        - atom: PB x: 30.93000030517578 y:97.8010025024414 z:42.875
        - atom: O1B x: 30.97599983215332 y:99.19999694824219 z:43.433998107910156
        - atom: O2B x: 31.295000076293945 y:97.83100128173828 z:41.40999984741211
        - atom: O3B x: 31.93899917602539 y:96.84600067138672 z:43.6879997253418
        - atom: PG x: 33.507999420166016 y:96.7760009765625 z:43.33300018310547
        - atom: O1G x: 33.66600036621094 y:96.30400085449219 z:41.90800094604492
        - atom: O2G x: 34.1650009155273

Once we have the coordinates of an atom we can compute distances between atoms or angles. We can also align two structures rototranslating the two to minimize their distance. We will not cover this and many other features that are provided by Biopython, such as Pyhlogentic analysis tools, interface towards pathways in KEGG, clustering, etc. If you are interested can find all the features available in the [Biopython tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.pdf).

## Exercises


1. Write a python function ```retrieve_sequences(search_term, number, outfile)``` that retrieves the first ```number``` of sequences from NCBI's "nucleotide" database having a search  term  ```term``` (hint: use term and retmax parameters of Entrez.esearch) and stores them in a fasta file ```outfile``` (hint: use SeqIO.write). Test your code retrieving the first 5 entries having search term "starch AND Malus Domestica [Organism]"

<div class="tggle" onclick="toggleVisibility('ex1');">Show/Hide Solution</div>
<div id="ex1" style="display:none;">

In [207]:
from Bio import Entrez
from Bio import SeqIO

def retrieve_sequences(search_term, number, filename):
    Entrez.email = "my_email"
    handle = Entrez.esearch(db="nucleotide", term=search_term, retmax=5)
    res = Entrez.read(handle)
    records = []
    for el in res["IdList"]:
        handle = Entrez.efetch(db="nucleotide", id=el, rettype = "gb", retmode="text")
        my_seq = SeqIO.read(handle, format = "genbank")
        records.append(my_seq)
    N = SeqIO.write(records, filename, "fasta")
    print("Search term was: ", search_term)
    print("{} sequences written to {}".format(N,filename))
    
s_term = "starch AND Malus Domestica [Organism]"
retrieve_sequences(s_term, 5, "file_samples/starch_sequences.fasta")


Search term was:  starch AND Malus Domestica [Organism]
5 sequences written to file_samples/starch_sequences.fasta


In [None]:
</div>

2. Write a python function that aligns the sequences  in the file created at point 1. ([here](file_samples/starch_sequences.fasta) you can find mine) against the NCBI nr database limiting the hits to the Malus Domestica organism (parameter entrez_query='"Malus Domestica" [Organism]' in qblast)and prints to screen the following info for each hsp: 
    1. The title;
    2. Score and e-value;
    3. The number of alignments on the same subject, the number of identities and positives and the alignment length;
    4. The number of mismatches and the list of their positions (hint: you can use the match string and look for " ").
   

    
<div class="tggle" onclick="toggleVisibility('ex2');">Show/Hide Solution</div>
<div id="ex2" style="display:none;">

In [217]:
%reset
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

fasta_string = open("file_samples/starch_sequences.fasta").read()
res_handle = NCBIWWW.qblast("blastn", "nt", fasta_string, 
                            entrez_query='"Malus Domestica" [Organism]'
                           )

for align in NCBIXML.parse(res_handle):
    
    for a in align.alignments:
            print("Align Title:{}".format(a.title))
            
            for h in a.hsps:
                s = h.score
                e = h.expect
                n = h.num_alignments
                i = h.identities
                p = h.positives
                m = h.match 
                al = h.align_length
                misM = [str(x) for x in range(len(m)) if m[x] == " "]
                print("Score: {} E-val: {}".format(s,e))
                print("N.aligns:{} Ident:{} Pos.:{} Align len:{}".format(
                    n,i,p,al))
                if(len(misM)):
                    print("Num mismatches:",len(misM))
                    print("Mismatch pos:", ",".join(misM))
                else:
                    print("No mismatches")
                print("")
            

res_handle.close()



Once deleted, variables cannot be recovered. Proceed (y/[n])? y
Align Title:gi|1040975204|ref|NM_001328845.1| Malus domestica glucose-1-phosphate adenylyltransferase small subunit, chloroplastic/amyloplastic (LOC103421714), mRNA >gi|295684200|gb|GU983663.1| Malus x domestica ADP glucose pyrophosphorylase small subunit 1-like protein mRNA, complete cds
Score: 3630.0 E-val: 0.0
N.aligns:None Ident:1815 Pos.:1815 Align len:1815
No mismatches

Align Title:gi|1039864835|ref|XM_008344054.2| PREDICTED: Malus x domestica glucose-1-phosphate adenylyltransferase small subunit, chloroplastic/amyloplastic (LOC103405088), mRNA
Score: 3280.0 E-val: 0.0
N.aligns:None Ident:1744 Pos.:1744 Align len:1818
Num mismatches: 74
Mismatch pos: 688,704,824,836,860,867,890,942,944,1031,1044,1122,1157,1190,1277,1295,1367,1371,1376,1382,1388,1394,1412,1445,1457,1477,1583,1584,1585,1589,1610,1614,1615,1617,1622,1632,1633,1649,1650,1651,1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1663,1670,1671,1681,1683,1686

Num mismatches: 60
Mismatch pos: 2,5,13,16,19,22,25,34,43,46,52,53,55,61,66,70,71,73,79,95,97,110,111,114,115,118,124,127,133,142,143,148,152,154,155,157,160,161,162,163,166,169,170,187,190,193,202,204,205,214,215,217,218,219,221,222,224,236,237,239

Score: 150.0 E-val: 1.73451e-30
N.aligns:None Ident:96 Pos.:96 Align len:110
Num mismatches: 14
Mismatch pos: 5,11,17,23,26,33,38,39,44,59,71,92,95,104

Score: 148.0 E-val: 6.05405e-30
N.aligns:None Ident:146 Pos.:146 Align len:194
Num mismatches: 48
Mismatch pos: 2,5,11,15,16,17,18,25,30,32,33,41,44,46,47,48,53,56,58,59,62,65,66,67,72,74,77,91,92,93,95,101,109,110,112,113,116,117,132,134,149,152,155,164,176,180,182,191

Score: 132.0 E-val: 1.33349e-25
N.aligns:None Ident:105 Pos.:105 Align len:131
Num mismatches: 26
Mismatch pos: 12,14,20,23,26,29,41,44,47,50,53,56,59,60,68,71,72,74,77,86,92,95,113,119,120,128

Score: 108.0 E-val: 4.35921e-19
N.aligns:None Ident:72 Pos.:72 Align len:84
Num mismatches: 12
Mismatch pos: 5,11,14,17,32,47,50,

</div>

3. Write a python function that given an organism and a term description, retrieves all the pubmed publications relatated to the organism and term.

<div class="tggle" onclick="toggleVisibility('ex3');">Show/Hide Solution</div>
<div id="ex3" style="display:none;">

</div>

4. Write some python code to retrieve the structure of two forms of the aspartate transcarbamoylase (PDB ids: 4FYW and 1D09). If you are interested, read more about the Aspartate Transcarbamoylase [here](http://pdb101.rcsb.org/motm/215). Write a function that gets the .cif file name and prints the number of chains, residues and atoms present in the file.

<div class="tggle" onclick="toggleVisibility('ex4');">Show/Hide Solution</div>
<div id="ex4" style="display:none;">

</div>