In [None]:
### NOTE FOR LUCA

**Remember to set/remove metadata as:**
{
  "nbsphinx": "hidden"
}

to enable/disable solutions view


# Practical 11

In this practical we will will see some other functionalities of Biopython.

## Slides

The slides of the introduction can be found here: [Intro](docs/Practical11.pdf)

## Biopython

From the Biopython tutorial: The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology. The goal of Biopython is to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and classes. Biopython features include parsers for various Bioinformatics file formats (BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy,...), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS...), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation.

In this practical we will see some features of Biopython but refer to [biopython documentation](http://biopython.org/wiki/Documentation) to discover all its features, recipes etc.

These notes are largely based on what is available [here](http://biopython.org/DIST/docs/tutorial/Tutorial.pdf).



## BLAST

[Blast (Basic logical alignment search tool)](https://www.ncbi.nlm.nih.gov/pubmed/2231712) is a well known tool to find similarities between biological sequences. It compares DNA or protein sequences and calculates the statistical significance of the matches found.

The typical interaction with BLAST sees the user submit some sequences to the tool to get an alignment and then the hits are parsed to obtain information on the matches. Both these steps can be performed from within Biopython. Although it is possible to interact directly with a local installation of BLAST, in this practical we will work with the tool made available by NCBI (available [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi)). 

### The function qblast

The online version of blast can be accessed through the ```Bio.Blast.NCBIWWW.qblast()``` function.

It's basic syntax is the following (first import ```from Bio.Blast import NCBIWWW```):

```
result_handle = Bio.Blast.NCBIWWW.qblast(blast_program, database, query_str)
```
where ```blast_program``` is the program to perform the alignment. The options are **blastn, blastp, blastx, tblast or tblastx**. ```database``` is the database to search against and ```query_str``` is a string containing the query to search against the database. The query can be a sequence or a fasta file entry or an identifier like a GI number (NCIBI's sequence identification number). Among the others, some optional parameters are the output format (```format_type``` that by default is "XML" which is the most stable output format but results can be stored also as text with "Text") and ```expect``` (the e-value threshold).

Some databases to search against are reported below:

![](img/pract11/blast_dbs.png)

The query string can be obtained in different ways, for example it is possible to load sequences from a fasta file with:

```
from Bio.Blast import NCBIWWW
fasta_string = open("myfile.fasta").read()
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)
```

or we can give a SeqRecord:

```
from Bio.Blast import NCBIWWW
from Bio import SeqIO
record = SeqIO.read("myfile.fasta", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
```

It is also possible to specify some optional parameters in the ```entrez_query``` for example we can limit the search to specific organisms with: ```entrez_query='"Malus Domestica" [Organism]'```.


**Example:** Let's align the first 100 bases of the first entry of the file [contigs82.fasta](file_samples/contigs82.fasta) to the Malus Domestica genome.

**NOTE: this can take several minutes.**

In [1]:
from Bio.Blast import NCBIWWW
from Bio import SeqIO



records = SeqIO.parse("file_samples/contigs82.fasta", format="fasta")
rec = next(records)
seq = rec.seq[0:100]
print("Aligning {} [{}] to Malus Dom.".format(rec.id,
                                              seq[0:10]+"..."+seq[90:101]))
result_handle = NCBIWWW.qblast("blastn", "nt", 
                               sequence,
                               entrez_query='"Malus Domestica" [Organism]'
                              )

Aligning MDC020656.85 [GAGGGGTTTA...TTGGCAGCAA] to Malus Dom.


Note that the previous code does not output anything, it just returns a ```result_handle```. We need to parse it to get some results.

### Parsing qblast output

Once the qblast call returns, it gives the results in a handle object ```result_handle``` that we can parse or we can write to disk to avoid having to rerun the query other times. If we expect to get one alignment only, we can use the method **read** otherwise (if we have multiple query sequences) we should use the method **parse**: 

```
blast_record = NCBIXML.read(result_handle)
```

or 

```
blast_records = NCBIXML.parse(result_handle)
```
Note that to use these methods we first need to import the ```NCBIXML``` module with ```from Bio.Blast import NCBIXML```.

These methods are analogous to what seen in the case of SeqIO and AlignIO. In the case of multiple entries we can loop through them with:

```
blast_records = NCBIXML.parse(result_handle)
for record in blast_records:
    #do something with it...
    
```
or we can retrieve one record at a time with ```record = next(blast_records)```.


### Saving results to file

To save the results present in the result_handle we can simply write them to file. In case we have only one entry we can read it and write it to file:

```
out_f = open("my_blast_result.xml", "w")
out_f.write(result_handle.read())
out_f.close 
result_handle.close()
```
If we have more than one entry we need to loop through all the entries and save them in the file:

```
out_f = open("my_blast_result.xml", "w")
for entry in result_handle.parse():
    out_f.write(entry)
out_f.close 
result_handle.close()
```

**Example:**

Let's BLAST the galactosidase alpha (gi number: 2717) against the human database on NCBI and save the results to file. (**Note that it can take several seconds/minutes to run!**).

In [2]:
from Bio.Blast import NCBIWWW

result_handle = NCBIWWW.qblast("blastn", "nt", "2717")


with open("file_samples/blast_res.xml","w") as out_f:
    out_f.write(result_handle.read())

result_handle.close()

### Open a blast .XML file

A BLAST output file can be read by opening the file to get the handler and then parse it with the method **parse** seen above:

```
from Bio.Blast import NCBIXML
result_handle = open("my_blast.xml")
blast_records = NCBIXML.parse(result_handle)
```
This will end up in a handler to the blast results.


### The BLAST record class

The ```Bio.Blast.Record.Blast``` class holds the results of the alignment. In particular it is composed of the following three information:

1. *Descriptions* : a list of Description objects. Each ```Description``` holds the following information:

    - ```Description.title``` : a string with the title of the hit;
    - ```Description.score``` : a float with the score of the alignment;
    - ```Description.num_alignments``` : an int with the number of alignments with the same subject;
    - ```Description.e``` : a float with the e-value of the alignment.
    
2. *Alignments* : a list of Alignment objects. Each ```Alignment``` holds the following information:

    - ```Alignment.title``` : a string with the title of the hit (identical to ```Description.title```);
    - ```Alignment.length``` : an int with the length of the alignment;
    - ```Alignment.hsps``` : a list of HSP objects (High Scoring Pair). Each ```HSP``` has the following info:
        
        - ```HSP.score``` : the BLAST score of the hit
        - ```HSP.bits``` :  the bits score of the hit (x: on average 2^x pairs to find such a good hit by chance)
        - ```HSP.expect``` : the evalue of the hit
        - ```HSP.num_alignments``` : the number of alignments for the same subject
        - ```HSP.identities``` : the number of identities between query and subject
        - ```HSP.positives``` : the number of identical bases/aminos or having similar chemical properties
        - ```HSP.gaps``` : the number of gaps between query and subject
        - ```HSP.strand``` : a **tuple** with (query,subject) strands
        - ```HSP.frame``` : a **tuple** with the frame shifts
        - ```HSP.query/HSP.sbjct``` : query/subject sequence
        - ```HSP.query_start/HSP.sbjct_start``` :query/subject start point
        - ```HSP.match``` : the match sequence (basically "|" for matches and spaces for mismatches)
        - ```HSP.align_length``` : the alignment length.

More information on the BLAST record can be found [here](http://biopython.org/DIST/docs/api/Bio.Blast.Record-module.html).

**Example:**

Let's blast the serum albumin sequence (gi number [23307792](https://www.ncbi.nlm.nih.gov/nuccore/AF542069.1)) on the human genome and report all the information reported by BLAST. (warning might take a while to run!)

In [3]:
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

result_handle = NCBIWWW.qblast("blastn", "nt", "23307792", 
                               entrez_query='"Homo Sapiens" [Organism]'
                               )



for res in NCBIXML.parse(result_handle):
    for d in res.descriptions:
        
        print("TITLE:{}\nSCORE:{}\nN.ALIGN:{}\nE-VAL:{}".format(
            d.title,d.score, d.num_alignments,d.e))
        
    for a in res.alignments:
        print("Align Title:{}\nAlign Len: {}".format(a.title, a.length))

        
    
        for h in a.hsps:
            s = h.score
            b = h.bits
            e = h.expect
            n = h.num_alignments
            i = h.identities
            p = h.positives
            g = h.gaps
            st = h.strand
            f = h.frame
            q = h.query
            sb = h.sbjct
            qs = h.query_start
            ss = h.sbjct_start
            qe = h.query_end
            se = h.sbjct_end
            m = h.match 
            al = h.align_length
            
            print("Score: {} Bits: {} E-val: {}".format(s,b,e))
            print("N.aligns:{} Ident:{} Pos.:{} Gaps:{} Align len:{}".format(
                n,i,p,g,al))
            print("Strand: {} Frame: {}".format(st,f))
            print("Query:", q, " start:", qs, " end:", qe)
            print("Match:",m)
            print("Subjc:",sb, " start:", ss, " end:", se)
            

result_handle.close()

TITLE:gi|23307792|gb|AF542069.1| Homo sapiens serum albumin (HSA) mRNA, complete cds
SCORE:4352.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|1046552723|ref|NM_000477.6| Homo sapiens albumin (ALB), mRNA
SCORE:4305.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|28591|emb|V00495.1| H.sapiens mRNA for serum albumin
SCORE:4253.0
N.ALIGN:2
E-VAL:0.0
TITLE:gi|7770116|gb|AF119840.1|AF119840 Homo sapiens PRO0903 mRNA, complete cds
SCORE:4062.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|25058738|gb|BC039235.1| Homo sapiens albumin, mRNA (cDNA clone IMAGE:4768004), containing frame-shift errors
SCORE:4056.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|23243417|gb|BC036003.1| Homo sapiens albumin, mRNA (cDNA clone MGC:32850 IMAGE:4724105), complete cds
SCORE:4052.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|158258946|dbj|AK292755.1| Homo sapiens cDNA FLJ78413 complete cds, highly similar to Homo sapiens albumin, mRNA
SCORE:4036.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|28589|emb|V00494.1| Human messenger RNA for serum albumin (HSA)
SCORE:4033.0
N.ALIGN:1
E-VAL:0.0
TITLE:gi|2170645

Subjc: ATACTTATATGAAATTGCCAGAAGACATCCTTACTTTTATGCCCCGGAACTCCTTTTCTTTGCTAAAAGGTATAAAGCTGCTTTTACAGAATGTTGCCAAGCTGCTGATAAAGCTGCCTGCCTGTTGCCAAAG  start: 75203  end: 75335
Score: 221.0 Bits: 200.559 E-val: 1.18952e-47
N.aligns:None Ident:115 Pos.:115 Gaps:0 Align len:118
Strand: (None, None) Frame: (1, 1)
Query: TCTCTTCTGTCAACCCCACGCGCCTTTGGCACAATGAAGTGGGTAACCTTTATTTCCCTTCTTTTTCTCTTTAGCTCGGCTTATTCCAGGGGTGTGTTTCGTCGAGATGCACACAAGA  start: 1  end: 118
Match: ||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||  ||||
Subjc: TCTCTTCTGTCAACCCCACACGCCTTTGGCACAATGAAGTGGGTAACCTTTATTTCCCTTCTTTTTCTCTTTAGCTCGGCTTATTCCAGGGGTGTGTTTCGTCGAGATGCACGTAAGA  start: 70143  end: 70260
Score: 200.0 Bits: 181.623 E-val: 3.19191e-42
N.aligns:None Ident:100 Pos.:100 Gaps:0 Align len:100
Strand: (None, None) Frame: (1, 1)
Query: GTTCGATGAATTTAAACCTCTTGTGGAAGAGCCTCAGAATTTAATCAAACAAAATTGTGAGCTTTTTGAGCAGCTTGGAGAGTACAAATTCCAGAATGCG  start: 1224  end: 1323
Match: 

**Example:** Let's align the first 100 bases of the first 5 entries of the file [contigs82.fasta](file_samples/contigs82.fasta) to the Malus Domestica genome, writing the results to a apple_first5.xml file.
Sample output is here: [apple_first5.xml](file_samples/apple_first5.xml).
**NOTE: this can take several minutes.**

## Getting data from NCBI

Biopython provides a module (```Bio.Entrez```) to pull data off resources like PubMed or GenBank, and other repositories programmatically through [Entrez](http://www.ncbi.nlm.nih.gov/Entrez/). 

There are some limitations (mostly taken care directly by Biopython) that you should be aware of when you use NCBI's services. Check [here](http://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Usage_Guidelines_and_Requiremen) if you want to learn what these limitations are.

First of all we need to import the Entrez module with (```from Bio import Entrez```) and then we can start interacting with Entrez, then we should specify (optional) an email setting ```Entrez.email```.

In particular the module (complete info on [Entrez module are here](http://biopython.org/DIST/docs/api/Bio.Entrez-module.html)) provides, among the others, the following functions:

1. ```res_handle = Entrez.einfo(db)``` returns a summary of the Entez databases as a results handle. ```db``` is an optional paramter specifying the resource of interest;
2. ```res_handle = Entrez.esearch(db, term,id)``` returns all the entries in ```db``` having query matching the term ```term```. It is also possible to specify an ```id``` to get the information relative to that resource id;
3. ```res_handle = Entrez.efetch(db, id, rettype, retmode)``` returns full record corresponding to the identifier ``id`` from the database ``db`` formatted in ```rettype``` (eg. gb, fasta,... [complete list](https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly)) and return mode ```retmode``` (eg. text); 
4. ```res_handle = Entrez.esummary(db, id)``` returns the summary of the entry ```id``` from the database ```db``` as a handle;
5. ```result = Entrez.read(res_handle)``` reads the information on the XML handle ```res_handle``` and stores them in a dictionary, list or string, depending on the case.  


**Example:**

Let's get a list of all available databases in Entrez as a dictionary. Let's then get a summary of the entries in 'sra'.

In [None]:
from Bio import Entrez

Entrez.email = "my_email"
handle = Entrez.einfo()
res = Entrez.read(handle)
print(res)
print("")
print("As a list:")
print(res['DbList'])

res = Entrez.read(Entrez.einfo(db = "sra"))
#uncomment to see all the information captured
#print(res)
#for el in res["DbInfo"].keys():
#    print(el) 
print("")
print("Entries count:", res["DbInfo"]["Count"])
print("LastUpdate:", res["DbInfo"]["LastUpdate"])
print("Description:", res["DbInfo"]["Description"])




Note that effectively **einfo** returned a handler to an object that can be read by the **read** function that produces a dictionary. This dictionary had one key only "DbList" that is the list of available databases in the first case, while the key when db was specified is "DbInfo".

**Example:**
Fetch the first three entries (retmax = 3) in pubmed that are related to the species "Malus Domestica" and report the title of the publication.

In [None]:
from Bio import Entrez

Entrez.email = "my_email"
handle = Entrez.esearch(db="pubmed", term="Malus Domestica", retmax = 3)
res = Entrez.read(handle)
for el in res.keys():
    print(el , " : ", res[el])

print("")
for ids in res["IdList"]:    
    print("Results for id:", ids)
    handle = Entrez.esummary(db="pubmed",  id = ids)
    res = Entrez.read(handle)
    #uncomment to see all info
    #print(res)
    for r in res:
        print(r["Title"])
        print(r["Source"])
        print("")


**Example:**
Retrieve genbank formatted information of the Malus x domestica MYB domain class transcription factor (MYB1) mRNA complete cds (nucleotide database id:HM122614.1). Parse it as a SeqRecord, printing only the sequence (remember previous practical's SeqIO).

In [None]:
from Bio import Entrez
from Bio import SeqIO

Entrez.email = "my_email"
handle = Entrez.efetch(db="nucleotide", 
                       id = "HM122614.1", 
                       rettype = "gb", 
                       retmode="text")
my_seq = SeqIO.read(handle, format = "genbank")
print(handle.read())
print(my_seq)
print("")
print("SEQUENCE:")
print(my_seq.seq)

## Getting data from ExPASy

Similarly to what done with Entrez, it is possible to pull data out of ExPASy (https://www.expasy.org/) through the Bio.ExPASy module. We will not cover this in detail. All information can be found here: [Bio.ExPASy module](http://biopython.org/DIST/docs/api/toc-Bio.ExPASy-module.html).

As an example, we will see how to download a couple of SwissProt entries (the human and mouse P53 protein).

In [None]:
from Bio import ExPASy
from Bio import SwissProt

accessions = ["P04637", "P02340"] #the ids of the human and mouse proteins

for accession in accessions:
    handle = ExPASy.get_sprot_raw(accession)
    record = SwissProt.read(handle)
    print(record.entry_name)
    print(",".join(record.accessions))
    print(record.keywords)
    print(repr(record.organism))
    print(record.sequence[:30] + "...")



You can find [here](http://biopython.org/DIST/docs/api/toc-Bio.SeqRecord-module.html) all the details on how to deal with the SwissProt records.

## 3D structure and PDB

Biopython can also deal with data coming from the [Protein Data Bank database](https://www.rcsb.org/pdb/home/home.do). It is a database of structural information of 3D shapes of proteins, nucleic acids, and complex assemblies. The database currently contains more than 134,000 total structures.

To deal with this kind of data we first need to import Biopython's module ```Bio.PDB``` with ```from Bio.PDB import *```. All the information on this module can be found [here](http://biopython.org/DIST/docs/api/toc-Bio.PDB-module.html). 

It is possible to download a structure directly from PDB by using a ```PDBList``` object that features a function called ```download_pdb_files``` having the basic syntax:

```
PDBList.download_pdb_files(pdb_codes, pdir, file_format)
```
that downloads the ```file_format``` formatted structures defined in the ```pdb_codes``` list of 4 symbols structure Ids from PDB, stores them in the directory ```pdir```. The safer ```file_format``` to use is "mmCif". The function will not download the structures more than once. If a file is already present in the specified directory, a message **Structure exists** will be displayed. 

**Example:**

Let's programmatically download two different structures of the DNA polymerase [3C2K](https://www.rcsb.org/pdb/explore/explore.do?structureId=3C2K) and [3C2L](https://www.rcsb.org/pdb/explore/explore.do?structureId=3C2K)

In [None]:
from Bio.PDB import *

pdbl = PDBList()
structures = ["3C2K", "3C2L"]
el = pdbl.download_pdb_files(structures, 
                             file_format = "mmCif", 
                             pdir = "file_samples/")


Macromolecular Crystallographic Information Files (mmCif files .cif) is a paired collection of names (starting with "\_") and values.They also contain a description of the 3D placement of every crystalized atom of the structure. A detailed description of the format can be found [here](http://mmcif.wwpdb.org/docs/tutorials/mechanics/pdbx-mmcif-syntax.html).  

Once the structures are available locally, one can start parsing them to do something useful.
Parsing can be done through the ```MMCIFParser``` object:

```
parser = MMCIFParser()
```

The ```parser``` object has several methods able to deal with structures. One of these is the ```get_structure``` that creates a ```PDB.Structure.Structure``` object with all the data present in the structure file.

The basic syntax is:
```
structure = parser.get_structure(pdb_code, filename)
```

where ```pdb_code``` is the PDB code of the structure contained in the file ```filename```. The method returns a ```PDB.Structure.Structure``` that contains one or more models.  


A ```Structure``` consists of a collection of one or more ```Model``` (different 3D conformations of the very same structure) that is a collection of ```Chain``` that is a collection of ```Residues``` that is a collection of ```Atoms```. Look in the documentation to get the information on each of these classes. This is the diagram of a structure:

![](img/pract11/structure1.png)

Given a ```Structure``` we can obtain iterators to **models**, **chains**, **residues** or **atoms** with:

```
Structure.get_models() 
Structure.get_chains()
Structure.get_residues()
Structure.get_atoms()
```

For each model obtained with ```structure.get_models()``` function we can loop through its chains, residues and atoms. For atoms we can get the 3D coordinates with ```Atom.get_coord()```.

**Example:**
Let's loop through all the models, chain, residues and atoms of the DNA polymerase structure 3C2K. Print the 3D coordinates of each atom.

In [None]:
from Bio.PDB import *


parser = MMCIFParser(QUIET=True) #To disable warnings


filename = "file_samples/3c2l.cif"
structure = parser.get_structure("3c2l", filename)

for model in structure.get_models():
    print("model", model, "has {} chains".format(len(model)))
    
    for chain in model:
        print(" - chain ", chain, "has {} residues".format(len(chain)))
        
        for residue in chain:
            print ("      - residue", residue.get_resname(), "has {} atoms".format(len(residue)))
            
            for atom in residue:
                x,y,z = atom.get_coord()
                print("        - atom:", atom.get_name(), "x: {} y:{} z:{}".format(x,y,z))
                
                

Once we have the coordinates of an atom we can compute distances between atoms or angles. We can also align two structures rototranslating the two to minimize their distance. We will not cover this and many other features that are provided by Biopython, such as Pyhlogentic analysis tools, interface towards pathways in KEGG, clustering, etc. If you are interested can find all the features available in the [Biopython tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.pdf).

## Exercises


1. Write a python function ```retrieve_sequences(search_term, number, outfile)``` that retrieves the first ```number``` of sequences from NCBI's "nucleotide" database having a search  term  ```term``` (hint: use term and retmax parameters of Entrez.esearch) and stores them in a fasta file ```outfile``` (hint: use SeqIO.write). Test your code retrieving the first 5 entries having search term "starch AND Malus Domestica [Organism]"

<div class="tggle" onclick="toggleVisibility('ex1');">Show/Hide Solution</div>
<div id="ex1" style="display:none;">

In [None]:
%reset -f

from Bio import Entrez
from Bio import SeqIO

def retrieve_sequences(search_term, number, filename):
    Entrez.email = "my_email"
    handle = Entrez.esearch(db="nucleotide", 
                            term=search_term, 
                            retmax=5)
    res = Entrez.read(handle)
    records = []
    for el in res["IdList"]:
        handle = Entrez.efetch(db="nucleotide", 
                               id=el, 
                               rettype = "gb", 
                               retmode="text")
        my_seq = SeqIO.read(handle, format = "genbank")
        records.append(my_seq)
    N = SeqIO.write(records, filename, "fasta")
    print("Search term was: ", search_term)
    print("{} sequences written to {}".format(N,filename))
    
s_term = "starch AND Malus Domestica [Organism]"
retrieve_sequences(s_term, 5, "file_samples/starch_sequences.fasta")


</div>

2. Write a python function that aligns the sequences  in the file created at point 1. ([here](file_samples/starch_sequences.fasta) you can find mine) against the NCBI nr database limiting the hits to the Malus Domestica organism (parameter entrez_query='"Malus Domestica" [Organism]' in qblast)and prints to screen the following info for each hsp: 
    1. The title;
    2. Score and e-value;
    3. The number of alignments on the same subject, the number of identities and positives and the alignment length;
    4. The number of mismatches and the list of their positions (hint: you can use the match string and look for " ").
   

    
<div class="tggle" onclick="toggleVisibility('ex2');">Show/Hide Solution</div>
<div id="ex2" style="display:none;">

In [None]:
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

fasta_string = open("file_samples/starch_sequences.fasta").read()

res_handle = NCBIWWW.qblast("blastn", "nt", fasta_string, 
                            entrez_query='"Malus Domestica" [Organism]'
                           )

for align in NCBIXML.parse(res_handle):
    
    for a in align.alignments:
            print("Align Title:{}".format(a.title))
            
            for h in a.hsps:
                s = h.score
                e = h.expect
                n = h.num_alignments
                i = h.identities
                p = h.positives
                m = h.match 
                al = h.align_length
                misM = [str(x) for x in range(len(m)) if m[x] == " "]
                print("Score: {} E-val: {}".format(s,e))
                print("N.aligns:{} Ident:{} Pos.:{} Align len:{}".format(
                    n,i,p,al))
                if(len(misM)):
                    print("Num mismatches:",len(misM))
                    print("Mismatch pos:", ",".join(misM))
                else:
                    print("No mismatches")
                print("")
            

res_handle.close()



</div>

3. Write a python function ```getPublicationInfo(title_term,other_term)``` that retrieves the first 20 pubmed publications having the ```title_term``` in the title and ```other_term``` somewhere else in the text (hint use: "Title" and "[Other Term]" as esearch parameter term). For each publication print: 
    1. the title
    2. authors 
    3. journal 
    4. year of publication (hint: get and split properly the "PubDate" entry)
    5. a link to the pubmed entry (hint: it is the string "https://www.ncbi.nlm.nih.gov/pubmed/" followed by the pubmed id ("eid" entry of the dictionary "ArticleIds"). es: https://www.ncbi.nlm.nih.gov/pubmed/26919684

Hint: to see how to combine search terms test them here: [https://www.ncbi.nlm.nih.gov/pubmed/advanced](https://www.ncbi.nlm.nih.gov/pubmed/advanced).

Test your code calling ```getPublicationInfo("apple","drought")```

<div class="tggle" onclick="toggleVisibility('ex3');">Show/Hide Solution</div>
<div id="ex3" style="display:none;">

In [None]:

from Bio import Entrez

def getPublicationInfo(title_term,other_term): 
    Entrez.email = "my_email"
    s_term = title_term + " [Title] AND " + other_term + " [Other Term]"
    handle = Entrez.esearch(db="pubmed", term=s_term)
    res = Entrez.read(handle)
#uncomment to see all info
#    for el in res.keys():
#        print(el , " : ", res[el])
#
#    print("")
    for ids in res["IdList"]:    
        handle = Entrez.esummary(db="pubmed",  id = ids)
        res = Entrez.read(handle)
        #uncomment to see all info
        #print(res)
        for r in res:
            print(r["Title"])
            print(",".join(r["AuthorList"]))
            print(r["Source"])
            print(r["PubDate"].split()[0])
            print("https://www.ncbi.nlm.nih.gov/pubmed/" + r["ArticleIds"]["eid"])
            print("")
            
getPublicationInfo("apple","drought")

</div>

4. Write some python code to retrieve the structure of two forms of the aspartate transcarbamoylase (PDB ids: 4FYW and 1D09). If you are interested, read more about the Aspartate Transcarbamoylase [here](http://pdb101.rcsb.org/motm/215). Write a function that gets the .cif file name and prints:

    1. the number of chains, residues and atoms present in the file;
    2. a histogram of the residues (plotting it with matplotlib) that are not water (encoded as "HOH");
    3. a link to an online tool to visualize the 3D structure. The link will be "http://www.rcsb.org/pdb/ngl/ngl.do?pdbid=" followed by the PDB id of the protein (e.g. 1d09).

<div class="tggle" onclick="toggleVisibility('ex4');">Show/Hide Solution</div>
<div id="ex4" style="display:none;">

In [None]:
from Bio.PDB import *
import matplotlib.pyplot as plt

def printCifInfo(filename):
    
    parser = MMCIFParser(QUIET=True) #To disable warnings
    id = filename.split("/")[1].split(".")[0]

    structure = parser.get_structure(id, filename)
    chains = structure.get_chains()
    residues = structure.get_residues()
    
    atoms = structure.get_atoms()
    res_histo = {}
    resCnt = 0 #need this because while reading the residues 
               #I am pulling stuff out of the iterator
    for res in residues:
        rname = res.get_resname()
        if(rname != "HOH"):
            if( rname not in res_histo):
                res_histo[rname] = 1
            else:
                res_histo[rname] += 1
        resCnt += 1    
    plt.figure(figsize=(15,5))
    plt.bar(res_histo.keys(), res_histo.values())
    plt.show()
    print("Number of chains: {}".format(len(list(chains))))
    print("Number of residues: {}".format(resCnt))
    print("Number of atoms: {}".format(len(list(atoms))))
    print("http://www.rcsb.org/pdb/ngl/ngl.do?pdbid=" + id) 

pdbl = PDBList()
structures = ["1D09", "4FYW"]
el = pdbl.download_pdb_files(structures, file_format = "mmCif", pdir = "file_samples/")

printCifInfo("file_samples/1d09.cif")
printCifInfo("file_samples/4fyw.cif")

</div>