# BMI565: Bioinformatics Programming & Scripting

#### (C) Michael Mooney (mooneymi@ohsu.edu)

## Week 7: BioPython - Alignments and BLAST

1. Sequence Alignment
2. Clustalw
    - Alignment I/O
    - `AlignInfo` module
3. BLAST
    - BLAST Queries Using BioPython

#### Requirements

- Python 2.7
- `Bio` (BioPython) module
- Clustalw command-line program 
    - [http://www.clustal.org/clustal2/](http://www.clustal.org/clustal2/)
- Data Files
    - `./data/egfr.fasta`

## Sequence Alignment

By aligning biological sequences we can identify functional, structural, or evolutionary relationships between sequences. 

- Pairwise alignment algorithms compare two sequences
    - Smith-Waterman
    - Needleman-Wunsch
- Multiple sequence alignment (MSA) algorithms compare two or more sequences
    - clustalw
    - muscle
    - tcoffee
    
BioPython supports both types of alignment through its `Align` module.

## Clustalw

The Clustalw MSA program uses progressive alignment construction by taking the following steps

1. Perform pair-wise alignment of all sequences
2. Construct a tree such that branches represent the most similar sequences
3. Collapse branches into groups in order of similarity 

In [1]:
from Bio.Align.Applications import ClustalwCommandline

In [2]:
## Create the Clustalw command
command = ClustalwCommandline("clustalw2", infile="./data/egfr.fasta")
print command

clustalw2 -infile=./data/egfr.fasta


In [3]:
## Run the Clustalw command
stdout, stderr = command()

In [4]:
print stdout




 CLUSTAL 2.1 Multiple Sequence Alignments


Sequence format is Pearson
Sequence 1: sp|P00533|EGFR_HUMAN    1210 aa
Sequence 2: sp|P00533-2|EGFR_HUMAN   405 aa
Sequence 3: sp|P00533-3|EGFR_HUMAN   705 aa
Sequence 4: sp|P00533-4|EGFR_HUMAN   628 aa
Start of Pairwise alignments
Aligning...

Sequences (1:2) Aligned. Score:  99
Sequences (1:3) Aligned. Score:  89
Sequences (1:4) Aligned. Score:  99
Sequences (2:3) Aligned. Score:  99
Sequences (2:4) Aligned. Score:  99
Sequences (3:4) Aligned. Score:  99
Guide tree file created:   [./data/egfr.dnd]

There are 3 groups
Start of Multiple Alignment

Aligning...
Group 1: Sequences:   2      Score:13892
Group 2: Sequences:   2      Score:8903
Group 3: Sequences:   4      Score:14198
Alignment Score 20210

CLUSTAL-Alignment file created  [./data/egfr.aln]




In [5]:
print stderr




#### Passing parameters to Clustalw

Clustalw has several parameters that can significantly affect the alignment results (e.g. gap penalties).

[http://www.clustal.org/download/clustalw_help.txt](http://www.clustal.org/download/clustalw_help.txt)

Run the following at the commandline to see all available options:

    clustalw2 -help

In [6]:
command = ClustalwCommandline("clustalw2", infile="egfr.fasta")
command.gapopen = 5
command.gapext = 3
print command

clustalw2 -infile=egfr.fasta -gapopen=5 -gapext=3


### Alignment I/O

BioPython stores the results of a MSA in `MultipleSeqAlignment` objects. These objects can be created by reading the results of an alignment from a file using the `AlignIO` module.

Supported formats: [http://biopython.org/wiki/AlignIO](http://biopython.org/wiki/AlignIO)

In [7]:
from Bio import AlignIO

## Read the alignment results from the ClustalW output file
egfr_alignment = AlignIO.read("./data/egfr.aln", "clustal")

## egfr_alignment is now a BioPython MultipleSeqAlignment object
print egfr_alignment

SingleLetterAlphabet() alignment with 4 rows and 1210 columns
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTF...--- sp|P00533-3|EGFR_HUMAN
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTF...--- sp|P00533-4|EGFR_HUMAN
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTF...IGA sp|P00533|EGFR_HUMAN
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTF...--- sp|P00533-2|EGFR_HUMAN


In [8]:
## MultipleSeqAlignment objects are iterable objects with one element per sequence
len(egfr_alignment)

4

In [9]:
## iterate through an alignment object
for seq in egfr_alignment:
    print seq.seq

MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYG---PGNE------------------SLKAMLFC-----LFK-----------------------LSSCNQSND-----------------------GSVSH--------------------QSGSPAA-----QES-----------------CLG----WIPSLLP----------------SEFQLGWG-----GCSHLHAWP------SASVIITASSCH------------------------------------------------------------------------------------------------------------------------------------------------------

In [10]:
## Use indices to retrieve sequences
egfr_alignment[0]

SeqRecord(seq=Seq('MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRM...---', SingleLetterAlphabet()), id='sp|P00533-3|EGFR_HUMAN', name='<unknown name>', description='sp|P00533-3|EGFR_HUMAN', dbxrefs=[])

In [11]:
## Use slice notation to retrieve columns of the alignment
## For instance, to see agreement at a specific location
egfr_alignment[:,750]

'G-T-'

In [12]:
print egfr_alignment[:,750:775]

SingleLetterAlphabet() alignment with 4 rows and 25 columns
GSPAA-----QES------------ sp|P00533-3|EGFR_HUMAN
------------------------- sp|P00533-4|EGFR_HUMAN
TSPKANKEILDEAYVMASVDNPHVC sp|P00533|EGFR_HUMAN
------------------------- sp|P00533-2|EGFR_HUMAN


#### Reading and Writing Alignments

Similar to SeqIO, we can read very large alignment files using a generator returned from the `AlignIO.parse()` method. Alignments can also be written to file in a variety of formats.

In [13]:
## Use a generator to read an alignment file
alignments = AlignIO.parse("./data/egfr.aln", "clustal")

for alignment in alignments:
    print alignment

SingleLetterAlphabet() alignment with 4 rows and 1210 columns
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTF...--- sp|P00533-3|EGFR_HUMAN
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTF...--- sp|P00533-4|EGFR_HUMAN
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTF...IGA sp|P00533|EGFR_HUMAN
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTF...--- sp|P00533-2|EGFR_HUMAN


In [14]:
## Write an alignment to a file
AlignIO.write(egfr_alignment, "example.stk", "stockholm")

1

### `AlignInfo` module

The `AlignInfo` module allows the generation of a consensus sequence resulting from an alignment. The consensus sequence contains the most common letter at each sequence position.

`AlignInfo` contains two methods for generating a consensus sequence:

`dumb_consensus()` counts the number of matches at a particular position. If a letter count exceeds a threshold it is included in the consensus, otherwise an ambiguous character is included.

`gap_consensus()` is the same as `dumb_consensus()` except that it allows gaps.

In [15]:
from Bio.Align import AlignInfo

In [16]:
summary = AlignInfo.SummaryInfo(egfr_alignment)

In [17]:
help(summary.dumb_consensus)

Help on method dumb_consensus in module Bio.Align.AlignInfo:

dumb_consensus(self, threshold=0.7, ambiguous='X', consensus_alpha=None, require_multiple=0) method of Bio.Align.AlignInfo.SummaryInfo instance
    Output a fast consensus sequence of the alignment.
    
    This doesn't do anything fancy at all. It will just go through the
    sequence residue by residue and count up the number of each type
    of residue (ie. A or G or T or C for DNA) in all sequences in the
    alignment. If the percentage of the most common residue type is
    greater then the passed threshold, then we will add that residue type,
    otherwise an ambiguous character will be added.
    
    This could be made a lot fancier (ie. to take a substitution matrix
    into account), but it just meant for a quick and dirty consensus.
    
    Arguments:
        - threshold - The threshold value that is required to add a particular
          atom.
        - ambiguous - The ambiguous character to be added when the 

In [18]:
summary.dumb_consensus(threshold=0.6, ambiguous='X')

Seq('MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRM...IGA', SingleLetterAlphabet())

In [19]:
## View the dumb_consensus sequence
str(summary.dumb_consensus(threshold=0.6, ambiguous='X'))

'MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGXEGCPTNGPKIPSIATGMVGXLXXXLXXALGIGLFXRRRHIVRKRTLRRLLQERELVEPLXXXXXXXXQALLRILKETEFKKIKVLGSGAFGXVXXGLWIPEGEKVKIPVAIKELRXXXSPXANKEILXEXYVMASVDNPHVCRLLGICLXSTVQXIXXLXPFGCLLDYVREHKDNIGSXXXLXWXVQIAKGXXXLXXXXLVHRDLXAXXXXXXXXXHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYR

In [20]:
## View the gap_consensus sequence
str(summary.gap_consensus(threshold=0.6, ambiguous='X'))

'MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYG---XXXX------------------XXXXXXXX-----XXX-----------------------XXXXXXXXX-----------------------XXXXX--------------------XXXXXXX-----XXX-----------------XXX----XXXXXXX----------------XXXXXXXX-----XXXXXXXXX------XXXXXXXXXXXX-----------------------------------------------------------------------------------------------------------------------------------------------------

## BLAST: Basic Local Alignment Search Tool

- Developed in 1990 by researchers at NCBI, Penn State, and University of Arizona (biologists and computer scientists)
- Emphasizes speed over accuracy, which is necessary when searching databases containing hundreds of millions of sequences
- [http://blast.ncbi.nlm.nih.gov/Blast.cgi](http://blast.ncbi.nlm.nih.gov/Blast.cgi)

#### The BLAST Algorithm

1. Break query sequence into "words" (substrings)
2. Search a list of indexed words for similarity
3. Use thresholding to eliminate poor matches
4. Search database for sequences associated with high-scoring "words"
5. Extend matching alignments and score the sequence's similarity to quantify high-scoring sequence pairs (HSPs)
6. Combine HSPs if mapping to the same database sequence
7. Display scores and Smith-Waterman alignment for high-scoring matches

The results of the algorithm is an E-value for each matching sequence. The E-value represents the probability of getting the alignment by chance (a p-value corrected for multiple testing).

Details about the sequence similarity scores can be found here: [http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html](http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html)

### BLAST Queries Using BioPython

BioPython supports running BLAST queries either locally or over the internet. To run BLAST locally you will need to install the algorithm and sequence databases on your machine. The advantages of running queries locally are improved performance (faster) and customizable databases. Use the following module for local BLAST:

    from Bio.Blast.Applications import NcbiblastpCommandline
    
We will focus on running BLAST queries over the internet, which requires less initial setup and provides access to NCBI's sequence databases. The `qblast()` method in the `Bio.Blast.NCBIWWW` module can be used to submit BLAST queries. The method takes three parameters:

1. The blast program (indicates the type of query)
    - blastp: amino acid sequence queried against protein database
    - blastn: nucleotide sequence queried against nucleotide database
    - blastx: translated nucleotide sequence queried against protein database
2. The database to search
    - [http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.shtml](http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.shtml)
3. The query sequence (a string)

The `qblast()` method returns XML formatted results.

In [21]:
from Bio.Blast import NCBIWWW

In [22]:
## Get a query sequence
seq = str(summary.dumb_consensus(threshold=0.6, ambiguous='X'))
seq

'MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGXEGCPTNGPKIPSIATGMVGXLXXXLXXALGIGLFXRRRHIVRKRTLRRLLQERELVEPLXXXXXXXXQALLRILKETEFKKIKVLGSGAFGXVXXGLWIPEGEKVKIPVAIKELRXXXSPXANKEILXEXYVMASVDNPHVCRLLGICLXSTVQXIXXLXPFGCLLDYVREHKDNIGSXXXLXWXVQIAKGXXXLXXXXLVHRDLXAXXXXXXXXXHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYR

In [23]:
## WARNING -- this may take a while to run
## Run a BLAST query
result_handle = NCBIWWW.qblast('blastp', 'nr', seq)

## Write XML results to file
fh = open("blast_results.xml", 'w')
fh.write(result_handle.read())
fh.close()

#### Reading BLAST XML Results

In [24]:
## Import module for parsing BLAST XML results
from Bio.Blast import NCBIXML

## Read BLAST results
fh = open('blast_results.xml')
blast_record = NCBIXML.read(fh)
fh.close()

In [25]:
## Get descriptions of matches
len(blast_record.descriptions)

50

In [26]:
for description in blast_record.descriptions[0:3]:
    print description.title
    print description.score
    print description.e, '\n'

gi|61354647|gb|AAX41033.1| epidermal growth factor receptor, partial [synthetic construct]
6046.0
0.0 

gi|29725609|ref|NP_005219.2| epidermal growth factor receptor isoform a precursor [Homo sapiens] >gi|2811086|sp|P00533.2|EGFR_HUMAN RecName: Full=Epidermal growth factor receptor; AltName: Full=Proto-oncogene c-ErbB-1; AltName: Full=Receptor tyrosine-protein kinase erbB-1; Flags: Precursor >gi|11494380|gb|AAG35789.1|AF288738_4 p170 epidermal growth factor receptor [Homo sapiens] >gi|46241840|gb|AAS83109.1| epidermal growth factor receptor (erythroblastic leukemia viral (v-erb-b) oncogene homolog, avian) [Homo sapiens] >gi|61354737|gb|AAX41052.1| epidermal growth factor receptor [synthetic construct] >gi|61354745|gb|AAX41053.1| epidermal growth factor receptor [synthetic construct] >gi|119571347|gb|EAW50962.1| epidermal growth factor receptor (erythroblastic leukemia viral (v-erb-b) oncogene homolog, avian), isoform CRA_a [Homo sapiens]
6044.0
0.0 

gi|261860248|dbj|BAI46646.1| epider

In [27]:
## Get alignment info about matches
len(blast_record.alignments)

50

In [28]:
for alignment in blast_record.alignments[0:2]:
    print alignment.title
    for hsp in alignment.hsps:
        print hsp.score
        print "Query: ", hsp.query, '\n'
        print "Match: ", hsp.match, '\n'
        print "Subject: ", hsp.sbjct, '\n'

gi|61354647|gb|AAX41033.1| epidermal growth factor receptor, partial [synthetic construct]
6046.0
Query:  MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGXEGCPTNGPKIPSIATGMVGXLXXXLXXALGIGLFXRRRHIVRKRTLRRLLQERELVEPLXXXXXXXXQALLRILKETEFKKIKVLGSGAFGXVXXGLWIPEGEKVKIPVAIKELRXXXSPXANKEILXEXYVMASVDNPHVCRLLGICLXSTVQXIXXLXPFGCLLDYVREHKDNIGSXXXLXWXVQIAKGXXXLXXXXLVHRDLXAXXXXXXXXXHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQ

## References

- Python for Bioinformatics, Sebastian Bassi, CRC Press (2010)
- [http://biopython.org/DIST/docs/tutorial/Tutorial.html](http://biopython.org/DIST/docs/tutorial/Tutorial.html)
- [http://biopython.org/DIST/docs/api/](http://biopython.org/DIST/docs/api/)
- Peter Cock et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics, <i>Bioinformatics</i> (2009)

#### Last Updated: 22-Sep-2016