# BMI565: Bioinformatics Programming & Scripting

#### (C) Michael Mooney (mooneymi@ohsu.edu)

## Week 7: BioPython - Sequence Objects

1. BioPython History
2. Alphabet Objects
3. Sequence Objects
4. Sequence Records
5. Sequence I/O
    - Reading Sequences
    - Writing Sequences
    - Converting Sequence Formats

#### Requirements

- Python 2.7 or 3.x
- `Bio` (BioPython) module (`conda install biopython`)
- Data Files:
    - `./data/P00533.fasta`
    - `./data/egfr.fasta`
    - `./data/egfr.gb`

In [1]:
from __future__ import print_function, division

## BioPython History

BioPython is a collection of Python modules developed to address bioinformatics problems.

- BioPython is free
- First released in 1999 by Jeff Chang and Brad Chapman
- Original design goals for BioPython:
    - import and parse biological data in to a computer usable format from a variety of sources (Entrez, PubMed, fasta, etc.)
    - use Python strengths in OOP to represent biological sequences
    - provide standardized tools for analyzing biological data (vizualization, statistics, machine-learning)



[http://biopython.org/DIST/docs/tutorial/Tutorial.html](http://biopython.org/DIST/docs/tutorial/Tutorial.html)

## Alphabet Objects (Removed in Sep. 2020)

[https://biopython.org/wiki/Alphabet](https://biopython.org/wiki/Alphabet)

You can now record molecule type in the `annotations` attribute of `SeqRecord` objects.

***For Biopython versions < 1.78**

Alphabets are used to describe specific types of biological sequences. The `Bio.Alphabet.IUPAC` module provides basic definitions of DNA, RNA, and protein sequences.

Alphabet objects:
- Are based on IUPAC (International Union of Pure and Applied Chemistry) rules for naming organic compounds
- [http://www.bioinformatics.org/sms/iupac.html](http://www.bioinformatics.org/sms/iupac.html)
- Constrains allowable sequence data
- Allows code to make safe assumptions about sequence content

In [None]:
from Bio.Alphabet import IUPAC
print("DNA: " + IUPAC.unambiguous_dna.letters)
print("Ambiguous DNA: " + IUPAC.ambiguous_dna.letters)
print("RNA: " + IUPAC.unambiguous_rna.letters)
print("Ambiguous RNA: " + IUPAC.ambiguous_rna.letters)
print("Protein: " + IUPAC.protein.letters)

## Sequence Objects

The `Seq` object is BioPython's core class for biological sequences. `Seq` objects behave similarly to strings but have additional methods specific to biological sequences.

In [2]:
from Bio.Seq import Seq
myseq = Seq("CCTATGT")
len(myseq)

7

In [3]:
myseq[0:3]

Seq('CCT')

In [4]:
str(myseq)

'CCTATGT'

In [5]:
rnaseq = myseq.transcribe()
rnaseq

Seq('CCUAUGU')

### Sequence Object Methods

<table align="left">
<tr><td style="text-align:center"><b>Method</b></td><td><b>Description</b></td></tr>
<tr><td style="text-align:center"><code>Seq.transcribe()</code></td><td>Returns the mRNA sequence for a transcribed DNA sequence</td></tr>
<tr><td style="text-align:center"><code>Seq.translate()</code></td><td>Returns amino acid sequence from transcribed and translated DNA sequence</td></tr>
<tr><td style="text-align:center"><code>Seq.complement()</code></td><td>Returns the complement of a DNA or RNA sequence</td></tr>
<tr><td style="text-align:center"><code>Seq.back_transcribe()</code></td><td>Returns a DNA sequence from an RNA sequence</td></tr>
<tr><td style="text-align:center"><code>Seq.reverse_complement()</code></td><td>Returns the reverse complement of a DNA or RNA sequence</td></tr>
<tr><td style="text-align:center"><code>Seq.find("CG")</code></td><td>Returns the index of the first match of the specified substring; behaves the same as the <code>find()</code> method for Python strings.</td></tr>
<tr><td style="text-align:center"><code>Seq.count("G")</code></td><td>Returns the number of non-overlaping matches</td></tr>
<tr><td style="text-align:center"><code>str(Seq)</code></td><td>Returns a string version of the sequence</td></tr>
</table>

In [6]:
str(myseq)

'CCTATGT'

In [7]:
rnaseq.back_transcribe()

Seq('CCTATGT')

In [8]:
myseq.find("AT")

3

In [9]:
help(myseq.find)

Help on method find in module Bio.Seq:

find(sub, start=None, end=None) method of Bio.Seq.Seq instance
    Return the lowest index in the sequence where subsequence sub is found.
    
    With optional arguments start and end, return the lowest index in the
    sequence such that the subsequence sub is contained within the sequence
    region [start:end].
    
    Arguments:
     - sub - a string or another Seq or MutableSeq object to search for
     - start - optional integer, slice start
     - end - optional integer, slice end
    
    Returns -1 if the subsequence is NOT found.
    
    e.g. Locating the first typical start codon, AUG, in an RNA sequence:
    
    >>> from Bio.Seq import Seq
    >>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
    >>> my_rna.find("AUG")
    3
    
    The next typical start codon can then be found by starting the search
    at position 4:
    
    >>> my_rna.find("AUG", 4)
    15



In [10]:
myseq.count("T")

3

### Mutable `Seq` Objects

`Seq` objects are not mutable. Use the class `MutableSeq` for mutable sequence objects.

In [11]:
from Bio.Seq import MutableSeq 
mutseq = MutableSeq(myseq)
mutseq

MutableSeq('CCTATGT')

In [12]:
mutseq[0] = "T"
mutseq.extend("AAATGC")
mutseq

MutableSeq('TCTATGTAAATGC')

In [13]:
## Create a Seq object from a MutableSeq object
newseq = Seq(mutseq)
newseq

Seq('TCTATGTAAATGC')

## Sequence Records

`SeqRecord` objects support additional annotation information associated with a biological sequence (genomic annotation).

- Structural Annotations:
    - ORFs
    - Gene structure
    - Coding regions
    - Genomic location
    - Regulatory motifs
- Functional Annotations:
    - Biological/biochemical functions
    - Molecular interactions
    - Regulation
    - Expression
    - Pathways

#### `SeqRecord` Attributes
<br />
<table align="left">
<tr><td style="text-align:center"><b>Attribute</b></td><td><b>Description</b></td></tr>
<tr><td style="text-align:center"><code>seq</code></td><td>The sequence itself, typically a <code>Seq</code> object</td></tr>
<tr><td style="text-align:center"><code>id</code></td><td>The primary sequence ID (a string)</td></tr>
<tr><td style="text-align:center"><code>name</code></td><td>The common name for the sequence (a string)</td></tr>
<tr><td style="text-align:center"><code>description</code></td><td>A human readable description of the sequence (a string)</td></tr>
<tr><td style="text-align:center"><code>letter_annotation</code></td><td>Per-letter annotations (e.g. quality scores). These annotations are stored as a dictionary, where keys describe the annotation and values are a sequence (list, tuple, string) of the same length as the <code>Seq</code> object.</td></tr>
<tr><td style="text-align:center"><code>annotations</code></td><td>A dictionary containing addtional information about the sequence</td></tr>
<tr><td style="text-align:center"><code>features</code></td><td>A list containing <code>SeqFeature</code> objects. <code>SeqFeatures</code> allow structured annotation of specific locations within a biological sequence (e.g. exons, binding sites, etc.) (<a href="http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec%3Aseq_features">http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec%3Aseq_features</a>)</td></tr>
<tr><td style="text-align:center"><code>dbxrefs</code></td><td>A list of database cross-references as strings</td></tr>
</table>

In [14]:
from Bio.SeqRecord import SeqRecord

In [15]:
## Create a sequence
simpleseq = Seq("AAAGCT")

## Create a sequence record
simpleseq_rec1 = SeqRecord(simpleseq)

In [16]:
## Set sequence record attributes
simpleseq_rec1.id = "4321"
simpleseq_rec1.description = "A simple sequence example"

## Alternatively, we can set attributes when we create the record
simpleseq_rec2 = SeqRecord(simpleseq, id="4321", description="A simple sequence example")

print(simpleseq_rec1)
print()
print(simpleseq_rec2)

ID: 4321
Name: <unknown name>
Description: A simple sequence example
Number of features: 0
Seq('AAAGCT')

ID: 4321
Name: <unknown name>
Description: A simple sequence example
Number of features: 0
Seq('AAAGCT')


In [17]:
## Set other SeqRecord attributes
simpleseq_rec1.annotations["Organism"] = "Homo sapiens"
simpleseq_rec1.annotations["molecule_type"] = "DNA"
print(simpleseq_rec1.annotations)

simpleseq_rec1.letter_annotations["qualities"] = [40,40,38,10,20,40]
print(simpleseq_rec1.letter_annotations)

{'Organism': 'Homo sapiens', 'molecule_type': 'DNA'}
{'qualities': [40, 40, 38, 10, 20, 40]}


In [18]:
simpleseq_rec1.seq

Seq('AAAGCT')

In [19]:
## Print the SeqRecord in a specified format
print(simpleseq_rec1.format("fasta"))

>4321 A simple sequence example
AAAGCT



## Sequence I/O

The `SeqIO` module provides tools for working with various sequence file formats. This module allows you to read and parse sequence files, convert between file formats and write sequence records to a file.

Supported formats:
[http://biopython.org/wiki/SeqIO](http://biopython.org/wiki/SeqIO)

<b>** It is important to note that not all sequence formats are perfectly compatible. For instance, some formats may require quality scores, or may enforce length limits on identifiers, etc. Also, information may not be perfectly preserved when converting to another format and then back to the original format.</b>

### Reading Sequences

In [20]:
from Bio import SeqIO

## open a sequence file
fh = open("./data/P00533.fasta", 'r')

## SeqIO.read() will parse the file and return a sequence record object
fasta_rec = SeqIO.read(fh, "fasta")
print(fasta_rec)

## close the file
fh.close()

ID: sp|P00533|EGFR_HUMAN
Name: sp|P00533|EGFR_HUMAN
Description: sp|P00533|EGFR_HUMAN Epidermal growth factor receptor OS=Homo sapiens GN=EGFR PE=1 SV=2
Number of features: 0
Seq('MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRM...IGA')


In [21]:
## Read multiple sequences from a single file 
record_list = []

with open("./data/egfr.fasta") as fh:
    ## SeqIO.parse() returns a generator that yields a SeqRecord for each sequence in the file
    records = SeqIO.parse(fh, "fasta")
    
    for record in records:
        record_list.append(record)

print("Number of records: ", len(record_list))
print(record_list[3])

Number of records:  4
ID: sp|P00533-4|EGFR_HUMAN
Name: sp|P00533-4|EGFR_HUMAN
Description: sp|P00533-4|EGFR_HUMAN Isoform 4 of Epidermal growth factor receptor OS=Homo sapiens GN=EGFR
Number of features: 0
Seq('MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRM...YGS')


### Writing Sequences

In [22]:
## Write a SeqRecord to a file
fh = open('newseq.fasta', 'w')
SeqIO.write(simpleseq_rec1, fh, "fasta")
fh.close()

In [23]:
## SeqIO.write() can also take a list of sequence records
fh = open('newseq2.fasta', 'w')
SeqIO.write(record_list, fh, "fasta")
fh.close()

### Converting Between Sequence Formats

SeqIO file formats:
[http://biopython.org/wiki/SeqIO](http://biopython.org/wiki/SeqIO)

In [24]:
## Read a file in GenBank format
fh = open('./data/egfr.gb', 'r')
egfr_rec = SeqIO.read(fh, "genbank")
print(egfr_rec)
fh.close()

ID: NM_005228.3
Name: NM_005228
Description: Homo sapiens epidermal growth factor receptor (EGFR), transcript variant 1, mRNA
Number of features: 70
/molecule_type=mRNA
/topology=linear
/data_file_division=PRI
/date=08-NOV-2014
/accessions=['NM_005228']
/sequence_version=3
/gi=41327737
/keywords=['RefSeq']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='Cellular migration and invasion uncoupled: increased migration is not an inexorable consequence of epithelial-to-mesenchymal transition', ...), Reference(title='EGF receptor uses SOS1 to drive constitutive activation of NFkappaB in cancer cells', ...), Reference(title='Associations between mutations and histologic patterns of mucin in lung adenocarcinoma: invasive mucinous pattern and extracellular mucin are ass

In [25]:
## View the record in fasta format
print(egfr_rec.format("fasta"))

>NM_005228.3 Homo sapiens epidermal growth factor receptor (EGFR), transcript variant 1, mRNA
CCCCGGCGCAGCGCGGCCGCAGCAGCCTCCGCCCCCCGCACGGTGTGAGCGCCCGACGCG
GCCGAGGCGGCCGGAGTCCCGAGCTAGCCCCGGCGGCCGCCGCCGCCCAGACCGGACGAC
AGGCCACCTCGTCGGCGTCCGCCCGAGTCCCCGCCTCGCCGCCAACGCCACAACCACCGC
GCACGGCCCCCTGACTCCGTCCAGTATTGATCGGGAGAGCCGGAGCGAGCTCTTCGGGGA
GCAGCGATGCGACCCTCCGGGACGGCCGGGGCAGCGCTCCTGGCGCTGCTGGCTGCGCTC
TGCCCGGCGAGTCGGGCTCTGGAGGAAAAGAAAGTTTGCCAAGGCACGAGTAACAAGCTC
ACGCAGTTGGGCACTTTTGAAGATCATTTTCTCAGCCTCCAGAGGATGTTCAATAACTGT
GAGGTGGTCCTTGGGAATTTGGAAATTACCTATGTGCAGAGGAATTATGATCTTTCCTTC
TTAAAGACCATCCAGGAGGTGGCTGGTTATGTCCTCATTGCCCTCAACACAGTGGAGCGA
ATTCCTTTGGAAAACCTGCAGATCATCAGAGGAAATATGTACTACGAAAATTCCTATGCC
TTAGCAGTCTTATCTAACTATGATGCAAATAAAACCGGACTGAAGGAGCTGCCCATGAGA
AATTTACAGGAAATCCTGCATGGCGCCGTGCGGTTCAGCAACAACCCTGCCCTGTGCAAC
GTGGAGAGCATCCAGTGGCGGGACATAGTCAGCAGTGACTTTCTCAGCAACATGTCGATG
GACTTCCAGAACCACCTGGGCAGCTGCCAAAAGTGTGATCCAAGCTGTCCCAATGGGAGC
TGCTGGGGTGCAGGAGAGGAGAACTGCCAGAAACTGACCAAAATCATCTGTG

In [26]:
## Write the record in fasta format
fh = open('egfr_mrna.fasta', 'w')
SeqIO.write(egfr_rec, fh, "fasta")
fh.close()

In [27]:
## BioPython 1.52 introduced SeqIO.convert()
help(SeqIO.convert)

Help on function convert in module Bio.SeqIO:

convert(in_file, in_format, out_file, out_format, molecule_type=None)
    Convert between two sequence file formats, return number of records.
    
    Arguments:
     - in_file - an input handle or filename
     - in_format - input file format, lower case string
     - out_file - an output handle or filename
     - out_format - output file format, lower case string
     - molecule_type - optional molecule type to apply, string containing
       "DNA", "RNA" or "protein".
    
    **NOTE** - If you provide an output filename, it will be opened which will
    
    The idea here is that while doing this will work::
    
        from Bio import SeqIO
        records = SeqIO.parse(in_handle, in_format)
        count = SeqIO.write(records, out_handle, out_format)
    
    it is shorter to write::
    
        from Bio import SeqIO
        count = SeqIO.convert(in_handle, in_format, out_handle, out_format)
    
    Also, Bio.SeqIO.convert is fas

In [28]:
## Convert the sequence from GenBank to fasta using SeqIO.convert()
## Output file will be overwritten if it exists
## Returns the number of records converted
SeqIO.convert("./data/egfr.gb", "genbank", "egfr_mrna.fasta", "fasta")

1

## In-Class Exercises

In [29]:
## Exercise 1.
## Create a sequence record by reading from a GenBank file 
## (use './data/egfr.gb').
## Create a new sequence object holding the translated protein sequence.
## Write the protein sequence to a file in fasta format.
##


## References

- Python for Bioinformatics, Sebastian Bassi, CRC Press (2010)
- [http://biopython.org/DIST/docs/tutorial/Tutorial.html](http://biopython.org/DIST/docs/tutorial/Tutorial.html)
- [http://biopython.org/DIST/docs/api/](http://biopython.org/DIST/docs/api/)
- Peter Cock et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics, <i>Bioinformatics</i> (2009)

#### Last Updated: 15-Sep-2022