# II. Sequences

## Biology 
- Sequence ~ ordered collection of letters representign a directed molecular chain
- Sequence alphabet:
    - DNA: A, C, G, T
    - RNA: A, C, G, U
    - Proteins: 20 single letter codes
  
    

## Bioinformatics
- String generally means a sequence of characters
- Python strings are often a good model

- Sequences are written using a standard nucleotide alphabet
- IUPAC defined alphabets
- Protein and DNA / RNA alphabets
- Biopython: alphabets can be set
- Extraction of IUPAC definitions for DNA: 

| Symbol        | Description   |
|:-------------:|:-------------:| 
| A             | Adenine       | 
| C             | Cytosine      |   
| G             | Guanine       | 
| T             | Thymine       |
| N             | aNy base      |
| R             | A or G (puRine)|
| Y             | C or T (pYrimidine|
| M             | A or C (aMino) |
| S             | C or G (Strong)|


- Biopython introduces the Seq object (Bio.Seq)


# Bio.Seq
- __data__: the underlying Python character string
- __alphabet__: DNA, RNA, protein, etc.:
    - two-fold importance:
        1. Type of information stored in the Seq object
        2. Information constraint, as a means of type checking
    - Bio.Alphabet.IUPAC: basic/ extend/ customized definitions for DNA/ RNA/ proteins
    - Initialization without explicit alphabet means generic alphabet
- READ-ONLY (IMMUTABLE)

In [1]:
from Bio.Seq import Seq
my_seq = Seq('AGTACACTGGT')
my_seq

Seq('AGTACACTGGT')

In [2]:
from Bio.Alphabet import IUPAC
dna = Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPAC.unambiguous_dna) # basic DNA Alphabet
print(dna, " with the alphabet: ", dna.alphabet)

ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG  with the alphabet:  IUPACUnambiguousDNA()


### a)The Seq object: Sequences act like strings 
--> support for most Python string methods:

In [3]:
# occurrences of a substring
# my_seq.count("GT")

# counts non-overlapping:
"AAAA".count("AA")
#Seq("AAAA").count("AA")

# length of the sequence
# len(dna)

# finds the position of a substring
# dna.find("TATAT")

# concatenation
# dna[:12] + '---'+ dna[15:] 

# changing the case
#dna.lower()

# string output formatting:
for index, letter in enumerate(my_seq):
    print("index: {0} with letter:{1}".format(index, letter))



index: 0 with letter:A
index: 1 with letter:G
index: 2 with letter:T
index: 3 with letter:A
index: 4 with letter:C
index: 5 with letter:A
index: 6 with letter:C
index: 7 with letter:T
index: 8 with letter:G
index: 9 with letter:G
index: 10 with letter:T


## About strands...
__DNA coding strand (aka Crick strand, strand +1)__

5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG	3’
  
3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC	5’

__DNA template strand (aka Watson strand, strand −1)__
 
 
__through Transcription__: __Single stranded messenger RNA__
 		 
 
5’	AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG	3’
 	 
 
- keep in mind: sequences are always shown in 5' to 3' direction
- biological transcription: takes template strand (does a reverse complement) in order to obtain mRNA
- bioinformatics: uses coding strand (only switch of T-> U to get mRNA)
 

### b) The Seq object: biological relevant methods
--> support for some biologically specific methods:

In [4]:
print(dna) # coding strand 5' to 3'
print(dna.complement())# get the complement strand
print(dna.reverse_complement()) # the template strand 

ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC
CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT


In [5]:
# transcription
rna = dna.transcribe() # uses the coding strand (T-> U)
print(rna)
print("rna's alphabet ", rna.alphabet, " consists of following letters: ", rna.alphabet.letters)

AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
rna's alphabet  IUPACUnambiguousRNA()  consists of following letters:  GAUC


In [6]:
# translation
protein = rna.translate()
print(protein, protein.alphabet)

MAIVMGR*KGAR* HasStopCodon(IUPACProtein(), '*')


In [7]:
dna2 = Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPAC.unambiguous_dna)

#direct translation is possible
protein2 = dna2.translate()
print(protein2, protein2.alphabet)


MAIVMGR*KGAR* HasStopCodon(IUPACProtein(), '*')


- there are different translation tables (genetic codes available) in biopython
- they are based on the NCBI's tables found here [NCBI's translation tables](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)
- if you have a mitochondrial sequence you should use mitochondrial translation table


In [8]:
dna2.translate(table="Vertebrate Mitochondrial") #default 1 (NCBI table id), 2 is Vertebrate Mitochondrial

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

In [9]:
# since there are two stop codons as in nature you want to stop at the first one:
print(dna2.translate(to_stop=True))

print(dna2.translate(to_stop=True, table="2"))

MAIVMGR
MAIVMGRWKGAR


### Comparing sequences
- not always easy
- Biopython: equality also depends on the alphabet used:

In [10]:
# same sequences but different alphabets
seq1 = Seq("ACGT", IUPAC.unambiguous_dna) 
seq2 = Seq("ACGT", IUPAC.ambiguous_dna)
seq3 = Seq("ACGT", IUPAC.unambiguous_dna) 

In [11]:
# compare the strings
print(str(seq1) == str(seq2))
print(str(seq1) == str(seq1))

True
True


In [12]:
# compare objects: Biopython only compares strings not taken alphabet into account
print(seq1 == seq3)

# typical id function of python
print(id(seq1) == id(seq3)) # seq1 and seq3 are different objects

True
False


In [13]:
# compare different molecules - triggers a warning
from Bio.Alphabet import generic_dna, generic_protein
dna_seq = Seq("ACGT", generic_dna)
prot_seq = Seq("ACGT", generic_protein)
print(dna_seq == prot_seq)

True




# MutableSeq

- remember: Bio.Seq objects are __IMMUTABLE__:

In [14]:
my_seq = Seq("ACCT", IUPAC.unambiguous_dna)
my_seq[1] = "G"

TypeError: 'Seq' object does not support item assignment

In [21]:
# modifiable Seq object is needed
my_mod_seq = my_seq.tomutable()
print(type(my_seq))
print(type(my_mod_seq))

from Bio.Seq import MutableSeq
# you can also directly create from a string
m = MutableSeq("ACCT", IUPAC.unambiguous_dna)
print(m, m.alphabet)

# modify sequence 
m[1] = "G"
print(m)
#print sequence in reverse order
print(m[::-1])

# convert it back to immutable seq
n = m.toseq()
print(n)

<class 'Bio.Seq.Seq'>
<class 'Bio.Seq.MutableSeq'>
ACCT IUPACUnambiguousDNA()
AGCT
TCGA
AGCT


## Exercises

## Exercise 1:
Count the number of A's, C', G's and T's of the sequence "seq" below. It contains the whole genome of *H.influenzae*. Compute the relative abundances of each nucleotide. What do you observe?

In [28]:
# only evaluate this cell ONCE !!!
from Bio import Entrez, SeqIO
from Bio.Seq import Seq

# we will se later how this works in detail
Entrez.email = 'A.N.Other@example.com'
handle = Entrez.efetch(db="nucleotide", id="NC_000907", rettype="fasta", retmode="text")
record = SeqIO.read(handle, "fasta")

# here is the sequence you should use for testing your function
seq = record.seq
print(len(seq))

1830138


In [42]:
#count A's, C's, G's and T's:
Ac = seq.count("A")
Cc = seq.count("C")
Gc = seq.count("G")
Tc = seq.count("T")

#print out absolute counts
print('A count: ', Ac)
print('C count: ', Cc)
print('G count: ', Gc)
print('T count: ', Tc)

#print out relative abundances
print("Rel. abundance A: ", Ac/len(seq))
print("Rel. abundance C: ", Cc/len(seq))
print("Rel. abundance G: ", Gc/len(seq))
print("Rel. abundance T: ", Tc/len(seq))

A count:  567623
C count:  350723
G count:  347436
T count:  564241
Rel. abundance A:  0.3101531141367482
Rel. abundance C:  0.19163746121877148
Rel. abundance G:  0.18984142179442207
Rel. abundance T:  0.3083051660585158


## Exercise 2:
Write a function to compute the GC content of a sequence (which is of type Bio.Seq). Compute the GC content of the following sequence and compare it with the result Biopython’s Bio.SeqUtils.GC() of the module Bio.SeqUtils method gives. 


In [43]:
def compute_GC_content(seq):
    gc = sum(seq.count(x) for x in ['G', 'C', 'g', 'c', 'S', 's']) 
    try: 
        return gc * 100.0 / len(seq) 
    except ZeroDivisionError: 
        return 0.0

In [45]:
seq2 = Seq("CTAACCAGCAGCACGACSCACCCTTCCAACGACCCSATAACAGC", IUPAC.ambiguous_dna)
# call your function and compare with Bio.SeqUtils method
print(compute_GC_content(seq2))

import Bio.SeqUtils as utl
print(utl.GC(seq2))

59.09090909090909
59.09090909090909


## Exercise 3:
Given the following dna sequence, how would true biological transcription work in Biopython? 
 

In [47]:
dna2 = Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPAC.unambiguous_dna)
# your solution goes here:
m_rna = dna2.reverse_complement().transcribe()
m_rna

Seq('CUAUCGGGCACCCUUUCAGCGGCCCAUUACAAUGGCCAU', IUPACUnambiguousRNA())