**Last modified on**: 30/10/2024

**Author**: Onur Serçinoğlu

**Credits**:

The following resources have been used to prepare this Jupyter notebook:

- A Jupyter notebook prepared by Prof. Ian Simpson for the Bioinformatics I course taught at University of Edinburgh (**https://github.com/tisimpson/bioinformatics1/blob/main/labs/notebooks/bio1_week2_part1.ipynb**)

- Pairwise sequence alignment exercises as part of Introduction to Bioinformatics course taught at Denmark Technical University (**https://teaching.healthtech.dtu.dk/22111/index.php/ExPairwiseAlignment**)

# Introduction to Python and Jupyter Notebooks


This notebook will guide you through using Python for bioinformatics applications, using **Jupyter Notebooks** as the primary environment.

Jupyter Notebooks allow you to write and execute code interactively, making them a powerful tool for exploring data and performing analysis in a convenient, intuitive format.

Throughout this course, you'll see how Python can be applied to bioinformatics problems, such as working with DNA, proteins, and various forms of biological data.


Have no previous coding or Python experience? You may wish to make use of hundreds of web resources that teach Python in various ways. 

**https://www.learnpython.org/**, is one of these resources, and is quite effective in introducing the fundamentals to absolute beginners!

## How to Use Jupyter Notebooks

In a Jupyter Notebook, cells are used to contain text, code, or other interactive content. There are two main types of cells:

1. **Markdown Cells** - These cells (like this one) are used for explanatory text, documentation, or instructions.
2. **Code Cells** - These cells contain Python code that can be executed.

To run a code cell, simply press `Shift + Enter`, or click the "Run" button at the top of the notebook.

Let's get started with a simple example. Below, we'll create a Python variable representing a short DNA sequence and print it.


In [1]:
# Simple DNA sequence example
dna_sequence = "ATGCGTACGTTAG"
print(f"DNA Sequence: {dna_sequence}")

DNA Sequence: ATGCGTACGTTAG


## Python in Bioinformatics

Python is widely used in bioinformatics for handling biological data, performing analysis, and even creating tools for complex data manipulation. 

In this notebook, we'll work through several examples, such as:

1. Counting nucleotides in a DNA sequence.
2. Transcribing DNA to RNA.
3. Calculating GC content in a sequence.

Let's start with a basic nucleotide count function.


In [2]:
# Nucleotide counting function
def count_nucleotides(seq):
    return {
        'A': seq.count('A'),
        'T': seq.count('T'),
        'G': seq.count('G'),
        'C': seq.count('C')
    }

# Example usage
dna_sequence = "ATGCGTACGTTAG"
nucleotide_counts = count_nucleotides(dna_sequence)
print(f"Nucleotide counts: {nucleotide_counts}")

Nucleotide counts: {'A': 3, 'T': 4, 'G': 4, 'C': 2}



### Transcribing DNA to RNA

In bioinformatics, one common task is converting DNA sequences to their corresponding RNA sequences. Here's a simple Python function to perform this transcription.


In [3]:
# DNA to RNA transcription function
def transcribe_dna_to_rna(seq):
    return seq.replace('T', 'U')

# Example usage
rna_sequence = transcribe_dna_to_rna(dna_sequence)
print(f"Transcribed RNA Sequence: {rna_sequence}")

Transcribed RNA Sequence: AUGCGUACGUUAG


## Introduction to Biopython

**Biopython** is a powerful library that provides tools for working with biological data in Python. It is widely used in bioinformatics for tasks such as parsing bioinformatics file formats, interacting with biological databases, and running sequence analyses.

To find out more about Biopython, its full set of features, and a set of tutorials, please visit the official biopython website: 

*http://biopython.org*

### Why use Biopython?

Biopython offers a range of functionality that makes it easier to perform bioinformatics tasks such as:

**Reading and writing sequence files**: Biopython supports many popular formats like FASTA, GenBank, and others.

**Performing common operations**: Including sequence manipulation, motif finding, and gene annotation.

**Interacting with online databases**: You can fetch sequences directly from databases like NCBI using Biopython's built-in functions.

**Handling complex data types**: It simplifies working with biological data structures like sequences, alignments, and phylogenetic trees.

### Installing Biopython

You don't need to install Biopython in case you're connected to our in-hourse JupyterHub server using the username and password provided to you by the course instructor.

If you wish to run the examples on your own device, biopython is easily installed using either pip or conda:

**conda install -c conda-forge biopython**

**pip install biopython**

Once installed, you can import Biopython and start using its features.

### Example: Reading a FASTA File

Biopython makes reading sequence files very simple. Here's how to read a FASTA file:

In [4]:
from Bio import SeqIO

# Load and parse a FASTA file
for record in SeqIO.parse("P29600.fa", "fasta"):
    print(f"ID: {record.id}")
    print(f"Sequence: {record.seq}")

ID: sp|P29600|SUBS_LEDLE
Sequence: AQSVPWGISRVQAPAAHNRGLTGSGVKVAVLDTGISTHPDLNIRGGASFVPGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYAVKVLGASGSGSVSSIAQGLEWAGNNGMHVANLSLGSPSPSATLEQAVNSATSRGVLVVAASGNSGAGSISYPARYANAMAVGATDQNNNRASFSQYGAGLDIVAPGVNVQSTYPGSTYASLNGTSMATPHVAGAAALVKQKNPSWSNVQIRNHLKNTATSLGSTNLYGSGLVNAEAATR


### Example: Creating a DNA sequence

In [5]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
print(my_seq)

# print the reverse complement of the sequence
print(my_seq.reverse_complement())

AGTACACTGGT
ACCAGTGTACT


### Example: Basic operations on DNA sequences

In [7]:
#sequence length
print(len(my_seq),"nucletotides long")

#sequence %GC content
from Bio.SeqUtils import GC

#simple print
print("%GC content = ",GC(my_seq),"%")

#printing to two decimal places
print("%GC content = "+'%4.2f' % GC(my_seq)+"%")

11 nucletotides long


ImportError: cannot import name 'GC' from 'Bio.SeqUtils' (/Users/onur/miniconda3/envs/bioinfo24/lib/python3.13/site-packages/Bio/SeqUtils/__init__.py)

**Why is GC content of DNA sequences important?**

In [None]:
#original sequence
print("original sequence",my_seq)

#sequence slicing NB this displays nucleotides 2-5
print("indexing from 1->5",my_seq[1:5])

#the sequence is indexed from 0
print("indexing from 0->5",my_seq[0:5])

In [None]:
#complement of the sequence
print(my_seq.complement())

#reverse complement of the sequence
print(my_seq.reverse_complement())

### Biopython also contains useful metadata!

In [None]:
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_id[1]

print(standard_table)

In [None]:
#and STOP codons
print(standard_table.stop_codons)

## Using ENTREZ via Biopython to access NCBI GenBank data

As introduced above, Biopython contains useful methods to access popular bioinformatics databases, such as the NCBI GenBank.

We can make use of the Entrez search systems of NCBI to perform search queries in the GenBank database.

Biopython has a module specifically created for this purpose (Biopython Entrez).

In [None]:
from Bio import Entrez

Entrez.email = "onursercin@gmail.com"

In [None]:
# we're going to search for up to 1000 sequences and we're going to ask for the accession number for each
# note the Entrez esearch function searches and returns a handle to the results.
handle = Entrez.esearch(db='nucleotide',term="Cypripedioideae",retmax=1000,idtype='acc')
record = Entrez.read(handle)
handle.close()

#look at the first 10 ids
print(record['IdList'][:10])

In [None]:
#lets fetch one
accession = record['IdList'][500]

handle = Entrez.efetch(db="nucleotide", id=accession, retmode="xml")
entry = Entrez.read(handle)
handle.close()

#print the whole entry (this is a GenBank record in XML format)
print(entry)

In [None]:
print(entry[0]['GBSeq_definition'])
print(entry[0]['GBSeq_organism'])

Let's print the same entry in a more user-friendly format (like how we see it when we visit the NCBI web page)

In [None]:
handle = Entrez.efetch(db="nuccore", id=accession, rettype="gb", retmode="text")
print(handle.read())

### Bio.Seq module

We can use the Bio.SeqIO module to handle groups of records, and then create Bio.Seq.Seq sequence objects to store them for later analysis.

This is especially useful if we need to work in A LOT of sequences! (which we often need to in actual bioinformatics tasks)

In [None]:
from Bio import SeqIO
handle = Entrez.efetch(db="nuccore", id=accession, rettype="gb", retmode="text")
records = SeqIO.parse(handle, "gb")

for entry in records:
    sequence = entry.seq
    print(sequence)
    print(type(sequence))
    
print('complement',sequence.complement())
print('reverse_complement',sequence.reverse_complement())

Let's say we're looking for "Gene" entries with title "Pax6". In other words, we're looking for all genes named Pax6. 

We reckon that if the gene name is specific enough, the results will include the same gene but from different organisms or from different experiments (variants among members of the same species).

In [None]:
# we're going to limit this to 100 sequences and we're going to ask for the accession number for each

# note the Entrez esearch function searches and returns a handle to the results.
handle = Entrez.esearch(db='nucleotide',term="Pax6[Gene]",retmax=100)
record = Entrez.read(handle)
handle.close()

#look at the first 10 ids
print(record['IdList'][:10])

In [None]:
# now lets fetch them all, to do this we extract the accession id list

gi_list = record['IdList']

#then turn it into a comma-separated string

gi_str = ",".join(gi_list)

handle = Entrez.efetch(db="nucleotide", id=gi_str, rettype="gb", retmode="text")
records = SeqIO.parse(handle, "gb")

for record in records:
    print("%s, length %i, from organism %s" % (record.name, len(record), record.description))

Now let's **specifically get the full gene entry from human Pax6 in Genbank!

In [None]:
handle = Entrez.efetch(db="nucleotide", id="208879460", rettype="gb", retmode="text")
gb_entry = handle.read()
handle.close()

#NB this is just a straight string at this point (as we just read() it straight into a string object)
print(gb_entry)

Now let's extract the **coding sequence** from this gene, and translate it into the protein it encodes.

In [None]:
handle = Entrez.efetch(db="nucleotide", id="208879460", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")

if record.features:
    for feature in record.features:
        #this tag identifies the CoDingSequences from the record
        if feature.type == "CDS":
            print(feature.qualifiers["protein_id"])
            print(feature.location,'\n')
            current_sequence = feature.location.extract(record).seq
            print('Nucleotide Sequence')
            print(current_sequence,'\n')
            #translate the current sequence into protein
            print('Protein Sequence')
            print(current_sequence.translate(),'\n')

In [None]:
# note the Entrez esearch function searches and returns a handle to the results.
handle = Entrez.esearch(db='gene',term="Nrg1[Gene] AND human",retmax=100)
record = Entrez.read(handle)
handle.close()

#look at the first 10 ids
print(record['IdList'][:10])

# lets retrieve as XML format and use the Entrez parser to read it
handle = Entrez.efetch(db="gene", id=record['IdList'][:1], retmode="xml")
# this returns an array of records which are in Python dict format
records = Entrez.read(handle)
handle.close()

# look at the first record by iterating through the keys of the dict
# NB there's a lot of information in here
for feature in list(records[0]):
    print(feature,':',records[0][feature])

## Performing pairwise alignment using Biopython

Biopython also allows users to perform pairwise alignments using common amino-acid substitution/scoring matrices.

In [None]:
from Bio import SeqIO
from Bio import pairwise2 as pw
from Bio import AlignIO
from Bio import Align as al

In [None]:
# list the available scoring matrices from the SubsMat module
print(al.substitution_matrices.load())

# the accession ids of human beta-globin and myoglobin proteins respectively
protein_ids = ['NP_000509.1','NP_005359.1']

handle = Entrez.efetch(db="protein", id=protein_ids, rettype='fasta',retmode="text")
records = list(SeqIO.parse(handle, "fasta"))
handle.close()

# use these as sequence objects
beta_globin = records[0].seq
myoglobin = records[1].seq

The pairwise2 module has two main functions for alignment 'local' and 'global' when they are called you add two charcters on to those to define how you want to perform the search, for example globalxx or localxx. Those two letters define the following:-

**x**     No parameters. Identical characters have score of 1, otherwise 0.

**m**     A match score is the score of identical chars, otherwise mismatch
      score.
      
**d**     A dictionary returns the score of any pair of characters.

**c**     A callback function returns scores.

The gap penalty parameters are:

**x**     No gap penalties.

**s**    Same open and extend gap penalties for both sequences.

**d**     The sequences have different open and extend gap penalties.

**c**     A callback function returns the gap penalties.

Further details can be found here.

As an example though a call of :-

pairwise2.align.globalms("ACCGT", "ACG", 2, -2, -.5, -.1)
for 'm' you specify match (+2) and mismatch (-2) scores and then 's' you specify gap-open (-0.5) and gap-extend (-0.1) scores

In [None]:
# perform a pairwise local alignment using the pam250 substitution matrix
mx = al.substitution_matrices.load('PAM250')

alignments = pw.align.localds(beta_globin,myoglobin,mx, -10, -0.5)

# this tells us how many alignments have the same optimal score (pretty useful, think of cells with more than 
# one backtrace mark in the hand-drawn alignments)
print(len(alignments))

# in the result we can extract several score features

# the alignment score
print(alignments[0][2])

# the start of the alignment
print(alignments[0][3])

# the end of the alignment
print(alignments[0][4])

# note here we are using 'd' the pam250 scoring matrix and then 's' gap-open (-10) and gap-extend (-0.5)

# unfortunately pairwise2 output looks awful (but its a good built in alignment method for you to practice
# with sequence alignment matrices and scoring systems

# so we're going to make a very basic fasta format file and use AlignIO to convert it into Clustal alignment
# format which is much nicer to look at

#create the fasta format
# > aligned seq 1
# SEQUENCE
# > aligned seq 2
# SEQUENCE

alignment_fasta = \
">"+records[0].name+" "+records[0].description+"\n"+alignments[0][0] \
+"\n"+ \
">"+records[1].name+" "+records[1].description+"\n"+alignments[0][1]

# write it to a file
fh = open('globin_alignment_pam250.fa','w')
fh.write(alignment_fasta)
fh.close()

# read in the file using AlignIO
alignment = AlignIO.read("globin_alignment_pam250.fa", "fasta")

# convert to clustal
print(format(alignment,'clustal'))

In [None]:
# perform a pairwise local alignment using the pam30 substitution matrix
mx = al.substitution_matrices.load('PAM30')

alignments = pw.align.localds(beta_globin,myoglobin,mx, -10, -0.5)

alignment_fasta = \
">"+records[0].name+" "+records[0].description+"\n"+alignments[0][0] \
+"\n"+ \
">"+records[1].name+" "+records[1].description+"\n"+alignments[0][1]

# write it to a file
fh = open('globin_alignment_pam30.fa','w')
fh.write(alignment_fasta)
fh.close()

# read in the file using AlignIO
alignment = AlignIO.read("globin_alignment_pam30.fa", "fasta")

# convert to clustal
print(format(alignment,'clustal'))

In [None]:
# perform a pairwise global alignment using the pam250 substitution matrix
mx = al.substitution_matrices.load('PAM250')

alignments = pw.align.globalds(beta_globin,myoglobin,mx, -10, -0.5)

# this tells us how many alignments have the same optimal score (pretty useful, think of cells with more than 
# one backtrace mark in the hand-drawn alignments)
print(len(alignments))

# in the result we can extract several score features

# the alignment score
print(alignments[0][2])

# the start of the alignment (NB global alignments must always start at 0)
print(alignments[0][3])

# the end of the alignment
print(alignments[0][4])

alignment_fasta = \
">"+records[0].name+" "+records[0].description+"\n"+alignments[0][0] \
+"\n"+ \
">"+records[1].name+" "+records[1].description+"\n"+alignments[0][1]

# write it to a file
fh = open('globin_alignment_pam250_global.fa','w')
fh.write(alignment_fasta)
fh.close()

# read in the file using AlignIO
alignment = AlignIO.read("globin_alignment_pam250_global.fa", "fasta")

# convert to clustal
print(format(alignment,'clustal'))

## Aligning two serine proteases

In [None]:
# Load and parse a FASTA file
subs_ledle = SeqIO.parse("P29600.fa", "fasta")
subs_ledle = next(subs_ledle)

In [None]:
elya_alkhc = SeqIO.parse("P41363.fa", "fasta")
elya_alkhc = next(elya_alkhc)

In [None]:
# perform a pairwise global alignment using the pam250 substitution matrix
mx = al.substitution_matrices.load('PAM250')

alignments = pw.align.globalds(subs_ledle.seq,elya_alkhc.seq,mx, -10, -0.5)

# this tells us how many alignments have the same optimal score (pretty useful, think of cells with more than 
# one backtrace mark in the hand-drawn alignments)
print(len(alignments))

# in the result we can extract several score features

# the alignment score
print(alignments[0][2])

# the start of the alignment (NB global alignments must always start at 0)
print(alignments[0][3])

# the end of the alignment
print(alignments[0][4])

alignment_fasta = \
">"+records[0].name+" "+records[0].description+"\n"+alignments[0][0] \
+"\n"+ \
">"+records[1].name+" "+records[1].description+"\n"+alignments[0][1]

# write it to a file
fh = open('globin_alignment_pam250_global.fa','w')
fh.write(alignment_fasta)
fh.close()

# read in the file using AlignIO
alignment = AlignIO.read("globin_alignment_pam250_global.fa", "fasta")

# convert to clustal
print(format(alignment,'clustal'))

## Aligning Savinase with a tripeptidly-peptidase 2

In [None]:
tri_pep = SeqIO.parse("P29144.fa", "fasta")
tri_pep = next(tri_pep)

In [None]:
# perform a pairwise global alignment using the pam250 substitution matrix
mx = al.substitution_matrices.load('PAM250')

alignments = pw.align.globalds(subs_ledle.seq,tri_pep.seq,mx, -10, -0.5)

# this tells us how many alignments have the same optimal score (pretty useful, think of cells with more than 
# one backtrace mark in the hand-drawn alignments)
print(len(alignments))

# in the result we can extract several score features

# the alignment score
print(alignments[0][2])

# the start of the alignment (NB global alignments must always start at 0)
print(alignments[0][3])

# the end of the alignment
print(alignments[0][4])

alignment_fasta = \
">"+records[0].name+" "+records[0].description+"\n"+alignments[0][0] \
+"\n"+ \
">"+records[1].name+" "+records[1].description+"\n"+alignments[0][1]

# write it to a file
fh = open('globin_alignment_pam250_global.fa','w')
fh.write(alignment_fasta)
fh.close()

# read in the file using AlignIO
alignment = AlignIO.read("globin_alignment_pam250_global.fa", "fasta")

# convert to clustal
print(format(alignment,'clustal'))