## Installing Biopython

Biopython can be installed easily from your Anaconda Distribution packages in Environment tab. Another option to do so is by using following commands in your command prompt.

1) All supported versions of Python include the Python package management tool pip, which allows an easy installation from the command line on all platforms. Try:
**pip install biopython**

2) For updating an older version of Biopython try:
**pip install biopython --upgrade**.
This will remove older versions of Biopython and NumPy before it installs the recent versions.

3) Should you wish to uninstall Biopython:
**pip uninstall biopython**

4) If pip is not already installed you may need to update your Python, but first try:
**python -m ensurepip**

5) If you need to install under a specific version of Python, try something like this:
**python3.9 -m pip install biopython**
or
**pypy -m pip install biopython** 

6) On Windows, by default python and pip are not on the PATH. You can re-install Python and tick this option, or give the full path instead. Try something like this, depending on where your copy of Python is installed:
**C:\Python39\Scripts\pip install biopython**

For more, visit https://biopython.org/wiki/Download

## <u>Working with Biopython</u>

Biopython is a set of freely available tools for biological computation written in Python by an international team of developers. It is used in both bioinformatics software development and in everyday scripts for common bioinformatics tasks. The homepage www.biopython.org provides access to the source code, documentation and mailing lists.

It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI. Separate modules extend Biopython's capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning. Biopython is one of a number of Bio* projects designed to reduce code duplication in computational biology.

**A comprehensive guide**: https://mygoblet.org/sites/default/files/materrials/Biopython_ppt.pdf

In [4]:
from Bio.Seq import Seq     #importing Seq object from Bio.Seq module

Sequence is basically a special series of letters which is used to represent the protein of an organism, DNA or RNA. 

Sequences in Biopython are usually handled by the **Seq** object described in **Bio.Seq** module. The Seq object has inbuilt functions like **complement, reverse_complement, transcribe, back_transcribe and translate**, etc. The Seq objects has numerous string methods like **count(), find(), split(), strip(), etc.**

More in https://www.geeksforgeeks.org/sequence-in-biopython-module/

https://biopython.org/docs/1.75/api/Bio.Seq.html

http://fenyolab.org/presentations/Introduction_Biostatistics_Bioinformatics_2015/pdf/IBB2015_homework4_biopython.pdf

### Find complement of following sequence

In [5]:
my_seq = Seq("AATGCTaaaaaaatgctgtATGC")
my_seq = my_seq.upper()
print("our sequence is", my_seq)
print("Its complement is", my_seq.complement())    #we use the .complement() method

our sequence is AATGCTAAAAAAATGCTGTATGC
Its complement is TTACGATTTTTTTACGACATACG


### Print reverse complement of given sequence

In [6]:
my_seq = Seq("ATGCATGCTGCATGACTCTGACT")
my_seq.reverse_complement()

Seq('AGTCAGAGTCATGCAGCATGCAT')

### Do Transcribing and Reverse-transcribing of your DNA

In [7]:
#transcription
my_seq = Seq("ATGCATGCTGCATGACTCTGACT")
my_seq.transcribe()

Seq('AUGCAUGCUGCAUGACUCUGACU')

In [8]:
# reverse transcription
my_seq = Seq("ATGCATGCTGCATGACTCTGACT")
my_seq.back_transcribe()

Seq('ATGCATGCTGCATGACTCTGACT')

### DNA to protein

In [9]:
my_seq = Seq("ATGCATGCTGCATACTCTGACATG")
my_seq.translate()

Seq('MHAAYSDM')

### Read the sequence from .fasta and translate it using Biopython.

In [10]:
from Bio.SeqIO import parse
from Bio.Seq import Seq

for a in parse(open("file.fasta"), "fasta"):
    seq_id = a.id
    my_seq = a.seq
    print(my_seq.translate())

WDQSAEAACVRVRVRVCACVCVRLHLCRVGKEIEMGGQ*AQVPKALNPLVWSLLRAMGAIEKSEQGCV*M*GLEGSSREASSKAFAIIW*ENPARMDRQNGIEMSWQLKWTGFGTSLVVGSKQRRIWDSGGLAWGRRGCLRGWEG*E*DDTWWCLAGGGQG*LCEGTARATEAF*DPAVPEPGRQDLHCGRPGEHLA


TranslationError: Codon 'MEP' is invalid

In [None]:
from Bio.SeqIO import parse
from Bio.Seq import Seq

for seq_record in parse(open("file.fasta"), "fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

### How many A, T, G, C are there in the sequence

In [None]:
my_seq = Seq("ATGCATGCTGCATACTCTGACATG")
print("Number of As:", my_seq.count("A"))
print("Number of Gs:", my_seq.count("G"))
print("Number of Cs:", my_seq.count("C"))
print("Number of Ts:", my_seq.count("T"))

### Read the fasta file (fa file) and print the first record from the same

In [None]:
from Bio.SeqIO import parse
from Bio.Seq import Seq

seqs = []

for seq_record in parse(open("file.fasta"), "fasta"):
    seqs.append(seq_record)
seqs[0]   

### Find GC content

In [None]:
from Bio.SeqUtils import GC
GC("ACTGNAAAGGGGTGCATGCTAGCTGGGATTC")

### Concatenate two sequences

In [None]:
seq1= Seq("ATGCATGC")
seq2= Seq("GATCGATC")

seq1 = str(seq1)         #this is using string methods
seq2 = str(seq2)
seq = seq1+seq2
seq

In [None]:
#using seq object

seq1= Seq("ATGCATGC")
seq2= Seq("GATCGATC")
seq3 = seq1+seq2         #simply add seq objects
seq3

In [None]:
seqs = [Seq("ATGC"), Seq("GATC"),Seq("ATGC")]

#index, save in seperate variables, just add
seq1 = seqs[0]
seq2 = seqs[1]
seq3 = seqs[2]
concat_seq = seq1+seq2+seq3
print(concat_seq)


#alternate method
concat = Seq("")
for i in seqs:
    concat += i
    
concat

### Take a Seq object (which has lower and upper case both) and convert lower cases into upper cases

In [None]:
my_seq = Seq("AATGCTaaaaaaatgctgtATGC")
my_seq = my_seq.upper()
print("our sequence is", my_seq)

### Take three Seq object and add a sequence of 10N between each of them

In [None]:
Seqs = [Seq("ATG"), Seq("ATCCCG"), Seq("TTGCA")]
filler_seq = Seq("NNNNNNNNNN")

my_seq = Seqs[0] + filler_seq + Seqs[1] + filler_seq + Seqs[2]
my_seq

In [None]:
# using join method

Seqs = [Seq("ATG"), Seq("ATCCCG"), Seq("TTGCA")] 
filler = Seq("N"*10) 
filler.join(Seqs) 

### Make a dataframe as instructed

In [None]:
from Bio.SeqIO import parse
from Bio.Seq import Seq
import pandas as pd

for seq_record in parse(open("file.fasta"), "fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

### Save all the “short” sequences of less than 150 nucleotides to a new Fasta file

In [24]:
from Bio import SeqIO

input_file = "file.fasta"
output_file = "converted_file.fasta"
count = 0
total = 0

for seq_record in SeqIO.parse(input_file, "fasta"):
    total += 1
    if 150 > len(seq_record):
        count += 1
        SeqIO.write(seq_record, output_file, "fasta")
print(str(count) + " records selected out of " + str(total))

1 records selected out of 2


### Select 200 random nucleotides from sequence.fasta and randomly generate 500 sequences

In [None]:
from Bio import SeqIO
import random

input_file = "sequence.fasta"
output_file = "random.fasta"

