<a href="https://colab.research.google.com/github/mariyagolchin/Biopython/blob/main/Biopython1_Parsing_sequence_file.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Working with sequences

-The Seq Object

-Parsing sequence file formats

-Simple FASTA parsing example

-Simple GenBank parsing example

In [4]:
try:
    import google.colab
    # Running on Google Colab, so install Biopython first
    !pip install biopython
except ImportError:
    pass

Collecting biopython
[?25l  Downloading https://files.pythonhosted.org/packages/76/02/8b606c4aa92ff61b5eda71d23b499ab1de57d5e818be33f77b01a6f435a8/biopython-1.78-cp36-cp36m-manylinux1_x86_64.whl (2.3MB)
[K     |████████████████████████████████| 2.3MB 7.2MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.78


In [5]:
import Bio
print("Biopython version:", Bio.__version__)

Biopython version: 1.78


# The Seq Object
The Seq object essentially combines a Python string with biological methods

In [7]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
my_seq


Seq('AGTACACTGGT')

In [10]:
# The Seq object has a number of methods which act just like those of a Python string, for example the find method:
my_seq.find('ACT')

5

In [11]:
my_seq.find('AGT')

0

In [12]:
# There is a count method too:
my_seq.count('A')

3

In [17]:
print(my_seq)
my_seq.count('GT')

AGTACACTGGT


2

## Nucleotide *methods*
These are very simple - the methods return a new Seq object with the appropriate sequence:

In [21]:
print("my_seq=>", my_seq)
my_seq.complement()

my_seq=> AGTACACTGGT


Seq('TCATGTGACCA')

In [19]:
my_seq.reverse_complement()

Seq('ACCAGTGTACT')

# Transcription and back transcription

-convert DNA to RNA: replacing all the thymines with uracil

In [30]:
my_dna = my_seq
print("my_dna=>",my_dna)
my_rna = my_dna.transcribe() #replacing all the T with U
print("my_rna=>",my_rna)
bk_dna= my_rna.back_transcribe()
print("bk_dna=>",bk_dna)

my_dna=> AGTACACTGGT
my_rna=> AGUACACUGGU
bk_dna=> AGTACACTGGT


# Translation
You can translate RNA:

In [32]:
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")
messenger_rna.translate()

Seq('MAIVMGR*KGAR*')

In [33]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
coding_dna.translate()

Seq('MAIVMGR*KGAR*')

# Parsing sequence file formats

In [34]:
!wget https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta


--2020-11-21 14:19:35--  https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76480 (75K) [text/plain]
Saving to: ‘ls_orchid.fasta’


2020-11-21 14:19:35 (6.12 MB/s) - ‘ls_orchid.fasta’ saved [76480/76480]



In [43]:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
  print(seq_record.id)
  print(repr(seq_record.seq))
  try:
    print(seq_record.count('TAC')) 
  except:
    print("IOF")
  print(len(seq_record),"\n")


gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
IOF
740 

gi|2765657|emb|Z78532.1|CCZ78532
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC')
IOF
753 

gi|2765656|emb|Z78531.1|CFZ78531
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA')
IOF
748 

gi|2765655|emb|Z78530.1|CMZ78530
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT')
IOF
744 

gi|2765654|emb|Z78529.1|CLZ78529
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA')
IOF
733 

gi|2765652|emb|Z78527.1|CYZ78527
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC')
IOF
718 

gi|2765651|emb|Z78526.1|CGZ78526
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT')
IOF
730 

gi|2765650|emb|Z78525.1|CAZ78525
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GCA')
IOF
704 

gi|2765649|emb|Z78524.1|CFZ78524
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATAGTAG...AGC')
IOF
740 

g

# Simple GenBank parsing example

In [44]:
!wget https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk

--2020-11-21 14:34:23--  https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 235482 (230K) [text/plain]
Saving to: ‘ls_orchid.gbk’


2020-11-21 14:34:24 (8.75 MB/s) - ‘ls_orchid.gbk’ saved [235482/235482]



In [45]:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
740
Z78532.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC')
753
Z78531.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA')
748
Z78530.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT')
744
Z78529.1
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA')
733
Z78527.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC')
718
Z78526.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT')
730
Z78525.1
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GCA')
704
Z78524.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATAGTAG...AGC')
740
Z78523.1
Seq('CGTAACCAGGTTTCCGTAGGTGAACCTGCGGCAGGATCATTGTTGAGACAGCAG...AAG')
709
Z78522.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...GAG')
700
Z78521.1
Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAGAATATATGATCGAGT...ACC')
726
Z78520.1
Seq('CGTAACAAGGTTTC