# Biopython

## Sequences and records

Load the required modules.

In [1]:
from Bio import SeqIO

Read a GenBank sequence file.

In [2]:
seq_records = SeqIO.parse('382544572.gbk', 'genbank')

This file has just a single record, so we access that one.

In [3]:
seq_record = next(seq_records)

Some metadata can be printed.

In [7]:
print('ID: ', seq_record.id)
print('Name: ', seq_record.name)
print('Description: ', seq_record.description)
print('Length: ', len(seq_record))

ID:  NG_007109.2
Name:  NG_007109
Description:  Homo sapiens mutL homolog 1 (MLH1), RefSeqGene (LRG_216) on chromosome 3.
Length:  79540


Check the number of features.

In [8]:
print('Number of features: ', len(seq_record.features))

Number of features:  36


Figure out which features we have in the record, and what index they have in the list.

In [10]:
features = dict()
for i, feature in enumerate(seq_record.features):
    if feature.type not in features:
        features[feature.type] = []
    features[feature.type].append(i)
for type in features:
    print('{0}: {1}'.format(type, ','.join(str(x) for x in features[type])))

exon: 6,8,9,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26
gene: 1,4,16,27
source: 0
CDS: 3,7,32,33,34,35
mRNA: 2,5,28,29,30,31


Which genes do we have?

In [11]:
for index in features['gene']:
    print(seq_record.features[index])

type: gene
location: [<0:4955](-)
qualifiers:
    Key: db_xref, Value: ['GeneID:9852', 'HGNC:HGNC:19735', 'MIM:607911']
    Key: gene, Value: ['EPM2AIP1']
    Key: note, Value: ['EPM2A interacting protein 1']

type: gene
location: [5000:62497](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:4292', 'HGNC:HGNC:7127', 'MIM:120436']
    Key: gene, Value: ['MLH1']
    Key: gene_synonym, Value: ['COCA2; FCC2; hMLH1; HNPCC; HNPCC2']
    Key: note, Value: ['mutL homolog 1']

type: gene
location: [28048:28685](-)
qualifiers:
    Key: db_xref, Value: ['GeneID:100131713', 'HGNC:HGNC:36905']
    Key: gene, Value: ['RPL29P11']
    Key: gene_synonym, Value: ['RPL29_2_362']
    Key: note, Value: ['ribosomal protein L29 pseudogene 11']
    Key: pseudo, Value: ['']

type: gene
location: [64276:>79540](-)
qualifiers:
    Key: db_xref, Value: ['GeneID:9209', 'HGNC:HGNC:6703', 'MIM:614043']
    Key: gene, Value: ['LRRFIP2']
    Key: gene_synonym, Value: ['HUFI-2']
    Key: note, Value: ['LRR binding FLII

We're interested in the MLH1 gene, so we look at this feature in detail.

In [13]:
mklh1_feature = seq_record.features[4]

The location is of special interest if we want to slice the record.

In [16]:
start_pos = int(mklh1_feature.location.start)
end_pos = int(mklh1_feature.location.end)

In [17]:
mklh1_record = seq_record[start_pos:end_pos]

Note that the subseqence record has retained a number of features, but only those that pertain to the subsequence itself.

In [18]:
len(mklh1_record.features)

23

In [19]:
for feature in mklh1_record.features:
    print(feature)

type: gene
location: [0:57497](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:4292', 'HGNC:HGNC:7127', 'MIM:120436']
    Key: gene, Value: ['MLH1']
    Key: gene_synonym, Value: ['COCA2; FCC2; hMLH1; HNPCC; HNPCC2']
    Key: note, Value: ['mutL homolog 1']

type: mRNA
location: join{[0:314](+), [3269:3360](+), [7605:7704](+), [11051:11125](+), [13641:13714](+), [15464:15556](+), [18470:18513](+), [18661:18750](+), [21082:21195](+), [24156:24250](+), [26960:27114](+), [32287:32658](+), [35434:35583](+), [46836:46945](+), [48918:48982](+), [54169:54334](+), [55167:55260](+), [55554:55668](+), [57136:57497](+)}
qualifiers:
    Key: db_xref, Value: ['GI:263191547', 'LRG:t1', 'GeneID:4292', 'HGNC:HGNC:7127', 'MIM:120436']
    Key: gene, Value: ['MLH1']
    Key: gene_synonym, Value: ['COCA2; FCC2; hMLH1; HNPCC; HNPCC2']
    Key: product, Value: ['mutL homolog 1, transcript variant 1']
    Key: transcript_id, Value: ['NM_000249.3']
Sub-Features
type: mRNA
location: [0:314](+)
qualifiers:

t

The length of this subsequence if of course less than that of the entire sequence.

In [20]:
len(mklh1_record)

57497