## Read and write sequence files using biopython

### Sequence objects

In [1]:
from Bio.Seq import Seq

A `Seq`instance is basically a string of letters with some additional functionality

In [2]:
s = Seq("GATTACA")
s

Seq('GATTACA')

In [3]:
s.complement()

Seq('CTAATGT')

In [4]:
s.reverse_complement()

Seq('TGTAATC')

In [5]:
s.transcribe()

Seq('GAUUACA')

In [6]:
s.translate()



Seq('DY')

In [7]:
s + s

Seq('GATTACAGATTACA')

In [8]:
str(s)

'GATTACA'

### Sequence annotation objects

Sequence annotation objects, `SeqRecord`, allow attaching further information to a `Seq` object.

They have (among others) the following attributes

- `id`
- `name`
- `description`

In [9]:
from Bio.SeqRecord import SeqRecord

In [10]:
sr = SeqRecord(s, id="test")
sr

SeqRecord(seq=Seq('GATTACA'), id='test', name='<unknown name>', description='<unknown description>', dbxrefs=[])

In [11]:
sr.seq

Seq('GATTACA')

In [12]:
sr.id

'test'

### Sequence input/output

How to read and write sequences to and from FASTA files

In [13]:
from Bio import SeqIO

To read sequences (as `SeqRecord`s) from a FASTA file, we can use the `SeqIO.parse(filaname, 'fasta')` function. It returns an iterator of `SeqRecord`s.

One can use an iterator directly in a `for` loop, or retrieve the next element using `next`. To convert the iterator to a list, one can use the `list`function.

To write a list of `SeqRecord` objects to a FASTA file, one can use the `SeqIO.write(records, filename, 'fasta')` function. Here `records` can be a list or a generator expression, which does not require all the records to be in memory at once.

In [14]:
seq1 = SeqRecord(Seq("ACTG"), id="seq1", description = "")
seq2 = SeqRecord(Seq("GATTACA"), id="seq2", description = "")
SeqIO.write([seq1, seq2], "/tmp/seq.fasta", "fasta")

2

In [15]:
iter = SeqIO.parse("/tmp/seq.fasta", "fasta")
list(iter)

[SeqRecord(seq=Seq('ACTG'), id='seq1', name='seq1', description='seq1', dbxrefs=[]),
 SeqRecord(seq=Seq('GATTACA'), id='seq2', name='seq2', description='seq2', dbxrefs=[])]

### Exercise

In notebook 02 you have created a function that returns the amino-acid sequence of a given chain. Now we can write that sequence to a FASTA file.

Consider '9ds2' again.

- figure out the names of the heavy and the light chain (using ab_ag.tsv)
- copy the function from notebook 02 into a code cell below
- extract the amino-acid sequence for those two chains up to residue_numbers 109 for the light chain and 113 for the heavy chain
- write them into separate FASTA files '/tmp/VH.fa' and '/tmp/VL.fa'

In [27]:
import os.path
PDB_DIR = "../data/pdbs"
from Bio.PDB.PDBParser import PDBParser

three2one = {
    'ALA': 'A',
    'ARG': 'R',
    'ASN': 'N',
    'ASP': 'D',
    'CYS': 'C',
    'GLN': 'Q',
    'GLU': 'E',
    'GLY': 'G',
    'HIS': 'H',
    'ILE': 'I',
    'LEU': 'L',
    'LYS': 'K',
    'MET': 'M',
    'PHE': 'F',
    'PRO': 'P',
    'SER': 'S',
    'THR': 'T',
    'TRP': 'W',
    'TYR': 'Y',
    'VAL': 'V',
    'SEC': 'U',  # Selenocysteine
    'PYL': 'O',  # Pyrrolysine
    'ASX': 'B',  # Aspartic acid or Asparagine
    'GLX': 'Z',  # Glutamic acid or Glutamine
    'XLE': 'J',  # Leucine or Isoleucine
    'UNK': 'X'   # Unknown
}


def extract_aa_sequence(pdb_id, chain, limit):
    filename = os.path.join(PDB_DIR, f"{pdb_id}_chothia.pdb")

    parser = PDBParser(PERMISSIVE=1)
    structure = parser.get_structure(pdb_id, filename)

    residues = structure[0][chain].get_residues()

    aa_residues = [res for res in residues if res.id[0] == ' ' and res.id[1] <= limit]

    return SeqRecord(Seq(''.join([three2one[res.get_resname()] for res in aa_residues])), 
                     id = pdb_id, description = '')

vh = extract_aa_sequence("9ds2", "H", 113)
vl = extract_aa_sequence("9ds2", "L", 109)



In [29]:
SeqIO.write([vh], "/tmp/VH.fa", "fasta")

1