# Biopython for working with PDB files

Useful links:

- Ultimate tutorial on Biopython: http://biopython.org/DIST/docs/tutorial/Tutorial.html
- Navigation through documentation, case examples: https://biopython.org/wiki/Documentation
- Manual on PDB module: https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ

## 0. Installation

In [1]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.83-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: biopython
Successfully installed biopython-1.83


## 1. Parse PDB file into structure object

In [2]:
!wget https://files.rcsb.org/download/1brs.pdb

--2024-02-08 21:17:33--  https://files.rcsb.org/download/1brs.pdb
Resolving files.rcsb.org (files.rcsb.org)... 128.6.159.157
Connecting to files.rcsb.org (files.rcsb.org)|128.6.159.157|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘1brs.pdb’

1brs.pdb                [ <=>                ] 456.02K  --.-KB/s    in 0.1s    

2024-02-08 21:17:33 (3.17 MB/s) - ‘1brs.pdb’ saved [466965]



In [3]:
# Create PDBparser object
from Bio.PDB.PDBParser import PDBParser
p = PDBParser(QUIET=True) # silence warnings

In [4]:
# Create the structure object
filename = '1brs.pdb'
# The first argument is your custom name for the structure. It is rarely handy, so you may just leave it empty.
structure = p.get_structure(' ', filename)

## 2. PDB file header

In [5]:
# self-explanatory keys of header dictionary
structure.header.keys()

dict_keys(['name', 'head', 'idcode', 'deposition_date', 'release_date', 'structure_method', 'resolution', 'structure_reference', 'journal_reference', 'author', 'compound', 'source', 'has_missing_residues', 'missing_residues', 'keywords', 'journal'])

In [6]:
# method of structure determination
structure.header['structure_method']

'x-ray diffraction'

In [7]:
# print missing residues (chain, residue name, residue number) if any
if structure.header['has_missing_residues'] == True:
    missres = structure.header['missing_residues']
    for res in missres:
        print(res['chain'], res['res_name'], res['ssseq'])

A ALA 1
A GLN 2
C ALA 1
C GLN 2
D GLU 64
D ASN 65
E LYS 1
E GLU 64
E ASN 65


## 3. Coordinate section

Hierarchy of elemets in structure object:

- A structure consists of models. In X-Ray and EM structures there is, as a rule, one model, in NMR - multiple models (conformers).
- A model consists of chains. Chains are usually named by a single capital letter.
- A chain consists of residues
- A residue consists of atoms

In [8]:
for model in structure:
    print(model)
    for chain in model:
        print(chain)
        for residue in chain:
            print(residue)
            break
            for atom in residue:
                print(atom)
                break

<Model id=0>
<Chain id=A>
<Residue VAL het=  resseq=3 icode= >
<Chain id=B>
<Residue ALA het=  resseq=1 icode= >
<Chain id=C>
<Residue VAL het=  resseq=3 icode= >
<Chain id=D>
<Residue LYS het=  resseq=1 icode= >
<Chain id=E>
<Residue LYS het=  resseq=2 icode= >
<Chain id=F>
<Residue LYS het=  resseq=1 icode= >


The nested for-loops in the code above are only for demonstration of structure object organization. In practice it is not convenient to iterate through all entities one by one. Accessing structure components by methods is more efficient:

In [9]:
atoms = structure.get_atoms()

for atom in atoms:
    print(atom.get_coord()) # get atom coordinates
    break

[16.783 48.812 26.447]


In [10]:
for res in structure.get_residues():
    print(res.get_full_id(), res.get_resname())
    break

(' ', 0, 'A', (' ', 3, ' ')) VAL


Useful tip: to check methods and attributes of the objects, use `dir()`:

In [11]:
for res in structure.get_residues():
    print(dir(res))
    break

['__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_generate_full_id', '_id', '_reset_full_id', 'add', 'center_of_mass', 'child_dict', 'child_list', 'copy', 'detach_child', 'detach_parent', 'disordered', 'flag_disordered', 'full_id', 'get_atoms', 'get_full_id', 'get_id', 'get_iterator', 'get_level', 'get_list', 'get_parent', 'get_resname', 'get_segid', 'get_unpacked_list', 'has_id', 'id', 'insert', 'internal_coord', 'is_disordered', 'level', 'parent', 'resname', 'segid', 'set_parent', 'transform', 'xtra']


Alternative way to access structure components is slicing:

In [12]:
# chain A of the first model
chainA = structure[0]['A']

for atom in chainA.get_atoms():
    print(atom)
    break

<Atom N>


In [13]:
# residue with the residue number 4 of chain F
# mind that it is the residue insertion code in PDB file not python index
res4 = structure[0]['F'][4]
res4

<Residue VAL het=  resseq=4 icode= >

In [14]:
res4.get_full_id()

(' ', 0, 'F', (' ', 4, ' '))

The full id of the residue is a tuple with the following items:
    
0. structure name, that you specified when loading it using parser
1. model id
2. chain id
3. tuple, where the 2nd element is the number of the residue

In [15]:
res4.get_full_id()[3][1]

4

## 4. Save structure object to PDB file

In [16]:
from Bio.PDB.PDBIO import PDBIO
io=PDBIO()

Specify the structure or the part of it you want to save in `.set_structure()`.

In [17]:
structure_to_save = structure[0]['F']
io.set_structure(structure_to_save)
name = '1brsF.pdb'
io.save(name)

## 5. Extract sequence from structure

Extract the sequence either from SEQRES record (`pdb-seqres`) or from ATOM record (`pdb-atom`).

Sequence in SEQRES record is the sequence of the studied protein, while sequence from ATOM record is what was actually captured in the crystallographic experiment (if talking about X-ray determined structures). Some residues from SEQRES might be absent in the ATOM record since they are not resolved because of high flexibility or flaws of the experiment. SEQRES sequence may also differ from the sequence in reference databases, e.g. Uniprot. It might happen because researchers introduced mutations into the protein either to increase its stability in order to obtain a nice crystal for structure determination or to study the structure of this particular mutant.

Read more about the differences in SEQRES and ATOM sequences:

https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/primary-sequences-and-the-pdb-format \
https://www.wwpdb.org/documentation/file-format-content/format33/sect3.html#SEQRES

More examples with `SeqIO` module: https://biopython.org/docs/1.75/api/Bio.SeqIO.PdbIO.html

In [18]:
from Bio import SeqIO

In [19]:
for record in SeqIO.parse(filename, "pdb-seqres"):
    print(record.annotations['chain'], record.seq)

A AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
B AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
C AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
D KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAEGADITIILS
E KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAEGADITIILS
F KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAEGADITIILS


In [20]:
for record in SeqIO.parse(filename, "pdb-atom"):
    print(record.annotations['chain'], record.seq)

A VINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
B AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
C VINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
D KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTXXGAESVLQVFREAKAEGADITIILS
E KAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTXXGAESVLQVFREAKAEGADITIILS
F KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAEGADITIILS




In [21]:
for record in SeqIO.parse(filename, "pdb-atom"):
    if record.annotations['chain'] == 'C':
        seq = record.seq
        print(seq)

VINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR




## 6. Save sequence to fasta file

In [22]:
for record in SeqIO.parse(filename, "pdb-seqres"):
    if record.annotations['chain'] == 'A':
        fasta = record
        print(record.seq)

AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR


In [23]:
SeqIO.write(fasta, "1brs_A.fasta", "fasta")

1

## 7. More sequence-related data

Apart from sequence, `SeqIO.parse()` outputs other useful information:

In [24]:
for record in SeqIO.parse("1brs.pdb", "pdb-seqres"):
    print('What is record:', dir(record))
    print('Annotations:', record.annotations.keys())
    break

What is record: ['_AnnotationsDict', '_AnnotationsDictValue', '__add__', '__annotations__', '__bool__', '__bytes__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'count', 'dbxrefs', 'description', 'features', 'format', 'id', 'islower', 'isupper', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'translate', 'upper']
Annotations: dict_keys(['chain', 'molecule_type'])


In [25]:
for record in SeqIO.parse("1brs.pdb", "pdb-atom"):
    print('What is record:', dir(record))
    print('Annotations:', record.annotations.keys())
    break

What is record: ['_AnnotationsDict', '_AnnotationsDictValue', '__add__', '__annotations__', '__bool__', '__bytes__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'count', 'dbxrefs', 'description', 'features', 'format', 'id', 'islower', 'isupper', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'translate', 'upper']
Annotations: dict_keys(['molecule_type', 'model', 'chain', 'start', 'end', 'name', 'head', 'idcode', 'deposition_date', 'release_date', 'structure_method', 'resolution', 'structure_reference', 'journal_ref



In [26]:
# get uniprot accession number
for record in SeqIO.parse(filename, "pdb-seqres"):
    if record.annotations['chain'] == 'A':
        ref = record.dbxrefs[0]
        print(ref)

UNP:P00648
