# Biopython for working with PDB files

## Useful links

Ultimate tutorial on Biopython: http://biopython.org/DIST/docs/tutorial/Tutorial.html 

Navigation through documentation, case examples: https://biopython.org/wiki/Documentation

Manual on PDB module: https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ

## 0. Installation 

Install via `pip` or `conda`:

In [None]:
!pip install biopython

In [None]:
!conda install -c conda-forge biopython

## 1. Download PDB file

In [1]:
pdbid = '1brs'

In [2]:
# Create PDBList object
from Bio.PDB import PDBList
pdbl = PDBList()

In [3]:
# Download the structure in PDB format. Note, that it will have .ent extention.
pdbl.retrieve_pdb_file(pdbid, file_format='pdb', pdir='.')

Downloading PDB structure '1brs'...


'./pdb1brs.ent'

Sometimes using `wget` is an easier way:

In [6]:
!wget https://files.rcsb.org/download/1brs.pdb

--2022-05-03 15:58:49--  https://files.rcsb.org/download/1brs.pdb
Распознаётся files.rcsb.org (files.rcsb.org)... 128.6.158.49
Подключение к files.rcsb.org (files.rcsb.org)|128.6.158.49|:443... соединение установлено.
HTTP-запрос отправлен. Ожидание ответа... 200 OK
Длина: нет данных [application/octet-stream]
Сохранение в каталог: ««1brs.pdb»».

1brs.pdb                [       <=>          ] 455,94K   326KB/s    за 1,4s    

2022-05-03 15:58:51 (326 KB/s) - «1brs.pdb» сохранён [466884]



Or use `os.system()` to pass your arguments to the terminal command:

In [None]:
import os
os.system('wget https://files.rcsb.org/download/'+pdbid+'.pdb')

## 2. Load structure file

In [7]:
# Create PDBparser object
from Bio.PDB.PDBParser import PDBParser
p = PDBParser()

In [8]:
# Create the structure object
filename = '1brs.pdb'
# The first argument is your custom name of the structure. It is rarely handy, so you may just leave it empty.
structure = p.get_structure('my_structure', filename)



Typically warnings on discontinuous chain will rise if there are missing residues in the structure. To ignore these warnings use the following:

In [9]:
import warnings
from Bio.PDB.PDBExceptions import PDBConstructionWarning 

with warnings.catch_warnings():
    warnings.simplefilter("ignore", PDBConstructionWarning)
    structure = p.get_structure('', filename)
    
structure

<Structure id=>

## 3. Access structure file sections and records

In [10]:
# self-explanatory keys of header dictionary
structure.header.keys()

dict_keys(['name', 'head', 'idcode', 'deposition_date', 'release_date', 'structure_method', 'resolution', 'structure_reference', 'journal_reference', 'author', 'compound', 'source', 'has_missing_residues', 'missing_residues', 'keywords', 'journal'])

In [11]:
# method of structure determination
structure.header['structure_method']

'x-ray diffraction'

In [12]:
# print missing residues (chain, residue name, residue number) if any
if structure.header['has_missing_residues'] == True:
    missres = structure.header['missing_residues']
    for res in missres:
        print(res['chain'], res['res_name'], res['ssseq'])

A ALA 1
A GLN 2
C ALA 1
C GLN 2
D GLU 64
D ASN 65
E LYS 1
E GLU 64
E ASN 65


### Organization of the coordinate section

Hierarchy of elemets in structure object:

- A structure consists of models
- A model consists of chains
- A chain consists of residues
- A residue consists of atoms

In [None]:
for model in structure:
    for chain in model:
        for residue in chain:
            for atom in residue:
                print(atom)

You may access a certain component directly without iteration:

In [None]:
atoms = structure.get_atoms()

for atom in atoms:
    print(atom.get_coord())

Or using slicing:

In [14]:
# chain A of the first model
chainA = structure[0]['A']

In [16]:
# residue with the residue number 4 of chain F
# mind that it is the residue number not python index
res4 = structure[0]['F'][4]
res4

<Residue VAL het=  resseq=4 icode= >

In [17]:
res4.get_full_id() 

('', 0, 'F', (' ', 4, ' '))

The full id of the residue is a tuple with the following items:
    
0. structure name, that you specified when loading it using parser
1. model id
2. chain id
3. tuple, where the 2nd element is the number of the residue

In [18]:
res4.get_full_id()[3][1]

4

## 4. Extract sequence from structure

Extract the sequence either from SEQRES record (`pdb-seqres`) or from ATOM record (`pdb-atom`). 

Sequence in SEQRES record is the sequence of the studied protein, while sequence from ATOM record is what was actually captured in the crystallographic experiment (if talking about X-ray determined structures). Some residues from SEQRES might be absent in the ATOM record since they are not resolved because of high flexibility or flaws of the experiment. SEQRES sequence may also differ from the sequence in reference databases, e.g. Uniprot. It might happen because researchers introduced mutations into the protein either to increase its stability in order to obtain a nice crystal for structure determination or to study the structure of this particular mutant.

Read more about the differences in SEQRES and ATOM sequences:

https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/primary-sequences-and-the-pdb-format \
https://www.wwpdb.org/documentation/file-format-content/format33/sect3.html#SEQRES

More examples with `SeqIO` module: https://biopython.org/docs/1.75/api/Bio.SeqIO.PdbIO.html

In [19]:
from Bio import SeqIO

In [20]:
for record in SeqIO.parse(filename, "pdb-seqres"):
    print(record.annotations['chain'], record.seq)

A AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
B AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
C AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
D KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAEGADITIILS
E KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAEGADITIILS
F KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAEGADITIILS


In [21]:
for record in SeqIO.parse(filename, "pdb-atom"):
    print(record.annotations['chain'], record.seq)

A VINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
B AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
C VINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
D KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTXXGAESVLQVFREAKAEGADITIILS
E KAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTXXGAESVLQVFREAKAEGADITIILS
F KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAEGADITIILS




In [22]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore", PDBConstructionWarning)
    for record in SeqIO.parse(filename, "pdb-atom"):
        if record.annotations['chain'] == 'C':
            seq = record.seq
            print(seq)

VINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR


## 5. Save sequence to fasta file

In [None]:
for record in SeqIO.parse(filename, "pdb-seqres"):
    if record.annotations['chain'] == 'A':
        fasta = record
        print(record.seq)

In [None]:
SeqIO.write(fasta, "1brs_A.fasta", "fasta")

## 6. More sections and records

Apart from sequence, `SeqIO.parse()` outputs other useful information:

In [23]:
for record in SeqIO.parse("1brs.pdb", "pdb-seqres"):
    print('What is record:', dir(record))
    print('Annotations:', record.annotations.keys())
    break

What is record: ['__add__', '__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__le___', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'translate', 'upper']
Annotations: dict_keys(['chain', 'molecule_type'])


In [24]:
for record in SeqIO.parse("1brs.pdb", "pdb-atom"):
    print('What is record:', dir(record))
    print('Annotations:', record.annotations.keys())
    break

What is record: ['__add__', '__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__le___', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'translate', 'upper']
Annotations: dict_keys(['molecule_type', 'model', 'chain', 'start', 'end', 'name', 'head', 'idcode', 'deposition_date', 'release_date', 'structure_method', 'resolution', 'structure_reference', 'journal_reference', 'author', 'compound', 'source', 'has_missing_residues', 'missing_residues', 'keywords', 



In [25]:
# get uniprot accession number
for record in SeqIO.parse(filename, "pdb-seqres"):
    if record.annotations['chain'] == 'A':
        ref = record.dbxrefs[0]
        print(ref)

UNP:P00648


## 7. Save structure file

In [None]:
from Bio.PDB.PDBIO import PDBIO
io=PDBIO()

Specify the structure or the part of it you want to save in `.set_structure()`.

In [None]:
structure_to_save = structure[0]['F']
io.set_structure(structure_to_save)
name = '1brs_F.pdb'
io.save(name)