<a href="https://colab.research.google.com/github/lllovej/KB7016-17/blob/main/BiopythonPDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Quick Start  - PDB Object

Biopython provides Bio.PDB module to manipulate polypeptide structures. The PDB (Protein Data Bank) is the largest protein structure resource available online. It hosts a lot of distinct protein structures, including protein-protein, protein-DNA, protein-RNA complexes. For this lab, we simply introduce **PDB format** files as examples.

In [1]:
from Bio.PDB import *   ## load the PDB
pdbl = PDBList() 
pdbl.retrieve_pdb_file('2HUE', pdir = '.', file_format = 'pdb') ## download pdb file

ModuleNotFoundError: ignored

In [None]:
parser = PDBParser(PERMISSIVE=1)  ## load PDB parser
structure_id = "2hue"
filename = "pdb2hue.ent"
test_structure = parser.get_structure(structure_id, filename)    ## parse the structure

The overall layout of a Structure object follows the so-called SMCRA (Structure/Model/Chain/Residue/Atom) architecture:
- A structure consists of models 
- A model consists of chains
- A chain consists of residues
- A residue consists of atoms

This is the way many structural biologists/bioinformaticians think about structure, and provides a simple but efficient way to deal with structure. Additional stuff is essentially added when needed. To get a better understanding of this hierarchical architecture, check the codes below.


In [None]:
for model in test_structure:  ## check Model, Chain
    for chain in model:
        print(model, chain)

In [None]:
#for residue in test_structure[0]['A']:  ## take Chain A of the 1st model as an example: check their residues and atoms
#    for atom in residue:
#        print(residue, atom)
    

A residue includes three elements:
- "Residue" is followed by three-letter amino acid name;
- 'het' indicates if it is a hetero residues;
- The insertion code (icode); a string, e.g. ’A’. The insertion code is sometimes used to preserve a certain desirable residue numbering scheme. A Ser 80 insertion mutant (inserted e.g. between a Thr 80 and an Asn 81 residue) could e.g. have sequence identifiers and insertion codes as follows: Thr 80 A, Ser 80 B, Asn 81. In this way the residue numbering scheme stays in tune with that of the wild type structure.

### Hint !
Go to RCSB webpage: https://www.rcsb.org/structure/2HUE, compare the previous chain results with the online information.

## Selection
**Bio.PDB.Selection module** unfold entities list to a child level, for example:
```
res_list = Selection.unfold_entities(structure, 'R') 
```
This line gett all residues from a structure. Or to get all atoms from a chain:
```
atom_list = Selection.unfold_entities(chain, 'A')
```

Obviously, `A=atom`, `R=residue`, `C=chain`, `M=model`, `S=structure`. You can use this to go up in the hierarchy, e.g. to get a list of (unique) Residue or Chain parents from a list of Atoms:
```
residue_list = Selection.unfold_entities(atom_list, 'R')
chain_list = Selection.unfold_entities(atom_list, 'C')

In [None]:
from Bio.PDB.Selection import unfold_entities
### unfold the structure to get all chains:
chain_list = unfold_entities(test_structure,'C')
print(chain_list)

## Question:
Can you get all atoms from **Chain B** of the test structure? how many atoms in **Chain B**?