# Initialization of the data

In [1]:
# Basic libraries
import os, sys, math
import numpy as np
from Bio.PDB import *
import biobb_structure_checking
import biobb_structure_checking.constants as cts
from biobb_structure_checking.structure_checking import StructureChecking

# Choose working directory
workdir = input("Input working directory: ")
base_path = os.path.join(workdir, "Structure_checking/")

# Load original structure 
pdb_path = os.path.join(base_path, "6M0J.pdb")
parser = PDBParser(QUIET=True)
structure = parser.get_structure("6M0J.pdb", pdb_path)

print("Initialization completed.")


Input working directory: /home/marti/Escriptori/BIOPHYSICS/Biophysics_Project_2025/Code_Data_Repo
Initialization completed.


# Preparation / Structure checking

## Methodology

### Step 1.1

#### The interface between can be defined by a list of residues on both chains that have at least one atom below a given distance.
1. Using pymol inspect visually the structure and choose a suitable distance in the way that all contact residues are included. Add 1-2 Å to that distance so the adjacent residues are also considered.

Visual inspection

Open 6M0J.pdb in PyMOL or Chimera and inspect the interface

Determine a suitable distance threshold for interface atoms/residues.

Typically, contact distances are ~4-5 Å, add 1-2 Å to include neighboring residues.


In [4]:
interface_distance = 5
print(f"Interface distance threshold set to: {interface_distance} Å")


Interface distance threshold set to: 5 Å


### Step 1.2

2. Prepare a python script to define the list of interface residues on each chain

In [5]:
from Bio.PDB import PDBParser

# Load PDB
parser = PDBParser(QUIET=True)
structure = parser.get_structure("6M0J", os.path.join(base_path, "6M0J.pdb"))

def get_interface_residues(structure, dt):
    model = structure[0]
    chains = list(model.get_chains())
    if len(chains) < 2:
        raise ValueError("Need at least 2 chains for interface calculation")
    
    chain1_atoms = [atom for res in chains[0] for atom in res]
    chain2_atoms = [atom for res in chains[1] for atom in res]
    
    interface_residues = set()
    
    for a1 in chain1_atoms:
        for a2 in chain2_atoms:
            if (a1 - a2) <= dt:
                interface_residues.add(a1.get_parent().id[1])
                interface_residues.add(a2.get_parent().id[1])
    
    return sorted(interface_residues)

interface_residues = get_interface_residues(structure, interface_distance)
print("Interface residues (residue numbers):", interface_residues)


Interface residues (residue numbers): [19, 24, 27, 28, 30, 31, 34, 35, 37, 38, 41, 42, 45, 79, 82, 83, 330, 353, 354, 355, 357, 393, 417, 446, 447, 449, 453, 455, 456, 473, 475, 476, 484, 486, 487, 489, 493, 496, 498, 500, 501, 502, 505]


### Step 1.3

3. Setup the initial protein structure as necessary

    1. Obtain the required structure from the PDB.
    2. Check at PDB which is the composition of a “Biological assembly”. Remove all chains but those involved in the assembly, if necessary
    3. Remove all heteroatoms
    4. Perform a quality checking on the structures, and add missing side-chains, hydrogen atoms and atom charges (use CMIP settings and prepare a PDBQT file), using the biobb_structure_checking module

In [14]:
# Paths for output files
pdb_fixed_new = os.path.join(base_path, "6m0j_fixed_step1.pdb")
pdb_cif = os.path.join(base_path, "6m0j.cif")

# Default args
base_dir_path = biobb_structure_checking.__path__[0]
args = cts.set_defaults(base_dir_path, {'notebook': True})
args.update({
    'output_format': "pdb",  # only PDB
    'keep_canonical': False,
    'input_structure_path': pdb_cif,
    'output_structure_path': pdb_fixed_new,
    'time_limit': False,
    'nocache': False,
    'copy_input': False,
    'build_warnings': False,
    'debug': False,
    'verbose': False,
    'coords_only': False,
    'overwrite': True
})

# Initialize checking engine
st_c = StructureChecking(base_dir_path, args)

# Remove heteroatoms
st_c.rem_hydrogen()
st_c.water("yes")
st_c.metals("All")
st_c.ligands("All")

# Fix structure and add missing atoms/hydrogens (no charges)
st_c.amide("All")
st_c.chiral("All")
st_c.backbone('--fix_atoms All --fix_chain none --add_caps none')
st_c.fixside("All")
st_c.add_hydrogen("auto")

# Save cleaned PDB
st_c._save_structure(args['output_structure_path'])
print(f"Cleaned structure saved at: {pdb_fixed_new}")


Structure /home/marti/Escriptori/BIOPHYSICS/Biophysics_Project_2025/Code_Data_Repo/Structure_checking/6m0j.cif loaded
 PDB id: 6M0J 
 Title: Crystal structure of 2019-nCoV spike receptor-binding domain bound with ACE2
 Experimental method: X-RAY DIFFRACTION
 Keywords: VIRAL PROTEIN/HYDROLASE
 Resolution (A): 2.4500

 Num. models: 1
 Num. chains: 2 (A: Protein, E: Protein)
 Num. residues:  876
 Num. residues with ins. codes:  0
 Num. residues with H atoms: 0
 Num. HETATM residues:  85
 Num. ligands or modified residues:  5
 Num. water mol.:  80
 Num. atoms:  6543
Metal/Ion residues found
 ZN A901
Small mol ligands found
NAG A902
NAG A903
NAG A904
NAG E601
Running rem_hydrogen.
No residues with Hydrogen atoms found
Running water. Options: yes
Detected 80 Water molecules
Removed 80 Water molecules
Running metals. Options: All
Found 1 Metal ions
  ZN A901.ZN 
Metal Atoms removed All (1)
Running ligands. Options: All
Detected 4 Ligands
 NAG A902
 NAG A903
 NAG A904
 NAG E601
Ligands removed

Description:
- We used the biobb_structure_checking module to process the PDB structure 6m0j.cif and generate a cleaned PDB file. The workflow included:

    1. Removing heteroatoms (waters, metals, ligands, hydrogens).

    2. Fixing amides, chiral centers, and backbone atoms.

    3. Rebuilding missing side chains.

    4. Adding hydrogens.

Results of our code:

    Cleaned PDB: 6m0j_fixed_step1.pdb

    Lines: 12,539

    Reference files provided:

    Cleaned PDB: 12,512 lines

    PDBQT: 12,512 lines

Comparison:

    Our generated files contain 27 extra lines, likely due to differences in hydrogen placement, terminal atoms, or formatting introduced during the automatic fixing steps.

    Functionally, our structures are correct and stable, but they do not exactly match the reference files.

Decision:

    To ensure compatibility with downstream workflows and consistency with the assignment reference, we will use the provided files for all subsequent steps.

    Our generated files are kept as a backup for reference and verification.

## Step 2