# Two Molecule Analysis Notebook

#### Last Updated 08-10-2022

This notebook outlines how to do the analysis from the Cubuk, Alston et al 2022 paper.

## Data Used

This notebook uses a test data set that is a segment of 5 concatenated replicates of the simulations run from the aforementioned paper:

This consists of only two key pieces of data:

1. Start.pdb
2. __traj.xtc

In particular these come from the NTDRBD simulations of RBD conformation 1 and rU25, replicate 1

## Before Trajectory Analysis

Before the trajectory is analyzed each of the five trajectories in each conformational replicate are pbc corrected using the bash command:

lammpstools_pbcfix --pdb Start.pdb --xtc __traj.xtc

which outputs a corrected pdb and traj file - > lammps_fixed_traj.pdb and lammps_fixed_traj.xtc

This corrects for periodic boundary condition clashes



In [1]:
import pandas as pd

import mdtraj as md
import itertools
import numpy as np

import soursop as ss
from soursop.sstrajectory import SSTrajectory
NA_EXTENSION = ['D5P', 'DPC',  'DPU',  'DPT',  'DPA',  'DPG',  'R5P',  'RPC' , 'RPU',  'RPT',  'RPA',  'RPG']

from pathlib import Path

In [2]:
path = str(Path().absolute())
path

'/work/alstonj/2022/Phase_Separation/LAMMPS/NTDRBD_PolyrU/NTDRBD_Titrate_Length_NoSalt/GitHub'

Necessary files are included in supportingdata/2022/cubuk_et_al_2022/data

In [4]:
#Next we take our PBC corrected xtc and pdb and we remove (based on looking at energy plots) the beginning of the simulation for equilibration

#PBC Corrected xtc and pdb
xtcfilename = path + '/lammps_fixed_traj.xtc'
print(xtcfilename)
pdbfilename = path + '/lammps_fixed_traj.pdb'
print(pdbfilename)

#Load into Soursop
print('Loading trajectory...')
TrajOb = SSTrajectory(xtcfilename, pdbfilename, extra_valid_residue_names=NA_EXTENSION)
print('Trajectory loaded!')

#Get total number of frames in simulation
nframes = TrajOb.n_frames
print((TrajOb.n_frames))

#Remove pre-equilibration frames
print(round(.002*(TrajOb.n_frames)))

sliced_list = list(range(round(.002*(TrajOb.n_frames)),TrajOb.n_frames))

sliced = TrajOb.traj.slice((sliced_list))

print(sliced)

#save equilibrated xtc
sliced.save(path + "/equilibrated_traj.xtc")

/work/alstonj/2022/Phase_Separation/LAMMPS/NTDRBD_PolyrU/NTDRBD_Titrate_Length_NoSalt/GitHub/lammps_fixed_traj.xtc
/work/alstonj/2022/Phase_Separation/LAMMPS/NTDRBD_PolyrU/NTDRBD_Titrate_Length_NoSalt/GitHub/lammps_fixed_traj.pdb
Loading trajectory...
Trajectory loaded!
2991
6
<mdtraj.Trajectory with 2985 frames, 198 atoms, 198 residues, and unitcells>


## A note on xtc concatenation

In our workflow there are 5 replicates for each starting state and after getting equilibrated xtc's they are concatenated using cattraj with the following bash command:

cattraj -n 5 --everything --pdb lammps_fixed_traj.pdb --xtc equilibrated_traj.xtc

which outputs a folder called "Full" that contains a full.pdb and full.xtc that are the concatenated trajectories that include all amino acids and nucleotides

These are then concatenated with each of the other starting states from the other starting RBD comformations that have been run, again using cattraj

For this tutorial we will just be working with the test data set

________________________________________________________________________________________________________________________________________________________________________________________________________________________________

# Binding Affinity Analysis

We want to calculate a second osmotic virial coefficient to correct the Kds of simulations of 2 molecules binding for finite size effects

This requires us to calculate $B_{ij}$   and   $K_{D}$

$B_{ij}$ can be estimated using the equation:  $$B_{ij}= -\frac{V}{2}\Bigg[\frac{1-\frac{v}{V}}{1-P_{u}(V)}-1\Bigg]$$

Where $V =$ Simulation Box Volume, $v$ = subvolume, and $P_{u}V$ = probability that molecule j is in the subvolume of molecule i

$K_{D}$ corrected for finite size effects can be estimated using the equation: $$K_{D} = \frac{1}{N_{A}P_{B}(V)(V-2B_{ij})}$$

Where $P_{B}(V)$ = fraction of bound protein events and $N_{A}$ is avagadros number

We will be using the subvolume method, defining an area around the molecule i (NTDRBD) that corresponds to the interaction area between i and j (RNA).

For a rudimentary first pass we will define this volume as a sphere with diameter = the sum of half the Rg of i and j

If the COM of j falls within the subolume of j then it is defined as potentially being bound


In [8]:
#read in necessary files

xtcfilename = path + '/equilibrated_traj.xtc'
#print(xtcfilename)
pdbfilename = path + '/lammps_fixed_traj.pdb'
#print(pdbfilename)
TrajOb = SSTrajectory(xtcfilename, pdbfilename, extra_valid_residue_names=NA_EXTENSION)

#Confirm that we have 2 molecules in simulation

print(TrajOb)

SSTrajectory (0x7f6764a032b0): 2 proteins and 2985 frames


In [51]:
#Calculate largest distance between residues and nucleotides individually to create a subvolume

#Here I is the length of your RNA

i = 25

#Define the RNA and protein the protein in this case is 173 residues long

#Group 1 is the RNA nts
group_1 = [nucleotide + 173 for nucleotide in range(i)]
#print(group_1)

#Group 2 is the NTDRBD, this is always 173nts

group_2 = [aminoacid for aminoacid in range(173)]
#print(group_2)

#For the RNA we can use the min and max residues to compute the largest distance
pairs = (min(group_1), max(group_1))
#print(pairs)

#compute the distance
cc = md.compute_contacts(TrajOb.traj, [pairs], ignore_nonprotein = False)

#Get RNA largest Diameter
RNA_Largest = []
for a in range(len(TrajOb.traj)):
    RNA_Largest.append(np.max(cc[0][a]))
    if a % 10000 == 0:
        print('RNA_Largest', a)
            
RNA_Diam = np.max(RNA_Largest)

#For the protein the largest distance isnt necasarily between the first and last residue and so we compute all pairwise distances and extract the largest
pairs = list(itertools.product(group_2, group_2))

#compute the pairwise distance for every frame protein
cc = md.compute_contacts(TrajOb.traj, pairs, ignore_nonprotein = False)

NTDRBD_Largest = []
for a in range(len(TrajOb.traj)): 
    NTDRBD_Largest.append(np.max(cc[0][a]))
    if a % 10000 == 0:
        print('RNA_Largest', a)

NTDRBD_Diam = np.max(NTDRBD_Largest)

print("RNA Diameter in nm is: ", RNA_Diam)
print("NTDRBD Diameter in nm is: ", NTDRBD_Diam)

RNA_Largest 0
RNA_Largest 0
RNA Diameter in nm is:  9.495762
NTDRBD Diameter in nm is:  12.937882


In [46]:
#Define a subvolume size as 1/2 of (NTDRBD_Rg + rU25_Rg) in nm

subvolume = .5*(RNA_Diam + NTDRBD_Diam)
print("Subvolume distance cutoff is", subvolume)

Subvolume distance cutoff is 11.216821670532227


In [47]:
#Define the COM of the NTDRBD and RNA for each frame

NTDRBD_coord = TrajOb.proteinTrajectoryList[0].get_center_of_mass()
RNA_coord = TrajOb.proteinTrajectoryList[1].get_center_of_mass()

#Compute distance between COM of NTDRBD and RNA and store frames in within a subvolume or outside a subvolume

Outside = []
Inside = []

for i in range(len(NTDRBD_coord)):
    dist = np.linalg.norm(NTDRBD_coord[i] - RNA_coord[i])
    if dist >=subvolume:
        Outside.append(dist)
    else:
        Inside.append(dist)

print("# of Events outside Subvolume", len(Outside))
print("# of Events inside Subvolume", len(Inside))

# of Events outside Subvolume 2752
# of Events inside Subvolume 233


In [48]:
#Define the probability of RNA being within or outside the subvolume

PuV = len(Inside)/(len(Outside)+len(Inside))
print("Probability of RNA sampling within the subvolume", PuV)

#Define Simulation Volume convert to nm

SimV = ((300/10)**3)

print("Simulation Volume in nm^3", SimV)

#Define a subvolume volume as spherical area converted to nm

SubV = 4/3*(np.pi)*(subvolume)**3

print("Subvolume Volume in nm^3", SubV)

print("Percentage of Simulation box taken up by the subvolume: ", SubV/SimV)

Probability of RNA sampling within the subvolume 0.07805695142378559
Simulation Volume in nm^3 27000.0
Subvolume Volume in nm^3 5911.504919432612
Percentage of Simulation box taken up by the subvolume:  0.21894462664565228


In [49]:
#Calculate second virial coefficient

term1 = -(SimV/2)
numerator = 1- (SubV/SimV)
denominator = 1 - PuV

Bij = term1*((numerator/denominator)-1)
print("Bij = ", Bij)

Bij =  2063.0163852664145


In [53]:
#Determine the fraction bound by generating a list of bound and not bound frames using a distance cutoff of .7nm

i = 25

distance_cutoff = .7

#Group 1 is the RNA nts

group_1 = [nucleotide + 173 for nucleotide in range(i)]
#print(group_1)

#Group 2 is the NTD, this is always 68nts

group_2 = [aminoacid for aminoacid in range(173)]
#print(group_2)

pairs = list(itertools.product(group_1, group_2))
#len(pairs)

#compute the pairwise distance for every frame 
cc = md.compute_contacts(TrajOb.traj, pairs, ignore_nonprotein = False)

bound = []
unbound = []

for a in range(len(TrajOb.traj)):
    if any(a <= distance_cutoff for a in cc[0][a]):
        bound.append(a)
        if a % 10000 == 0:
            print('bound', a)

    else:
        unbound.append(a)
        if a % 10000 == 0:
            print('unbound', a)


#print(TrajOb.traj)
#print(len(bound))
#print(len(unbound))

unbound 0


In [54]:
#Define terms and calculate

#(Avo #)
Na = (6.022*(10**23))

#Probability of being in bound state
PbV = len(bound)/(len(bound)+len(unbound))
#x = (Na*PbV)

#Calculate Kd and and Ka and subsequently convert to nM
Kd = (10**24)/(Na*PbV*(SimV-(2*Bij)))
Ka = ((1/Kd)/10**6)

KdinnM = 1/Ka

print("Kd is in nm ",KdinnM)
print("Ka is in nm ",Ka)

rows = [["KD", "KA", "Bij"],[KdinnM, Ka, Bij]]

# using the savetxt 
# from the numpy module
np.savetxt("Binding_Affinity.csv", 
           rows,
           delimiter =", ", 
           fmt ='% s')

Kd is in nm  163.05612184200936
Ka is in nm  0.006132857746788152
