## Intro: This is a Markdown cell

Markdown is a simple language to add basic formatting to plain text. Cells in a Jupyter notebook can be changed into Markdown cells using the *Cell > Cell Type > Markdown* menu. You can double click within the cell to edit/view source, and **run** the cell to produce the formatted text. Take a look at the source for this cell, which includes bold/italic text, headers, links, and code blocks.

Next, we'll tackle a common problem in setting up molecular dynamics simulations from crystal or cryo structures:
modeling missing loops where there is no clear electron density (perhaps due to flexibility or multiple conformations).

We'll use next-generation KIC: https://doi.org/10.1371/journal.pone.0063090  
https://www.rosettacommons.org/docs/latest/application_documentation/structure_prediction/loop_modeling/next-generation-KIC  
within the [Rosetta loopmodel application](https://www.rosettacommons.org/docs/latest/application_documentation/structure_prediction/loop_modeling/loopmodel).

For loopmodel, we need to add initial coordinates for the missing atoms to the pdb file, and they need to be linearly independent. Note: this is not required for the remodel application, but we would not be able to use next-generation KIC or KIC with fragments, which perform better on benchmarks.

We will also need to generate a loop file. From the Rosetta documentation, it should contain:

    column1  "LOOP":     The loop file identify tag
    column2  "integer":  Loop start residue number
    column3  "integer":  Loop end residue number
    column4  "integer":  Cut point residue number, >=startRes, <=endRes. Default: 0 (let the loop modeling code 
                         choose the cut point)
                         Note: Setting the cut point outside the loop can lead to a segmentation fault. 
    column5  "float":    Skip rate. Default: 0 (never skip modeling this loop)
    column6  "boolean":  Extend loop (i.e. discard the native loop conformation and rebuild the loop from
                         scratch, idealizing all bond lengths and angles). Default: 0 (false)

And while we're at it, we might as well automatically generate the batch scripts we'll need to run loopmodel and submit jobs on our karplus cluster using [GNU parallel](https://www.gnu.org/software/parallel/) (I also have GNU parallel and Rosetta compiled on Savio).

In [1]:
# Let's start with a more manual method. We'll want to generate random numbers for the fake atom positions, so
# we'll import a uniform random number generator:
from random import uniform
# Note: we don't care about the numbers being actually random, we just need to change them all from (0,0,0), so we
# won't bother seeding the number generator.

# I've manually created an edited PDB file which has all of the missing atoms. The 'dummy' atoms have coordinates
# (0,0,0) and b-factors of 0.
with open("6vfx_dummy.pdb", 'r') as f:
    lines = f.readlines()

# Let's look at an example dummy atom:
print(lines[995])

newlines = []

for line in lines:
    # Here we'll identify the lines corresponding to dummy atoms based on their b-factors of 0:
    if line[0:4] == "ATOM" and line[60:66] == "  0.00":
        # We need a random number for x, y, and z:
        num = tuple(uniform(-96.0, 96.0) for i in range(3))
        # And we'll replace the coordinates in the line with these properly-formatted numbers:
        line = line[:30] + "%8.3f" * 3 % num + line[54:]
        # Try to figure out how the above is equivalent to
        # line = line[:30] + "%8.3f" % num[0] + "%8.3f" % num[1] + "%8.3f" % num[2] + line[54:]
    newlines.append(line)

# Let's look at the same dummy atom as before:
print(newlines[995])

# Now we'll write out our edited PDB:
with open("6vfx_dummy2.pdb", 'w') as f:
    f.writelines(newlines)

ATOM   2005  N   LYS C 192       0.000   0.000   0.000  1.00  0.00           N

ATOM   2005  N   LYS C 192     -73.707  13.359 -64.837  1.00  0.00           N



In [8]:
# The above method required a lot of manual effort. Let's see if we can automate adding the dummy residues using
# Bio.PDB:
from Bio import PDB as pdb

# First we'll retrieve the full mmCIF file from the PDB as before:
pdbl = pdb.PDBList()
filename = pdbl.retrieve_pdb_file('6vfx')
mmcifp = pdb.MMCIFParser()
structure_6vfx = mmcifp.get_structure('6vfx', filename)

# We'll get a bunch of errors about the 6 ClpX chains being discontinuous...

# We'll verify that we only have a single model in this structure. This example does not extend to multiple models.
print(list(structure_6vfx.get_models()))



Structure exists: '/home/kent/kuriyanlab_python-workshops/2/vf/6vfx.cif' 
[<Model id=0>]




In [3]:
# The Bio.PDB module itself does not have any tools for retrieving the SEQRES sequence from the header of
# PDB/mmCIF files (which should contain the sequence of regions that were not modeled as well). Instead, we'll use
# Bio.SeqIO.PdbIO, which is confusingly similar to Bio.PDB.PDBIO...

from Bio.SeqIO import PdbIO as pdbio

# We'll open the same mmCIF file as before and create a list of SeqRecords corresponding to individual chains:
with open(filename, 'r') as f:
    chainseqs = list(pdbio.CifSeqresIterator(f))
    
print(chainseqs)

[SeqRecord(seq=Seq('MSNENRTCSFCGKSKSHVKHLIEGENAFICDECVSNCIEILHEDGNDGTPSESA...FES', ProteinAlphabet()), id='6VFX:A', name='6VFX:A', description='UNP:A0A0Y4ZJG4 A0A0Y4ZJG4_NEIME', dbxrefs=['UNP:A0A0Y4ZJG4', 'UNP:A0A0Y4ZJG4_NEIME']), SeqRecord(seq=Seq('MSNENRTCSFCGKSKSHVKHLIEGENAFICDECVSNCIEILHEDGNDGTPSESA...FES', ProteinAlphabet()), id='6VFX:B', name='6VFX:B', description='UNP:A0A0Y4ZJG4 A0A0Y4ZJG4_NEIME', dbxrefs=['UNP:A0A0Y4ZJG4', 'UNP:A0A0Y4ZJG4_NEIME']), SeqRecord(seq=Seq('MSNENRTCSFCGKSKSHVKHLIEGENAFICDECVSNCIEILHEDGNDGTPSESA...FES', ProteinAlphabet()), id='6VFX:C', name='6VFX:C', description='UNP:A0A0Y4ZJG4 A0A0Y4ZJG4_NEIME', dbxrefs=['UNP:A0A0Y4ZJG4', 'UNP:A0A0Y4ZJG4_NEIME']), SeqRecord(seq=Seq('MSNENRTCSFCGKSKSHVKHLIEGENAFICDECVSNCIEILHEDGNDGTPSESA...FES', ProteinAlphabet()), id='6VFX:D', name='6VFX:D', description='UNP:A0A0Y4ZJG4 A0A0Y4ZJG4_NEIME', dbxrefs=['UNP:A0A0Y4ZJG4', 'UNP:A0A0Y4ZJG4_NEIME']), SeqRecord(seq=Seq('MSNENRTCSFCGKSKSHVKHLIEGENAFICDECVSNCIEILHEDGNDGTPSESA...FES

In [21]:
# Now we want to iterate through the chains corresponding to ClpX (C,B,E,F,D,A), copying them to a new structure

# Create a new structure, and add a new model with id 0 to it
structure_6vfx_edited = pdb.Structure.Structure('6vfx')
structure_6vfx_edited.add(pdb.Model.Model(0))

# We'll make a list of chain lengths
chainlengths = [len(list(chain.get_residues())) for chain in structure_6vfx.get_chains()]
print(chainlengths)

# And an empty list to keep track of the loops we will ask Rosetta to build
loops = []

for chainnum, chain in enumerate(structure_6vfx.get_chains()):
    chainid = chain.get_id()
    if chainid in "CBEFDA":
        # Residues have no get_id() method, and Residue.id is a tuple. From the source code:
        # (field - hetero flag; "W" for waters; "H" for hetero residues; otherwise blank,
        #  resseq - int; sequence identifier, icode - string; insertion code)
        resolvedlist = [residue.id[1] for residue in chain.get_residues()]
        structure_6vfx_edited[0].add(pdb.Chain.Chain(chainid))
        for resnum, residue in enumerate(chain.get_residues()):
            if resnum != chainlengths[chainnum] - 1: # if not the last resolved residue
                structure_6vfx_edited[0][chainid].add(residue)
                if (residue.id[1] + 1) not in resolvedlist:
                    loopstart = residue.id[1]
                    buildid = loopstart + 1
                    while buildid not in resolvedlist:
                        # TODO: figure out next residue to build from sequence
                        # find first example of that residue in the full structure and copy it
                        # loop through the atoms and randomize the xyz, set b-factor to 0
                        # add residue to chain
                        buildid += 1
                    loops.append((loopstart, buildid))
    elif chainid == "G":
        structure_6vfx_edited[0].add(pdb.Chain.Chain("G"))
        for residue in chain.get_residues():
            tempresidue = residue
            tempresidue.resname = "ALA"
            structure_6vfx_edited[0]["G"].add(tempresidue)

[335, 339, 343, 322, 341, 342, 7, 191, 191, 191, 191, 191, 191, 191, 191, 190, 192, 191, 193, 186, 191]
