In [None]:
(c) Copyright Rosetta Commons Member Institutions.
(c) This file is part of the Rosetta software suite and is made available under license.
(c) The Rosetta software is developed by the contributing members of the Rosetta Commons.
(c) For more information, see http://www.rosettacommons.org. Questions about this can be
(c) addressed to University of Washington CoMotion, email: license@uw.edu.

## *This uses the older PyRosetta bindings!* c. 2015
---

## Is PyRosetta really the same as Rosetta???!!!

When I originally started using PyRosetta, I used to ask this question myself. Is the energy function in PyRosetta
the same as when I run Rosetta? I knew at the time that this was not (necessarily) the case for the scores
between Rosetta and Foldit. This is easily seen by taking output from Rosetta and loading it into Foldit.
What scores -500 REU in Rosetta proper, might score a whopping 100 pts! in Foldit!??? 

So again, how do I know that the score I get from PyRosetta is the same as what I would get from Rosetta?
This was an important question at the time because I had just begun my quest to convert everyone and everything
in our lab over to using PyRosetta (yay Python!). 

In order to answer this question, I decided that I would score a protein structure under a constant seed value,
both under Rosetta using the score_jd2 binary and with PyRosetta until they gave me the same score.


Goal: Want to be able to show pyrosetta score and score_jd2 are the same (weights and sums)
    

Notes: 

-When you pick a Rosetta scorefunction you are picking weights and which terms are turned on.

-Most scores are readily accessible in PyRosetta; however, the hbond terms must be dealt with specially.

So, assuming you have Rosetta proper installed in your home directory and compiled with gcc on linux,
executation of the next line in the notebook (using Shift-Enter) will run Rosetta in the background and 
generate a score file and a score output pdb.

flags

-score:weights                                         specify the specific energy function weights file

-s                                                     path to the input structure

-out:pdb                                               outputs a pdb structure

-ignore_unrecognized_res T                             needed for older Rosetta builds to not crash on ligands

-constant_seed                                         needed to remove randomness

-jran 1                                                needed to remove randomness
    

In [None]:
!rm score.sc; rm 2jie_0001.pdb
!~/Rosetta/main/source/bin/score_jd2.default.linuxgccdebug -score:weights ./talaris2013 -database ~/Rosetta/main/database -s ../resources/protein_structures/2jie.pdb -out:pdb -ignore_unrecognized_res T -constant_seed -jran 1 -overwrite

Excellent, now you should have a score.sc file and a file called 2jie_0001.pdb

You should see a line like the following, note the talaris2013
```
core.scoring.ScoreFunctionFactory: SCOREFUNCTION: talaris2013
```

We can quickly verify this by running shell commands (precede with !)

In [None]:
!ls

Ok, so now we can open the score.sc in a spreadsheet program, or we can using something called a Pandas Dataframe. The next cell will read the score file.

If you have not installed Pandas before, try running
```
>pip install pandas
```
from the command line and restarting the notebook

In [None]:
import pandas as pd
df_ref = pd.read_csv("score.sc",sep='\s+',header=1)
df_ref

Looks good, except that first column.....

In [None]:
df_ref = df_ref.drop('SCORE:',axis=1)
df_ref

So, from above we see the total_score is 737.643 REU. Now we have a baseline. 

Next, we'll try to get PyRosetta to read in and score the same PDB and get the exact same value.


###
Side Note:
Rosetta stores all of the energies on a per energy term, per protein residue bases.

For example, if we have a protein that has 2 amino acids (GA) and we look at the scorefile, the term for fa_attr might be -30.0. This is the summation of the fa_attr term from the G residue and the A residue multiplied by the weight of the fa_attr term.

Ideally, a pdb can be passed in with a sfxn and a dataframe of the per residue scores is returned, that way the totals are instantly accesible.

In [None]:
from rosetta import *
from rosetta.core.scoring.methods import EnergyMethodOptions
import pandas as pd

rosetta.init('-ignore_unrecognized_res T -constant_seed True -jran 1 ')

p = pose_from_file('../resources/protein_structures/2jie.pdb')  #import pose
sfxn = rosetta.core.scoring.ScoreFunctionFactory.create_score_function('talaris2013')
talaris2013_energy_methods = sfxn.energy_method_options() #have to copy the default energy methods from talaris first

emo = EnergyMethodOptions( talaris2013_energy_methods)    #must do this to get per res hbond_bb terms in breakdown
emo.hbond_options().decompose_bb_hb_into_pair_energies( True )  # set to true, defaults False
sfxn.set_energy_method_options( emo ) #set the sfxn up with the energy method options
print sfxn(p)

This matches our original, Rosetta run scorefile! Mission accomplished! PyRosetta & Rosetta use the exact same energy function. Now, of course for seasoned developers, of course this is the case since the bindings for PyRosetta are built off of Rosetta's C++ source code. They are one and the same! However, I have found there to be a bit fo misconception around this fact.

---
---

Onto the next thing!!


Pandas Dataframes
---


So wouldn't it be neat if, during the course of a Rosetta simulation, we could query the pose and get back the residue energy of the ith residue? Or perhaps a set of residues that we were interested in? Maybe, in fact, we would like to be able to define an active site by passing in a list of residue number. Imagine having a function that takes a pose and score function and gives back the "active site energy".
```
myactivesite_energy = active_site_energy(pose, sfxn)
```
This is treated in it's entirety in another notebook (link active site energy calculator)
but for now, let's just build the structure. 

We want to be able to get the weighted, per energy term, per residue score for a pose given a scorefunction.

There is a python library which is essentially a spreadsheet on steroids called Pandas Dataframes http://pandas.pydata.org/. Using pandas dataframes, we will construct a 'array' which will allow us access to the energy terms and residue numbers by name and index. For example, to look at all of the protein's fa_atr score terms, we will be able to type
```
df['fa_atr']
```

and too look and all of the score terms for the 11th residue, we will be able to type
```
df.loc[[11]]
```

The following creates that:

In [None]:
score_types = []
for i in range(1, rosetta.core.scoring.end_of_score_type_enumeration+1):
    ii = rosetta.core.scoring.ScoreType(i)
    if p.energies().weights()[ii] != 0: score_types.append(ii)

listofseries = []
for j in range(1,p.total_residue()+1):
    mydict = {}
    for i in score_types:
        myweight = p.energies().weights()[i]
        mydict['%s' %core.scoring.ScoreTypeManager.name_from_score_type(i)] = myweight*p.energies().residue_total_energies(j)[i]

    listofseries.append( pd.Series(mydict))

df = pd.DataFrame(listofseries)
df.index +=1 #makes index start at 1, not 0. Now, each row refers to its proper residue number (ie resi =1 -> row1)
print df
print df.sum()          #add another .sum() to get the total protein score
print df.sum().sum() 

So what we got from that last section is the ability to easily access any score term for any residue. Let's say you really care about the fa_atr for residue 10 in your system.
```
print df['fa_atr'][10]
```

and below in executable code

In [None]:
df['fa_atr'][10]

Well, after all of that work, let's put what we learned into a function so that we can reuse it in the future.

This function takes a rosetta Pose and a scorefunction and returns a pandas dataframe so that you can inspect and filter the energies on a per term, per residue basis all in one line!

In [None]:
from rosetta import *
from rosetta.core.scoring.methods import EnergyMethodOptions
import pandas as pd

#### This function is used to get a dataframe of all per residues scores in the pose
def dataframe_from_pose_and_sfxn(p, sfxn):
    '''
    Takes a Rosetta Pose and a scorefunction, then returns a dataframe
    of the Per-residues weighted scoreterms, including hbond_bb (which get zeroed out
    by default)

    '''

    sfxn(p)
    current_energy_methods_options = sfxn.energy_method_options() #keep current options
    emo = EnergyMethodOptions( current_energy_methods_options)    #must do this to get per res hbond_bb terms in breakdown
    emo.hbond_options().decompose_bb_hb_into_pair_energies( True )  # set to true, defaults False
    sfxn.set_energy_method_options( emo ) #set the sfxn up with the energy method options
    

    score_types = []
    for i in range(1, rosetta.core.scoring.end_of_score_type_enumeration+1):
        ii = rosetta.core.scoring.ScoreType(i)
        if p.energies().weights()[ii] != 0: score_types.append(ii)
        
    listofseries = []
    for j in range(1,p.total_residue()+1):
        mydict = {}
        for i in score_types:
            
            myweight = p.energies().weights()[i]
            mydict['%s' %core.scoring.ScoreTypeManager.name_from_score_type(i)] = myweight*p.energies().residue_total_energies(j)[i]

        listofseries.append( pd.Series(mydict))

    df = pd.DataFrame(listofseries)
    df.index +=1
    df = df.T
    return df