# Adding Configurational Entropy (S$_{cnf}$) to Rosetta Descriptors

### Problem:
* RosettaCommons currently ignores S$_{cnf}$ as a modelling feature. 
* Its modelling power would be improved by explicitly including S$_{cnf}$ as a descriptor

### Methods:
* Ran 80,000+ proteins through the Configurational Entropy Estimator [Popcoen](http://fmc.ub.edu/popcoen/)
    * stored in 'proteins.entropy' txt files
* Converted files in 'proteins.entropy' to tsv files 
* Converted tsv files to Pandas Dataframes
    * stored in 'proteins.df'
* Merged Dataframes with Hamed's aggregated data
    * created 'final_df.csv'
* Ran analytic methods on the data
    * sk-learn's `Random Forest Classifier`, `Linear Regression`, and `Logistic Regression`
    * Kolgomorov-Smirov Test
    * etc.. 
* Separated Data by Topologies and Re-ran the analytic methods
* Plotted figures

### Results:
* S$_{cnf}$ does not offer significant improvement when unspecified for topology
* Average alpha-helix S$_{cnf}$ mostly distinguishes between stable and unstable proteins  

-------------------

## Part 0: Obtaining S$_{cnf}$ Data

---------------

**Set up the client**
* Open a terminal and change directory to `Popcoen_Full_Version1`. 
* Run `popcoen_server.py` #all the followings steps assume the server is running

__IMPORTANT__: all Popcoen files must be run through python 2.x. Since Jupyter runs 3.x,
I used an Anaconda virtual environment when running Popcoen.



**Run .pdb files through Popcoen from Python**
* `pdb_to_text` looks for a pdb directory name (specified in the first argument) within `"~/Jan/pdb_files/"` , 
* function runs all its files through Popcoen
* stores the outputs in `"~/Jan/proteins.entropy/"` under a specified name (second argument). 


* Will use `Example` library which is only 5 files long. Taken from `Inna`.

In [1]:
from helper_functions import pdb_to_text

pdb_to_text('Example','Example.entropy')

path_name = /home/jupyter/tacc-work/Jan/PDB_Files/Example
('pdb list contains', 5, 'entries')


In [2]:
#Check that it worked!
import os
os.listdir("/home/jupyter/tacc-work/Jan/proteins.entropy/Example.entropy/")

['p1-14H-BBL-14H-GABBL-16H_0161_0001_0001.entropy',
 'p1-14H-BBL-14H-GABBL-15H_0076_0001_0001.entropy',
 'p1-14H-BBL-14H-GABBL-15H_0397_0001_0001.entropy',
 'p1-14H-BBL-14H-GABBL-16H_0212_0001_0001.entropy',
 'p1-14H-BBL-14H-GABBL-14H_0322_0001_0001.entropy']

----------

## Step 1: Converting from .entropy to .tsv

--------------

**Convert files stored in proteins.entropy to .tsv files**
* Formats the text file in order to make it readable for Pandas
* S$_{cnf}$ is under `S_PC` as a reference to Popcoen

In [4]:
from helper_functions import entropy_to_tsv

entropy_to_tsv('Example.entropy','proteins.tsv','Example')

In [5]:
#Check that it worked!
import pandas as pd
pd.read_table('/home/jupyter/tacc-work/Jan/proteins.tsv/Example.tsv')

Unnamed: 0,filename,library,S_PC,reliability,Si values
0,p1-14H-BBL-14H-GABBL-16H_0161_0001_0001.pdb,Example,3.809,0.995,"['1.390', '1.220', '1.060', '0.940', '-0.561',..."
1,p1-14H-BBL-14H-GABBL-15H_0076_0001_0001.pdb,Example,-0.699,0.994,"['1.306', '1.278', '-0.714', '0.635', '-0.230'..."
2,p1-14H-BBL-14H-GABBL-15H_0397_0001_0001.pdb,Example,1.195,0.995,"['1.440', '1.005', '1.009', '0.789', '0.457', ..."
3,p1-14H-BBL-14H-GABBL-16H_0212_0001_0001.pdb,Example,-0.029,0.995,"['1.302', '-0.046', '0.761', '-0.468', '-0.736..."
4,p1-14H-BBL-14H-GABBL-14H_0322_0001_0001.pdb,Example,-5.291,0.994,"['1.339', '0.394', '0.922', '1.072', '-0.657',..."


------------

## Step 2: Merge with Hamed's Data

---------------

**Use Hamed's consistent data to add Rosetta Features to S$_{cnf}$ data**

In [7]:
#locate Hamed's file
Hamed = '/home/jupyter/sd2e-community/protein-design/data_v1_April_2018/aggregated_v1_data_Hamed_May_23_2018/data_v1_aggregated.csv'

#create a Dataframe from Hamed
Hamed_df = pd.read_csv(Hamed)

#load the Dataframe to merge
df = pd.read_table('/home/jupyter/tacc-work/Jan/proteins.tsv/Example.tsv')

#fix Dataframes to make it ready for merge
from helper_functions import canonicalize_name
df.rename(columns={'filename':'name'},inplace=True)
Hamed_df.loc[:,'name'] = Hamed_df.loc[:,'name'].map(canonicalize_name)

#carry out the merge
merged_df = Hamed_df.merge(df,on='name')

merged_df

Unnamed: 0,name,library_x,description,sequence,dssp,stabilityscore_t,stabilityscore_c,stabilityscore,ec50_95ci_t,ec50_95ci_c,...,sum_best_frags,total_score,tryp_cut_sites,two_core_each,worst6frags,worstfrag,library_y,S_PC,reliability,Si values
0,p1-14H-BBL-14H-GABBL-14H_0322_0001_0001.pdb,Inna,results/curate_designs_from_April_2016_chip/mo...,DTEKLKEKVREILEKLSPDEARKYIERLYKEGKISDEQRKELERFL...,LHHHHHHHHHHHHHHLLHHHHHHHHHHHHHHLLLLHHHHHHHHHHH...,2.168502,2.014133,2.014133,3.1,0.8,...,11.3702,-141.10361,14.0,1.0,5.2066,1.2177,Example,-5.291,0.994,"['1.339', '0.394', '0.922', '1.072', '-0.657',..."
1,p1-14H-BBL-14H-GABBL-15H_0076_0001_0001.pdb,Inna,results/curate_designs_from_April_2016_chip/mo...,SQDQAREEVKKLESQLSPEQVKRKLEELRRKGKLDPKVLEEWQKRL...,LHHHHHHHHHHHHLLLLHHHHHHHHHHHHHHLLLLHHHHHHHHHHH...,-0.13475,-0.225559,-0.225559,2.2,2.2,...,11.6962,-130.848259,15.0,1.0,4.5311,0.8806,Example,-0.699,0.994,"['1.306', '1.278', '-0.714', '0.635', '-0.230'..."
2,p1-14H-BBL-14H-GABBL-15H_0397_0001_0001.pdb,Inna,results/curate_designs_from_April_2016_chip/mo...,SQKEKWKKVEEKLRRLSPDEAEKLVRKIEKKGLLSPELIERAKEVV...,LHHHHHHHHHHHHHHLLHHHHHHHHHHHHHHLLLLHHHHHHHHHHH...,2.502716,1.936421,1.936421,2.6,1.1,...,10.0659,-148.088147,16.0,1.0,3.2959,0.6785,Example,1.195,0.995,"['1.440', '1.005', '1.009', '0.789', '0.457', ..."
3,p1-14H-BBL-14H-GABBL-16H_0161_0001_0001.pdb,Inna,results/curate_designs_from_April_2016_chip/mo...,SEEKVKEEVKKLRKKLSKEEARKVVEQLVRDGKLDPEELRVLKEWI...,LHHHHHHHHHHHHHHLLHHHHHHHHHHHHHHLLLLHHHHHHHHHHH...,0.094105,0.051388,0.051388,2.9,2.6,...,12.0142,-136.877889,18.0,0.666667,5.2009,1.2392,Example,3.809,0.995,"['1.390', '1.220', '1.060', '0.940', '-0.561',..."
4,p1-14H-BBL-14H-GABBL-16H_0212_0001_0001.pdb,Inna,results/curate_designs_from_April_2016_chip/mo...,SYKDVQKELEKVFKTLSPEEARKFVEKLERKGKIDESEIREAKKFV...,LHHHHHHHHHHHHHHLLHHHHHHHHHHHHHHLLLLHHHHHHHHHHH...,0.164503,1.111317,0.164503,3.1,1.0,...,13.724,-130.393499,16.0,1.0,5.3308,1.1369,Example,-0.029,0.995,"['1.302', '-0.046', '0.761', '-0.468', '-0.736..."


Great! Now we have a dataframe that contains the S$_{cnf}$ for all of these proteins!

----------------

## Step 3: Add more S$_{cnf}$ features

------------------

In [8]:
from helper_functions import dssp_to_bin, sum_entropies, avg_entropy, maxmin_entropy, add_Scnf_features
from RandomForestAlgorithm import to_float
#make column names standard
df = merged_df.rename(columns={'Si values':'Si_values'})

#change string values to integers
df['Si_values'] = df['Si_values'].map(to_float)

#add remaining Configurational Entropy features
df = add_Scnf_features(df)

df


Unnamed: 0,name,library_x,description,sequence,dssp,stabilityscore_t,stabilityscore_c,stabilityscore,ec50_95ci_t,ec50_95ci_c,...,Mean_res_entropy,H_max_entropy,L_max_entropy,E_max_entropy,H_min_entropy,L_min_entropy,E_min_entropy,H_range_entropy,L_range_entropy,E_range_entropy
0,p1-14H-BBL-14H-GABBL-14H_0322_0001_0001.pdb,Inna,results/curate_designs_from_April_2016_chip/mo...,DTEKLKEKVREILEKLSPDEARKYIERLYKEGKISDEQRKELERFL...,LHHHHHHHHHHHHHHLLHHHHHHHHHHHHHHLLLLHHHHHHHHHHH...,2.168502,2.014133,2.014133,3.1,0.8,...,-0.03526,1.433,1.574,0.0,-2.306,-0.622,0.0,3.739,2.196,0.0
1,p1-14H-BBL-14H-GABBL-15H_0076_0001_0001.pdb,Inna,results/curate_designs_from_April_2016_chip/mo...,SQDQAREEVKKLESQLSPEQVKRKLEELRRKGKLDPKVLEEWQKRL...,LHHHHHHHHHHHHLLLLHHHHHHHHHHHHHHLLLLHHHHHHHHHHH...,-0.13475,-0.225559,-0.225559,2.2,2.2,...,-0.004562,1.278,1.507,0.0,-1.69,-0.308,0.0,2.968,1.815,0.0
2,p1-14H-BBL-14H-GABBL-15H_0397_0001_0001.pdb,Inna,results/curate_designs_from_April_2016_chip/mo...,SQKEKWKKVEEKLRRLSPDEAEKLVRKIEKKGLLSPELIERAKEVV...,LHHHHHHHHHHHHHHLLHHHHHHHHHHHHHHLLLLHHHHHHHHHHH...,2.502716,1.936421,1.936421,2.6,1.1,...,0.007804,1.628,1.44,0.0,-1.64,-0.45,0.0,3.268,1.89,0.0
3,p1-14H-BBL-14H-GABBL-16H_0161_0001_0001.pdb,Inna,results/curate_designs_from_April_2016_chip/mo...,SEEKVKEEVKKLRKKLSKEEARKVVEQLVRDGKLDPEELRVLKEWI...,LHHHHHHHHHHHHHHLLHHHHHHHHHHHHHHLLLLHHHHHHHHHHH...,0.094105,0.051388,0.051388,2.9,2.6,...,0.024429,2.031,1.961,0.0,-2.111,-0.019,0.0,4.142,1.98,0.0
4,p1-14H-BBL-14H-GABBL-16H_0212_0001_0001.pdb,Inna,results/curate_designs_from_April_2016_chip/mo...,SYKDVQKELEKVFKTLSPEEARKFVEKLERKGKIDESEIREAKKFV...,LHHHHHHHHHHHHHHLLHHHHHHHHHHHHHHLLLLHHHHHHHHHHH...,0.164503,1.111317,0.164503,3.1,1.0,...,-0.000179,1.966,1.583,0.0,-1.407,-0.182,0.0,3.373,1.765,0.0


---------------

## Step 4: 

----------------