## Physicochemical Properties of Protein Tertiary Structure Data Set

### Description
    This is a data set of Physicochemical Properties of Protein Tertiary Structure. The data set is taken from CASP 5-9. There are 45730 decoys and size varying from 0 to 21 armstrong.


### Attribute Information:

- RMSD - Size of the residue.
- F1 - Total surface area.
- F2 - Non polar exposed area.
- F3 - Fractional area of exposed non polar residue.
- F4 - Fractional area of exposed non polar part of residue.
- F5 - Molecular mass weighted exposed area.
- F6 - Average deviation from standard exposed area of residue.
- F7 - Euclidian distance.
- F8 - Secondary structure penalty.
- F9 - Spacial Distribution constraints (N,K Value).

In [1]:
import pandas as pd
import csv
import numpy as np
np.random.seed(0)

In [2]:
df = pd.read_csv("CASP.csv")
df.head()

Unnamed: 0,RMSD,F1,F2,F3,F4,F5,F6,F7,F8,F9
0,17.284,13558.3,4305.35,0.31754,162.173,1872791.0,215.359,4287.87,102,27.0302
1,6.021,6191.96,1623.16,0.26213,53.3894,803446.7,87.2024,3328.91,39,38.5468
2,9.275,7725.98,1726.28,0.22343,67.2887,1075648.0,81.7913,2981.04,29,38.8119
3,15.851,8424.58,2368.25,0.28111,67.8325,1210472.0,109.439,3248.22,70,39.0651
4,7.962,7460.84,1736.94,0.2328,52.4123,1021020.0,94.5234,2814.42,41,39.9147


In [34]:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
df_val = df.values
df_val = min_max_scaler.fit_transform(df_val)
df = pd.DataFrame(df_val)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.823087,0.296637,0.26172,0.463818,0.423008,0.301464,0.323758,0.040471,0.291429,0.294518
1,0.286728,0.100946,0.08181,0.349616,0.119996,0.093926,0.097508,0.03142,0.111429,0.581909
2,0.441688,0.141698,0.088727,0.269853,0.158712,0.146755,0.087955,0.028137,0.082857,0.588525
3,0.754845,0.160257,0.131787,0.388734,0.160226,0.172921,0.136765,0.030659,0.2,0.594843
4,0.379161,0.134655,0.089442,0.289165,0.117274,0.136153,0.110432,0.026564,0.117143,0.616045


In [35]:
msk = np.random.rand(len(df)) < 0.75

train_df = df[msk]
test_df = df[~msk]

train = train_df.values
test = test_df.values

trainX = train[:,1:]
trainY = train[:, 0:1]

testX = test[:,1:]
testY = test[:, 0:1]

In [36]:
import statsmodels.api as sm
model = sm.GLM(trainY, trainX)
results = model.fit()
print(results.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                      y   No. Observations:                34113
Model:                            GLM   Df Residuals:                    34104
Model Family:                Gaussian   Df Model:                            8
Link Function:               identity   Scale:                        0.061365
Method:                          IRLS   Log-Likelihood:                -796.39
Date:                Mon, 04 Nov 2019   Deviance:                       2092.8
Time:                        02:26:55   Pearson chi2:                 2.09e+03
No. Iterations:                     3   Covariance Type:             nonrobust
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             3.4590      0.253     13.690      0.000       2.964       3.954
x2             0.2585      0.081      3.210      0.0

In [37]:
y_pred = results.predict(testX)

In [40]:
from sklearn import metrics
score = metrics.r2_score(testY, y_pred)
print(score)

0.2772883655189693
