# Load data

Chignolin is a small protein which can be found in a folded (ordered, close) state and in an unfolded (disordered, open) state. Here we characterize it with 4 group of different features.

1. a set of distances between carbon atoms [ca] 
2. a set of diehdral angles of the main chain of the protein (sin and cos) [cphi,sphi,cpsi,spsi]
3. a set of diehdral angles of the side chain (sin and cos) [cchi,schi]  
4. a set of contacts between carbon and nitrogen atoms where hydrogen bonds can be formed (actually between O and the H bounded to N) [o-n]

A small dataset to start with can be made considering only the features belonging to the first group.

The files _folded.dat_ and _unfolded.dat_ contain frames from a trajectory of the protein in the two states. Since there are additional columns (for instance there are both the diedhral angles and their sin/cos, but we want to use only the latter), we can use pandas to extract the column based on the header.

### Example of dataset import

In [None]:
import pandas as pd

#state A (folded)
filename='train-set/folded.dat'
headers = pd.read_csv(filename,sep=' ',skipinitialspace=True, nrows=0).columns[2:]  
df = pd.read_csv(filename,sep=' ',skipinitialspace=True, header=None,skiprows=1,names=headers,comment='#') 

In [5]:
df

Unnamed: 0,time,o-n1,o-n2,o-n3,o-n4,o-n5,o-n6,o-n7,o-n8,o-n9,...,ca-88_102,ca-88_109,ca-88_123,ca-88_147,ca-102_109,ca-102_123,ca-102_147,ca-109_123,ca-109_147,ca-123_147
0,0.0,3.365559,2.762824,5.285263,3.198980,2.323934,2.480256,2.304085,4.049950,2.086928,...,0.381254,0.554182,0.916527,1.117721,0.375618,0.688094,0.988150,0.381066,0.652290,0.381635
1,50.0,3.379572,2.744688,4.661482,3.012232,2.334897,2.484532,2.541566,3.909495,2.131110,...,0.385045,0.542727,0.884412,1.136157,0.379896,0.633754,0.968653,0.374616,0.643890,0.377043
2,100.0,3.256660,2.616842,4.891612,3.086377,2.247415,2.525398,2.548148,3.973970,2.056355,...,0.378272,0.539875,0.903319,1.125826,0.376559,0.650718,0.960272,0.385673,0.636797,0.374900
3,150.0,3.619743,2.687783,4.980612,2.790702,2.187083,2.244627,2.295603,4.172405,2.084816,...,0.385479,0.551803,0.903471,1.069470,0.377738,0.684110,0.969640,0.373553,0.611913,0.382958
4,200.0,3.550007,2.541721,4.949437,2.844430,2.358803,2.470524,2.316255,3.948478,2.056963,...,0.381618,0.528578,0.888199,1.060548,0.376982,0.693364,0.989131,0.376459,0.629492,0.392047
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,499750.0,3.394749,2.696564,4.902963,2.929017,2.209103,2.482822,2.540227,3.997486,2.162467,...,0.392563,0.546995,0.907817,1.090265,0.379406,0.672605,0.965751,0.381340,0.604611,0.376125
9996,499800.0,3.457285,2.614702,4.897009,2.976728,2.338816,2.401391,2.286432,3.938829,1.973183,...,0.377473,0.545140,0.906116,1.100935,0.371626,0.655898,0.955925,0.382003,0.619622,0.388671
9997,499850.0,3.361701,2.688526,4.386531,3.236395,2.404412,2.428383,2.330824,3.984482,2.055548,...,0.378217,0.546044,0.910256,1.091156,0.373583,0.665028,0.947055,0.384554,0.630994,0.385986
9998,499900.0,3.166119,2.681709,5.083762,3.013725,2.374039,2.468818,2.460192,3.872121,2.061673,...,0.382755,0.540520,0.893357,1.125641,0.385416,0.658369,0.985072,0.368763,0.629254,0.382864


#### Small dataset (features #1)

In [19]:
#select subset of columns based on names
small_df = df.filter(regex='ca')

#save features names and np array
names = small_df.columns
array = small_df.values

print('names:',names.shape)
print('array:',array.shape)

names: (45,)
array: (10000, 45)


#### Full dataset (features #1,2,3,4)

In [20]:
#select subset of columns based on names
full_df = df.filter(regex='ca|cphi|sphi|cpsi|spsi|cchi|schi|o-n')

#save features names and np array
names = full_df.columns
array = full_df.values

print('names:',names.shape)
print('array:',array.shape)

names: (122,)
array: (10000, 122)
