# Get Features

This notebook demonstrates how to obtain features and other data from the ensemble analysis class and should be deleted afterwards.

In [1]:
import sys
sys.path.append('C:/Users/nikol/Documents/GitHub/EnsembleTools')

In [2]:
from dpet.ensemble_analysis import EnsembleAnalysis
import os
os.environ["LOKY_MAX_CPU_COUNT"] = "8"

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
ens_codes = [
    'PED00156e001',
    'PED00157e001',
    'PED00158e001'
]
data_dir = 'C:/Users/nikol/Documents/test_dir/ped'

analysis = EnsembleAnalysis(ens_codes, data_dir)
analysis.download_from_database(database='ped')

Ensemble PED00156e001 already downloaded. Skipping.
File PED00156e001.pdb already exists. Skipping extraction.
Ensemble PED00157e001 already downloaded. Skipping.
File PED00157e001.pdb already exists. Skipping extraction.
Ensemble PED00158e001 already downloaded. Skipping.
File PED00158e001.pdb already exists. Skipping extraction.


Now all the pipeline steps return the results of that step. For example:

In [4]:
analysis.generate_trajectories()

Trajectory already exists for ensemble PED00156e001. Loading trajectory.
Trajectory already exists for ensemble PED00157e001. Loading trajectory.
Trajectory already exists for ensemble PED00158e001. Loading trajectory.


{'PED00156e001': <mdtraj.Trajectory with 100 frames, 941 atoms, 59 residues, without unitcells at 0x19ed45bbf10>,
 'PED00157e001': <mdtraj.Trajectory with 100 frames, 939 atoms, 59 residues, without unitcells at 0x19ede5b7ee0>,
 'PED00158e001': <mdtraj.Trajectory with 88 frames, 939 atoms, 59 residues, without unitcells at 0x19ede47c1c0>}

The trajectories can also be accessed as a dictionary in the following way:

In [5]:
analysis.trajectories

{'PED00156e001': <mdtraj.Trajectory with 100 frames, 941 atoms, 59 residues, without unitcells at 0x19ed45bbf10>,
 'PED00157e001': <mdtraj.Trajectory with 100 frames, 939 atoms, 59 residues, without unitcells at 0x19ede5b7ee0>,
 'PED00158e001': <mdtraj.Trajectory with 88 frames, 939 atoms, 59 residues, without unitcells at 0x19ede47c1c0>}

Extract features also returns the data in a dictionary.

In [6]:
analysis.extract_features(featurization='phi_psi')

Performing feature extraction for Ensemble: PED00156e001.
Transformed ensemble shape: (100, 116)
Performing feature extraction for Ensemble: PED00157e001.
Transformed ensemble shape: (100, 116)
Performing feature extraction for Ensemble: PED00158e001.
Transformed ensemble shape: (88, 116)
Feature names: ['GLU2-PHI', 'ALA3-PHI', 'ILE4-PHI', 'ALA5-PHI', 'LYS6-PHI', 'HIS7-PHI', 'ASP8-PHI', 'PHE9-PHI', 'SER10-PHI', 'ALA11-PHI', 'THR12-PHI', 'ALA13-PHI', 'ASP14-PHI', 'ASP15-PHI', 'GLU16-PHI', 'LEU17-PHI', 'SER18-PHI', 'PHE19-PHI', 'ARG20-PHI', 'LYS21-PHI', 'THR22-PHI', 'GLN23-PHI', 'ILE24-PHI', 'LEU25-PHI', 'LYS26-PHI', 'ILE27-PHI', 'LEU28-PHI', 'ASN29-PHI', 'MET30-PHI', 'GLU31-PHI', 'ASP32-PHI', 'ASP33-PHI', 'SER34-PHI', 'ASN35-PHI', 'TRP36-PHI', 'TYR37-PHI', 'ARG38-PHI', 'ALA39-PHI', 'GLU40-PHI', 'LEU41-PHI', 'ASP42-PHI', 'GLY43-PHI', 'LYS44-PHI', 'GLU45-PHI', 'GLY46-PHI', 'LEU47-PHI', 'ILE48-PHI', 'PRO49-PHI', 'SER50-PHI', 'ASN51-PHI', 'TYR52-PHI', 'ILE53-PHI', 'GLU54-PHI', 'MET55-PHI', 

{'PED00156e001': array([[ 1.1577249 , -1.217782  , -1.2799942 , ...,  2.5013933 ,
         -0.20402087,  1.3094579 ],
        [ 1.2899319 , -2.3262308 , -1.172357  , ..., -1.4751793 ,
         -3.1351974 ,  1.2594489 ],
        [-1.1285951 , -1.2928191 , -1.3119535 , ...,  2.0671933 ,
          2.170035  ,  2.564002  ],
        ...,
        [-1.8687296 , -1.1213598 , -1.401682  , ...,  1.3203958 ,
          1.3073406 , -0.5479945 ],
        [-1.3948162 , -2.7970872 , -2.1043155 , ...,  2.1060603 ,
         -0.5932995 , -1.2692754 ],
        [-1.5568932 , -1.4384656 , -1.2367293 , ..., -0.58454627,
         -1.4442189 , -0.16403243]], dtype=float32),
 'PED00157e001': array([[-1.1845857 , -1.162165  , -1.3434937 , ...,  2.500024  ,
          0.6335331 ,  2.0394804 ],
        [ 3.0242426 , -1.2952259 , -1.1229025 , ...,  2.4418864 ,
         -2.897522  , -0.6671064 ],
        [-2.8507354 , -2.7416358 , -1.2957426 , ..., -0.58928984,
          1.8301954 ,  2.174993  ],
        ...,
       

But it can be accessed similarly

In [7]:
analysis.features

{'PED00156e001': array([[ 1.1577249 , -1.217782  , -1.2799942 , ...,  2.5013933 ,
         -0.20402087,  1.3094579 ],
        [ 1.2899319 , -2.3262308 , -1.172357  , ..., -1.4751793 ,
         -3.1351974 ,  1.2594489 ],
        [-1.1285951 , -1.2928191 , -1.3119535 , ...,  2.0671933 ,
          2.170035  ,  2.564002  ],
        ...,
        [-1.8687296 , -1.1213598 , -1.401682  , ...,  1.3203958 ,
          1.3073406 , -0.5479945 ],
        [-1.3948162 , -2.7970872 , -2.1043155 , ...,  2.1060603 ,
         -0.5932995 , -1.2692754 ],
        [-1.5568932 , -1.4384656 , -1.2367293 , ..., -0.58454627,
         -1.4442189 , -0.16403243]], dtype=float32),
 'PED00157e001': array([[-1.1845857 , -1.162165  , -1.3434937 , ...,  2.500024  ,
          0.6335331 ,  2.0394804 ],
        [ 3.0242426 , -1.2952259 , -1.1229025 , ...,  2.4418864 ,
         -2.897522  , -0.6671064 ],
        [-2.8507354 , -2.7416358 , -1.2957426 , ..., -0.58928984,
          1.8301954 ,  2.174993  ],
        ...,
       

Additionally, the following method can be called to perform feature extraction without changing any data inside the ensemble analysis class

In [8]:
analysis.get_features(featurization='ca_dist')

{'PED00156e001': array([[0.62181395, 0.8748281 , 1.0244368 , ..., 0.5381905 , 0.65995586,
         0.5500405 ],
        [0.6740468 , 0.94238734, 1.2103361 , ..., 0.696765  , 0.7199101 ,
         0.5397694 ],
        [0.54331213, 0.5967415 , 0.8770647 , ..., 0.6101002 , 0.9751156 ,
         0.68294644],
        ...,
        [0.69683045, 0.88360864, 1.1769351 , ..., 0.5458644 , 0.8227767 ,
         0.53999794],
        [0.58550507, 0.8509589 , 1.0959796 , ..., 0.5325418 , 0.539001  ,
         0.63170296],
        [0.5536332 , 0.6240836 , 0.865476  , ..., 0.65317243, 0.5724018 ,
         0.5386198 ]], dtype=float32),
 'PED00157e001': array([[0.5444462 , 0.7421228 , 0.8755765 , ..., 0.6166486 , 0.8442986 ,
         0.689368  ],
        [0.61922514, 0.89435095, 0.93977153, ..., 0.7174825 , 0.85039765,
         0.5540449 ],
        [0.69710505, 1.0575799 , 1.2049499 , ..., 0.57521105, 0.84521836,
         0.6335973 ],
        ...,
        [0.5811551 , 0.5352497 , 0.58276105, ..., 0.6435778 ,

Dim. reduction also returns the concatenated result.

In [9]:
analysis.reduce_features(method='pca')

Concatenated featurized ensemble shape: (288, 116)
Reduced dimensionality ensemble shape: (100, 10)
Reduced dimensionality ensemble shape: (100, 10)
Reduced dimensionality ensemble shape: (88, 10)


array([[-3.928243  , -1.5126026 ,  3.6904237 , ..., -0.7948641 ,
        -4.1602144 , -0.1433025 ],
       [-2.6395655 , -0.09820706, -0.5210915 , ..., -5.0571675 ,
        -0.3417546 ,  0.9657626 ],
       [ 0.3107639 , -0.95393425, -0.11607962, ...,  5.596979  ,
         1.5136942 , -0.34727108],
       ...,
       [ 3.6289392 , -1.3870721 ,  0.2640261 , ...,  2.0198689 ,
        -2.4033306 ,  2.8421519 ],
       [-1.7796977 ,  1.3084623 , -1.8101686 , ...,  4.249404  ,
        -2.1751888 ,  1.3485581 ],
       [ 2.3454287 , -1.1246995 ,  2.7517998 , ..., -0.49983734,
         0.9402564 , -2.0259523 ]], dtype=float32)

The concated results can also be accessed like this

In [10]:
analysis.transformed_data

array([[-3.928243  , -1.5126026 ,  3.6904237 , ..., -0.7948641 ,
        -4.1602144 , -0.1433025 ],
       [-2.6395655 , -0.09820706, -0.5210915 , ..., -5.0571675 ,
        -0.3417546 ,  0.9657626 ],
       [ 0.3107639 , -0.95393425, -0.11607962, ...,  5.596979  ,
         1.5136942 , -0.34727108],
       ...,
       [ 3.6289392 , -1.3870721 ,  0.2640261 , ...,  2.0198689 ,
        -2.4033306 ,  2.8421519 ],
       [-1.7796977 ,  1.3084623 , -1.8101686 , ...,  4.249404  ,
        -2.1751888 ,  1.3485581 ],
       [ 2.3454287 , -1.1246995 ,  2.7517998 , ..., -0.49983734,
         0.9402564 , -2.0259523 ]], dtype=float32)

And the separated results can be accessed like this. This is only available for PCA and KPCA.

In [11]:
analysis.reduce_dim_data

{'PED00156e001': array([[-3.92824292e+00, -1.51260257e+00,  3.69042373e+00,
         -7.47175038e-01,  5.13347292e+00, -2.14487052e+00,
         -4.38553542e-01, -7.94864118e-01, -4.16021442e+00,
         -1.43302500e-01],
        [-2.63956547e+00, -9.82070565e-02, -5.21091521e-01,
         -4.96524572e-01, -4.11229670e-01,  3.57955503e+00,
         -2.12160873e+00, -5.05716753e+00, -3.41754586e-01,
          9.65762615e-01],
        [ 3.10763896e-01, -9.53934252e-01, -1.16079621e-01,
          3.61978102e+00, -4.91667330e-01, -5.43132401e+00,
         -4.11722994e+00,  5.59697914e+00,  1.51369417e+00,
         -3.47271085e-01],
        [-2.58515477e+00,  1.69309127e+00,  7.39073336e-01,
         -1.67521322e+00, -1.26205337e+00,  1.51760185e+00,
          4.33679223e-01, -7.30333388e-01, -5.76000690e-01,
         -5.54052401e+00],
        [-6.28407907e+00, -1.05129778e+00,  2.21357346e-01,
         -8.96127343e-01, -2.11869970e-01, -1.14502311e+00,
         -3.48400950e-01, -6.0485452