# Bioinformatics Project - Computational Drug Discovery - Influenza virus A matrix protein M2  
Michael Bahchevanov  
***

## Principal Component Analysis 🧩  
This notebook will be conducting a **Principal Component Analysis (*PCA*)**. We will be looking for which features or combination of features are correlated in the training data set.

### 1. Data Scaling and Correlation Matrix Formation 🔎  
The current data consists of 3 main sets:
1. The **training** set - consisting of binders only  
2. The **test/validation** set - consisting of binders only  
3. The **decoy** set - consisting of molecules assumed to be non-binders only

#### 1.1 Importing Libraries and Tooling 🔨   
We will be using *sklearn* for the conduction of the anlysis, as well as other machine learning tasks, *pandas* for data wrangling, *numpy* for computation, and *matplotlib* for visualization.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import sklearn
from sklearn.model_selection import train_test_split

In [8]:
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

np.random.seed(42)

#### 1.2 Loading in the Morgan Matrix , Decoy, and Training/Testing Data

In [5]:
morgan_matrix_train_test = pd.read_csv('./data/influenza_virus_A_matrix_M2_protein_06_morgan_matrix.csv')
morgan_matrix_decoy = pd.read_csv('./data/5HT1A_02_morgan_matrix.csv')
df_train_test = pd.read_csv('./data/influenza_virus_A_matrix_M2_protein_07_training_data.csv')
df_decoy = pd.read_csv('./data/5HT1A_03_decoy_data.csv')

#### 1.3 Feature Selection  
Here we will be selecting only the relevant features, as well as applying some **One-hot Encoding** on the *standard_value* columns of the sets.

In [7]:
smiles_train_test = df_train_test['canonical_smiles']
smiles_train_test = smiles_train_test.reset_index()['canonical_smiles']
affinity_train_test = df_train_test['standard_value']

binding_treshold = 1000
affinity_train_test = affinity_train_test.apply(lambda x: 0 if x < 1000 else 1)

smiles_decoy = df_decoy['canonical_smiles']
smiles_decoy = smiles_decoy.reset_index()['canonical_smiles']
affinity_decoy = pd.Series([0 for i in range(len(smiles_decoy))])

#### 1.4 Splitting Data  
We will be using *sklearn's* tool for splitting our data into training and testing.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(morgan_matrix_train_test, affinity_train_test)

In [10]:
X_train.shape

(10, 2048)

In [11]:
morgan_matrix_train_test.shape

(14, 2048)

#### 1.5 Data Scaling  
Before we do the **principal component analysis**, we need to scale all columns of the training set to have 0 **mean** and **unit standard deviation**. We will be using these to also scale the **decoy** set. This will be done using *sklearn's* `StandardScaler`. Afterwards, we will be using the column **means** and **stds** to scale the sets. Finally, we will delete the columns in the matrices that correspond to columns with 0 **std**.

In [12]:
from sklearn.preprocessing import StandardScaler

In [13]:
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
morgan_matrix_decoy = scaler.transform(morgan_matrix_decoy)

In [18]:
cols_to_delete = []
for i in range(X_train.shape[1]):
    if X_train[:, i].std() == 0.0:
        cols_to_delete.append(i)

In [20]:
len(cols_to_delete)

1961

In [21]:
def delete_columns(matrix, selection_cols):
    return np.delete(matrix, selection_cols, axis=1)

In [22]:
X_train = delete_columns(X_train, cols_to_delete)
X_test = delete_columns(X_test, cols_to_delete)
morgan_matrix_decoy = delete_columns(morgan_matrix_decoy, cols_to_delete)