<a href="https://colab.research.google.com/github/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/DIMENSIONALITY_REDUCTION_FOR_MACHINE_LEARNING_PRINCIPAL_COMPONENT_ANALYSIS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DIMENSIONALITY REDUCTION FOR MACHINE LEARNING - PRINCIPAL COMPONENT ANALYSIS


In this notebook, we will demonstrate how to use dimensionality reduction before training and evaluating Random Forest models. We will use the Principle Component Analysis (PCA) algorithm for dimensionality reduction. We will work on the Heart Failure dataset from Kaggle (https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset).

# Import Libraries

First, we need to import some libraries that will be used during the creation and evaluation of the Random Forest model.

In [None]:
import pandas as pd

# Data Preparation

**Clone the dataset Repository**

The prepared dataset after cleaning, removing outliers, and feature engineering can be cloned from the GitHub repository https://github.com/mkjubran/AIData.git as below

In [None]:
!rm -rf ./AIData
!git clone https://github.com/mkjubran/AIData.git

**Read the dataset**

The data is stored in the cardio_EDA.csv file. Read the input data into a dataframe using the Pandas library (https://pandas.pydata.org/) to read the data.

In [None]:
df = pd.read_csv("/content/AIData/cardio_EDA.csv",sep=";")
df.head()

**Display Data Info**

Display some information about the dataset using the info() method

In [None]:
df.info()

The dataset contains 53659 records with 15 features for each record. Twelve features are numeric and the rest are objects (strings).

# Clean Data and Remove Outliers

This data has been processed in previous notebooks
- Data Cleaning: https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_DATA_CLEANING.ipynb
- Feature Selection and Feature Engineering: https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_FEATURE_SELECTION_AND_FEATURE_ENGINEERING.ipynb

As we noticed from the presented sample of the dataset above some features are highly correlated such as the age and the age_year features. To demonstrate the effectiveness of PCA on Random Forest, we will keep these two features. But we will drop any not needed features such as the 'id' feature.

In [None]:
df.drop(['id'],axis=1, inplace=True)
df.head()

# Encode Categorical Data

We will use hot encoding through the get_dummies() method in pandas to encode the data in the 'gender' and 'smoke' features.

In [None]:
df = pd.get_dummies(df)
df.head()

Remember to drop one of the columns that resulted from the hot encoding of each feature. Also, make sure that the original features ('age' and 'smoke') are dropped too.

In [None]:
df.drop(['gender_female','smoke_No'],axis=1,inplace=True)
df.head()

# Train And Evaluate Random Forest Classifier

**Train Random Forest Classifier**

We will start by specifying the independent variables and the dependent variable. The independent variables are the features that will be used to predict the target feature (class,label). And the dependent variable is the target feature (class, label).

In [None]:
# independent variables
X=df.drop(['cardio'],axis=1)
X.head()

In [None]:
# dependet variable (target feature, class, label)
Y=df.cardio
Y.head()

It is known that the Random Forest does not perform well when features are monotonic transformations of other features such as the 'age' and 'age_years' (this makes the trees of the forest less independent from each other). So we will use PCA to get rid of collinear features. (all collinear features will end up in a single PCA component).

Now, we will reduce the dimensionality of the features dataframe by transforming the features using PCA

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=13)
pca.fit(X)
X_PCA13 = pca.transform(X)
X_PCA13=pd.DataFrame(X_PCA13)

Display a snapshot of the PCA dataframe

In [None]:
X_PCA13.head()

Let us now use cross-validation to train and evaluate the Random Forest classifier. We will use the same parameters obtained through the automated hyperparameter tuning (grid search) in the notebook https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/PERFORMANCE_OF_MACHINE_LEARNING_SYSTEMS_%E2%80%93_K_FOLD_CROSS_VALIDATION.ipynb  

In [None]:
from sklearn.model_selection import cross_validate
from sklearn import ensemble
model_rf = ensemble.RandomForestClassifier(max_features=3,max_samples=2000,n_estimators=200)
cv_value = 10
score_rf_PCA13 = cross_validate(model_rf,X_PCA13,Y,cv = cv_value, return_train_score=True)

from prettytable import PrettyTable
t = PrettyTable(['PCA Components', 'train_score','test_score'])
t.add_row(['All PCA', score_rf_PCA13['train_score'].mean(),score_rf_PCA13['test_score'].mean()])
print(t)



Let us reduce the dimensionality of the input features by dropping the last PCA component (component no. 12)

In [None]:
X_PCA12=X_PCA13.drop([12], axis=1)
X_PCA12.head()

After dropping the last PCA component and reducing the dimensionality to 12, let us train and evaluate the random forest classifier using the reduced input features

In [None]:
score_rf_PCA12 = cross_validate(model_rf,X_PCA12,Y,cv = cv_value, return_train_score=True)

t = PrettyTable(['PCA Components', 'train_score','test_score'])
t.add_row(['All PCA (13)', score_rf_PCA13['train_score'].mean(),score_rf_PCA13['test_score'].mean()])

t.add_row(['First 12', score_rf_PCA12['train_score'].mean(),score_rf_PCA12['test_score'].mean()])
print(t)

As can be observed the dropped PCA component (last component number 12) has a negligible effect on the training and testing accuracy of the random forest. Let's drop one more PCA component (Number 11) and check the results.

In [None]:
X_PCA11=X_PCA13.drop([11,12], axis=1)
score_rf_PCA11 = cross_validate(model_rf,X_PCA11,Y,cv = cv_value, return_train_score=True)

t = PrettyTable(['PCA Components','train_score','test_score'])
t.add_row(['All PCA (13)', score_rf_PCA13['train_score'].mean(),score_rf_PCA13['test_score'].mean()])

t.add_row(['First 12', score_rf_PCA12['train_score'].mean(),score_rf_PCA12['test_score'].mean()])

t.add_row(['First 11', score_rf_PCA11['train_score'].mean(),score_rf_PCA11['test_score'].mean()])
print(t)

the results show that dropping the last two PCA components has negligible effect on the training and testing accuracy of the random forest. Let's keep only the firt drop one more PCA component (Number 11) and check the results.

In [None]:
X_PCA10=X_PCA13.drop([10,11,12], axis=1)
score_rf_PCA10 = cross_validate(model_rf,X_PCA10,Y,cv = cv_value, return_train_score=True)

X_PCA9=X_PCA13.drop([9,10,11,12], axis=1)
score_rf_PCA9 = cross_validate(model_rf,X_PCA9,Y,cv = cv_value, return_train_score=True)

X_PCA7=X_PCA13.drop([7,8,9,10,11,12], axis=1)
score_rf_PCA7 = cross_validate(model_rf,X_PCA7,Y,cv = cv_value, return_train_score=True)

X_PCA5=X_PCA13.drop([5,6,7,8,9,10,11,12], axis=1)
score_rf_PCA5 = cross_validate(model_rf,X_PCA5,Y,cv = cv_value, return_train_score=True)

X_PCA3=X_PCA13.drop([3,4,5,6,7,8,9,10,11,12], axis=1)
score_rf_PCA3 = cross_validate(model_rf,X_PCA3,Y,cv = cv_value, return_train_score=True)

t = PrettyTable(['PCA Components', 'train_score','test_score'])
t.add_row(['All PCA (13)',score_rf_PCA13['train_score'].mean(),score_rf_PCA13['test_score'].mean()])

t.add_row(['First 12', score_rf_PCA12['train_score'].mean(),score_rf_PCA12['test_score'].mean()])

t.add_row(['First 11', score_rf_PCA11['train_score'].mean(),score_rf_PCA11['test_score'].mean()])

t.add_row(['First 10', score_rf_PCA10['train_score'].mean(),score_rf_PCA10['test_score'].mean()])

t.add_row(['First 9', score_rf_PCA9['train_score'].mean(),score_rf_PCA9['test_score'].mean()])

t.add_row(['First 7', score_rf_PCA7['train_score'].mean(),score_rf_PCA7['test_score'].mean()])

t.add_row(['First 5', score_rf_PCA5['train_score'].mean(),score_rf_PCA5['test_score'].mean()])

t.add_row(['First 3', score_rf_PCA3['train_score'].mean(),score_rf_PCA3['test_score'].mean()])

print(t)

As we can observe, reducing the dimensionality of model input results in a negligible effect on the accuracy performance. The number of components that we need to keep for model fitting depends on the dataset and the model used. In our case we will keep the first 7 components.

After we decided what components to keep, we split the daaset and fit the model using the PCA components of the training split.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=200)

pca.fit(x_train)
x_train_PCA = pca.transform(x_train)
x_train_PCA = pd.DataFrame(x_train_PCA)
x_train_PCA.drop([7,8,9,10,11,12],axis=1, inplace=True)
model_rf = ensemble.RandomForestClassifier(max_features=3,max_samples=2000,n_estimators=200)
model_rf.fit(x_train_PCA,y_train)

# Saving and Loading Models

We will use the joblib method from sklearn library (https://scikit-learn.org/stable/modules/model_persistence.html) to save and load the models. To save the model we use the dump method as

In [None]:
import joblib as jb
jb.dump(model_rf, './Model_rf.joblib')

jb.dump(pca, './Model_pca.joblib')

And to load the trained random forest model, we will use the load() method

In [None]:
model_rf_joblib = jb.load('./Model_rf.joblib')
model_pca_joblib = jb.load('./Model_pca.joblib')

# Predict New Values Using Models

To predict the target values for new data, we will use the loaded model

In [None]:
x_test.head()

Apply PCA of the test fetures using the loaded model

In [None]:
x_test_PCA = model_pca_joblib.transform(x_test)
x_test_PCA = pd.DataFrame(x_test_PCA)

Then we keep drop the last 6 PCA components as we did when we trained the model.

In [None]:
x_test_PCA.drop([7,8,9,10,11,12],axis=1,inplace=True)

Now, to predict the label, we apply the remaining PCA components to the Random Forest classifier

In [None]:
y_predict = model_rf_joblib.predict(x_test_PCA)
dfnew=x_test.copy()
dfnew['cardio_predict']=y_predict

For the test split, we have the actual value of the 'cardio', so we can add it to the new dataframe for comparison purposes.

In [None]:
dfnew['cardio_actual']=y_test
dfnew.head()

Based on the measured accuracy above, the cardio_predict and cardio_acutal should match in ~70% (testing accuracy) of the records.

In [None]:
dfnew[dfnew['cardio_predict'] != dfnew['cardio_actual']]