## PCA and DecisionTreeClassifier from SAS® Viya® on Mushroom and using GridSearchCV
### Source
This example is adapted from [ML_-Decision-Trees-GridSearchCV-PCA-](https://github.com/GalaRusina/ML_-Decision-Trees-GridSearchCV-PCA-/tree/main) by Galyna Rusina.

### Data Preparation
#### About the data set
This data set contains 8124 hypothetical samples from 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Although each species was originally identified as "edible", "poisonous", or "unknown", the last two are combined into one classification of "poisonous".  The goal is identify whether a sample is poisonous based on its 22 attributes.

In [None]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler

from sasviya.ml.decomposition import PCA
from sasviya.ml.tree import DecisionTreeClassifier

import warnings
from sklearn.exceptions import FitFailedWarning
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FitFailedWarning)

#### Importing the data

In [None]:
workspace=f'{os.path.abspath("")}/../data/'
df=pd.read_csv(workspace+'mushroom.csv')
df.head()

### Examining data characteristics
#### Basic information
Let's examine various characteristics of the data:
* Shape
* Names of variables
* Information about variables
* Basic statistics about the variable values

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe().T

In [None]:
df['target'].value_counts(normalize=True)

It appears that all variables are categorical and none have missing values.  In addition, "veil-type" has a constant value for all observations.  As a result, it can be dropped from the analysis.

In [None]:
df=df.drop('veil-type',axis=1)

#### Examining correlations
Rather than a pure numeric view of the correlation matrix, we will apply a heatmap-like gradient to the matrix for better visibility.

In [None]:
corr_matrix = df.corr()
corr_matrix_styled = corr_matrix.style.background_gradient(cmap="coolwarm").format("{:.2f}")
corr_matrix_styled

"cap-shape" appears to be the least correlated with the other variables. That often means it will be important for classifying the observations.  We will take a look at the relationship between "target" and the values of "cap-shape".

In [None]:
df_corr = df[['target','cap-shape']]
df_corr.head()

In [None]:
df_last_corr = df_corr.groupby('cap-shape').mean().sort_values(by='target', ascending = False)
df_last_corr

#### Standardizing the data
It is commonly recommended to standardize data when using machine learning techniques. scikit-learn provides the `StandardScaler` class to help easily accomplish this.  

In [None]:
scaler=StandardScaler()
df_X = df.drop(['target'], axis=1) # remove the dependent variable
scaler.fit(df_X) # calculate the mean to perform the transformation
X_scaled=scaler.transform(df_X) # scale the data, and normalize it

In [None]:
df_X_scaled = pd.DataFrame(X_scaled)
y = df[['target']]
df_scaled = pd.concat([df_X_scaled, y], axis=1)
df_scaled.head()

#### Creating training and test data
We split the scaled data by putting 80% into the training set and 20% into the test set. 

In [None]:
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(df_X_scaled, y, random_state=42, stratify=df_scaled['target'], test_size=0.2)

### Classifying the scaled data using DecisionTreeClassifier
Let's run `DecisionTreeClassifier` against the scaled training data.  After fitting the model, we will display scores for accuracy, ROC, and F1. 

In [None]:
dec_tree = DecisionTreeClassifier(
                     criterion='gini',
                     max_depth=7)

In [None]:
dec_tree.fit(X_train_scaled,y_train)
y_predict = dec_tree.predict(X_test_scaled)

In [None]:
print('Accuracy: ', accuracy_score(y_test, y_predict)) # get the metric from the model with all features
print('ROC: ',roc_auc_score(y_test,y_predict))
print('F1: ', f1_score(y_test,y_predict))

### Applying Principal Component Analysis

Often applied to data sets with large numbers of variables, Principal Component Analysis (PCA) combines variables such that the number of variables is reduced but the information from them is retained in the output data set.  A key feature is that the new variables created through PCA are independent of each other. In this case, we decide to transform our original 22 characteristics down to 15, using the scaled data generated earlier.

For details about using the `PCA` class of the `sasviya` package, see the [PCA documentation](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=n1hbrdco0inum2n1ddq5wv4ghifq.htm).

In [None]:
pca=PCA(n_components=15)
pca.fit(X_scaled)
X_pca=pca.transform(X_scaled)

In [None]:
print("Shape of X_pca", X_pca.shape)
expl = pca.explained_variance_ratio_
print('Explained variance by component\n', [round(x, 3) for x in expl])
print(f'Sum of explained variance: {sum(expl[0:15]):.2f}')

In [None]:
# plot the cumulative explained variance in the new dimensions
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.grid()
plt.show()

In [None]:
df_pca = pd.DataFrame(X_pca)
y_pca = df[['target']]
df_pca = pd.concat([df_pca, y_pca], axis=1)
df_pca.head()

#### Creating training and test data based on PCA data

In [None]:
X_pca = df_pca.drop('target',axis=1)
y_pca = df_pca['target']

X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y_pca, random_state=42, stratify=df['target'], test_size=0.2)

### Creating a DecisionTreeClassifier Model From PCA Data
After fitting the model, we will display the scores for accuracy, ROC, and F1.

In [None]:
dec_tree_pca = DecisionTreeClassifier(
                     criterion='gini',
                     max_depth=7)

In [None]:
dec_tree_pca.fit(X_train_pca,y_train_pca)

In [None]:
y_predict_pca = dec_tree_pca.predict(X_test_pca)

### Examining the Results


In [None]:
print(f'Accuracy: {accuracy_score(y_test_pca, y_predict_pca):.2f}') # get the metrics from the optimized model
print(f'ROC: {roc_auc_score(y_test_pca,y_predict_pca):.2f}')
print(f'F1: {f1_score(y_test_pca,y_predict_pca):.2f}')

In [None]:
print("Classification Report\n",classification_report(y_test_pca, y_predict_pca, target_names=['0','1']))

print("Classification Report\n",confusion_matrix(y_test_pca, y_predict_pca))

In [None]:
disp = ConfusionMatrixDisplay.from_estimator(
        dec_tree_pca,
        X_test_pca,
        y_test_pca,
        display_labels=y_pca.unique(),
        cmap=plt.cm.Blues,
    )
disp.ax_.set_title('Confusion matrix')
plt.show()

### Optimizing the model via GridSearchCV


In [None]:
params = {'criterion' : ['gini', 'entropy'],
          'max_depth' : range(10, 15),
          'min_samples_leaf' : range(1,5)
          }
cv = 3
verbose = 1

grid = GridSearchCV(
        estimator  = DecisionTreeClassifier(),
        param_grid = params,
        scoring    = 'accuracy',
        n_jobs     = - 1,
        cv         = 3,
        verbose    = 1,
        return_train_score = True,

       )

grid.fit(X = X_train_pca, y = y_train_pca)

# obtain the best parameters

In [None]:
print('Best parameters:', grid.best_params_)

print(f'Best accuracy score:{grid.best_score_:.2f}')

In [None]:
dec_tree_pca = DecisionTreeClassifier(
                 criterion='gini',
                 max_depth=14,
                 min_samples_leaf=1
)

In [None]:
dec_tree_pca.fit(X_train_pca, y_train_pca)

In [None]:
y_predict_pca = dec_tree_pca.predict(X_test_pca)

In [None]:
print(f'Accuracy: {accuracy_score(y_test_pca, y_predict_pca):.2f}')
print(f'ROC: {roc_auc_score(y_test_pca,y_predict_pca):.2f}')
print(f'F1: {f1_score(y_test_pca,y_predict_pca):.2f}')

In [None]:
# classification report of X_test
print('Classification Report\n', classification_report(y_test_pca, y_predict_pca, target_names=['0','1']))

print('Confusion Matrix\n', confusion_matrix(y_test_pca, y_predict_pca))

In [None]:
disp = ConfusionMatrixDisplay.from_estimator(
        dec_tree_pca,
        X_test_pca,
        y_test_pca,
        display_labels=y_pca.unique(),
        cmap=plt.cm.Blues,
    )
disp.ax_.set_title('Confusion matrix')

plt.show()

We managed to reduce the variables from 22 to 15 using PCA without losing the model performance.