# Data modelling
**Part 3**

This is a classification problem where the target variable edibility is either 'e' (edible) or 'p' (poisonous). In this stage of the project - after feature selection and cleaning - there are **115 features** in each row of data that all have to be used to classify the edibility of the specimen. 

## Methodology
Considering the high number of predictor variables(100), the algorithms used for classifying the edibility are decision tree, random forest, and support vector machine because these algorithms have superior accuracy and computational performance on high dimensional data compared to other conventional classifiers like Stochastic gradient descent, K-nearest neighbours, and Naive bayesian classifiers. 

Logistic regression might be a better alternative for deployment

Utilizing more than 1 model gives me a baseline model for comparison that will eliminate the possibility of oversight in this experiment when determining the causality of machine learning performance. I will test these models with exhaustive grid search to optimize the hyperparamaters, and compare them against a baseline model with the default hyperparameter values.

## Real-life considerations
In real life applications, a binary classifier predicicting the edibility of something isn't the safest or most useful model to use. If classification was performed for the production of consumable or medicine, there is a vastly high risk of consuming a poisonous product that was wrongly classified as edible. Logistic regression that returns the probability of toxicity is more useful. 

However binary classification still has some value when using the metrics of a confusion matrix

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

In [2]:
data = pd.read_csv("agaricus-lepiota_processed.data")
print(data.shape)
data.head()

(7821, 102)


Unnamed: 0,cap-surface_f,cap-surface_g,cap-surface_s,cap-surface_y,cap-color_b,cap-color_c,cap-color_e,cap-color_g,cap-color_n,cap-color_p,...,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w,edibility_e,edibility_p
0,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,1
1,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
4,0,0,1,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,1,0


## Modelling process
1. Wrangle the one hot encoded edibility columns into a single edibility column
2. Split the data into train and test with a 80-20 split
3. Set up and train the model objects
    * Decision tree +  Eshaustive grid search
    * Random forest +  Eshaustive grid search
4. Evaluate the models, compare: 
    * baseline model and grid search model
    * grid search model: training and testing data

In [3]:
def edibility(row):
    for c in row.index:
        if row[c] == 1:
            return c[-1]
        
data_copy = data.copy()

data_copy['edibility'] = data_copy.iloc[:,-2:].apply(edibility, axis=1)
data_copy = data_copy.drop(['edibility_e','edibility_p'], axis=1)
data_copy.head()

Unnamed: 0,cap-surface_f,cap-surface_g,cap-surface_s,cap-surface_y,cap-color_b,cap-color_c,cap-color_e,cap-color_g,cap-color_n,cap-color_p,...,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w,edibility
0,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,p
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,e
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,e
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,p
4,0,0,1,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,e


### Metric selection and configuring settings

The target attribute is a binary class label, accuracy score is the only possible metric. Checking the cpu count to set up the gridsearch object properly.

Determine the suitability of sklearn.metrics.accuracy_score

In [4]:
import os
print(os.cpu_count())

8


In [5]:
from sklearn import metrics
sorted(metrics.SCORERS.keys())

['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'jaccard',
 'jaccard_macro',
 'jaccard_micro',
 'jaccard_samples',
 'jaccard_weighted',
 'max_error',
 'mutual_info_score',
 'neg_brier_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_gamma_deviance',
 'neg_mean_poisson_deviance',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'neg_root_mean_squared_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'roc_auc_ovo',
 'roc_auc_ovo_weighted',
 'roc_auc_ovr',
 'roc_auc_ovr_weighted',
 'v_measure_score']

In [6]:
# simple test of accuracy_score on target variable
a1 = np.array(['p','p','p','p','p'])
a2 = np.array(['p','p','p','e','e'])
print(accuracy_score(a1, a2))

0.6


In [7]:
X = data_copy.iloc[:,:-1]
y = data_copy.iloc[:,-1:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify = y)

In [8]:
print('training samples: '+str(len(X_train)))
print('testing samples: '+str(len(X_test)))

training samples: 6256
testing samples: 1565


## Decision tree

In [27]:
dt= DecisionTreeClassifier()

param_grid = {'criterion':['gini','entropy'], 
              'max_depth':list(range(30,90,10)), 
              'min_samples_split':[1, 0.0025, 0.005, 0.01, 0.02],
              'min_samples_leaf':[1, 0.0025, 0.005, 0.01, 0.02, 0.04], 
              'max_features':list(range(1,11,1))} #there are 100 features, square root of 100 is 10

gridsearch_dt_class = GridSearchCV(estimator = dt,
                     param_grid = param_grid,
                     scoring = 'accuracy',
                     n_jobs = 6,
                     cv = None, #Turning off cross fold validation for performance
                     refit= True, 
                     return_train_score = False)

In [28]:
gridsearch_dt_class.fit(X_train, y_train)

GridSearchCV(estimator=DecisionTreeClassifier(), n_jobs=6,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [30, 40, 50, 60, 70, 80],
                         'max_features': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                         'min_samples_leaf': [1, 0.0025, 0.005, 0.01, 0.02,
                                              0.04],
                         'min_samples_split': [1, 0.0025, 0.005, 0.01, 0.02]},
             scoring='accuracy')

In [29]:
print(gridsearch_dt_class.best_params_)

{'criterion': 'entropy', 'max_depth': 60, 'max_features': 8, 'min_samples_leaf': 1, 'min_samples_split': 0.0025}


In [30]:
# Comparing the accuracy of a baseline model and the gridsearch model
dt_control = DecisionTreeClassifier()
y_ = dt_control.fit(X_train, y_train).predict(X_test)
acc = accuracy_score(y_test, y_)
print(f'baseline tree accuracy: {acc}')

y_2 = gridsearch_dt_class.best_estimator_.predict(X_test)
acc = accuracy_score(y_test, y_2)
print(f'gridsearch tree accuracy: {acc}')

baseline tree accuracy: 1.0
gridsearch tree accuracy: 1.0


In [31]:
# gridsearch training set accuracy vs testing set accuracy
y_ = gridsearch_dt_class.predict(X_train)
acc = accuracy_score(y_train, y_)
print(f'training set gridsearch tree accuracy: {acc}')

y_ = gridsearch_dt_class.predict(X_test)
acc = accuracy_score(y_test, y_)
print(f'testing set gridsearch tree accuracy: {acc}')

training set gridsearch tree accuracy: 0.9988810741687979
testing set gridsearch tree accuracy: 1.0


## Random forest

In [14]:
rf = RandomForestClassifier()

param_grid = {'criterion':['gini','entropy'], 
              'max_depth':list(range(30,90,10)), 
              'min_samples_split':[1, 0.0025, 0.005, 0.01, 0.02],
              'min_samples_leaf':[1, 0.0025, 0.005, 0.01, 0.02, 0.04], 
              'max_features':list(range(1,11,1))} #there are 100 features, square root of 100 is 10

gridsearch_rf_class = GridSearchCV(estimator = rf,
                     param_grid = param_grid,
                     scoring = 'accuracy',
                     n_jobs = 6,
                     cv = None, #Turning off cross fold validation for performance
                     refit= True, 
                     return_train_score = False)

In [15]:
gridsearch_rf_class.fit(X_train, y_train)

  self.best_estimator_.fit(X, y, **fit_params)


GridSearchCV(estimator=RandomForestClassifier(), n_jobs=6,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [30, 40, 50, 60, 70, 80],
                         'max_features': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                         'min_samples_leaf': [1, 0.0025, 0.005, 0.01, 0.02,
                                              0.04],
                         'min_samples_split': [1, 0.0025, 0.005, 0.01, 0.02]},
             scoring='accuracy')

In [16]:
print(gridsearch_rf_class.best_params_)

{'criterion': 'gini', 'max_depth': 30, 'max_features': 1, 'min_samples_leaf': 1, 'min_samples_split': 0.0025}


In [17]:
# Comparing the accuracy of a baseline model and the gridsearch model
rf_control = RandomForestClassifier()
y_ = rf_control.fit(X_train.to_numpy().squeeze(), y_train.to_numpy().squeeze()).predict(X_test)
acc = accuracy_score(y_test, y_)
print(f'baseline random forest accuracy: {acc}')

y_2 = gridsearch_rf_class.best_estimator_.predict(X_test)
acc = accuracy_score(y_test, y_2)
print(f'gridsearch random forest accuracy: {acc}')

baseline random forest accuracy: 1.0
gridsearch random forest accuracy: 1.0


In [18]:
# gridsearch training set accuracy vs testing set accuracy
y_ = gridsearch_rf_class.predict(X_train)
acc = accuracy_score(y_train, y_)
print(f'training set gridsearch random forest accuracy: {acc}')

y_ = gridsearch_rf_class.predict(X_test)
acc = accuracy_score(y_test, y_)
print(f'testing set gridsearch random forest accuracy: {acc}')

training set gridsearch random forest accuracy: 1.0
testing set gridsearch random forest accuracy: 1.0


## Support vector machine

In [19]:
svm = SVC()
              
param_grid = {'kernel':['linear','rbf','polynomial','sigmoid'],
              'C':[.001, .1, 1, 10, 100]}

gridsearch_svm_class = GridSearchCV(estimator = svm,
                     param_grid = param_grid,
                     scoring = 'accuracy',
                     n_jobs = 6,
                     cv = None, #Turning off cross fold validation for performance
                     refit= True, 
                     return_train_score = False)

In [20]:
gridsearch_svm_class.fit(X_train.to_numpy().squeeze(), y_train.to_numpy().squeeze())

GridSearchCV(estimator=SVC(), n_jobs=6,
             param_grid={'C': [0.001, 0.1, 1, 10, 100],
                         'kernel': ['linear', 'rbf', 'polynomial', 'sigmoid']},
             scoring='accuracy')

In [21]:
print(gridsearch_svm_class.best_params_)

{'C': 1, 'kernel': 'linear'}


In [22]:
# Comparing the accuracy of a baseline model and the gridsearch model

svm_control = SVC()
y_ = svm_control.fit(X_train.to_numpy().squeeze(), y_train.to_numpy().squeeze()).predict(X_test)
acc = accuracy_score(y_test, y_)
print(f'baseline random forest accuracy: {acc}')

y_2 = gridsearch_svm_class.best_estimator_.predict(X_test)
acc = accuracy_score(y_test, y_2)
print(f'gridsearch random forest accuracy: {acc}')

baseline random forest accuracy: 1.0
gridsearch random forest accuracy: 1.0


In [23]:
gridsearch_svm_class.best_estimator_.predict(X_train)

array(['p', 'p', 'p', ..., 'e', 'p', 'p'], dtype=object)

In [24]:
# gridsearch training set accuracy vs testing set accuracy
y_ = gridsearch_svm_class.best_estimator_.predict(X_train)
acc = accuracy_score(y_train, y_)
print(f'training set gridsearch SVM accuracy: {acc}')

y_ = gridsearch_svm_class.best_estimator_.predict(X_test)
acc = accuracy_score(y_test, y_)
print(f'testing set gridsearch SVM accuracy: {acc}')

training set gridsearch SVM accuracy: 1.0
testing set gridsearch SVM accuracy: 1.0


# Conclusions
training samples: 6256

testing samples: 1565

| Accuracy (%) | Baseline  | Gridsearch  | Gridsearch train   | Gridsearch test   |
|---:|:-------------|:-----------|:------|:------|
| Decision Tree | 1.0  | 1.0  | 1.0   | 1.0     |
| Random Forest | 1.0  | 1.0    | 1.0   | 1.0     |
| Support Vector | 1.0  | 1.0    | 1.0   | 1.0     |


High accuracy values >0.999 of all classifiers

## Decision tree
The accuracy values of all the tests are at 1.0 meaning that the classifiers are fully accurate on the training data and the test data, with and without exhaustive gridsearch

An examination of the hyperparamters of the gridsearch model show that min_samples_leaf has a value of 1 - which is the highest value of in the list of the parameter grid. The value needs to be increased for more accurate

## Random forest 
The accuracy values of all the tests are at 1.0 meaning that the classifiers are fully accurate on the training data and the test data, with and without exhaustive gridsearch. However, I need to test lower values of the max_depth parameter on more data to acquire a more optimized model that might work better on other data.

The random forest same grid search has the same hyperparameter ranges as decision tree yet was able to retain full and constant classification accuracy midly suggesting that the accuracy of the random forest classifier is not detrimented by un-optimal hyperparameters 

## Support Vector machine
The accuracy values of all the tests are at 1.0 meaning that the classifiers are fully accurate on the training data and the test data, with and without exhaustive gridsearch.

I would put either this or random forest into production due to the high accuracy of the untuned model and 
