# Human Interpretation of Machine Learning Models
**Project**: M3 Project 3b
**Author**: Jacob Buysse

In this project we are going to give human interpretation of different machine learning models for the edibility of mushrooms for both decision tree and logistic regression.

In this notebook we will be using...

In [20]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

## Part I

Let us load the mushrooms data set and view the head/info/describe results.

In [2]:
df = pd.read_csv('./csc5610-m4-mushrooms.csv')
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

In [4]:
df.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


The describe results were truncated since there were so many column.  Let us also look at those omitted columns.

In [5]:
df[['stalk-shape', 'stalk-root', 'stalk-surface-above-ring']].describe()

Unnamed: 0,stalk-shape,stalk-root,stalk-surface-above-ring
count,8124,8124,8124
unique,2,5,4
top,t,b,s
freq,4608,3776,5176


So there appear to be 8124 rows with no missing values.  The `class` feature appears to be the output variable with values `e` for `edible` and `p` for `poisonous`.  There are 22 other columns to use as features, all of which appear to be single-letter categorical variables (ranging from 2 values to 12 values).

Let us now convert all features to categorical.

In [6]:
for column in df.columns:
    df[column] = df[column].astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   class                     8124 non-null   category
 1   cap-shape                 8124 non-null   category
 2   cap-surface               8124 non-null   category
 3   cap-color                 8124 non-null   category
 4   bruises                   8124 non-null   category
 5   odor                      8124 non-null   category
 6   gill-attachment           8124 non-null   category
 7   gill-spacing              8124 non-null   category
 8   gill-size                 8124 non-null   category
 9   gill-color                8124 non-null   category
 10  stalk-shape               8124 non-null   category
 11  stalk-root                8124 non-null   category
 12  stalk-surface-above-ring  8124 non-null   category
 13  stalk-surface-below-ring  8124 non-null   catego

## Part II - Label Encoding

Let us convert our output variable to a numeric using the `LabelEncoder` and inspect the first 10 rows.

In [7]:
top10 = df['class'][:10].values
print(f'First 10 classes: {top10}')
y_encoder = LabelEncoder()
y_encoder.fit(df['class'])
y = y_encoder.transform(df['class'])
print(f'First 10 encoded classes: {y[:10]}')

First 10 classes: ['p', 'e', 'e', 'p', 'e', 'e', 'e', 'e', 'p', 'e']
Categories (2, object): ['e', 'p']
First 10 encoded classes: [1 0 0 1 0 0 0 0 1 0]


## Part III - Feature Encoding

Now we want to convert all of our input features from categorical values to numerical values using one-hot encoding.

In [9]:
X_source = df.loc[:, df.columns != 'class']
x_encoder = OneHotEncoder()
x_encoder.fit(X_source)
X = x_encoder.transform(X_source)
X

<8124x117 sparse matrix of type '<class 'numpy.float64'>'
	with 178728 stored elements in Compressed Sparse Row format>

We have converted our original 22 categorical columns into 117 binary columns.  But as I read the instructions more clearly, I see that we were supposed to use the pandas `get_dummies` function to do this encoding.

In [17]:
bdf = pd.get_dummies(X_source, dtype=float)
bdf.head()

Unnamed: 0,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,cap-surface_s,cap-surface_y,...,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [18]:
bdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Columns: 117 entries, cap-shape_b to habitat_w
dtypes: float64(117)
memory usage: 7.3 MB


In [19]:
X = bdf.values
X

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

So this appears to be a second method of converting the source categorical columns into a 117 column binary matrix.

## Part IV - Experimental Setup

Now we want to divide our dataset into a training dataset (75%) and a testing dataset (25%).

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=777, stratify=df['class'])
print(f"X_train {len(X_train)}, X_test {len(X_test)}, y_train {len(y_train)}, y_test {len(y_test)}")

X_train 6093, X_test 2031, y_train 6093, y_test 2031


So we have 6,093 rows in our training dataset and 2,031 rows in our testing dataset.  This split should be consistent given the fixed random seed (`777`) and should be stratified over the resulting labels (`edible`/`poisonous`).