# Data preprocessing

The purpose of this notebook is to output a single dataframe with which to build a classification model.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("DATA/mushrooms.csv", na_values='?')

## Dealing with missing data

In [3]:
df.isnull().sum()

class                          0
cap-shape                      0
cap-surface                    0
cap-color                      0
bruises                        0
odor                           0
gill-attachment                0
gill-spacing                   0
gill-size                      0
gill-color                     0
stalk-shape                    0
stalk-root                  2480
stalk-surface-above-ring       0
stalk-surface-below-ring       0
stalk-color-above-ring         0
stalk-color-below-ring         0
veil-type                      0
veil-color                     0
ring-number                    0
ring-type                      0
spore-print-color              0
population                     0
habitat                        0
dtype: int64

It looks like we don't have any missing values except the stalk-root feature. I am going to just delete this examples for simplicity's sake.

In [4]:
df = df.dropna()

## Feature Selection

We need to remove useless features in order to simplify our future model.

### Remove constant features

Constant features are those that show the same value, just one value, for all the observations of the dataset. These features provide no information that allows a machine learning model to discriminate or predict a target.

In [5]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
df_encoded = df.apply(enc.fit_transform)

In [6]:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0)
sel.fit(df_encoded)

In [7]:
constant_features = [x for x in df_encoded.columns if x not in df_encoded.columns[sel.get_support()]]
constant_features

['veil-type']

We can see that veil-type feature is constant. This means that this feature shows the same value, just one value, for all the observations of the dataset. So we can drop this column without any hesitation.

In [8]:
df = df.drop(labels=constant_features, axis=1)
df_encoded = df_encoded.drop(labels=constant_features, axis=1)

### Remove quasi-constant features

Now, we aim to remove all features which variance doesn’t meet some threshold.

In [9]:
sel = VarianceThreshold(threshold=0.1) 
sel.fit(df_encoded)

In [10]:
quasiconstant_features = [x for x in df_encoded.columns if x not in df_encoded.columns[sel.get_support()]]
quasiconstant_features

['gill-attachment', 'veil-color', 'ring-number']

In [11]:
df = df.drop(labels=quasiconstant_features, axis=1)
df_encoded = df_encoded.drop(labels=quasiconstant_features, axis=1)

We can see that 3 features are almost constant. This means that 3 variables show predominantly one value for ~90% the observations of the dataset.

### Other features to drop

In [12]:
df.columns

Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
       'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root',
       'stalk-surface-above-ring', 'stalk-surface-below-ring',
       'stalk-color-above-ring', 'stalk-color-below-ring', 'ring-type',
       'spore-print-color', 'population', 'habitat'],
      dtype='object')

I want this to be a simple model. So, I am going to remove the features that might not be fun to play with in my web application. This app should be exciting, educational and entertaining. So I am going to pick the features that serve this purpose and also matter for the model.

In [13]:
df = df[["class", "cap-shape", "cap-surface", "cap-color", "gill-spacing", "gill-color", "stalk-shape", "stalk-root"]]

## Now, we are ready to move on.

After this data preparation, our dataset is finally ready for the classification model.

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5644 entries, 0 to 8114
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   class         5644 non-null   object
 1   cap-shape     5644 non-null   object
 2   cap-surface   5644 non-null   object
 3   cap-color     5644 non-null   object
 4   gill-spacing  5644 non-null   object
 5   gill-color    5644 non-null   object
 6   stalk-shape   5644 non-null   object
 7   stalk-root    5644 non-null   object
dtypes: object(8)
memory usage: 396.8+ KB


In [15]:
df.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,gill-spacing,gill-color,stalk-shape,stalk-root
count,5644,5644,5644,5644,5644,5644,5644,5644
unique,2,6,4,8,2,9,2,4
top,e,x,y,g,c,p,t,b
freq,3488,2840,2220,1696,4620,1384,2880,3776


In [16]:
df.to_csv('DATA/mushrooms-final.csv',index=False)