# Mushroom classification - Part I: Preprocessing

In [1]:
import pandas as pd
import numpy as np
import random
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer
from sklearn.model_selection import train_test_split

We create an initial random seed, for reproducibility purposes.

In [2]:
num = 5 
random.seed(num)

In [3]:
df = pd.read_csv('mushrooms_traintest.csv')
df.pop('stalk-root')
df.columns

Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
       'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
       'stalk-shape', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
       'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',
       'veil-color', 'ring-number', 'ring-type', 'spore-print-color',
       'population', 'habitat'],
      dtype='object')

### Features vs classes

We now want to separate the data into the input features and the output classes. We also convert the classes into binary, as there can only be poisonous or edible.

In [4]:
df_class = df.pop('class')
df_class = LabelBinarizer().fit_transform(df_class)

### Train vs test

We will now separate the dataset into 75% of the entries (training) and 25% of them (test). Here we give the function the random seed generated earlier, as we shuffle the data before the split.

In [5]:
train_df, test_df, train_class, test_class = train_test_split(df, df_class, test_size=0.25, random_state=num, shuffle = True)

## Encodings
As we can see in the preview and from the attributes.txt file, the data is categorical, with some particular labels. Therefore, the first thing we have to do is transforming it into numerical values. The OneHotEncoder function from sklearn is the perfect fit here. 

In [6]:
OHE = OneHotEncoder().fit(df)

train_df = OHE.transform(train_df)
test_df = OHE.transform(test_df)

## Saving for later

We now save the results to a file, to be loaded and processed in the next notebook. 

In [7]:
with open('train.npy', 'wb') as f:
    np.save(f, train_df.todense())
    np.save(f, train_class)

In [8]:
with open('test.npy', 'wb') as g:
    np.save(g, test_df.todense())
    np.save(g, test_class)