# 

# <center> Module 2: Lesson 5 Preprocessing for Machine Learning Part 1</center>
##  <center> Use Encoding to Prepare a Dataset for Machine Learning </center>
<center>by: Nicole Woodland, P. Eng. for RoboGarden Inc. </center>

---

This notebook will showcase one of the 4 preprocessing steps where possible on a dataset, 
- Cleaning
- **Encoding**
- Scaling
- Feature (or Dimensionality) Reduction

We will look at the Mushroom Dataset used in the first Editor Mission.
Link: https://www.kaggle.com/uciml/mushroom-classification

In [15]:
import pandas as pd
#read the mushroom csv file downloaded from Kaggle into a dataframe called 'df'
mushrooms = pd.read_csv("mushrooms.csv")
mushrooms

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


This dataset is predominately categorical data! We will need to do some encoding!

In [18]:
# View a summary of the dataframe - all categorical data
mushrooms.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [20]:
# Make a subset dataframe (note the double square brackets, case sensitive)

mushrooms = mushrooms[['class', 'cap-shape', 'cap-surface', 
                       'cap-color', 'bruises', 'odor',
                       'gill-attachment',]] 

mushrooms.head(5)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment
0,p,x,s,n,t,p,f
1,e,x,s,y,t,a,f
2,e,b,s,w,t,l,f
3,p,x,y,w,t,p,f
4,e,x,s,g,f,n,f


## Let's Encode the Label

In [25]:
# Label Econder will encode the single column into n - ni mapped values.
# Either option below will work.

from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()

mushrooms.loc[:,'class'] = enc.fit_transform(mushrooms['class'])

# mushrooms['class'] = enc.fit_transform(mushrooms['class'])
mushrooms.head(3)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment
0,1,x,s,n,t,p,f
1,0,x,s,y,t,a,f
2,0,b,s,w,t,l,f


## Encode the Remaining Columns
There are Two approaches to encode the feature columns.
- Ordinal Encoder - encode one or more columns into n - ni mapped values based on its series
- One Hot Encoding - encode the columns to add n - ni columns representing the 'hot column' (Result will be MANY columns!) 

### Ordinal Encoder

In [None]:
# Drop the class column as it is already Encoded!
## This list of column names can be sent to the encoder to transform the data inplace on teh dataframe. 

mushrooms_ordinal = mushrooms.copy()

categorical_columns = mushrooms.columns.drop(['class'])
categorical_columns

In [None]:
# Because the Encoder is a SciKitLearn function, and not belonging to the Pandas object, we need a different notation than
# inplace = True to update the date directly within the dataframe. I.e., use .loc and the position or name of the columns

from sklearn.preprocessing import OrdinalEncoder 
enc = OrdinalEncoder()

mushrooms_ordinal.loc[:, categorical_columns] = enc.fit_transform(mushrooms_ordinal.loc[:, categorical_columns])

In [None]:
print("Original Data:")
display(mushrooms.head(3))
print("\nEncoded Data:")
display(mushrooms_ordinal.head(3))

### One Hot Encoding

In [None]:
#Drop the class column, it is already encoded!
mushrooms_one_hot = mushrooms.drop('class', axis = 1)

# Use panda's get_dummies to One-Hot Encode all the selected columns. I've made a copy of the dataframe here, not to loose the original data.
mushrooms_one_hot = pd.get_dummies(mushrooms_one_hot, dtype = int)

# Add the 'class' column back in
mushrooms_one_hot.insert(0, 'class', mushrooms['class'])
mushrooms_one_hot

Note: Best Practice is to drop a column from each set of OneHot columns, to reduce Multi-Colinearity. Essentially, the last column is always directly related to all other columns from that encoding, so it's redundant and can be removed to save resources.

Let's break it down!

In [None]:
# Use only the cap_shape column as an example:

mushrooms_one_hot = mushrooms['cap-shape']
mushrooms_one_hot = pd.get_dummies(mushrooms_one_hot, dtype=int, drop_first=False)
mushrooms_one_hot.insert(0, 'original_class_shape', mushrooms['cap-shape'])
print("Without dropping a column:"), display(mushrooms_one_hot.head(3))

# Repeat to show the difference with dropping
mushrooms_one_hot = mushrooms['cap-shape']
mushrooms_one_hot = pd.get_dummies(mushrooms_one_hot, dtype=int, drop_first=True)
mushrooms_one_hot.insert(0, 'original_class_shape', mushrooms['cap-shape'])
print("\nDropping the first column:")
display(mushrooms_one_hot.head(3))

Columns can also be encoded in place using get_dummies by referencing selected columns from a dataframe:

In [None]:
columns_to_encode = ['cap-shape',"cap-surface"]

# Apply get_dummies() to selected columns
encoded_df = pd.get_dummies(mushrooms, columns=columns_to_encode, drop_first=True)
encoded_df