# Lab 04 One hot encoding
## Introduction

This lab demonstrates how to apply one hot encoding to categorical variables with pandas. At the end of the lab, you should be able to use `pandas` to:

- Encode categorical variables via one hot encoding.
- Modify a data frame to substitute the newly encoded variables for the categorical labels they were generated from.

### Getting started

Let's start by importing pandas in the usual way.

In [1]:
import pandas as pd

Next, let's load the data. Write the path to your iris.csv file in the cell below:

In [2]:
path_to_csv = "data/iris.csv"

Execute the cell below to load the data into a pandas data frame and index that data frame by the `sample_number` column:

In [3]:
df = pd.read_csv(path_to_csv, index_col=['sample_number'])

Take a quick peek at the data:

In [4]:
df.head()

Unnamed: 0_level_0,species,sepal_length,sepal_width,petal_length,petal_width
sample_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,setosa,5.1,3.5,1.4,0.2
2,setosa,4.9,3.0,1.4,0.2
3,setosa,4.7,3.2,1.3,0.2
4,setosa,4.6,3.1,1.5,0.2
5,setosa,5.0,3.6,1.4,0.2


## One hot encoding

We can examine the type of the data in our data frame via the `dtypes` attribute, as follows:

In [5]:
df.dtypes

species          object
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
dtype: object

As you can see, we have four columns of numerical data (`float64`), corresponding to the physical measurements, and one column of text data (`object`), corresponding to the species labels. Let's take a closer look at the unique values in the species column:

In [6]:
df['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

If we wanted to use these labels as input to a machine learning algorithm, we would first need to convert them from text into some numerical format, so that the algorithm could understand them. One way to do this would be to assign a numerical value to each species, e.g. `setosa = 0`, `versicolor = 1`, `virginica = 2`, but this wouldn't make a lot of sense as setosa is not "less than" versicolor or virginica in a mathematical sense.

A better alternative would be to create a set of new features that encode the values of the labels in such a way that an algorithm would view them as equal. One hot encoding is supported in pandas via the [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) method:

In [8]:
encoded_features = pd.get_dummies(df['species'])

encoded_features.head()  # Take a quick look at the result

Unnamed: 0_level_0,setosa,versicolor,virginica
sample_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,1,0,0


As you can see, pandas has encoded each label as a binary indicator variable, where a "1" represents the presence of the label and a "0" indicates the absence of the label.

We can use the [`concat`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) method to glue the new features to our existing data frame:

In [9]:
df = pd.concat([df, encoded_features], axis='columns')

df.head()

Unnamed: 0_level_0,species,sepal_length,sepal_width,petal_length,petal_width,setosa,versicolor,virginica
sample_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,setosa,5.1,3.5,1.4,0.2,1,0,0
2,setosa,4.9,3.0,1.4,0.2,1,0,0
3,setosa,4.7,3.2,1.3,0.2,1,0,0
4,setosa,4.6,3.1,1.5,0.2,1,0,0
5,setosa,5.0,3.6,1.4,0.2,1,0,0


Finally, we can use the [`drop`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html#pandas.DataFrame.drop) method to remove the original `species` column from the data frame, leaving us with the new features only:

In [10]:
df = df.drop('species', axis='columns')

df.head()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,setosa,versicolor,virginica
sample_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,5.1,3.5,1.4,0.2,1,0,0
2,4.9,3.0,1.4,0.2,1,0,0
3,4.7,3.2,1.3,0.2,1,0,0
4,4.6,3.1,1.5,0.2,1,0,0
5,5.0,3.6,1.4,0.2,1,0,0
