# Lecture 4: Categorical Variables, One-hot Encoding

* How to use one-hot encoding for categorical variables.

## Setup

Imports:

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set()

We use the slightly expanded student data from *Lecture 3*:

In [2]:
df = pd.read_csv('students.csv')
df

Unnamed: 0,student,programme,enrolment
0,Bob,BIM,2008
1,Jake,MiM,2012
2,Lisa,IM,2004
3,Sue,BIM,missing
4,William,SCM,2008
5,James,BIM,2012
6,Harper,BIM,2004
7,Mason,IM,2009
8,Evelyn,IM,missing
9,Ella,SCM,2012


In the data, `enrolment` contains the year of enrolment, but also the *text* `missing` for missing observations. Thus, the data type of this column is `object`:

In [3]:
df.dtypes

student      object
programme    object
enrolment    object
dtype: object

We can fix this with [`pandas.to_numeric()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html) function:

In [4]:
df['enrolment'] = pd.to_numeric(df['enrolment'], errors='coerce')
df

Unnamed: 0,student,programme,enrolment
0,Bob,BIM,2008.0
1,Jake,MiM,2012.0
2,Lisa,IM,2004.0
3,Sue,BIM,
4,William,SCM,2008.0
5,James,BIM,2012.0
6,Harper,BIM,2004.0
7,Mason,IM,2009.0
8,Evelyn,IM,
9,Ella,SCM,2012.0


By setting `errors='coerce'`, any non-numeric data in the column is set to `np.nan`, which is Pandas'/NumPy's internal representation of missing data (as we saw last lecture). The data type of `enrolment` is now `float64`:

In [5]:
df.dtypes

student       object
programme     object
enrolment    float64
dtype: object

In the homework exercises you often have to print to the console. Do that easily with a `DataFrame`, you can use the [`DataFrame.iterrows()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html) function, which creates an iterator over a `DataFrame`'s rows. The iterator returns tuples, the first element being a row's index and the second element being a row's representation as a `Series` (remember how we learned about `Series` representing columns and rows last lecture).

In [7]:
for index, row in df.iterrows():
    print(row['student'])

Bob
Jake
Lisa
Sue
William
James
Harper
Mason
Evelyn
Ella
Jackson
Avery


Students' programme is a categorical variable. Each programme is one category.

In [6]:
df['programme'].unique()

array(['BIM', 'MiM', 'IM', 'SCM'], dtype=object)

## One-hot Encoding

We only need one function for one-hot encoding: [`pandas.get_dummies(...)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html). In statistics, one-hot encoded variables are often called "dummy variables," because they're a placeholder for the original variable. Hence the function's name "`get_dummies`".

Let's convert `programme` to a set of dummy variables:

In [8]:
programmes = pd.get_dummies(df['programme'])
programmes

Unnamed: 0,BIM,IM,MiM,SCM
0,1,0,0,0
1,0,0,1,0
2,0,1,0,0
3,1,0,0,0
4,0,0,0,1
5,1,0,0,0
6,1,0,0,0
7,0,1,0,0
8,0,1,0,0
9,0,0,0,1


We can append the dummy variables to our original `DataFrame`, too:

In [9]:
df = pd.concat([df, programmes], axis=1)
df

Unnamed: 0,student,programme,enrolment,BIM,IM,MiM,SCM
0,Bob,BIM,2008.0,1,0,0,0
1,Jake,MiM,2012.0,0,0,1,0
2,Lisa,IM,2004.0,0,1,0,0
3,Sue,BIM,,1,0,0,0
4,William,SCM,2008.0,0,0,0,1
5,James,BIM,2012.0,1,0,0,0
6,Harper,BIM,2004.0,1,0,0,0
7,Mason,IM,2009.0,0,1,0,0
8,Evelyn,IM,,0,1,0,0
9,Ella,SCM,2012.0,0,0,0,1


## Baseline Category

If we want to use dummy variables in an ML model and the categories are exhaustive (i.e., every observation is in one category), we need make sure to exclude one *baseline* category.

We can choose which category to exclude from the estimation. It can be any of the four. It only affects the interpretation of the remaining dummy estimates, which will be relative to the excluded baseline. For example, if we exclude `SCM`, then the estimator for `BIM` is the difference between `BIM` and `SCM`.

To exclude a baseline, simply do not select it when constructing the feature matrix:

In [10]:
X = df[['BIM', 'IM', 'MiM']]
X

Unnamed: 0,BIM,IM,MiM
0,1,0,0
1,0,0,1
2,0,1,0
3,1,0,0
4,0,0,0
5,1,0,0
6,1,0,0
7,0,1,0
8,0,1,0
9,0,0,0


Alternatively, we can also set `drop_first=True` when we call `get_dummies()`.

© 2023 Philipp Cornelius