# One-Hot Encoding

This gist shows three different methods of doing one-hot encoding:

* `pd.get_dummies`
* `sklearn.preprocessing.OneHotEncoder`
* `categoryencoders.OneHotEncoder`

It is _strongly_ recommended that you don't use `pd.get_dummies`, and instead use one of the other two methods. 

The `pd.get_dummies` method is included because it is often one of the first approaches introduced for one-hot encoding; we will use it to show its limitations.

Related blog articles:

* [Encoding categorical variables](): Different methods for encoding categorical variables
* [Are you getting burned by one-hot encoding?](): Discussion of one-hot encoding and the "dummy variable trap"

## Data set

We will look at a dataset that tries to predict interstate traffic volume on I-94 (outside St Paul, MN) based on the date and the weather. We see the description of the description of the weather is categorical.

In [4]:
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00492/Metro_Interstate_Traffic_Volume.csv.gz')
df.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918


Let's look at the overall distribution of the numeric columns:

In [8]:
df.describe()

Unnamed: 0,temp,rain_1h,snow_1h,clouds_all,traffic_volume
count,48204.0,48204.0,48204.0,48204.0,48204.0
mean,281.20587,0.334264,0.000222,49.362231,3259.818355
std,13.338232,44.789133,0.008168,39.01575,1986.86067
min,0.0,0.0,0.0,0.0,0.0
25%,272.16,0.0,0.0,1.0,1193.0
50%,282.45,0.0,0.0,64.0,3380.0
75%,291.806,0.0,0.0,90.0,4933.0
max,310.07,9831.3,0.51,100.0,7280.0


From the temperature values (272 - 310) it is clear they are measured in Kelvin.

Let's look at the categorical variables in the `holiday` and `weather_main` columns.

In [5]:
df['holiday'].value_counts()

None                         48143
Labor Day                        7
New Years Day                    6
Christmas Day                    6
Martin Luther King Jr Day        6
Thanksgiving Day                 6
Independence Day                 5
Memorial Day                     5
Washingtons Birthday             5
State Fair                       5
Columbus Day                     5
Veterans Day                     5
Name: holiday, dtype: int64

We can see we have five years of data here. While we could one-hot encode the holiday that we have, we only have 5 or 6 examples of each holiday. Instead, we will just make a binary feature `is_holiday`.

In [38]:
df['is_holiday'] = (df['holiday'] != 'None')
df = df.drop('holiday', axis=1)

KeyError: 'holiday'

In [39]:
df['weather_main'].value_counts()

Clouds          15164
Clear           13391
Mist             5950
Rain             5672
Snow             2876
Drizzle          1821
Haze             1360
Thunderstorm     1034
Fog               912
Smoke              20
Squall              4
Name: weather_main, dtype: int64

## Postprocessing

Let's save the post-processed file to a CSV:

In [31]:
columns = ['date_time', 'is_holiday', 'temp_F', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_main', 'traffic_volume']
df['temp_F'] = ((9/5)*(df['temp'] - 273.15) + 32)
df.drop(['temp', 'weather_description'], axis=1)[columns].to_csv('processed_traffic.csv', index=False)

## Restore processed

In [40]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

traffic = pd.read_csv('processed_traffic.csv', parse_dates=[0])
traffic.head()

Unnamed: 0,date_time,is_holiday,temp_F,rain_1h,snow_1h,clouds_all,weather_main,traffic_volume
0,2012-10-02 09:00:00,True,59.234,0.0,0.0,40,Clouds,5545
1,2012-10-02 10:00:00,True,61.178,0.0,0.0,75,Clouds,4516
2,2012-10-02 11:00:00,True,61.574,0.0,0.0,90,Clouds,4767
3,2012-10-02 12:00:00,True,62.564,0.0,0.0,90,Clouds,5026
4,2012-10-02 13:00:00,True,64.382,0.0,0.0,75,Clouds,4918


In [42]:
from sklearn.datasets import fetch_lfw_pairs
fetch_lfw_pairs()

Downloading LFW metadata: https://ndownloader.figshare.com/files/5976012


KeyboardInterrupt: 

In [45]:
split_loc = int(0.9*len(traffic))
train, test = traffic[:split_loc], traffic[split_loc:]

In [46]:
train['weather_main'].value_counts()

Clouds          13903
Clear           12042
Mist             5355
Rain             4756
Snow             2872
Drizzle          1584
Haze             1251
Fog               821
Thunderstorm      777
Smoke              18
Squall              4
Name: weather_main, dtype: int64

In [47]:
test['weather_main'].value_counts()

Clear           1349
Clouds          1261
Rain             916
Mist             595
Thunderstorm     257
Drizzle          237
Haze             109
Fog               91
Snow               4
Smoke              2
Name: weather_main, dtype: int64

In [50]:
pd.get_dummies(train)

Unnamed: 0,date_time,is_holiday,temp_F,rain_1h,snow_1h,clouds_all,traffic_volume,weather_main_Clear,weather_main_Clouds,weather_main_Drizzle,weather_main_Fog,weather_main_Haze,weather_main_Mist,weather_main_Rain,weather_main_Smoke,weather_main_Snow,weather_main_Squall,weather_main_Thunderstorm
0,2012-10-02 09:00:00,True,59.234,0.0,0.0,40,5545,0,1,0,0,0,0,0,0,0,0,0
1,2012-10-02 10:00:00,True,61.178,0.0,0.0,75,4516,0,1,0,0,0,0,0,0,0,0,0
2,2012-10-02 11:00:00,True,61.574,0.0,0.0,90,4767,0,1,0,0,0,0,0,0,0,0,0
3,2012-10-02 12:00:00,True,62.564,0.0,0.0,90,5026,0,1,0,0,0,0,0,0,0,0,0
4,2012-10-02 13:00:00,True,64.382,0.0,0.0,75,4918,0,1,0,0,0,0,0,0,0,0,0
5,2012-10-02 14:00:00,True,65.426,0.0,0.0,1,5181,1,0,0,0,0,0,0,0,0,0,0
6,2012-10-02 15:00:00,True,68.036,0.0,0.0,1,5584,1,0,0,0,0,0,0,0,0,0,0
7,2012-10-02 16:00:00,True,69.278,0.0,0.0,1,6015,1,0,0,0,0,0,0,0,0,0,0
8,2012-10-02 17:00:00,True,69.782,0.0,0.0,20,5791,0,1,0,0,0,0,0,0,0,0,0
9,2012-10-02 18:00:00,True,67.910,0.0,0.0,20,4770,0,1,0,0,0,0,0,0,0,0,0


In [53]:
test['weather_main'].unique()

array(['Clear', 'Clouds', 'Thunderstorm', 'Snow', 'Haze', 'Rain', 'Mist',
       'Drizzle', 'Fog', 'Smoke'], dtype=object)