# Difference between OneHotEncoder and get_dummies

<img src="https://github.com/kevinprinsloo/My_Machine_Learning_Notebooks/blob/master/Full_End_to_End_ML_Project_House_Prices/images/Fig_2.png?raw=true?" 
        alt="Picture" 
        width="900" 
        height="400" 
        dpid="300" 
        style="display: block; margin: 0 auto" />

# Introduction

> Bottomline: Scikit-learn 'OneHotEncoder' and Pandas 'get_dummies' method both achieve the same result. If you are building the machine learning models then go for OneHotEncoder and for data analysis tasks you can consider either OneHotEncoder or get_dummies.

In general machine learning algorithms do not like categorical (non-numeric) variables. To this end, you have to convert them to numeric features before building your model. This process of converting categorical features into numeric features is known as categorical data encoding. There are many encoding techniques available but the most common and widely used technique is One Hot Encoding.
 
Scikit-learn, a widely used machine learning library, provides `OneHotEncoder()` method that can be used for one hot encoding. Then, there is also the Pandas method called `get_dummies()` for one hot encoding. The aim of this article is to provide a tutorial explaining how or when to use both.

You may have already come across this “One hot encoding” term but when referring to the sklearn documentation it says ***“Encode categorical integer features using a one-hot aka one-of-K scheme.”***, which is not very clear, well to me it wasn't.

### What is One Hot Encoding anyway?

If a categorical column in a dataframe (df) has ***k*** unique categories then applying one hot encoding technique will create ***k*** features. Subsequently, we then ignore the original column that is used for one-hot encoding moving forward.

For example, let’s say, the categorical column ***Search Engine*** has 3 categories: ***Google***, ***Bing***, and ***Baidu***. The one-hot encoding will create 3 features as shown below.

<img src="https://github.com/kevinprinsloo/My_Machine_Learning_Notebooks/blob/master/Full_End_to_End_ML_Project_House_Prices/images/Fig_1.png?raw=true?" 
        alt="Picture" 
        width="800" 
        height="150" 
        dpid="300" 
        style="display: block; margin: 0 auto" />

## Let's download a Seaborn dataset

Using thw ***tips*** dataset we can have some practicle hands-on demonstration on how one-hot encoding is done using Scikit-learn and Pandas.

In [89]:
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

pd.options.mode.chained_assignment = None  # default='warn'

In [90]:
df = sns.load_dataset('tips')
df = df[['total_bill', 'tip', 'day', 'size']]

df.head()

Unnamed: 0,total_bill,tip,day,size
0,16.99,1.01,Sun,2
1,10.34,1.66,Sun,3
2,21.01,3.5,Sun,3
3,23.68,3.31,Sun,2
4,24.59,3.61,Sun,4


## Pandas get_dummies

The `get_dummies` method of Pandas is the most simplest implementation of one-hot encoding.     

`pd.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)`


- **data** — Series or Dataframe on which you want to apply one-hot encoding.
  
- **prefix** — String to append DataFrame column names.The default is None. For example, if ***'day’*** is the column name, then you may want to append ***day_*** to each one-hot encoded feature and then pass `prefix = ['day_']`. If more than one column is one hot encoded then pass the list of values. 
- **prefix_sep** — it is a prefix separator. You can use **‘-’** or **‘_’** or any valid separator. The default is **‘_’**.
- **dummy_na** — if set to **True**, a new column named **NaN** will be added to indicate the missing values. Default is false.
- **columns** — list of columns on which one-hot encoding to be applied.
- **sparse** — If set to True result will be a sparse matrix otherwise NumPy array. The default is False.
- **drop_first** — if set to True the result will have ***k-1*** categories out of ***k*** categories. The default is False.

With the most basic implimentation all you are required to do is pass the dataframe name and the names of the columns on which you'd like to apply the one-hot encoding to. And that's it!

In [91]:
X = df.drop('tip', axis=1)
y = df['tip']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [92]:
X_train = pd.get_dummies(X_train, columns=['day'])
X_test = pd.get_dummies(X_test, columns=['day'])

In [93]:
X_train.head()

Unnamed: 0,total_bill,size,day_Thur,day_Fri,day_Sat,day_Sun
234,15.53,2,0,0,1,0
227,20.45,4,0,0,1,0
180,34.65,4,0,0,0,1
5,25.29,4,0,0,0,1
56,38.01,4,0,0,1,0


Now that we've seent he full set, we can play around with extract a subset of the data and applying one-hot encoding. Let's have a look below

In [94]:
cols = X_test.columns.tolist()

X_test_new = pd.DataFrame( {'total_bill': [42, 101], 'day': ['Sat', 'Sun'], 'size': [2, 6]} )
pd.get_dummies(X_test_new, columns=['day']).head()

Unnamed: 0,total_bill,size,day_Sat,day_Sun
0,42,2,1,0
1,101,6,0,1


Below we can use `.reindex` to add the remainder of the original columns and assign them to zero

In [95]:
X_test_new = pd.get_dummies(X_test_new, columns=['day'])
X_test_new.reindex(columns=cols).fillna(0) 

Unnamed: 0,total_bill,size,day_Thur,day_Fri,day_Sat,day_Sun
0,42,2,0.0,0.0,1,0
1,101,6,0.0,0.0,0,1


## Scikit-learn One Hot Encoding

In order use one-hot encoding from Scikit-learn, you need to import `OneHotEncoder` from `sklearn.preprocessing`.

`OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')`

- **categories** — the default value ‘**auto**’ will identify the unique categories in the column. You can also pass it as a list.
  
- **drop** — if set to ‘**first**’ it will delete the first category of each variable and if set to ‘if_binary’ remove the first category if the column has only 2 categories. The default is None. 
- **sparse** — If set to True, it will return the result of one-hot encoding as a sparse matrix otherwise return a NumPy array. The default is True.
- **dtype** — the datatype of the result.
- **handle_unknown** — this is an important parameter. If ‘raise’, when an unknown categorical feature is present it will raise the error. If ‘ignore’, a new feature will be created with all values of zero. 

Let's have a look at a hands on exmple using out dataset. First we need to create an `OneHotEncoder` object. Next, we need to call `.fit()` method on the training dataset and then call `.transform()` method on both the training and test set to extract the encoded features.

It's worth noting that you must pass only the categorical features into `OneHotEncoder` because if not it will encode all the numerical features as well.

> only pass categorical data when calling its fit_transform method

In [96]:
X = df.drop('tip', axis=1)
y = df['tip']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ohe = OneHotEncoder(sparse=False)
enc_ohe = ohe.fit_transform(X_train[['day']])
#Returns a NumPy array of encoded data

#to print the encoded features for train data
train_disp = pd.DataFrame(enc_ohe, columns=list(ohe.categories_[0]))
train_disp.head(10)

Unnamed: 0,Fri,Sat,Sun,Thur
0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,0.0
3,0.0,1.0,0.0,0.0
4,0.0,0.0,1.0,0.0
5,0.0,0.0,1.0,0.0
6,1.0,0.0,0.0,0.0
7,0.0,0.0,1.0,0.0
8,0.0,1.0,0.0,0.0
9,0.0,0.0,0.0,1.0


In [97]:
# tranform encoding for test data
test_onehot = ohe.transform(X_test[['day']])

#to print the encoded features for train data
test_disp = pd.DataFrame(test_onehot, columns=list(ohe.categories_[0]))
test_disp.head(10)

Unnamed: 0,Fri,Sat,Sun,Thur
0,0.0,1.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0
6,0.0,1.0,0.0,0.0
7,0.0,0.0,1.0,0.0
8,0.0,0.0,1.0,0.0
9,0.0,0.0,1.0,0.0


> Implementing dummy encoding with Scikit-learn
> You can simply do this by specifying the drop parameter to ‘first’.<br />
> `ohe = OneHotEncoder(categories='auto', drop='first',sparse=False)`

Below is a slightly different way of doing it, might be useful for some poeple. But it's pretty muhc the same but now making use of a small function, I guess that could be helpful?

In [98]:
X = df.drop('tip', axis=1)
y = df['tip']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ohe = OneHotEncoder(sparse=False)
ohe.fit(X_train[['day']])

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='error', sparse=False)

In [114]:
ohe.categories_[0]

array(['Fri', 'Sat', 'Sun', 'Thur'], dtype=object)

In [126]:
X = df.drop('tip', axis=1)
y = df['tip']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe.fit(X_train[['day']])

def get_ohe(df):
    df_tmp = pd.DataFrame(data=ohe.transform(df[['day']]), columns=list(ohe.categories_[0]))
    df.drop(columns=['day'], axis=1, inplace=True)
    df = pd.concat([df.reset_index(drop=True), df_tmp], axis=1)
    return df

X_train = get_ohe(X_train)
X_test = get_ohe(X_test)

In [100]:
X_train.head(10) 

Unnamed: 0,total_bill,size,Fri,Sat,Sun,Thur
0,26.88,4,0.0,0.0,1.0,0.0
1,32.68,2,0.0,0.0,0.0,1.0
2,17.89,2,0.0,0.0,1.0,0.0
3,20.49,2,0.0,1.0,0.0,0.0
4,48.17,6,0.0,0.0,1.0,0.0
5,9.6,2,0.0,0.0,1.0,0.0
6,12.03,2,1.0,0.0,0.0,0.0
7,29.93,4,0.0,0.0,1.0,0.0
8,20.69,4,0.0,1.0,0.0,0.0
9,14.26,2,0.0,0.0,0.0,1.0


## Label encoding

IF you would like to keep the logical order of an ordical categorical variable, we could use **label encoding** with the `LabelEncoder()` instead of the two types of encoding that we’ve discussed above.

Let’s say we have a categorical variable **Sport Medals** with three categories called “**Bronze**”, “**Sliver**” and “**Gold**”. The natural order of these categories is:

> Bronze < Sliver < Gold => 0 < 1 < 2

In [141]:
df_med = pd.DataFrame( {'Medals Won': [50, 200, 100], 'Medcal': ['Gold', 'Sliver', 'Bronze'], 'Prize_Money': [1000, 500, 100]} )
pd.get_dummies(df_med, columns=['Medcal']).head()

Unnamed: 0,Medals Won,Prize_Money,Medcal_Bronze,Medcal_Gold,Medcal_Sliver
0,50,1000,0,1,0
1,200,500,0,0,1
2,100,100,1,0,0


One advantage of label encoding is that it does not expand the feature space at all as we just replace category names with numbers. Here, we do not use dummy variables.

The major disadvantage of label encoding is that machine learning algorithms may consider there may be relationships between the encoded categories. For example, an algorithm may interpret **Gold** (2) as two times better than **Silver** (1). Actually, there is no such relationship between the categories.

To avoid this, label encoding should only be applied to target (y) values, not to input (X) values.

In [143]:
from sklearn.preprocessing import LabelEncoder

df_med['Medal_enc'] = LabelEncoder().fit_transform(df_med['Prize_Money'])
df_med.head()

Unnamed: 0,Medals Won,Medcal,Prize_Money,Medal,Medal_enc
0,50,Gold,1000,2,2
1,200,Sliver,500,1,1
2,100,Bronze,100,0,0


The new encoded data column (**Medal_enc**) has been added at right. We can now remove the **Medal** column variable.

## Differences between OneHotEncoder and get_dummies


> Both `OneHotEncoder` and `get_dummies` give the same results. Both types of encoding can be used to encode ordinal and nominal categorical variables

> But there are some important differences between them. 

**(1)** The `.get_dummies` can’t handle the unknown category during the transformation natively. You have to apply some techniques to handle it. But it is not efficient. On the other hand, `OneHotEncoder` will natively handle unknown categories. All you need to do is set the parameter `handle_unknown='ignore'` to `OneHotEncoder`.

For example, in the **tips** dataset we used above, the ***day*** column contains four unique values — ***Thur***, ***Fri***, ***Sat***, and ***Sun***. If the test dataset contains a new category, say Mon or Tue, then the get dummies will create a new column ***day_Mon*** or ***day_Tue*** which will be inconsistent with train data and will eventually fail during the model building process.

`.get_dummies` example — The `.get_dummies` method doesn’t store the information about train data categories. Hence it may result in inconsistencies with train and test data features. In the below example, ***X_test_new*** contains only 2 categories “**Sun**” and “**Mon**”. As expected `.get_dummies` will create only 2 columns “**day_Mon**” and “**day_Sun**” which is inconsitent with train data columns. The `.get_dummies` doesn’t have the knowledge about train data columns. 

In [108]:
X_test_new = pd.DataFrame( {'total_bill': [42, 101], 'day': ['Sat', 'Mon'], 'size': [2, 6]} )
X_test_new

Unnamed: 0,total_bill,day,size
0,42,Sat,2
1,101,Mon,6


In [105]:
pd.get_dummies(X_test_new, columns=['day']).head()

Unnamed: 0,total_bill,size,day_Mon,day_Sat
0,42,2,0,1
1,101,6,1,0


Though `get_dummies` can’t handle unknown categories natively, you could get around this inconsistency by applying the below technique. You will have to save the columns of the train set and load it during prediction on test set. Then you need to apply reindex and after filling the missing values will get you the same features as the train set. Refer below. 

In [106]:
X_test_new = pd.get_dummies(X_test_new, columns=['day'])
X_test_new.reindex(columns=cols).fillna(0)

Unnamed: 0,total_bill,size,day_Thur,day_Fri,day_Sat,day_Sun
0,42,2,0.0,0.0,1,0.0
1,101,6,0.0,0.0,0,0.0


In [101]:
X_test_new = pd.DataFrame( {'total_bill': [42, 101], 'day': ['Sat', 'Sun'], 'size': [2, 6]} )

**OneHotEncoder example** — `OneHotEncoder` object stores the information about categories from the training dataset. So, whenever it encounters any unknown categories during transformation on test set, it will ignore them and the number of features will remain the same as the training data. 

In the below example, even though there is an unknown category ‘**Mon**’ in the X_test_new, OneHotEncoder ignores the new feature and make sure that the final features will be the same as the training data.

In [127]:
X_test_new = pd.DataFrame( {'total_bill': [42, 101], 'day': ['Sat', 'Mon'], 'size': [2, 6]} )
get_ohe(X_test_new)

Unnamed: 0,total_bill,size,Fri,Sat,Sun,Thur
0,42,2,0.0,1.0,0.0,0.0
1,101,6,0.0,0.0,0.0,0.0


**(2)** If you want to put your machine learning model into production:

- Scikit-learn `pipeline` will be envaluable for production code. However, `get_dummies` is not compatible as it requires you to create your own transformer.
- However, `OneHotEncoder` is compatible with the Scikit-learn `pipeline`.
- 
Let's have a look below to get a hands-on example of how `OneHotEncoder` is used in a pipeline: 

<br/>

## Pipeline Implementation

### **OneHotEncoder**

As you can see from the below example, `OneHotEncoder` is compatible with Pipeline and is rather simple to implement. 

In [128]:
X = df.drop('tip', axis=1)
y = df['tip']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

numeric_preprocessor = Pipeline(steps=[
    ("scaler", MinMaxScaler())
])

categorical_preprocessor = Pipeline(steps=[
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("categorical", categorical_preprocessor, ["day"]),
    ("numerical", numeric_preprocessor, ["total_bill", "size"])
])

pipe = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LinearRegression())
])

pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
# Output: 0.5706168878130049

X_test_new = pd.DataFrame( {'total_bill': [42, 101], 'day': ['Sun', 'Mon'], 'size': [2, 6]} )
pipe.predict(X_test_new)

# Output: array([4.8734386, 11.61902722])

array([ 4.8734386 , 11.61902722])

### get_dummies 

In order to use `get_dummies` in the Sklearn pipeline, you have to write a custom transformer, `PreprocessorTransformer` in the below example. Then you also need to pass the list of columns you should consider building the model. 

In [129]:
X = df.drop('tip', axis=1)
y = df['tip']

X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=0)

cols = ["total_bill", "size", "day_Fri", "day_Sat", "day_Sun", "day_Thur"]
from sklearn.base import BaseEstimator, TransformerMixin

class PreprocessorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y = None):
        return self

    def transform(self, X, y = None):
        X = pd.get_dummies(X, columns=['day'])
        X = X.reindex(columns=self.cols).fillna(0)
        return X[self.cols]

preprocessor = Pipeline(steps=[
    ("preprocessor", PreprocessorTransformer(cols))
])

pipe = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LinearRegression())
])

pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
# Output: 0.5706168878130053

X_test_new = pd.DataFrame( {'total_bill': [42, 101], 'day': ['Sun', 'Mon'], 'size': [2, 6]} )

pipe.predict(X_test_new)

# Output: array([4.71096253, 10.60358328]

array([ 4.71096253, 10.60358328])

# Summary

The one-hot encoding is one of the categorical data encoding techniques. There are two popular and commonly used methods — Scikit-learn OneHotEncoder and Pandas get_dummies method. Both achieve the same result. If you are building the machine learning models then go for OneHotEncoder 

**Advantages of Pandas `get_dummies()` over Scikit-learn `OneHotEncoder()`**
- The `get_dummies()` returns encoded data with variable names, which helps track the apllied encoding. As mentioned before, we can also add prefixes to dummy variables in each categorical variable name.
- The `get_dummies()` function returns the entire dataset with numerical variables also.
- For quick data cleaning and EDA, it makes sense to use pandas `get_dummies`

**Advantages of one-hot encoding over dummy encoding over**
- New feature categories can also be handled using `handle_unknown=’ignore’` parameter for one-hot encoder, which can further cause data mismatch issues in `pd.get_dummies()`
- If you are building the machine learning models then go for OneHotEncoder
- if I plan to transform a categorical column to multiple binary columns for machine learning, it’s better to use `OneHotEncoder()`.

For data analysis tasks you can consider either OneHotEncoder or get_dummies.