# Goal
This post aims to introduce how to create one-hot-encoded features for categorical variables.

**Reference**
* [scikit learn documentation - sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

* [Insightsbot - Python One Hot Encoding with SciKit Learn](http://www.insightsbot.com/blog/McTKK/python-one-hot-encoding-with-scikit-learn)

* [pandas documentation - pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

# Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# Create a data for one hot encoding

In [4]:
df = pd.DataFrame(data={'fruit': ['apple', 'apple', 'banana', 'orange', 'banana', 'apple'], 
                       'size': ['large', 'medium', 'small','large', 'medium', 'small']})
df

Unnamed: 0,fruit,size
0,apple,large
1,apple,medium
2,banana,small
3,orange,large
4,banana,medium
5,apple,small


# Create one-hot encoded columns

## Using `OneHotEncoder` in `sklearn`

In [17]:
encoder = OneHotEncoder()
df_fruit_encoded = pd.DataFrame(encoder.fit_transform(df[['fruit']]).todense(), 
                                columns=encoder.get_feature_names())
df_fruit_encoded


Unnamed: 0,x0_apple,x0_banana,x0_orange
0,1.0,0.0,0.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,0.0,1.0,0.0
5,1.0,0.0,0.0


## Using `get_dummies` method in `pandas`

In [18]:
pd.get_dummies(df['size'])

Unnamed: 0,large,medium,small
0,1,0,0
1,0,1,0
2,0,0,1
3,1,0,0
4,0,1,0
5,0,0,1
