## Sources

- https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/
- https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
- https://www.kaggle.com/code/alexisbcook/categorical-variables/tutorial
- https://www.kaggle.com/code/dansbecker/using-categorical-data-with-one-hot-encoding


[Dataset](https://www.kaggle.com/datasets/jasleensondhi/hair-eye-color?resource=download)


# Introduction

Ordinal encoding will assign a numerical value to the labels.</br>

&emsp;&emsp;```For example Red 0, Green 1 and Blue is set to 2.```</br>


One-Hot encoding will assign a new column for each value and it will mark them as 1 if the parameter exist and 0 if does not.</br>

&emsp;&emsp;```If the value is Green, the columns will be Red 0, Green 1, Blue 0```.</br>

Dummy encoding is identical to One-Hot but we will drop the first column, resulting there will be N-1 columns.</br>

&emsp;&emsp;```If the value is Red, the columns will be Green 0 and Blue 0.```</br>


## Category Types

<img src="table.png" alt="Comparison Table" width="1400">


Categorizing as integers can add bias in our model as it will start giving higher preference to the Blue as 2 > 0 but this is mostly not preferable since all of our parameters are equally important.</br>
To deal with this issue we will use One Hot Encoding technique.


> The **advantages** of using one hot encoding include:
>
> - It allows the use of categorical variables in models that require numerical input.
>
> - It can improve model performance by providing more information to the model about the categorical variable.
>
> - It can help to avoid the problem of ordinality, which can occur when a categorical variable has a natural ordering (e.g. “small”, “medium”, “large”).
>
> </br>
>
> The **disadvantages** of using one hot encoding include:
>
> - It can lead to increased dimensionality, as a separate column is created for each category in the variable. This can make the model more complex and slow to train.
>
> - It can lead to sparse data, as most observations will have a value of 0 in most of the one-hot encoded columns.
>
> - It can lead to overfitting, especially if there are many categories in the variable and the sample size is relatively small.
>
> - One-hot-encoding is a powerful technique to treat categorical data, but it can lead to increased dimensionality, sparsity and overfitting. It is important to use it cautiously, and consider > other methods such as ordinal encoding or binary encoding.
>
> _geeksforgeeks.org_


We refer to categorical variables that have a clear ordering as ordinal variables.</br>
For tree-based models (like decision trees and random forests), you can expect ordinal encoding to work well with ordinal variables.

---

One hot encoding is useful for data that has no relationship to each other. </br>
Machine learning algorithms treat the order of numbers as an attribute of significance. </br>
In other words, they will read a higher number as better or more important than a lower number.

Some input data may not have any ranking for category values so this can lead to issues with predictions and poor performance.


# Imports


In [1]:
import math
from pprint import pprint

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display
from plotly.subplots import make_subplots
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Usage Methods


In [2]:
df = pd.read_csv('HairEyeColor.csv')
df.drop('Unnamed: 0', axis=1, inplace=True)
df.sample(5)


Unnamed: 0,Hair,Eye,Sex,Freq
6,Red,Blue,Male,10
9,Brown,Hazel,Male,25
3,Blond,Brown,Male,3
16,Black,Brown,Female,36
21,Brown,Blue,Female,34


In [3]:
df.dtypes

Hair    object
Eye     object
Sex     object
Freq     int64
dtype: object

In [4]:
df['Hair'].unique()


array(['Black', 'Brown', 'Red', 'Blond'], dtype=object)

In [5]:
hair_code = df['Hair'].astype('category').cat.codes.values
df_cat_hair = df.copy()
df_cat_hair['Code'] = hair_code
df_cat_hair.head()


Unnamed: 0,Hair,Eye,Sex,Freq,Code
0,Black,Brown,Male,32,0
1,Brown,Brown,Male,53,2
2,Red,Brown,Male,10,3
3,Blond,Brown,Male,3,1
4,Black,Blue,Male,11,0


In [6]:
df_one_hot_hair = pd.get_dummies(df, columns=['Hair'])
df_one_hot_hair.tail()


Unnamed: 0,Eye,Sex,Freq,Hair_Black,Hair_Blond,Hair_Brown,Hair_Red
27,Hazel,Female,5,0,1,0,0
28,Green,Female,2,1,0,0,0
29,Green,Female,14,0,0,1,0
30,Green,Female,7,0,0,0,1
31,Green,Female,8,0,1,0,0


In [7]:
df_dummy_hair = pd.get_dummies(df, columns=['Hair'], drop_first=True)
df_dummy_hair.tail()


Unnamed: 0,Eye,Sex,Freq,Hair_Blond,Hair_Brown,Hair_Red
27,Hazel,Female,5,1,0,0
28,Green,Female,2,0,0,0
29,Green,Female,14,0,1,0
30,Green,Female,7,0,0,1
31,Green,Female,8,1,0,0


In [8]:
df_one_hot = pd.get_dummies(df)
df_one_hot.tail()


Unnamed: 0,Freq,Hair_Black,Hair_Blond,Hair_Brown,Hair_Red,Eye_Blue,Eye_Brown,Eye_Green,Eye_Hazel,Sex_Female,Sex_Male
27,5,0,1,0,0,0,0,0,1,1,0
28,2,1,0,0,0,0,0,1,0,1,0
29,14,0,0,1,0,0,0,1,0,1,0
30,7,0,0,0,1,0,0,1,0,1,0
31,8,0,1,0,0,0,0,1,0,1,0


In [9]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(df['Eye'][:15])

print(integer_encoded)
# binary encode
onehot_encoder = OneHotEncoder(sparse_output=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
# invert first example
inverted = label_encoder.inverse_transform([np.argmax(onehot_encoded[0, :])])
print(inverted)


[1 1 1 1 0 0 0 0 3 3 3 3 2 2 2]
[[0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]
['Brown']


In [10]:
from keras.utils import to_categorical
data = df['Eye'].astype('category').cat.codes.values[:15]
print(list(data))
encoded = to_categorical(data)
print(encoded)


[1, 1, 1, 1, 0, 0, 0, 0, 3, 3, 3, 3, 2, 2, 2]
[[0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]


# Real Life Examples


In [11]:
car_data = pd.read_csv(
    'https://raw.githubusercontent.com/hardenedcotton/ML_Project/main/CarPrice_Assignment.csv')
car_data


Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,201,-1,volvo 145e (sw),gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845.0
201,202,-1,volvo 144ea,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045.0
202,203,-1,volvo 244dl,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485.0
203,204,-1,volvo 246,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470.0


## Categorizing


### Small Data

In [12]:
cols_small = [
    'fueltype',
    'aspiration',
    'carbody',
    'enginesize',
    'horsepower',
    'citympg',
    'highwaympg'
]
cat_cols_small = [
    'fueltype',
    'aspiration',
    'carbody',
]


In [13]:
car_df_small = car_data[cols_small].copy()
car_df_small.name = 'Small Data'

In [14]:
car_ord_small = car_df_small.copy()
for col in cat_cols_small:
    cat_code = col + '_code'
    car_ord_small[col] = car_df_small[col].astype('category').cat.codes
car_ord_small.name = 'Ordinal Encoding'

car_OH_small = pd.get_dummies(car_df_small)
car_OH_small.name = 'One-Hot Encoding'

car_dummy_small = pd.get_dummies(car_df_small, drop_first=True)
car_dummy_small.name = 'Dummy Encoding'

disp = [
    car_df_small,
    car_ord_small,
    car_OH_small,
    car_dummy_small,
]
for d in disp:
    print(d.name)
    display(d.sample(5, random_state=0))


Small Data


Unnamed: 0,fueltype,aspiration,carbody,enginesize,horsepower,citympg,highwaympg
52,gas,std,hatchback,91,68,31,38
181,gas,std,wagon,161,156,19,24
5,gas,std,sedan,136,110,19,25
18,gas,std,hatchback,61,48,47,53
188,gas,std,sedan,109,100,26,32


Ordinal Encoding


Unnamed: 0,fueltype,aspiration,carbody,enginesize,horsepower,citympg,highwaympg
52,1,0,2,91,68,31,38
181,1,0,4,161,156,19,24
5,1,0,3,136,110,19,25
18,1,0,2,61,48,47,53
188,1,0,3,109,100,26,32


One-Hot Encoding


Unnamed: 0,enginesize,horsepower,citympg,highwaympg,fueltype_diesel,fueltype_gas,aspiration_std,aspiration_turbo,carbody_convertible,carbody_hardtop,carbody_hatchback,carbody_sedan,carbody_wagon
52,91,68,31,38,0,1,1,0,0,0,1,0,0
181,161,156,19,24,0,1,1,0,0,0,0,0,1
5,136,110,19,25,0,1,1,0,0,0,0,1,0
18,61,48,47,53,0,1,1,0,0,0,1,0,0
188,109,100,26,32,0,1,1,0,0,0,0,1,0


Dummy Encoding


Unnamed: 0,enginesize,horsepower,citympg,highwaympg,fueltype_gas,aspiration_turbo,carbody_hardtop,carbody_hatchback,carbody_sedan,carbody_wagon
52,91,68,31,38,1,0,0,1,0,0
181,161,156,19,24,1,0,0,0,0,1
5,136,110,19,25,1,0,0,0,1,0
18,61,48,47,53,1,0,0,1,0,0
188,109,100,26,32,1,0,0,0,1,0


## Original Data

In [15]:
drop_cols = [
    'car_ID',
    'CarName',
    'price'
]
cat_cols = [
    'fueltype',
    'aspiration',
    'carbody',
    'doornumber',
    'carbody',
    'drivewheel',
    'enginelocation',
    'enginetype',
    'cylindernumber',
    'fuelsystem',
]

In [16]:
car_df = car_data.drop(drop_cols, axis=1).copy()
car_df.name = 'Original Data'
y = car_data['price']

In [20]:
car_ord = car_df.copy()
for col in cat_cols:
    cat_code = col + '_code'
    car_ord[col] = car_df[col].astype('category').cat.codes
car_ord.name = 'Ordinal Encoding'

car_OH = pd.get_dummies(car_df)
car_OH.name = 'One-Hot Encoding'

car_dummy = pd.get_dummies(car_df, drop_first=True)
car_dummy.name = 'Dummy Encoding'

disp = [
    car_ord,
    car_OH,
    car_dummy,
]
for d in disp:
    print(d.name)
    display(d.sample(5, random_state=0))

Ordinal Encoding


Unnamed: 0,symboling,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,carwidth,...,cylindernumber,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg
52,1,1,0,1,2,1,0,93.1,159.1,64.2,...,2,91,1,3.03,3.15,9.0,68,5000,31,38
181,-1,1,0,0,4,2,0,104.5,187.8,66.5,...,3,161,5,3.27,3.35,9.2,156,5200,19,24
5,2,1,0,1,3,1,0,99.8,177.3,66.3,...,1,136,5,3.19,3.4,8.5,110,5500,19,25
18,2,1,0,1,2,1,0,88.4,141.1,60.3,...,4,61,1,2.91,3.03,9.5,48,5100,47,53
188,2,1,0,0,3,1,0,97.3,171.7,65.5,...,2,109,5,3.19,3.4,10.0,100,5500,26,32


One-Hot Encoding


Unnamed: 0,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,...,cylindernumber_twelve,cylindernumber_two,fuelsystem_1bbl,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
52,1,93.1,159.1,64.2,54.1,1905,91,3.03,3.15,9.0,...,0,0,0,1,0,0,0,0,0,0
181,-1,104.5,187.8,66.5,54.1,3151,161,3.27,3.35,9.2,...,0,0,0,0,0,0,0,1,0,0
5,2,99.8,177.3,66.3,53.1,2507,136,3.19,3.4,8.5,...,0,0,0,0,0,0,0,1,0,0
18,2,88.4,141.1,60.3,53.2,1488,61,2.91,3.03,9.5,...,0,0,0,1,0,0,0,0,0,0
188,2,97.3,171.7,65.5,55.7,2300,109,3.19,3.4,10.0,...,0,0,0,0,0,0,0,1,0,0


Dummy Encoding


Unnamed: 0,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,...,cylindernumber_three,cylindernumber_twelve,cylindernumber_two,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
52,1,93.1,159.1,64.2,54.1,1905,91,3.03,3.15,9.0,...,0,0,0,1,0,0,0,0,0,0
181,-1,104.5,187.8,66.5,54.1,3151,161,3.27,3.35,9.2,...,0,0,0,0,0,0,0,1,0,0
5,2,99.8,177.3,66.3,53.1,2507,136,3.19,3.4,8.5,...,0,0,0,0,0,0,0,1,0,0
18,2,88.4,141.1,60.3,53.2,1488,61,2.91,3.03,9.5,...,1,0,0,1,0,0,0,0,0,0
188,2,97.3,171.7,65.5,55.7,2300,109,3.19,3.4,10.0,...,0,0,0,0,0,0,0,1,0,0


In [21]:
print(df_one_hot.dtypes)

Freq          int64
Hair_Black    uint8
Hair_Blond    uint8
Hair_Brown    uint8
Hair_Red      uint8
Eye_Blue      uint8
Eye_Brown     uint8
Eye_Green     uint8
Eye_Hazel     uint8
Sex_Female    uint8
Sex_Male      uint8
dtype: object


## Splitting


In [457]:
df_dict = {}
keys = [
    'x_train',
    'x_test',
    'y_train',
    'y_test'
]
for d in disp:
    key = f'{d.name}'
    split = train_test_split(d, y, test_size=0.25, random_state=0)
    df_dict[d.name] = {}
    for idx, s in enumerate(split):
        df_dict[d.name][keys[idx]] = s

In [463]:
for df in df_dict.values():
    for d in df.values():
        print(list(d.iloc[:3].index))

[163, 61, 75]
[52, 181, 5]
[163, 61, 75]
[52, 181, 5]
[163, 61, 75]
[52, 181, 5]
[163, 61, 75]
[52, 181, 5]
[163, 61, 75]
[52, 181, 5]
[163, 61, 75]
[52, 181, 5]


## Testing


In [464]:
class Score:
    def __init__(self, x_train, x_test, y_train, y_test) -> float:
        self.x_train = x_train
        self.x_test = x_test
        self.y_train = y_train
        self.y_test = y_test

    def forest(self) -> dict:
        model = RandomForestRegressor(n_estimators=100, random_state=0)
        model.fit(self.x_train, self.y_train)
        preds = model.predict(self.x_test)
        mae = mean_absolute_error(self.y_test, preds)
        rms = math.sqrt(mean_squared_error(self.y_test, preds))
        acc = model.score(self.x_test, self.y_test)*100

        return {'name': 'Random Forest',
                'MAE': mae,
                'RMS': rms,
                'Accuracy': acc}

    def dtree(self) -> dict:
        model = DecisionTreeRegressor()
        model.fit(self.x_train, self.y_train)
        preds = model.predict(self.x_test)
        mae = mean_absolute_error(self.y_test, preds)
        rms = math.sqrt(mean_squared_error(self.y_test, preds))
        acc = model.score(self.x_test, self.y_test)*100

        return {'name': 'Desicion Tree',
                'MAE': mae,
                'RMS': rms,
                'Accuracy': acc}

    def get_scores(self, key) -> str:
        funcs = [
            self.forest(),
            self.dtree(),
        ]
        result = ''
        for func in funcs:
            result += (
                f'{key} {func["name"]}\n'
                f'{"Accuracy:":<12}' f'{"":>15}{func["Accuracy"]:.2f}\n'
                f'{"MAE:":<12}' f'{"":>15}{func["MAE"]:.2f}\n'
                f'{"RMS:":<12}' f'{"":>15}{func["RMS"]:.2f}\n\n')
        return result

In [465]:
def return_args(method) -> tuple:
    args = (
        method['x_train'],
        method['x_test'],
        method['y_train'],
        method['y_test']
    )
    return args

In [466]:
def get_results() -> str:
    results = ''
    for key, value in df_dict.items():
        args = return_args(value)
        results += Score(*args).get_scores(key)
    return print(results)

In [468]:
def get_scores_dict() :
    scores = {}
    for key, value in df_dict.items():
        args = return_args(value)
        scores[key] = {}
        scores[key]['forest']=Score(*args).forest()
        scores[key]['dtree']=Score(*args).dtree()
    return scores

In [470]:
get_results()

Ordinal Encoding Random Forest
Accuracy:                  91.15
MAE:                       1745.32
RMS:                       2568.34

Ordinal Encoding Desicion Tree
Accuracy:                  87.61
MAE:                       1987.96
RMS:                       3040.02

One-Hot Encoding Random Forest
Accuracy:                  90.98
MAE:                       1776.71
RMS:                       2593.95

One-Hot Encoding Desicion Tree
Accuracy:                  88.19
MAE:                       1958.95
RMS:                       2968.15

Dummy Encoding Random Forest
Accuracy:                  90.56
MAE:                       1801.64
RMS:                       2652.54

Dummy Encoding Desicion Tree
Accuracy:                  84.49
MAE:                       2253.77
RMS:                       3401.06




In [471]:
pprint(get_scores_dict())

{'Dummy Encoding': {'dtree': {'Accuracy': 88.17862236435526,
                              'MAE': 1949.2596153846155,
                              'RMS': 2969.001178038215,
                              'name': 'Desicion Tree'},
                    'forest': {'Accuracy': 90.56439499164053,
                               'MAE': 1801.6416496794873,
                               'RMS': 2652.5354724934155,
                               'name': 'Random Forest'}},
 'One-Hot Encoding': {'dtree': {'Accuracy': 88.1280723422208,
                                'MAE': 1974.4903846153845,
                                'RMS': 2975.3423582906325,
                                'name': 'Desicion Tree'},
                      'forest': {'Accuracy': 90.97657913860105,
                                 'MAE': 1776.7121147435898,
                                 'RMS': 2593.951975143054,
                                 'name': 'Random Forest'}},
 'Ordinal Encoding': {'dtree': {'Accuracy': 87.946748

In [474]:
score_dict = {}
regressors = ('dtree','forest')
scores = ('Accuracy', 'MAE', 'RMS')
for reg in regressors:
    score_dict[reg] = {}
    for sc in scores:
        score_dict[reg][sc] = {}
        for enc in get_scores_dict().keys():
            score_dict[reg][sc][enc] = get_scores_dict()[enc][reg][sc]
pprint(score_dict)

{'dtree': {'Accuracy': {'Dummy Encoding': 88.56037354631714,
                        'One-Hot Encoding': 89.57817021621678,
                        'Ordinal Encoding': 87.60243408157416},
           'MAE': {'Dummy Encoding': 2132.5865384615386,
                   'One-Hot Encoding': 2056.7596153846152,
                   'Ordinal Encoding': 2093.3173076923076},
           'RMS': {'Dummy Encoding': 3211.8641147075564,
                   'One-Hot Encoding': 3367.620789764466,
                   'Ordinal Encoding': 3081.181469045974}},
 'forest': {'Accuracy': {'Dummy Encoding': 90.56439499164053,
                         'One-Hot Encoding': 90.97657913860105,
                         'Ordinal Encoding': 91.15386731572326},
            'MAE': {'Dummy Encoding': 1801.6416496794873,
                    'One-Hot Encoding': 1776.7121147435898,
                    'Ordinal Encoding': 1745.323909679487},
            'RMS': {'Dummy Encoding': 2652.5354724934155,
                    'One-Hot Encod

In [475]:
fig = make_subplots(
    rows=3, cols=2,
    row_titles=('Accuracy', 'MAE', 'RMS'),
    column_titles = ('Desicion Tree', 'Random Forest')
)

for col, (regressors, scores) in enumerate(score_dict.items()):
    for row, (enc, val) in enumerate(scores.items()):
        x = list(val.keys())
        fig.add_trace(
            go.Scatter(x=x, y=list(val.values()), mode='markers'),
            row=row+1, col=col+1
        )

fig.update_layout(
    template="plotly_dark",
    showlegend=False,
    height=800
)