# Categorical Data

There is a good website that explains the different methods of handling categorical data.

https://www.kaggle.com/code/shahules/an-overview-of-encoding-techniques

There are currently 4 methods that the preprocessing module handles categorical data.
- Passthrough
- Ordinal (Label Encoding)
- One Hot Encoding (OHE)
- Target Encoding

The following are not implemented yet.
- Feature Hashing
- Cyclical Encoding

In [1]:
import os
import sys

import pandas as pd

os.chdir("../../")
sys.path.insert(0, os.getcwd())

In [2]:
from morai.forecast import preprocessors

We'll load the data into a `df` dataframe and then use the `sex` column to transform the feature

In [5]:
df = pd.read_csv("tests/files/experience/simple_experience.csv")

In [6]:
df.head()

Unnamed: 0,sex,smoker_status,smoker_status_encode,rate,sex_rate,smoker_rate,exposed,expected
0,F,NS,0,0.72,0.8,0.9,50,0.36
1,F,S,1,0.88,0.8,1.1,50,0.44
2,M,NS,0,1.08,1.2,0.9,100,1.08
3,M,S,1,1.32,1.2,1.1,100,1.32


## Passthrough

In [8]:
preprocess_dict = preprocessors.preprocess_data(
    df,
    feature_dict={
        "target": ["rate"],
        "weight": [],
        "passthrough": ["sex"],
    },
    standardize=False,
    add_constant=False,
)
preprocess_dict["X"]

[37m 2024-05-15 22:43:22 [0m|[37m morai.forecast.preprocessors [0m|[32m INFO     [0m|[32m model target: ['rate'] [0m
[37m 2024-05-15 22:43:22 [0m|[37m morai.forecast.preprocessors [0m|[32m INFO     [0m|[32m passthrough - (generally numeric): ['sex'] [0m


Unnamed: 0,sex
0,F
1,F
2,M
3,M


## Ordinal

In [9]:
preprocess_dict = preprocessors.preprocess_data(
    df,
    feature_dict={
        "target": ["rate"],
        "weight": [],
        "ordinal": ["sex"],
    },
    standardize=False,
    add_constant=False,
)
preprocess_dict["X"]

[37m 2024-05-15 22:44:13 [0m|[37m morai.forecast.preprocessors [0m|[32m INFO     [0m|[32m model target: ['rate'] [0m
[37m 2024-05-15 22:44:13 [0m|[37m morai.forecast.preprocessors [0m|[32m INFO     [0m|[32m ordinal - ordinal encoded: ['sex'] [0m


Unnamed: 0,sex
0,0
1,0
2,1
3,1


## OHE

In [10]:
preprocess_dict = preprocessors.preprocess_data(
    df,
    feature_dict={
        "target": ["rate"],
        "weight": [],
        "ohe": ["sex"],
    },
    standardize=False,
    add_constant=False,
)
preprocess_dict["X"]

[37m 2024-05-15 22:44:26 [0m|[37m morai.forecast.preprocessors [0m|[32m INFO     [0m|[32m model target: ['rate'] [0m
[37m 2024-05-15 22:44:26 [0m|[37m morai.forecast.preprocessors [0m|[32m INFO     [0m|[32m nominal - one hot encoded: ['sex'] [0m


Unnamed: 0,sex_F,sex_M
0,1,0
1,1,0
2,0,1
3,0,1


## Target (Nominal)

In [11]:
preprocess_dict = preprocessors.preprocess_data(
    df,
    feature_dict={
        "target": ["rate"],
        "weight": [],
        "nominal": ["sex"],
    },
    standardize=False,
    add_constant=False,
)
preprocess_dict["X"]

[37m 2024-05-15 22:44:56 [0m|[37m morai.forecast.preprocessors [0m|[32m INFO     [0m|[32m model target: ['rate'] [0m
[37m 2024-05-15 22:44:56 [0m|[37m morai.forecast.preprocessors [0m|[32m INFO     [0m|[32m nominal - weighted average of target encoded: ['sex'] [0m


Unnamed: 0,sex
0,0.8
1,0.8
2,1.2
3,1.2
