<a href="https://colab.research.google.com/github/kyook17/UIUC_BADM/blob/main/BADM576_DS/576_Categorical_Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Categorical Encoding

* One Hot Encoding
> - OneHotEncoder
> - OneHotEncoder + ColumnTransformer
> - OneHotEncoder + SimpleImputer in a pipeline

* Ordinal Encoding
> - Ordinal Encoder

* Target Encoding
> - Target Encoding or Mean Encoding

* One Hot Encoding for frequent catgeories

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

from sklearn.impute import SimpleImputer



In [None]:
# Read dataset
df = pd.read_csv("https://raw.githubusercontent.com/ashish-cell/BADM-211-FA21/main/Data/medical_cost/insurance.csv")

In [None]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In this dataset, `charges` is the outcome variable.


In [None]:
X = df.drop(columns = ["charges"])

y = df["charges"]

In [None]:
# let's separate into training and testing set

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
train_X.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
dtype: object

In [None]:
cat_vars = [var for var in df.columns if df[var].dtype == 'O']

In [None]:
cat_vars

['sex', 'smoker', 'region']

# One hot encoding

In [None]:
# To keep things simple, just handling the missing values using fillna method. Will show later how to handle it using a pipeline
train_X.fillna("Missing", inplace=True)
test_X.fillna("Missing", inplace=True)

In [None]:
train_cat = train_X[cat_vars]

test_cat = test_X[cat_vars]

In [None]:
train_cat

Unnamed: 0,sex,smoker,region
560,female,no,northwest
1285,female,no,northeast
1142,female,no,southeast
969,female,no,southeast
486,female,no,northwest
...,...,...,...
1095,female,no,northeast
1130,female,no,southeast
1294,male,no,northeast
860,female,yes,southwest


In [None]:
# Initializing encoder object with initial attributes

ohe = OneHotEncoder(
    categories="auto",
    drop="first",  # to drop 1 category, use drop=false to NOT drop
    sparse_output = False, # to NOT return a sparse matrix # This will throw an error if using older version of sklearn
    handle_unknown="error"  # raise error when there are rare labels in the train but not in test
)

ohe.set_output(transform= "pandas")

In [None]:
# Use fit method on the object "ohe". This will learn column names to be created from the categorical columns

ohe.fit(train_cat)

In [None]:
# Here are the categories that the encoder learned from the train data

ohe.categories_

[array(['female', 'male'], dtype=object),
 array(['no', 'yes'], dtype=object),
 array(['northeast', 'northwest', 'southeast', 'southwest'], dtype=object)]

In [None]:
# Here are the names of the columns that will be created in the new data frame. Note that there are fewer columns here than the output in the cell above. Why?
ohe.get_feature_names_out()

array(['sex_male', 'smoker_yes', 'region_northwest', 'region_southeast',
       'region_southwest'], dtype=object)

In [None]:
test_cat

Unnamed: 0,sex,smoker,region
764,female,no,northeast
887,female,no,northwest
890,female,yes,northwest
1293,male,no,northwest
259,male,yes,northwest
...,...,...,...
109,male,yes,southeast
575,female,no,northwest
535,male,no,northeast
543,female,yes,southeast


In [None]:
# fit transform the training data

train_ohe_enc = ohe.fit_transform(train_cat)


In [None]:
train_ohe_enc  # the values shown here are 0.0 and 1.0. We could have set the dtype = "int" while initializing the encoder to make them 0 and 1

Unnamed: 0,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
560,0.0,0.0,1.0,0.0,0.0
1285,0.0,0.0,0.0,0.0,0.0
1142,0.0,0.0,0.0,1.0,0.0
969,0.0,0.0,0.0,1.0,0.0
486,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...
1095,0.0,0.0,0.0,0.0,0.0
1130,0.0,0.0,0.0,1.0,0.0
1294,1.0,0.0,0.0,0.0,0.0
860,0.0,1.0,0.0,0.0,1.0


In [None]:
test_ohe_enc = ohe.transform(test_cat) # transform the test data

In [None]:
test_ohe_enc

Unnamed: 0,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
764,0.0,0.0,0.0,0.0,0.0
887,0.0,0.0,1.0,0.0,0.0
890,0.0,1.0,1.0,0.0,0.0
1293,1.0,0.0,1.0,0.0,0.0
259,1.0,1.0,1.0,0.0,0.0
...,...,...,...,...,...
109,1.0,1.0,0.0,1.0,0.0
575,0.0,0.0,1.0,0.0,0.0
535,1.0,0.0,0.0,0.0,0.0
543,0.0,1.0,0.0,1.0,0.0


# What if we encounter a new category during the test/ inference stage?

In [None]:
new_test = test_cat # creating  a new test set

new_test.iloc[len(new_test) -2]["region"] = "lost" # Adding a category to the region column that was not seen during training

In [None]:
ohe.transform(new_test)



Unnamed: 0,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
764,0.0,0.0,0.0,0.0,0.0
887,0.0,0.0,1.0,0.0,0.0
890,0.0,1.0,1.0,0.0,0.0
1293,1.0,0.0,1.0,0.0,0.0
259,1.0,1.0,1.0,0.0,0.0
...,...,...,...,...,...
109,1.0,1.0,0.0,1.0,0.0
575,0.0,0.0,1.0,0.0,0.0
535,1.0,0.0,0.0,0.0,0.0
543,0.0,1.0,0.0,0.0,0.0


We can use `Explain Error` option to understand what the problem is. However, this may not be very helpful.

In [None]:
# Initializing encoder object with hnadle_unknown = ignore

ohe = OneHotEncoder(
    categories="auto",
    drop="first",  # to drop 1 category, use drop=false to NOT drop
    sparse_output = False, # to NOT return a sparse matrix # This will throw an error if using older version of sklearn
    handle_unknown="ignore"  # ignore the row when it has an unseen category during training
)

ohe.set_output(transform= "pandas")

In [None]:
train_ohe_enc = ohe.fit_transform(train_cat)


In [None]:
ohe.transform(new_test)



Unnamed: 0,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
764,0.0,0.0,0.0,0.0,0.0
887,0.0,0.0,1.0,0.0,0.0
890,0.0,1.0,1.0,0.0,0.0
1293,1.0,0.0,1.0,0.0,0.0
259,1.0,1.0,1.0,0.0,0.0
...,...,...,...,...,...
109,1.0,1.0,0.0,1.0,0.0
575,0.0,0.0,1.0,0.0,0.0
535,1.0,0.0,0.0,0.0,0.0
543,0.0,1.0,0.0,0.0,0.0


## Alternatively, we can set an infrequent category while setting up the `onehotencoder` and `handle_unknown` can be set to `infrequnet_if_exists`.

In [None]:
# Initializing encoder object with initial attributes

ohe = OneHotEncoder(
    categories="auto",
    drop="first",  # to drop 1 category, use drop=false to NOT drop
    sparse_output = False, # to NOT return a sparse matrix # This will throw an error if using older version of sklearn
    handle_unknown="infrequent_if_exist",  # set to infrequent_if_exist if it has an unseen category during training
    max_categories= 4
)

ohe.set_output(transform= "pandas")

In [None]:
train_ohe_enc = ohe.fit_transform(train_cat)

In [None]:
train_ohe_enc

Unnamed: 0,sex_male,smoker_yes,region_southeast,region_southwest,region_infrequent_sklearn
560,0.0,0.0,0.0,0.0,1.0
1285,0.0,0.0,0.0,0.0,0.0
1142,0.0,0.0,1.0,0.0,0.0
969,0.0,0.0,1.0,0.0,0.0
486,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...
1095,0.0,0.0,0.0,0.0,0.0
1130,0.0,0.0,1.0,0.0,0.0
1294,1.0,0.0,0.0,0.0,0.0
860,0.0,1.0,0.0,1.0,0.0


In [None]:
ohe.transform(new_test)



Unnamed: 0,sex_male,smoker_yes,region_southeast,region_southwest,region_infrequent_sklearn
764,0.0,0.0,0.0,0.0,0.0
887,0.0,0.0,0.0,0.0,1.0
890,0.0,1.0,0.0,0.0,1.0
1293,1.0,0.0,0.0,0.0,1.0
259,1.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...
109,1.0,1.0,1.0,0.0,0.0
575,0.0,0.0,0.0,0.0,1.0
535,1.0,0.0,0.0,0.0,0.0
543,0.0,1.0,0.0,0.0,1.0


## We decide whether we want to raise error, ignore or set infrequent whenever a new category is seen during inference based on what we expect during the inference.

For example, if we notice that suddenly a new region or country shows up unexpectedly, we may want to raise an `error` to ensure data validation (we get requests from only specific countries). If we expect that some new country can show up, we may want to use `ignore` or `infrequent category`.

# Setting different schemes for different categorical columns

Here we created a list of column variables and ran the encoder on those columns by creating a subset of the dataframe.

What if there were separate set of variables such that for one set we wanted to use one encoder and for another set we needed another encoder.


`ColumnTransformer` gives us flexibility to run multiple encoders on their respective list of variables.

It takes an argument `transformers` where we a `list` of `tuples`, where each `tuple` has three arguments: `name`, `transformer`, and `columns`.

We can also suggest what happens to other columns in the dataset, using `remainder` parameter that may take either of the two values : `drop` or `passthrough`.


* transformers: list of tuples
List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data.

* remainder{‘drop’, ‘passthrough’}

### Use Column Transformer to specify columns for encoding

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
# Initializing encoder object with initial attributes

ohe = OneHotEncoder(
    categories="auto",
    drop="first",  # to drop 1 category, use drop=false to NOT drop
    sparse= False, # to NOT return a sparse matrix
    handle_unknown="error",  # raise error when there are rare labels in the train but not in test
)


In [None]:
# In this instance, we only give one transformer (just to keep things simple).

ct = ColumnTransformer(
    [("encd", ohe, cat_vars)], remainder="passthrough") # The variables that are not passed through will get a prefix "remainder"

ct.set_output(transform= "pandas")

In [None]:
# train encoder

ct.fit(train_X)



In [None]:
# One hot encode the variables

X_train_enc = ct.transform(train_X)
X_test_enc =  ct.transform(test_X)



In [None]:
X_train_enc

Unnamed: 0,encd__sex_male,encd__smoker_yes,encd__region_northwest,encd__region_southeast,encd__region_southwest,remainder__age,remainder__bmi,remainder__children
560,0.0,0.0,1.0,0.0,0.0,46,19.950,2
1285,0.0,0.0,0.0,0.0,0.0,47,24.320,0
1142,0.0,0.0,0.0,1.0,0.0,52,24.860,0
969,0.0,0.0,0.0,1.0,0.0,39,34.320,5
486,0.0,0.0,1.0,0.0,0.0,54,21.470,3
...,...,...,...,...,...,...,...,...
1095,0.0,0.0,0.0,0.0,0.0,18,31.350,4
1130,0.0,0.0,0.0,1.0,0.0,39,23.870,5
1294,1.0,0.0,0.0,0.0,0.0,58,25.175,0
860,0.0,1.0,0.0,0.0,1.0,37,47.600,2


`ColumnTransformer` allows us to apply multiple transformations parrallelly and independently.

### Imputing Missing Values and Encoding Categorical

### Pipeline allows us to apply multiple transformers in a sequence

In [None]:
from sklearn.pipeline import Pipeline # Pipeline allows us to apply multiple transformers in a sequence

In [None]:
X = df.drop(columns = ["charges"])

y = df["charges"]

In [None]:
# Separate into training and testing set

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
# check for missing data

train_X.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
dtype: int64

In [None]:
# Initialize imputer
imputer =  SimpleImputer(strategy="constant", fill_value="missing")

imputer.set_output(transform= "pandas")

In [None]:
# Initializing encoder object with initial attributes
ohe = OneHotEncoder(
    categories="auto",
    drop="first",  # to drop 1 category, use drop=false to NOT drop
    sparse_output= False, # to NOT return a sparse matrix
    handle_unknown="error",  # raise error when there are rare labels in the train but not in test
)

ohe.set_output(transform= "pandas")

In [None]:
ord = OrdinalEncoder(
    categories="auto",
    handle_unknown="error",  # raise error when there are rare labels in the train but not in test
)

ord.set_output(transform ="pandas")

In [None]:
# set up encoder and imputation in pipeline
# we only want to impute categorical variables

pipe_ohe = Pipeline(
    [("imputer", imputer),
        ("ohe", ohe),
    ]
)

pipe_ohe.set_output(transform= "pandas")

In [None]:
# set up encoder and imputation in pipeline
# we only want to impute categorical variables

pipe_ord = Pipeline(
    [("imputer", imputer),
        ("ord", ord),
    ]
)

pipe_ord.set_output(transform= "pandas")

In [None]:
# select the variables to encode

ct = ColumnTransformer(
    [
    ("ohe", pipe_ohe, ["region"]),
    ("ord", pipe_ord, ["smoker", "sex"])
    ],
    remainder="passthrough") # The variables that are not passed through will get a prefix "reminder"

ct.set_output(transform= "pandas")

In [None]:
ct.transformers

[('ohe',
  Pipeline(steps=[('imputer',
                   SimpleImputer(fill_value='missing', strategy='constant')),
                  ('ohe', OneHotEncoder(drop='first', sparse_output=False))]),
  ['region']),
 ('ord',
  Pipeline(steps=[('imputer',
                   SimpleImputer(fill_value='missing', strategy='constant')),
                  ('ord', OrdinalEncoder())]),
  ['smoker', 'sex'])]

In [None]:
# select the variables to transform (impute + encode)

ct.fit(train_X)

In [None]:
# transform data

X_train_fin = ct.transform(train_X)
X_test_fin = ct.transform(test_X)



In [None]:
X_test_fin

Unnamed: 0,ohe__region_northwest,ohe__region_southeast,ohe__region_southwest,ord__smoker,ord__sex,remainder__age,remainder__bmi,remainder__children
764,0.0,0.0,0.0,0.0,0.0,45,25.175,2
887,1.0,0.0,0.0,0.0,0.0,36,30.020,0
890,1.0,0.0,0.0,1.0,0.0,64,26.885,0
1293,1.0,0.0,0.0,0.0,1.0,46,25.745,3
259,1.0,0.0,0.0,1.0,1.0,19,31.920,0
...,...,...,...,...,...,...,...,...
109,0.0,1.0,0.0,1.0,1.0,63,35.090,0
575,1.0,0.0,0.0,0.0,0.0,58,27.170,0
535,0.0,0.0,0.0,0.0,1.0,38,28.025,1
543,0.0,1.0,0.0,1.0,0.0,54,47.410,0


# Target Encoding

In [None]:
!pip install category_encoders



In [None]:
import category_encoders

In [None]:
from category_encoders.target_encoder import TargetEncoder

In [None]:
mean_enc = TargetEncoder(smoothing=8   # smoothing specifies that if there are categories with less than 8 observations, their means need to be a combination of group mean and overall mean.
)

mean_enc.set_output(transform = "pandas")

In [None]:
mean_enc.fit(train_X, train_y) # Note that it takes both X and Y while fitting

In [None]:
mean_enc.mapping

{'sex': sex
  1    12646.882679
  2    14012.122736
 -1    13346.089736
 -2    13346.089736
 dtype: float64,
 'smoker': smoker
  1     8578.322548
  2    31767.008418
 -1    13346.089736
 -2    13346.089736
 dtype: float64,
 'region': region
  1    12622.514246
  2    13333.008791
  3    14698.242993
  4    12611.500973
 -1    13346.089736
 -2    13346.089736
 dtype: float64}

In [None]:
mean_enc.cols

['sex', 'smoker', 'region']

In [None]:
train_X = mean_enc.transform(train_X)
test_X = mean_enc.transform(test_X)

In [None]:
train_X.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
560,46,12646.882679,19.95,2,8578.322548,12622.514246
1285,47,12646.882679,24.32,0,8578.322548,13333.008791
1142,52,12646.882679,24.86,0,8578.322548,14698.242993
969,39,12646.882679,34.32,5,8578.322548,14698.242993
486,54,12646.882679,21.47,3,8578.322548,12622.514246


# One Hot Encoding for Frequent Categories

In [None]:
ohe_enc = OneHotEncoder(
    handle_unknown="infrequent_if_exist",  # unseen categories will be treated like the less frequent ones
    max_categories=5,  # the number of top categories
    sparse_output=False,  # necessary for set output pandas
)

ohe_enc.set_output(transform="pandas")

ohe_enc.fit(train_X)