In [1]:
# Install category_encoders using pip (not conda, which is
# on an old version)
#!pip install category_encoders

# Introduction to category encoders

Let's read in the example data that was used in the article ["Encoding Categorical Variables"](https://kiwidamien.gihub.io/encoding-categorical-variables.html). We are deliberately using a small dataset, so that it is easy to see what the encoders are doing.

In [2]:
import category_encoders as ce
import pandas as pd

print(f"You are using category encoders version {ce.__version__}")
if int(ce.__version__.split('.')[0]) < 2:
    print("Install version 2.0.0 or higher!")
    
df_train = pd.read_csv('https://raw.githubusercontent.com/kiwidamien/StackedTurtles/master/content/preprocessing/simple_loan_example.csv')
df_train

You are using category encoders version 2.0.0


Unnamed: 0,annual_income,debt_to_income,loan_amount,purpose,grade,repaid
0,120000,0.1,3500,medical,A,True
1,130000,0.5,13800,medical,C,False
2,220000,0.4,33500,medical,B,False
3,65000,0.25,2000,refinance,B,False
4,60000,0.2,2200,refinance,B,True
5,45000,0.312,5500,auto,D,True
6,75000,0.111,2000,auto,B,True
7,24000,0.4,500,other,C,False


## Ordinal encoder

* Used for ordered categories (e.g. grade, where `A` is better than `B`, `B` is better than `C`, etc)
* Actual values used **don't** mattter for tree-based models, only the order matters
* Actual values used **do** mattter for linear-coefficient basde models.

Let's start with the default encoding:

In [3]:
encoder_grade = ce.OrdinalEncoder(cols=['grade'], return_df=True)
encoder_grade.fit_transform(df_train)

Unnamed: 0,annual_income,debt_to_income,loan_amount,purpose,grade,repaid
0,120000,0.1,3500,medical,1,True
1,130000,0.5,13800,medical,2,False
2,220000,0.4,33500,medical,3,False
3,65000,0.25,2000,refinance,3,False
4,60000,0.2,2200,refinance,3,True
5,45000,0.312,5500,auto,4,True
6,75000,0.111,2000,auto,3,True
7,24000,0.4,500,other,2,False


What happens if we have a new grade (e.g. `E`) that we didn't see in training?

In [4]:
df_test = df_train.copy()
df_test.loc[0, 'grade'] = 'E'

encoder_grade.transform(df_test)

Unnamed: 0,annual_income,debt_to_income,loan_amount,purpose,grade,repaid
0,120000,0.1,3500,medical,-1.0,True
1,130000,0.5,13800,medical,2.0,False
2,220000,0.4,33500,medical,3.0,False
3,65000,0.25,2000,refinance,3.0,False
4,60000,0.2,2200,refinance,3.0,True
5,45000,0.312,5500,auto,4.0,True
6,75000,0.111,2000,auto,3.0,True
7,24000,0.4,500,other,2.0,False


Note that `E` was mapped to the value `-1`.

### Custom map

By default, we map `A` &rightarrow; `1`, `B` &rightarrow; `2`, etc. More precisely, we "sort" the levels seen in training, and then label them consecutively.

Let's say we wanted `A` &rightarrow; `1`, `B` &rightarrow; `3`, `C` &rightarrow; `5`, and anything worse than `C` to go 10. We can implement our own map using a function:

In [5]:
def custom_grade(grade):
    encoding = {'A': 1, 'B': 3, 'C': 5}
    return encoding.get(grade, 10)

encoder_grade = ce.OrdinalEncoder(mapping=[{'col': 'grade', 'mapping': custom_grade}], return_df=True)
encoder_grade.fit_transform(df_train)

Unnamed: 0,annual_income,debt_to_income,loan_amount,purpose,grade,repaid
0,120000,0.1,3500,medical,1,True
1,130000,0.5,13800,medical,5,False
2,220000,0.4,33500,medical,3,False
3,65000,0.25,2000,refinance,3,False
4,60000,0.2,2200,refinance,3,True
5,45000,0.312,5500,auto,10,True
6,75000,0.111,2000,auto,3,True
7,24000,0.4,500,other,5,False


This might be particularly useful if you lexigraphic ordering doesn't match your intended ordering (e.g. `A+`, `A`, `A-` are not ordered the way you would typically want by default)

## One Hot Encoder

One hot encoding is used for non-ordered categories if there are only a few levels. 

In this case, `purpose` only has 4 different levels, as we can see with `value_counts`

In [6]:
df_train['purpose'].value_counts()

medical      3
refinance    2
auto         2
other        1
Name: purpose, dtype: int64

Each level of the `purpose` feature gets it's own column:

In [7]:
encoder_purpose = ce.OneHotEncoder(cols='purpose', use_cat_names=True, return_df=True)
encoder_purpose.fit_transform(df_train)

Unnamed: 0,annual_income,debt_to_income,loan_amount,purpose_medical,purpose_refinance,purpose_auto,purpose_other,grade,repaid
0,120000,0.1,3500,1,0,0,0,A,True
1,130000,0.5,13800,1,0,0,0,C,False
2,220000,0.4,33500,1,0,0,0,B,False
3,65000,0.25,2000,0,1,0,0,B,False
4,60000,0.2,2200,0,1,0,0,B,True
5,45000,0.312,5500,0,0,1,0,D,True
6,75000,0.111,2000,0,0,1,0,B,True
7,24000,0.4,500,0,0,0,1,C,False


Columns with unknown values just get all zeros, as we can see by setting the first row to `"tuition"`:

In [8]:
df_test = df_train.copy()
df_test.loc[0, 'purpose'] = "tuition"

encoder_purpose.transform(df_test)

Unnamed: 0,annual_income,debt_to_income,loan_amount,purpose_medical,purpose_refinance,purpose_auto,purpose_other,grade,repaid
0,120000,0.1,3500,0,0,0,0,A,True
1,130000,0.5,13800,1,0,0,0,C,False
2,220000,0.4,33500,1,0,0,0,B,False
3,65000,0.25,2000,0,1,0,0,B,False
4,60000,0.2,2200,0,1,0,0,B,True
5,45000,0.312,5500,0,0,1,0,D,True
6,75000,0.111,2000,0,0,1,0,B,True
7,24000,0.4,500,0,0,0,1,C,False


## Target Encoder

`TargetEncoder` uses the average value of the target in the same level to determine the value we should encode with. In our case, the target is binary (`repaid`), so the average of the level is fraction of that level that repaid.

By default, it smooths the value between the overall average and the average of the group. This helps prevent overfitting by giving more weight to the overall average when we only have a few examples in that level. You should keep this smoothing in actual problems, but we will turn it off here (`smoothing=0.0`) as it makes it easier to see what the `TargetEncoder` is doing.

First, let's show what fraction of each `purpose` ended up repaying their loan:

In [9]:
df_train.groupby(['purpose'])['repaid'].mean()

purpose
auto         1.000000
medical      0.333333
other        0.000000
refinance    0.500000
Name: repaid, dtype: float64

Here is the original dataframe (should be easy to verify):

In [10]:
df_train

Unnamed: 0,annual_income,debt_to_income,loan_amount,purpose,grade,repaid
0,120000,0.1,3500,medical,A,True
1,130000,0.5,13800,medical,C,False
2,220000,0.4,33500,medical,B,False
3,65000,0.25,2000,refinance,B,False
4,60000,0.2,2200,refinance,B,True
5,45000,0.312,5500,auto,D,True
6,75000,0.111,2000,auto,B,True
7,24000,0.4,500,other,C,False


Now look at the encoding:

In [11]:
encoder_purpose = ce.TargetEncoder(cols='purpose', smoothing=0.0, return_df=True)
encoder_purpose.fit_transform(df_train, df_train.repaid)

Unnamed: 0,annual_income,debt_to_income,loan_amount,purpose,grade,repaid
0,120000,0.1,3500,0.333333,A,True
1,130000,0.5,13800,0.333333,C,False
2,220000,0.4,33500,0.333333,B,False
3,65000,0.25,2000,0.5,B,False
4,60000,0.2,2200,0.5,B,True
5,45000,0.312,5500,1.0,D,True
6,75000,0.111,2000,1.0,B,True
7,24000,0.4,500,0.5,C,False


With the exception of the "Other" category, the `purpose` category was replaced with the average repayment rate for each purpose. If we have only one example (like we did for "other") or a new category, it is replaced with the average.

### Warning:

When using the target encoder, you are using the values of the output. It is critical when you are doing cross-validation that you encode on each fold, rather than encoding everything and then doing cross validation. Otherwise your cross validation will "know" about the hold out set, making your cross-validatation scores higher than they will be on the test set (and on new data). 

## Hash Encoder

The hash encoder maps each feature value to `n_components` binary columns. Because it doesn't memorize the levels during training, it can be good if you have a **lot** of categories. The function can also translate new (unseen) levels at test time.

It helps with tree-based models, because roughly half the levels will have a 0 or 1 in each column, so if there are relationships between levels the hope is that some of the columns will have common values for the related levels.

Drawbacks:

* It is hard to get interpretable results from a HashEncoder colum
* If you choose a small number of levels, or are unlucky, you can get _collisions_ where distinct levels get mapped to the same encoding. Below we see that `medical` and `refinance` are both mapped to `(0, 0, 1)`.

In [12]:
encoder_purpose = ce.HashingEncoder(n_components=3, cols=['purpose'])
encoder_purpose.fit_transform(df_train)

Unnamed: 0,col_0,col_1,col_2,annual_income,debt_to_income,loan_amount,grade,repaid
0,0,0,1,120000,0.1,3500,A,True
1,0,0,1,130000,0.5,13800,C,False
2,0,0,1,220000,0.4,33500,B,False
3,0,0,1,65000,0.25,2000,B,False
4,0,0,1,60000,0.2,2200,B,True
5,1,0,0,45000,0.312,5500,D,True
6,1,0,0,75000,0.111,2000,B,True
7,0,1,0,24000,0.4,500,C,False


## Encoding multiple columns

Let's encode 

* `grades` using `OneHotEncoder` (usually you would use "OrdinalEncoder")
* `purpose` using `TargetEncoder`

We will do it in two steps, then use a pipeline, to ensure that we are able to do cross-validation correctly:

## Two steps

First, let's do it _incorrectly_:

In [13]:
encoder_grade = ce.OneHotEncoder(cols=['grade'], return_df=True).fit(df_train)
encoder_purpose = ce.TargetEncoder(cols=['purpose'], return_df=True).fit(df_train, df_train['repaid'])

In [14]:
# Can we encode?
df_train_grade_encoded = encoder_grade.transform(df_train)
df_train_grade_encoded

Unnamed: 0,annual_income,debt_to_income,loan_amount,purpose,grade_1,grade_2,grade_3,grade_4,repaid
0,120000,0.1,3500,medical,1,0,0,0,True
1,130000,0.5,13800,medical,0,1,0,0,False
2,220000,0.4,33500,medical,0,0,1,0,False
3,65000,0.25,2000,refinance,0,0,1,0,False
4,60000,0.2,2200,refinance,0,0,1,0,True
5,45000,0.312,5500,auto,0,0,0,1,True
6,75000,0.111,2000,auto,0,0,1,0,True
7,24000,0.4,500,other,0,1,0,0,False


Now let's encode pupose of this dataframe....

In [15]:
df_train_all = encoder_purpose.transform(df_train_grade_encoded)

ValueError: Unexpected input dimension 9, expected 6

What happened?

Our `encoder_purpose` was trained on `df_train`, which had only 6 columns. Here we asked it to transform _after_ the one hot encoder had expanded to 9 columns! If we reverse the order, however, we are fine:

In [16]:
# This encoding doesn't change the number of columns
df_train_purpose_encoded = encoder_purpose.transform(df_train)
df_train_purpose_encoded

Unnamed: 0,annual_income,debt_to_income,loan_amount,purpose,grade,repaid
0,120000,0.1,3500,0.3532,A,True
1,130000,0.5,13800,0.3532,C,False
2,220000,0.4,33500,0.3532,B,False
3,65000,0.25,2000,0.5,B,False
4,60000,0.2,2200,0.5,B,True
5,45000,0.312,5500,0.865529,D,True
6,75000,0.111,2000,0.865529,B,True
7,24000,0.4,500,0.5,C,False


.... so we _can_ pass this along to one hot encoding:

In [17]:
df_train_all_encoded = encoder_grade.transform(df_train_purpose_encoded)
df_train_all_encoded

Unnamed: 0,annual_income,debt_to_income,loan_amount,purpose,grade_1,grade_2,grade_3,grade_4,repaid
0,120000,0.1,3500,0.3532,1,0,0,0,True
1,130000,0.5,13800,0.3532,0,1,0,0,False
2,220000,0.4,33500,0.3532,0,0,1,0,False
3,65000,0.25,2000,0.5,0,0,1,0,False
4,60000,0.2,2200,0.5,0,0,1,0,True
5,45000,0.312,5500,0.865529,0,0,0,1,True
6,75000,0.111,2000,0.865529,0,0,1,0,True
7,24000,0.4,500,0.5,0,1,0,0,False


## A better way: pipelines!

That was really annoying! We would hope to have a better way, and there is -- use a pipeline! The pipeline trains all at once, with each step trained on the output of the previous step. Therefore we don't need to keep track of which step we do first. They also work nicely with `GridSearch` and `cross_val_score` as we do the encoding on each set of training folds, so we know there is no data leakage into the validation set.

Let's do an example with the OneHotEncoder first:

In [18]:
from sklearn.pipeline import Pipeline

# We can put these in either order, the second one
# fits/transforms on the output of the first!
encoding_pipeline = Pipeline([
    ('encode_grade', encoder_grade),
    ('encode_purpose', encoder_purpose)
])

encoding_pipeline.fit(df_train, df_train['repaid'])

Pipeline(memory=None,
     steps=[('encode_grade', OneHotEncoder(cols=['grade'], drop_invariant=False, handle_missing='value',
       handle_unknown='value', return_df=True, use_cat_names=False,
       verbose=0)), ('encode_purpose', TargetEncoder(cols=['purpose'], drop_invariant=False, handle_missing='value',
       handle_unknown='value', min_samples_leaf=1, return_df=True,
       smoothing=1.0, verbose=0))])

In [19]:
# Note we don't pass in the target values!
encoding_pipeline.transform(df_train)

Unnamed: 0,annual_income,debt_to_income,loan_amount,purpose,grade_1,grade_2,grade_3,grade_4,repaid
0,120000,0.1,3500,0.3532,1,0,0,0,True
1,130000,0.5,13800,0.3532,0,1,0,0,False
2,220000,0.4,33500,0.3532,0,0,1,0,False
3,65000,0.25,2000,0.5,0,0,1,0,False
4,60000,0.2,2200,0.5,0,0,1,0,True
5,45000,0.312,5500,0.865529,0,0,0,1,True
6,75000,0.111,2000,0.865529,0,0,1,0,True
7,24000,0.4,500,0.5,0,1,0,0,False


For more information on pipelines, see the article ["An introduction to pipelines"](https://kiwidamien.github.io/introduction-to-pipelines.html)