# Best Practices in Feature Engineering for Tabular Data With GPU Acceleration #

# Install 

https://docs.rapids.ai/install/

```
 docker run --gpus all --pull always --rm -it     --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864     -p 8888:8888 -p 8787:8787 -p 8786:8786  --volume /mnt/d/repos/nvidia:/home/rapids/notebooks/nvidia   nvcr.io/nvidia/rapidsai/notebooks:25.08-cuda12.9-py3.13 
```

## Part 1: Target Encoding ##
Most models cannot accept categorical columns as is. A categorical column is typically a column of strings (or non ordered numbers) and we need to convert these into some numeric representation to input it into our model. Common techniques are OHE (one hot encoding) and LE (label encoding). Advanced techniques are TE (Target encoding) and CE (Count encoding). In this notebook, we will discuss TE.

[1]: https://rapids.ai/cudf-pandas/
[2]: https://docs.rapids.ai/install/

In this lab, we will use the speed of GPUs to help us create new columns quickly. Specificially we will use [cuDF-Pandas][1] zero code change GPU acceleration. After adding cell magic `%load_ext cudf.pandas` all of our subsequent Pandas calls will use [RAPIDS cuDF][2] and thus utilize GPU instead of Pandas CPU!

**Table of Contents**
<br>
This notebook shows how to perform target encoding. This notebook covers the below sections: 

1. [GPU Accelerating Pandas with Zero Code Change](#GPU-Accelerating-Pandas-with-Zero-Code-Change)
    * [Load Data](#Load-Data)
    * [Target Encoding Technique](#Target-Encoding-Technique)
    * [Smoothing](#Smoothing)
    * [Compare ACC (Accuracy) Errors](#Compare-ACC-(Accuracy)-Errors)
    * [Improve Target Encoding with Nested Folds](#Improve-Target-Encoding-with-Nested-Folds)
    * [Target Encoding Summary](#Target-Encoding-Summary)
2. [CPU-GPU Comparison](#CPU-GPU-Comparison)
    * [Sample Data](#Sample-Data)
    * [Enlarge Data](#Enlarge-Data)
3. [Summary](#Summary)

## GPU Accelerating Pandas with Zero Code Change
After adding cell magic `%load_ext cudf.pandas` all of our subsequent Pandas calls will use [RAPIDS cuDF][1] and thus utilize GPU instead of Pandas CPU! 

[1]: https://rapids.ai/cudf-pandas/

In [None]:
%load_ext cudf.pandas

### Load Data
 **Amazon product data dataset** : https://jmcauley.ucsd.edu/data/amazon/


**Description**<br>
This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

**Citation**<br>
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016
[pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf)

Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
[pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/sigir15.pdf)

First we load the data and fill nans in the categorical column `brand` with string `UNKNOWN`.

In [None]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

# LOAD DATA
PATH = "./data/"
df_train = pd.read_parquet(f'{PATH}train.parquet') 
df_valid = pd.read_parquet(f'{PATH}valid.parquet')
df_test = pd.read_parquet(f'{PATH}test.parquet')

# FILL NAN
df_train['brand'] = df_train['brand'].fillna('UNKNOWN')
df_valid['brand'] = df_valid['brand'].fillna('UNKNOWN')
df_test['brand'] = df_test['brand'].fillna('UNKNOWN')

print("Train data shape:",df_train.shape)
df_train.head()

### Target Encoding Technique

`Target Encoding` is a technique used to create new features, which can be used by the model for training. The advantage of `Target Encoding` is, that it process the categorical features and makes them more accessible to the model during training and validation.

Tree-based models require to create a split for each categorical value (depending on the exact model). `Target Encoding` makes it easier for the model to locate important values without creating many splits. In particular, when applying `Target Encoding` to multiple columns, it reduces significantly the number of splits needed. The model can directly operate on the probablities/averages and create splits based on them.

Another advantage is, that some boosted-tree libraries, such as XGBoost, only offer experimental categorical feature handling. The library may require a `One Hot Encoding`. Categorical features with large cardinality (e.g. >100) are inefficient to store as `One Hot`.

Deep learning models often apply Embedding Layers to categorical features. Embedding layer can overfit quickly and categorical values with low frequencies have ony a few gradient descent updates and models will memorize the training data.

#### Encode Single Categorical Column

`Target Encoding (TE)` calculates the statistics from a target variable grouped by the unique values of one or more categorical features.

For example in a binary classification problem, it calculates the probability that the target is true for each category value - a simple mean. See the example (in first code cell) below where we list all unique values from column `brand` together with their proportion of target equal true. In second code cell below, we merge `TE` onto the original dataframe creating a new `TE` column.

In [None]:
cat = 'brand'
te = df_train[[cat, 'label']].groupby(cat).mean()
te

In [None]:
te = te.reset_index()
te.columns = [cat, 'TE_' + cat]
df_train.merge(te, how='left', on=cat)[['userID', 'productID', cat, 'TE_' + cat]]

#### Encode Group of Categorical Columns
Similarly, we can apply `Target Encoding` to a group of categorical features.

In [None]:
te = df_train[['brand', 'cat_2', 'label']].groupby(['brand', 'cat_2']).mean()
te

In [None]:
te = te.reset_index()
te.columns = ['brand', 'cat_2', 'TE_brand_2']
df_train.merge(te, how='left', on=['brand', 'cat_2'])

### Smoothing
The introduced `Target Encoding` is a good first step, but it lacks ability to generalize well and it will tend to overfit too. Let's take a look on `Target Encoding` with the observation count. We notice how some `brands` only have a few rows in train data. See the table and histogram below. If a brand only has a few observations, can we be confident that the observed proportion of true targets will apply to new data?

In [None]:
dd = df_train[[cat, 'label']].groupby(cat).agg(['mean', 'count'])
dd

In [None]:
plt.bar(dd['label']['count'].value_counts().index.to_numpy() , dd['label']['count'].value_counts().to_numpy() )
plt.xlim(0,50)
plt.title("Histogram of Brands and their Observation Count")
plt.show()

We can observe, that the observation count for some categories are 1. This means, that we have only one data point to calculate the average and `Target Encoding` overfits to these values. Therefore, we need to adjust the calculation:
* if the number of observation is **high**, we want to use the **mean of this category value**
* if the number of observation is **low**, we want to use the **global mean**

A simple way is to calculate a weighted average of the `category value mean` and the `global mean`.

We add a smoothing weight `w`. A bigger `w` encourages the `Target Encoding` to be closer to the `global mean`.  


* Use a smoothing factor of `w=20`
* Target Encode the columns `feat = ['brand', 'cat_2']`

In [None]:
feat = ['brand', 'cat_2']
w = 20
mean_global = df_train.label.mean()
te = df_train.groupby(feat)['label'].agg(['mean','count']).reset_index()
te['TE_brand_cat_2'] = ((te['mean']*te['count'])+(mean_global*w))/(te['count']+w)

df_train = df_train.merge(te, on=feat, how='left')
df_valid = df_valid.merge( te, on=feat, how='left' )
df_test = df_test.merge( te, on=feat, how='left' )
df_valid['TE_brand_cat_2'] = df_valid['TE_brand_cat_2'].fillna(mean_global)
df_test['TE_brand_cat_2'] = df_test['TE_brand_cat_2'].fillna(mean_global)

#### Exploring the Effect of Smoothing

A tree-based or deep learning based model cannot easily capture the idea of smoothing. We show the positive effect of smoothing on the target. Therefore, we compare `Target Encoding` with and without smoothing.

#### TargetEncoding Without Smoothing

In [None]:
cat = ['weekday', 'cat_2', 'brand']
te = df_train.groupby(cat).label.agg(['mean', 'count']).reset_index()
te.columns = cat + ['TE_mean', 'TE_count']

In [None]:
df_valid = df_valid.merge(te, on=cat, how='left')
df_valid['error'] = (df_valid['label'] - (df_valid['TE_mean']>=0.5)).abs()

In [None]:
mean_global = df_train.label.mean()
df_valid['TE_mean'] = df_valid['TE_mean'].fillna(mean_global)

#### TargetEncoding With Smoothing

In [None]:
w = 20
df_valid['TE_mean_smoothed'] = ((df_valid['TE_mean']*df_valid['TE_count'])+(mean_global*w))/(df_valid['TE_count']+w)
df_valid['TE_mean_smoothed'] = df_valid['TE_mean_smoothed'].fillna(mean_global)

In [None]:
df_valid['error_smoothed'] = (df_valid['label'] - (df_valid['TE_mean_smoothed']>=0.5)).abs()

### Compare ACC (Accuracy) Errors
Let's look at the error based on the number of observations. We can see, that the categorical values with low observation count (1, 2, 3) have a lower error rate with smoothing than without smoothing.

In [None]:
print("ACC errors without smoothing:")
df_valid[['TE_count', 'error']].groupby('TE_count').error.mean().sort_index()

In [None]:
print("ACC errors with smoothing:")
df_valid[['TE_count', 'error_smoothed']].groupby('TE_count').error_smoothed.mean().sort_index()

#### Compare AUC (Area under ROC curve) Errors
We can look at the roc_auc values as well:

In [None]:
from sklearn.metrics import roc_auc_score

print("AUC without smoothing:")
roc_auc_score(df_valid['label'].astype(int).values, 
              df_valid['TE_mean'].values)

In [None]:
print("AUC with smoothing:")
roc_auc_score(df_valid['label'].astype(int).values, 
              df_valid['TE_mean_smoothed'].values)

### Improve Target Encoding with Nested Folds

We can still improve our `Target Encoding` function. We can even make it more generalizable, if we apply an **out of fold calculation**. 

In our current definition, we use the full training dataset to `Target Encode` the training dataset and validation/test dataset. Therefore, we will likely overfit slightly on our training dataset, because we use the information from it to encode the categorical values. A better strategy is to use **out of fold**:
* use the full training dataset to encode the validation/test dataset
* split the training dataset in k-folds and encode the i-th fold by using all folds except of the i-th one

The following figure visualize the strategy for k=5:

The k-fold can be generated by a random split or by a timestamp depending on the dataset.

#### Target Encode with Nested Folds and Smoothing
We now restart the session, load data, and perform target encoding with nested folds and smoothing using zero code change GPU acceleration with cuDF-Pandas.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

#### Zero Code GPU Acceleration

In [None]:
%load_ext cudf.pandas
#!nvidia-smi

#### Load Data

In [None]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

PATH = "./data/"
df_train = pd.read_parquet(f'{PATH}train.parquet') 
df_valid = pd.read_parquet(f'{PATH}valid.parquet')

df_train['brand'] = df_train['brand'].fillna('UNKNOWN')
df_valid['brand'] = df_valid['brand'].fillna('UNKNOWN')

print("Original train data and valid data shape:")
df_train.shape, df_valid.shape

#### Enlarge Data
The training and validation datasets are small for real-world use cases. We artificially increase the dataset size by duplicating the datasets 10 times to make it more similar to a real-world dataset.

In [None]:
df_train = pd.concat([df_train]*10).reset_index(drop=True)
df_valid = pd.concat([df_valid]*10).reset_index(drop=True)
print("Enlarged train data and valid data shape:")
df_train.shape, df_valid.shape

In [None]:
def target_encode(train, valid, col, target, kfold=5, smooth=20):
    """
        train:  train dataset
        valid:  validation dataset
        col:   column which will be encoded (in the example RESOURCE)
        target: target column which will be used to calculate the statistic
    """
    
    # We assume that the train dataset is shuffled
    train['kfold'] = ((train.index) % kfold)
    # We create the output column, we fill with 0
    col_name = '_'.join(col)
    train['TE_' + col_name] = 0.
    for i in range(kfold):
        ###################################
        # filter for out of fold
        # calculate the mean/counts per group category
        # calculate the global mean for the oof
        # calculate the smoothed TE
        # merge it to the original dataframe
        ###################################
        
        df_tmp = train[train['kfold']!=i]
        mn = df_tmp[target].mean()
        df_tmp = df_tmp[col + [target]].groupby(col).agg(['mean', 'count']).reset_index()
        df_tmp.columns = col + ['mean', 'count']
        df_tmp['TE_tmp'] = ((df_tmp['mean']*df_tmp['count'])+(mn*smooth)) / (df_tmp['count']+smooth)
        df_tmp_m = train[col + ['kfold', 'TE_' + col_name]].merge(df_tmp, how='left', left_on=col, right_on=col)
        df_tmp_m.loc[df_tmp_m['kfold']==i, 'TE_' + col_name] = df_tmp_m.loc[df_tmp_m['kfold']==i, 'TE_tmp']
        train['TE_' + col_name] = df_tmp_m['TE_' + col_name].fillna(mn).values

    
    ###################################
    # calculate the mean/counts per group for the full training dataset
    # calculate the global mean
    # calculate the smoothed TE
    # merge it to the original dataframe
    # drop all temp columns
    ###################################    
    
    df_tmp = train[col + [target]].groupby(col).agg(['mean', 'count']).reset_index()
    mn = train[target].mean()
    df_tmp.columns = col + ['mean', 'count']
    df_tmp['TE_tmp'] = ((df_tmp['mean']*df_tmp['count'])+(mn*smooth)) / (df_tmp['count']+smooth)
    df_tmp_m = valid[col].merge(df_tmp, how='left', left_on=col, right_on=col)
    valid['TE_' + col_name] = df_tmp_m['TE_tmp'].fillna(mn).values
    
    train = train.drop('kfold', axis=1)
    return(train, valid)

In [None]:
%%time
df_train, df_valid = target_encode(df_train, df_valid, ['weekday', 'cat_2', 'brand'], 'label')

In [None]:
df_train.head()

In [None]:
df_valid.head()

### Target Encoding Summary

* `Target Encoding` calculates statistics of a target column given one or more categorical features
* `Target Encoding` smooths the statistics as a weighted average of the category value and the global statistic
* `Target Encoding` uses a out-of-fold strategy to prevent overfitting to the training dataset.
    
We can see the advantage of using `Target Encoding` as a feature engineering step. 

## CPU-GPU Comparison
Let's compare the runtime between `CPU Pandas` and `GPU cuDF-Pandas`. All the code is written in Pandas, so we can execute it on both CPU and GPU by choosing to activate GPU acceleration or not.

We restart the session, load data, and perform `target encoding` with nested folds and smoothing. This time we will not use the magic command `%load_ext cudf.pandas` and subsequently our code will run using CPU Pandas instead of GPU `cuDF-Pandas`. When running with GPU above, it took about `3 seconds` to add a new TE column. Let's see how long CPU takes...

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

### Sample Data

In [None]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

PATH = "./data/"
df_train = pd.read_parquet(f'{PATH}train.parquet') 
df_valid = pd.read_parquet(f'{PATH}valid.parquet')

df_train['brand'] = df_train['brand'].fillna('UNKNOWN')
df_valid['brand'] = df_valid['brand'].fillna('UNKNOWN')

print("Original train data and valid data shape:")
df_train.shape, df_valid.shape

### Enlarge Data
The training and validation datasets are small for real-world use cases. We artificially increase the dataset size by duplicating the datasets 10 times to make it more similar to a real-world dataset.

In [None]:
df_train = pd.concat([df_train]*10).reset_index(drop=True)
df_valid = pd.concat([df_valid]*10).reset_index(drop=True)
print("Enlarged train data and valid data shape:")
df_train.shape, df_valid.shape

In [None]:
def target_encode(train, valid, col, target, kfold=5, smooth=20):
    """
        train:  train dataset
        valid:  validation dataset
        col:   column which will be encoded (in the example RESOURCE)
        target: target column which will be used to calculate the statistic
    """
    
    # We assume that the train dataset is shuffled
    train['kfold'] = ((train.index) % kfold)
    # We create the output column, we fill with 0
    col_name = '_'.join(col)
    train['TE_' + col_name] = 0.
    for i in range(kfold):
        ###################################
        # filter for out of fold
        # calculate the mean/counts per group category
        # calculate the global mean for the oof
        # calculate the smoothed TE
        # merge it to the original dataframe
        ###################################
        
        df_tmp = train[train['kfold']!=i]
        mn = df_tmp[target].mean()
        df_tmp = df_tmp[col + [target]].groupby(col).agg(['mean', 'count']).reset_index()
        df_tmp.columns = col + ['mean', 'count']
        df_tmp['TE_tmp'] = ((df_tmp['mean']*df_tmp['count'])+(mn*smooth)) / (df_tmp['count']+smooth)
        df_tmp_m = train[col + ['kfold', 'TE_' + col_name]].merge(df_tmp, how='left', left_on=col, right_on=col)
        df_tmp_m.loc[df_tmp_m['kfold']==i, 'TE_' + col_name] = df_tmp_m.loc[df_tmp_m['kfold']==i, 'TE_tmp']
        train['TE_' + col_name] = df_tmp_m['TE_' + col_name].fillna(mn).values

    
    ###################################
    # calculate the mean/counts per group for the full training dataset
    # calculate the global mean
    # calculate the smoothed TE
    # merge it to the original dataframe
    # drop all temp columns
    ###################################    
    
    df_tmp = train[col + [target]].groupby(col).agg(['mean', 'count']).reset_index()
    mn = train[target].mean()
    df_tmp.columns = col + ['mean', 'count']
    df_tmp['TE_tmp'] = ((df_tmp['mean']*df_tmp['count'])+(mn*smooth)) / (df_tmp['count']+smooth)
    df_tmp_m = valid[col].merge(df_tmp, how='left', left_on=col, right_on=col)
    valid['TE_' + col_name] = df_tmp_m['TE_tmp'].fillna(mn).values
    
    train = train.drop('kfold', axis=1)
    return(train, valid)

In [None]:
%%time
df_train, df_valid = target_encode(df_train, df_valid, ['weekday', 'cat_2', 'brand'], 'label')

## Summary
In this notebook, the GPU accelerated code computed and added a new Target Encoded column in about 3 seconds and the CPU code took about 60 seconds. We observe a speed up of `20x using GPU versus CPU`, wow!

Additionally, our implementation can be still improved. When the dataset gets larger, the speed up will increase more because GPUs like lots of data and doing lots of work at once. Furthermore, we can optimize our solution more based on `dask` and `dask_cudf` to use multiple GPUs. See our Recsys 2020 solution writeup for details!

Please execute the cell below to shut down the kernel when you are done. Also do not forget to stop the running instance.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(False)