# Best Practices in Feature Engineering for Tabular Data With GPU Acceleration #

## Part 2: Count Encoding ##
Most models cannot accept categorical columns as is. A categorical column is typically a column of strings (or non ordered numbers) and we need to convert these into some numeric representation to input it into our model. Common techniques are `OHE (one hot encoding)` and `LE (label encoding)`. Advanced techniques are `TE (Target encoding`) and `CE (Count encoding)`. In this notebook, we will discuss CE.

[1]: https://rapids.ai/cudf-pandas/
[2]: https://docs.rapids.ai/install/

In this lab, we will use the speed of GPUs to help us create new columns (features) quickly. Specificially we will use [cuDF-Pandas][1] zero code change GPU acceleration. After adding cell magic `%load_ext cudf.pandas` all of our subsequent Pandas calls will use [RAPIDS cuDF][2] and thus utilize GPU instead of Pandas CPU!

**Table of Contents**
<br>
This notebook shows how to perform count encoding. This notebook covers the below sections: 

1. [GPU Accelerating Pandas with Zero Code Change](#GPU-Accelerating-Pandas-with-Zero-Code-Change)
    * [Load Data](#Load-Data)
    * [Count Encoding Technique](#Count-Encoding-Technique)
    * [Apply Count Encoding](#Apply-Count-Encoding)
2. [CPU-GPU Comparison](#CPU-GPU-Comparison)
    * [Sample Data](#Sample-Data)
    * [Enlarge Data](#Enlarge-Data)
3. [Summary](#Summary)

## GPU Accelerating Pandas with Zero Code Change
After adding cell magic `%load_ext cudf.pandas` all of our subsequent Pandas calls will use [RAPIDS cuDF][1] and thus utilize GPU instead of Pandas CPU!

[1]: https://rapids.ai/cudf-pandas/

In [None]:
%load_ext cudf.pandas

### Load Data
 **Amazon product data dataset** : https://jmcauley.ucsd.edu/data/amazon/

**Description**<br>
This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

**Citation**<br>
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016
[pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf)

Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
[pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/sigir15.pdf)

First we load the data and fill `nans` in the categorical column `brand` with string `UNKNOWN`.

In [None]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

# LOAD DATA
PATH = "./data/"
df_train = pd.read_parquet(f'{PATH}train.parquet') 
df_valid = pd.read_parquet(f'{PATH}valid.parquet')
df_test = pd.read_parquet(f'{PATH}test.parquet')

# FILL NAN
df_train['brand'] = df_train['brand'].fillna('UNKNOWN')
df_valid['brand'] = df_valid['brand'].fillna('UNKNOWN')
df_test['brand'] = df_test['brand'].fillna('UNKNOWN')

print("Train data shape:",df_train.shape)
df_train.head()

### Count Encoding Technique

`Count Encoding` creates a new feature, which can be used by the model for training. It calculates frequency of categories and thus groups categorical values based on their frequency together.

For example:
* users, which have only 1 interaction in the datasets, are encoded with 1. Instead of having 1 datapoint per user, now, the model can learn a behavior pattern of these users at once.
* products, which have many interactions in the datasets, are encoded with a high number. The model can learn to see them as top sellers and treat them, accordingly.

The advantage of Count Encoding is that the category values are grouped together based on behavior. Particularly in cases with only a few observation, a decision tree is not able to create a split and neural networks have only a few gradient descent updates for these values.

#### Note
In competitions, we could count encode the categories for the datasets in different ways:
* Count Encode the training dataset and apply it to the validation dataset<br>
* Count Encode the training dataset and Count Encode the validataion dataset, separately<br>
* Merge the training dataset and validation dataset, Count Encode the concatenated dataset and apply to both datasets.

Our focus is on industry applications, therefore only the first process is a valid real-world solution. We maybe can collect statistics as a stream and update the characteristic of our dataset, but it is probably cleaner to increase the training frequency of our recommender models.

#### Encode Single Categorical Column
`Count Encoding (CE)` calculates the frequency from one or more categorical features given the training dataset.

For example we can consider `Count Encoding` as the popularity of an item or activity of an user. See the example (in first code cell) below where we list all unique values from column `productID` together with their frequency count. In second code cell below, we merge CE onto the original dataframe creating a new CE column.

In [None]:
cat = 'productID'
ce = df_train[[cat, 'label']].groupby(cat).count()
ce

In [None]:
ce = ce.reset_index()
ce.columns = [cat, 'CE_' + cat]
df_train.merge(ce, how='left', on=cat)[['userID', 'productID', 'CE_productID']]

#### Encode Group of Categorical Columns
Similarly, we can apply `Count Encoding` to a group of categorical features.

In [None]:
ce = df_train[['cat_2', 'brand', 'label']].groupby(['cat_2', 'brand']).count()
ce

In [None]:
ce = ce.reset_index()
ce.columns = ['cat_2', 'brand', 'CE_cat_2_brand']
df_train.merge(ce, how='left', on=['cat_2', 'brand'])[['productID', 'userID', 'brand', 'cat_2', 'CE_cat_2_brand']]


* Count Encode the column `col = 'userID'`

In [None]:
col = 'userID'
train_tmp = df_train[col].value_counts().reset_index()
#train_tmp = df_train[[col,'label']].groupby(col).count().reset_index()
train_tmp.columns = [col, 'CE_' + col]
df_train = df_train.merge(train_tmp, how='left', on=col)
df_train['CE_' + col] = df_train['CE_' + col].fillna(0).values
df_valid = df_valid.merge(train_tmp, how='left', on=col)
df_valid['CE_' + col] = df_valid['CE_' + col].fillna(0).values

### Apply Count Encoding
We now restart the session, load data, and perform `Count Encoding` using zero code change GPU acceleration with cuDF-Pandas.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

**Zero Code GPU Acceleration**

In [None]:
%load_ext cudf.pandas
#!nvidia-smi

#### Load Data

In [None]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

PATH = "./data/"
df_train = pd.read_parquet(f'{PATH}train.parquet')
df_valid = pd.read_parquet(f'{PATH}valid.parquet')

df_train['brand'] = df_train['brand'].fillna('UNKNOWN')
df_valid['brand'] = df_valid['brand'].fillna('UNKNOWN')

print("Original train data and valid data shape:")
df_train.shape, df_valid.shape

#### Enlarge Data
The training and validation datasets are small for real-world use cases. We artificially increase the dataset size by duplicating the datasets 10 times to make it more similar to a real-world dataset.

In [None]:
%%time
df_train = pd.concat([df_train]*10).reset_index(drop=True)
df_valid = pd.concat([df_valid]*10).reset_index(drop=True)
print("Enlarged train data and valid data shape:")
df_train.shape, df_valid.shape

In [None]:
def count_encode(train, valid, col):
    """
        train:  train dataset
        valid:  validation dataset
        col:    column which will be count encoded (in the example RESOURCE)
    """

    train_tmp = train[col].value_counts().reset_index()
    train_tmp.columns = [col,  'CE_' + col]
    df_tmp = train[[col]].merge(train_tmp, how='left', left_on=col, right_on=col)
    train['CE_' + col] = df_tmp['CE_' + col].fillna(0).values
        
    df_tmp = valid[[col]].merge(train_tmp, how='left', left_on=col, right_on=col)
    valid['CE_' + col] = df_tmp['CE_' + col].fillna(0).values

    return(train, valid)

In [None]:
%%time
df_train, df_valid = count_encode(df_train, df_valid, 'userID')

In [None]:
df_train.head()

In [None]:
df_valid.head()

## CPU-GPU Comparison
Let's compare the runtime between `CPU Pandas` and `GPU cuDF-Pandas`. All the code is written in Pandas, so we can execute it on both CPU and GPU by choosing to activate GPU acceleration or not.

We restart the session, load data, and perform `count encoding`. This time we will not use the magic command `%load_ext cudf.pandas` and subsequently our code will run using CPU Pandas instead of GPU cuDF-Pandas. When running with GPU above, it took about 0.5 seconds to add a new CE column. Let's see how long CPU takes...

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

### Sample Data

In [None]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

PATH = "./data/"
df_train = pd.read_parquet(f'{PATH}train.parquet')
df_valid = pd.read_parquet(f'{PATH}valid.parquet')

df_train['brand'] = df_train['brand'].fillna('UNKNOWN')
df_valid['brand'] = df_valid['brand'].fillna('UNKNOWN')

print("Original train data and valid data shape:")
df_train.shape, df_valid.shape

### Enlarge Data
The training and validation datasets are small for real-world use cases. We artificially increase the dataset size by duplicating the datasets 10 times.

In [None]:
%%time
df_train = pd.concat([df_train]*10).reset_index(drop=True)
df_valid = pd.concat([df_valid]*10).reset_index(drop=True)
print("Enlarged train data and valid data shape:")
df_train.shape, df_valid.shape

In [None]:
def count_encode(train, valid, col):
    """
        train:  train dataset
        valid:  validation dataset
        col:    column which will be count encoded (in the example RESOURCE)
    """

    train_tmp = train[col].value_counts().reset_index()
    train_tmp.columns = [col,  'CE_' + col]
    df_tmp = train[[col]].merge(train_tmp, how='left', left_on=col, right_on=col)
    train['CE_' + col] = df_tmp['CE_' + col].fillna(0).values
        
    df_tmp = valid[[col]].merge(train_tmp, how='left', left_on=col, right_on=col)
    valid['CE_' + col] = df_tmp['CE_' + col].fillna(0).values

    return(train, valid)

In [None]:
%%time
df_train, df_valid = count_encode(df_train, df_valid, 'userID')

## Summary
In this notebook, the GPU accelerated code computed and added a new Count Encoded column in about `0.5 seconds` and the CPU code took about `7.5 seconds`. We observe a speed up of `15x using GPU versus CPU`, wow!

Additionally, our implementation can be still improved. When the dataset gets larger, the speed up will increase more because GPUs like lots of data and doing lots of work at once. Furthermore, we can optimize our solution more based on `dask` and `dask_cudf` to use multiple GPUs. See our Recsys 2020 solution writeup for details!

Please execute the cell below to shut down the kernel when you are done. Also do not forget to stop the running instance.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(False)