# Best Practices in Feature Engineering for Tabular Data With GPU Acceleration #

## Part 3: Train Models with GPUs, cuDF, and cuML ##
Training models is the process of learning patterns from train data features to predict targets. Previously we learned how feature engineering transforms raw features into new columns to help our models recognize patterns. Now we will learn how to train models quickly allowing us to evaluate different feature engineering experiments and build the most accurate models.

The secret to building the most accurate models is to evaluate as many experiments as possible and keep the features, models, ideas that perform the best. To run the most experiments we need to run everything as fast as possible with GPUs.

**Table of Contents**
<br>
In this notebook, we will use the speed of GPUs to help us evaluate which feature engineering creates the most accurate model. Specifically we will use `RAPIDS cuDF` for fast feature engineering and `XGBoost on GPU` for fast model training, and `cuML` which accelerates `Scikit-Learn` with `GPU` acceleration for fast model training. This notebook covers the below sections: 

1. [XGBoost with Categorical Feature Engineering](#XGBoost-with-Categorical-Feature-Engineering)
    * [Load Data with cuDF](#Load-Data-with-cuDF)
    * [EDA and Find Categorical Columns](#EDA-and-Find-Categorical-Columns)
    * [XGBoost with Built-In Enable Categorical](#XGBoost-with-Built-In-Enable-Categorical)
    * [XGBoost with Label Encoding](#XGBoost-with-Label-Encoding)
    * [XGBoost with Target Encoding](#XGBoost-with-Target-Encoding)
2. [XGB Feature Importance](#XGB-Feature-Importance)
    * [Summary of XGB Experiments](#Summary-of-XGB-Experiments)
    * [CPU-GPU Comparison for XGB](#CPU-GPU-Comparison-for-XGB)
3. [SVC](#SVC)
    * [SVC with Target Encoding](#SVC-with-Target-Encoding)
    * [SVC with TF-IDF Feature Engineering](#SVC-with-TF-IDF-Feature-Engineering)
    * [CPU-GPU Comparison for SVC](#CPU-GPU-Comparison-for-SVC)
4. [Summary of Experiments](#Summary-of-Experiments)

## XGBoost with Categorical Feature Engineering
[XGBoost](https://xgboost.readthedocs.io/en/stable/tutorials/model.html) uses gradient boosted decision trees to build accurate models quickly for tabular datasets. To optimize model acccuray, we must experiment using different feature engineering. Let's explore the effects of different categorical column encodings and demonstrate the process of improving model accuracy.

### Load Data with cuDF
For this course, we are using the **Amazon product data dataset**. 

**Description**<br>
This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

**Citation**<br>
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016
[pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf)

Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
[pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/sigir15.pdf)

We load our dataframes on the GPU using RAPIDS cuDF for fast processing and feature engineering.

In [None]:
#import pandas as pd, numpy as np
import cudf, cupy
import matplotlib.pyplot as plt

# LOAD DATA
PATH = "./data/"
train0 = cudf.read_parquet(f'{PATH}train.parquet').reset_index(drop=True)
valid0 = cudf.read_parquet(f'{PATH}valid.parquet').reset_index(drop=True)
test0 = cudf.read_parquet(f'{PATH}test.parquet').reset_index(drop=True)

# FEATURES NOT USED
not_used = ['timestamp']
train = train0.drop(not_used,axis=1)
test = test0.drop(not_used,axis=1)
valid = valid0.drop(not_used,axis=1)

print("Train data shape, valid data shape, test data shape:")
train.shape, valid.shape, test.shape

### EDA and Find Categorical Columns
When training models with XGBoost, an effective type of feature engineering is encoding the categorical columns. We learned two techniques `Target Encoding` and `Count Encoding` in our previous tutorials. Popular encoding techniques are:
* Built-In Encoding
* One Hot Encoding
* Label Encoding
* Target Encoding
* Count Encoding
* Other Encodings (like tfidf)

Let's identify the categorical columns and then experiment using some different techniques to discover which produces the most accurate for our dataset.

In [None]:
print("ALL COLUMNS:")
print( list(train.columns) )

In [None]:
print("CATEGORICAL COLUMNS:")
CAT_COLS = []
for c in train.columns:
    if train[c].dtype=="object":
        u = train[c].nunique()
        n = 100 * train[c].isna().mean()
        print(f"{c} (is categorical) with number unique = {u}, with nan = {n:.1f}%")
        CAT_COLS.append(c)

### XGBoost with Built-In Enable Categorical
Most GBDT (gradient boosted decision trees, i.e. XGB, LGB, CAT) have built-in algorithms for dealing with categorical columns (read more [here](https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html)). We will use XGBoost's build-in categorical handling to train a model. To use XGB built-in, we convert category columns to dtype `category` and use flag `enable_categorical=True`. Additionally we can use parameter `min_child_weight` to prevent overfitting if necessary.

We observe on this particular dataset, the built-in categorical handling achieves a good train AUC score but doesn't generalize well to valid and test data well. 

In [None]:
import warnings, xgboost as xgb
print("XGBoost version",xgb.__version__)
warnings.filterwarnings("ignore", category=FutureWarning, module="xgboost.data")

In [None]:
# MARK CATEGORICAL COLUMNS AS CATEGORY FOR XGBOOST BUILT-IN
for c in train.columns:
    if c in CAT_COLS:
        train[c] = train[c].astype('category')
        test[c] = test[c].astype('category')
        valid[c] = valid[c].astype('category')
    else:
        train[c] = train[c].astype('float32')
        test[c] = test[c].astype('float32')
        valid[c] = valid[c].astype('float32')

In [None]:
# XGBOOST PARAMETERS
params = {
        'objective': 'binary:logistic', 
        'learning_rate':0.1,
        'tree_method': 'hist', 
        'max_depth': 7, 
        'subsample':0.8,
        'eval_metric': 'auc',
        'colsample_bytree': 0.8,
        'min_child_weight':1,
        'device':'cuda',}

In [None]:
%%time
# TRAIN XGBOOST
dtrain = xgb.DMatrix(data=train.drop('label',axis=1), label=train['label'], enable_categorical=True)
dvalid = xgb.DMatrix(data=valid.drop('label',axis=1), label=valid['label'], enable_categorical=True)
dtest = xgb.DMatrix(data=test.drop('label',axis=1), enable_categorical=True)
watchlist = [(dtrain, 'train'),(dvalid, 'eval')]
clf = xgb.train(params, dtrain=dtrain,
                num_boost_round=2000,evals=watchlist,
                early_stopping_rounds=50,
                verbose_eval=100)

In [None]:
#from sklearn.metrics import roc_auc_score
from cuml.metrics import roc_auc_score

In [None]:
# COMPUTE VALIDATION SCORES
yp = clf.predict(dvalid)
valid_auc_XE = roc_auc_score(valid['label'],yp)
yp = clf.predict(dtest)
test_auc_XE = roc_auc_score(test['label'],yp)
print(f"Valid AUC = {valid_auc_XE:.3f}, Test AUC = {test_auc_XE:.3f}")

### XGBoost with Label Encoding
We will now convert the categorical strings to numbers using basic `Label Encoding`. We will fill NAN with a new categorical value of `NONE`. Doing this instead of imputing an existing category will allow XGBoost to decide how to use NAN. 

One advantage of basic `Label Encoding` is that it overfits the train data less (than other encodings) because all the categorical values are mixed together with their random number label encoded values. We observe that using `Label Encoding` achieves similar train AUC score as `Built-In Encoding` but generalizes better to valid and achieves a better valid AUC score and test AUC score.

In [None]:
#from sklearn.preprocessing import LabelEncoder
from cuml.preprocessing import LabelEncoder

In [None]:
#LABEL ENCODING
print("Processing: ",end="")
for c in CAT_COLS:
    print(f"{c}, ",end="")

    # CONVERT CATEOGORY COLUMN BACK TO OBJECT
    train[c] = train[c].astype('object')
    test[c] = test[c].astype('object')
    valid[c] = valid[c].astype('object')

    # IMPUTE NAN WITH NEW CATEGORY
    train[c] = train[c].fillna('NONE') 
    test[c] = test[c].fillna('NONE')
    valid[c] = valid[c].fillna('NONE')

    # FIT LABEL ENCODER
    values = cudf.concat([train[c],test[c],valid[c]])
    LE = LabelEncoder()
    LE.fit( values )

    # LE TRANSFORM TRAIN, VALID, TEST
    train[c] = LE.transform(train[c]).astype("float32")
    test[c] = LE.transform(test[c]).astype("float32")
    valid[c] = LE.transform(valid[c]).astype("float32")

In [None]:
%%time
# TRAIN XGBOOST

dtrain = xgb.DMatrix(data=train.drop('label',axis=1), label=train['label'])
dvalid = xgb.DMatrix(data=valid.drop('label',axis=1), label=valid['label'])
dtest = xgb.DMatrix(data=test.drop('label',axis=1))
watchlist = [(dtrain, 'train'),(dvalid, 'eval')]
clf = xgb.train(params, dtrain=dtrain,
                num_boost_round=2000,evals=watchlist,
                early_stopping_rounds=50,
                verbose_eval=100)

In [None]:
# COMPUTE VALIDATION SCORES
yp = clf.predict(dvalid)
valid_auc_LE = roc_auc_score(valid['label'],yp)
yp = clf.predict(dtest)
test_auc_LE = roc_auc_score(test['label'],yp)
print(f"Valid AUC = {valid_auc_LE:.3f}, Test AUC = {test_auc_LE:.3f}")

### XGBoost with Target Encoding
We will now add new columns of `Target Encoding`. We have a choice to keep the existing `Label Encoding` and add a new `Target Encoding` column, or we can replace `Label Encoding` columns with `Target Encoding` columns. In this notebook, we will add new columns. We `Target Encode` with `smoothing=10` and `kfold=5`. We will encode the target's `mean` but we can compute other statistics too like `median`, `max`, `min`, etc. We wrote code in our previous tutorial but for conveinence, we will use code from `cuML` implementation of `Target Encoding`.

We observe that `Target Encoding` improves the valid AUC score and test AUC score. Woohoo!

In [None]:
#from sklearn.preprocessing import TargetEncoder
from cuml.preprocessing import TargetEncoder

In [None]:
# TARGET ENCODE WITH SMOOTHING=10, KFOLD=5, STAT=MEAN
print("Processing: ",end="")
for c in CAT_COLS:
    print(f"{c}, ",end="")

    # ADD TARGET ENCODE FEATURES
    TE = TargetEncoder(smooth=10, n_folds=5, stat="mean")
    train[f"TE_{c}"] = TE.fit_transform(train[c],train['label']).astype("float32")
    test[f"TE_{c}"] = TE.transform(test[c]).astype("float32")
    valid[f"TE_{c}"] = TE.transform(valid[c]).astype("float32")

In [None]:
%%time
# TRAIN XGBOOST GPU
dtrain = xgb.DMatrix(data=train.drop('label',axis=1), label=train['label'])
dvalid = xgb.DMatrix(data=valid.drop('label',axis=1), label=valid['label'])
dtest = xgb.DMatrix(data=test.drop('label',axis=1))
watchlist = [(dtrain, 'train'),(dvalid, 'eval')]
clf = xgb.train(params, dtrain=dtrain,
                num_boost_round=2000,evals=watchlist,
                early_stopping_rounds=50,
                verbose_eval=100)

In [None]:
# COMPUTE VALIDATION SCORES
yp_te_v = clf.predict(dvalid)
valid_auc_TE = roc_auc_score(valid['label'],yp_te_v)
yp_te_t = clf.predict(dtest)
test_auc_TE = roc_auc_score(test['label'],yp_te_t)
print(f"Valid AUC = {valid_auc_TE:.3f}, Test AUC = {test_auc_TE:.3f}")

## XGB Feature Importance
We can use XGBoost feature importance to see which features are the most helpful. We see that the most helpful features are our `Target Encode` features. In particular `TE_userID` and `TE_productID` are the two strongest features. The model learns that some users and some products are more likely to have positive targets. And our model uses this knowledge to predict future users and products accurately. Note the `Label Encoded` columns of `userID` and `productID` are not as helpful as the `Target Encoded` versions.

In [None]:
# PLOT TOP 25 FEATURES BY IMPORTANCE
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 5))  # Adjust the figure size if needed
xgb.plot_importance(
    clf,
    ax=ax,
    max_num_features=25,  # Display only the top 25 features
    importance_type="weight",  # Options: 'weight', 'gain', 'cover', 'total_gain', 'total_cover'
)
plt.title("XGB Top 25 Feature Importances")
plt.show()

### Summary of XGB Experiments
Let's pause and compare our three XGB experiments side by side. We see that using `Label Encoding` and `Target Encoding` worked best for this dataset. 

In [None]:
fig = plt.figure(figsize=(10, 3), dpi=180)
ax = fig.add_axes([0, 0, 1, 1])

x = ['valid_auc_XE', 'test_auc_XE',
     'valid_auc_LE', 'test_auc_LE',
     'valid_auc_TE', 'test_auc_TE']

y = [valid_auc_XE, test_auc_XE,
     valid_auc_LE, test_auc_LE,
     valid_auc_TE, test_auc_TE]

ax.bar(x, y, color=['g', 'g',  # XE pair
                    'b', 'b',  # LE pair
                    'r', 'r'])  # TE pair

ax.set_title('AUC higher is better')
plt.xticks(rotation=45, ha='right')  
plt.show()

### CPU-GPU Comparison for XGB
Let's compare the runtime between CPU XGBoost and GPU XGBoost. To toggle between CPU and GPU, we change the parameter `device` from `cude` to `cpu`. We observe that GPU XGBoost is roughly 10x faster than CPU XGBoost. Above GPU took about 1 second and below CPU takes about 10 seconds for this small dataset. This means that we can explore 10x more feature engineering in the same amount of time using GPU versus CPU. (And when training models with bigger datasets, the ratio of speed up will be even larger because GPUs are most efficient when doing lots of work and processing lots of data).

In [None]:
params['device'] = 'cpu'

In [None]:
%%time
# TRAIN XGBOOST CPU
dtrain = xgb.DMatrix(data=train.drop('label',axis=1), label=train['label'])
dvalid = xgb.DMatrix(data=valid.drop('label',axis=1), label=valid['label'])
dtest = xgb.DMatrix(data=test.drop('label',axis=1))
watchlist = [(dtrain, 'train'),(dvalid, 'eval')]
clf = xgb.train(params, dtrain=dtrain,
                num_boost_round=2000,evals=watchlist,
                early_stopping_rounds=50,
                verbose_eval=100)

## SVC
We will now explore the performance of a different model. We will investigate Support Vector Machines (SVM). SVM will find a decision boundary to separate the two classes of target. 

### SVC with Target Encoding
Unlike GBDT, SVM do not have built-in categorical support. Therefore we need to transform all categorical features with an encoding. Let's try using `Target Encoding`. Also SVM prefer numerical inputs to have `mean=0` and `std=1` per column, so lets use `StandardScaler` to normalize all numeric columns.

In order to train SVC in a resonable amount of time, we will need to use GPU, specifically **RAPIDS cuML Support Vector Machine Classifier** [(SVC)](https://docs.rapids.ai/api/cuml/stable/api/). Inspired by Scikit-learn’s implementation, the SVC classifier in cuML is designed to be a drop-in replacement for scikit-learn’s SVC module. Also we will train SVC with a subset of the train data which allows us to perform more experiments quickly.



Image source: https://towardsdatascience.com/support-vector-machines-soft-margin-formulation-and-kernel-trick-4c9729dc8efe

In [None]:
#from sklearn.svm import SVC
from cuml.svm import SVC

#from sklearn.preprocessing import StandardScaler
from cuml.preprocessing import StandardScaler

In [None]:
print("Processing: ",end="")
for c in train.columns:
    if c=="label": continue
    print(f"{c}, ",end="")

    # STANDARD SCALER FIT TRANSFORM
    SS = StandardScaler().fit(train[[c]])
    train[c] = SS.transform( train[[c]] )
    test[c] = SS.transform( test[[c]] )
    valid[c] = SS.transform( valid[[c]] )

    # IMPUTE NAN WITH MEAN
    train[c] = train[c].fillna(0).astype('float32')
    test[c] = test[c].fillna(0).astype('float32')
    valid[c] = valid[c].fillna(0).astype('float32')

In [None]:
# SUBSAMPLE TRAIN
import numpy as np
np.random.seed(42) 
idx = np.random.randint(0,len(train),15*1024)

# TRAIN SVC
clf = SVC(C=1,probability=True,kernel="rbf",cache_size=8*1024)
_ = clf.fit( train.drop(["label"],axis=1).iloc[idx], train["label"].iloc[idx] )

In [None]:
# COMPUTE VALIDATION SCORES
yp_svc = clf.predict_proba(valid.drop(["label"],axis=1))
valid_auc_SVC = roc_auc_score(valid['label'],yp_svc[1].to_numpy())
yp = clf.predict_proba(test.drop(["label"],axis=1))
test_auc_SVC = roc_auc_score(test['label'],yp[1].to_numpy())
print(f"Valid AUC {valid_auc_SVC:.3f}, Test AUC {test_auc_SVC:.3f}")

### SVC with TF-IDF Feature Engineering
We will now create new columns of feature engineering. We will combine the columns `['brand','cat_0','cat_1','cat_2','cat3']` into a single product string. Then we will use `TF-IDF` to transform this string into features for our model. We use [cuML TfidfVectorizer](https://github.com/rapidsai/cuml/blob/branch-25.04/python/cuml/cuml/feature_extraction/_tfidf_vectorizer.py) method, largely based on scikit-learn's TfIdfVectorizer code. In order to speed up experiments, we fit transform `TF-IDF` with GPU specifically `RAPIDS cuML TF-IDF`. These new features imporove our valid AUC score and test AUC score.

In [None]:
#from sklearn.feature_extraction.text import TfidfVectorizer
from cuml.feature_extraction.text import TfidfVectorizer

In [None]:
# FEATURE ENGINEER - COMBINE COLUMNS
train_product = (train0["brand"].fillna("") + " " + train0["cat_0"] + " " + 
                         train0["cat_1"] + " " + train0["cat_2"] + " " + train0["cat_3"])
valid_product = (valid0["brand"].fillna("") + " " + valid0["cat_0"] + " " + 
                         valid0["cat_1"] + " " + valid0["cat_2"] + " " + valid0["cat_3"])
test_product = (test0["brand"].fillna("") + " " + test0["cat_0"] + " " + 
                         test0["cat_1"] + " " + test0["cat_2"] + " " + test0["cat_3"])

In [None]:
# FEATURE ENGINEER WITH TFIDF
from cupy import hstack
tfidf = TfidfVectorizer(ngram_range=(1, 2), min_df=10) 
tfidf.fit(train_product)
Xtrain1 = train.drop(["label"],axis=1).iloc[idx].values
Xtrain2 = tfidf.transform(train_product.iloc[idx]).todense()
Xtrain = hstack([Xtrain1,Xtrain2])
print("Train data shape:")
Xtrain.shape

In [None]:
%%time
# TRAIN SVC GPU
clf = SVC(C=1,probability=True,kernel="rbf",cache_size=8*1024)
_ = clf.fit( Xtrain, train["label"].iloc[idx].values )

In [None]:
# COMPUTE VALIDATION SCORES
Xvalid1 = valid.drop(["label"],axis=1).values
Xvalid2 = tfidf.transform(valid_product).todense()
Xvalid = hstack([Xvalid1,Xvalid2])
yp = clf.predict_proba(Xvalid)
valid_auc_SVC_tfidf = roc_auc_score(valid['label'],yp[:,1].get())

Xtest1 = test.drop(["label"],axis=1).values
Xtest2 = tfidf.transform(test_product).todense()
Xtest = hstack([Xtest1,Xtest2])
yp = clf.predict_proba(Xtest)
test_auc_SVC_tfidf = roc_auc_score(test['label'],yp[:,1].get())

print(f"Valid AUC {valid_auc_SVC_tfidf:.3f}, Test AUC {test_auc_SVC_tfidf:.3f}")

### CPU-GPU Comparison for SVC
Using GPU versus CPU is 100x or more times faster. The SVC above on GPU trained in a few seconds. The SVC on CPU below takes 1+ hours to train. Wow! 
The cell is commented out, we will not run it now, but you can test it yourself later!

In [None]:
from sklearn.svm import SVC as SVC_cpu

In [None]:
%%time
# TRAIN SVC CPU
clf = SVC_cpu(C=1,probability=True,kernel="rbf",cache_size=8*1024)
_ = clf.fit( Xtrain.get(), train["label"].iloc[idx].values.get() )

## Summary of Experiments
In this notebook we observed the typical model building strategy. We try different models and we try different feature engineering ideas and keep what works best for our particular dataset. We can also try ensembling our different models and we can try doing more feature engineering. Performing fast experiments is key to finding the most accurate model quickly.

In [None]:
fig = plt.figure(figsize=(10, 3), dpi=180)
ax = fig.add_axes([0, 0, 1, 1])

x = ['valid_auc_XE', 'test_auc_XE',
     'valid_auc_LE', 'test_auc_LE',
     'valid_auc_TE', 'test_auc_TE',
     'valid_auc_SVC_TE', 'test_auc_SVC_TE',
     'valid_auc_SVC_tfidf', 'test_auc_SVC_tfidf',]

y = [valid_auc_XE, test_auc_XE,
     valid_auc_LE, test_auc_LE,
     valid_auc_TE, test_auc_TE,
     valid_auc_SVC, test_auc_SVC,
     valid_auc_SVC_tfidf, test_auc_SVC_tfidf]

ax.bar(x, y, color=['g','g','b','b','r','r',
                    'y','y','m','m'])

ax.set_title('AUC higher is better')
plt.xticks(rotation=45, ha='right')  
plt.show()

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)