# Target Encoding Implementation for 30 Days of ML Challenge, by Juan Torres

#### Based on Abhishek Thakur's tutorials and notebooks:

https://www.youtube.com/watch?v=2Yx2Y545yBk

https://www.kaggle.com/abhishek/competition-part-3-target-encoding

A quick word about our previous notebook. We switched from XGBoost to LightGBM but obtained worse results. We will see after parameter optimization which model works best, but for now we'll switch back to XGBoost.

For this notebook, we will implement target encoding, which is a useful technique that averages the target value by category. A more detailed explanation of this technique can be found at https://maxhalford.github.io/blog/target-encoding/. This method requires that we change the way data is preprocessed.

## 1. Import libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# For one-hot encoding categorical variables
from sklearn.preprocessing import OneHotEncoder
from sklearn import preprocessing

# For the construction of the pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# For training the XGBoost model
from xgboost import XGBRegressor

# For the mean squared error needed to calculate our scores
from sklearn.metrics import mean_squared_error

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/train-folds-30-days-of-ml/train_folds.csv
/kaggle/input/30-days-of-ml/sample_submission.csv
/kaggle/input/30-days-of-ml/train.csv
/kaggle/input/30-days-of-ml/test.csv


## 2. Loading and preparing data and pipeline construction

In [2]:
# Load the training and test data. 
X_full = pd.read_csv("../input/train-folds-30-days-of-ml/train_folds.csv")
X_test_full = pd.read_csv("../input/30-days-of-ml/test.csv")

We have loaded the data, and now we will modify the data in order to do the target encoding:

In [3]:
# We select all features except "id", "target" and "kfold", as these are not predictors of our target.
useful_features = [c for c in X_full.columns if c not in ("id", "target", "kfold")]

# Select numerical columns
num_cols = [col for col in useful_features if 'cont' in col]

# We select categorical columns. Note that we dropped the cardinality check.
object_cols = [col for col in useful_features if 'cat' in col]

# We build X_test out of X_test_full, but only selecting the useful features.
X_test = X_test_full[useful_features]

# Next up, we set up the for loop which will perform the target encoding:
for col in object_cols: 
    temp_X_full = [] # We create a temporary list to store the dataframes.
    temp_test_feature = None # We create a temporary feature for the test set.
    
    for fold in range(5): # We loop across all folds
        X_train = X_full[X_full.kfold != fold].reset_index(drop=True) 
        X_valid = X_full[X_full.kfold == fold].reset_index(drop=True) 
        feat = X_train.groupby(col)["target"].agg("mean") # We group the columns by target, and then we get the mean value of the values in "target" column.
        feat = feat.to_dict() # We convert the dataframe into a dictionary.
        X_valid.loc[:, f"tar_enc_{col}"] = X_valid[col].map(feat) # We map the mean values to a new column in X_valid.
        temp_X_full.append(X_valid) # We append X_valid to our temporary list.
        
        if temp_test_feature is None: # If we don't have a temp_test_feature...
            temp_test_feature = X_test[col].map(feat) # ...we assign it this value.
            
        else: # If its not None, (for folds above 0)...
            temp_test_feature = temp_test_feature + X_test[col].map(feat) # ...add to it the present value.
            
    temp_test_feature = temp_test_feature/5 # We divide by the number of folds to get the average.
    X_test.loc[:, f"tar_enc_{col}"] = temp_test_feature # We assign the temp_test_feat value to a new column.
    X_full = pd.concat(temp_X_full) # We build the new X_full dataframe with the new target encoding columns.
    

Ok, let's take a peek at the reworked X_full:

In [4]:
X_full.head()

Unnamed: 0,id,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,...,tar_enc_cat0,tar_enc_cat1,tar_enc_cat2,tar_enc_cat3,tar_enc_cat4,tar_enc_cat5,tar_enc_cat6,tar_enc_cat7,tar_enc_cat8,tar_enc_cat9
0,2,B,B,A,A,B,D,A,F,A,...,8.246824,8.204869,8.246016,8.277317,8.241689,8.251699,8.241846,8.300596,8.233186,8.241516
1,6,A,A,A,C,B,D,A,E,A,...,8.240298,8.278027,8.246016,8.237701,8.241689,8.251699,8.241846,8.241403,8.233186,8.253157
2,8,B,A,A,A,B,D,A,E,C,...,8.246824,8.278027,8.246016,8.277317,8.241689,8.251699,8.241846,8.241403,8.281416,8.260024
3,10,A,B,A,C,B,D,A,E,G,...,8.240298,8.204869,8.246016,8.237701,8.241689,8.251699,8.241846,8.241403,8.252384,8.224612
4,18,B,A,A,C,B,D,A,E,A,...,8.246824,8.278027,8.246016,8.237701,8.241689,8.251699,8.241846,8.241403,8.233186,8.233269


As we can see, we produced a new column for each of the categorical columns in our data. Now we can use these new features for our model, but first we have to redefine the object columns and numerical columns, as our current definition would include the "tar_enc_cat" columns as object columns:

In [5]:
# We select all features except "id", "target" and "kfold", as these are not predictors of our target.
useful_features = [c for c in X_full.columns if c not in ("id", "target", "kfold")]

# Select numerical columns by data type, not by column name
num_cols = [col for col in X_full[useful_features] if X_full[col].dtype in ['int64', 'float64']]

# We select categorical columns. Note that we dropped the cardinality check.
object_cols = [col for col in useful_features if col.startswith("cat")]

In [6]:
# Preprocessing for numerical data, we use a StandardScaler to apply standardization.
numerical_transformer = preprocessing.StandardScaler()

# Preprocessing for categorical data and one-hot encoding.
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, num_cols),('cat', categorical_transformer, object_cols)])

# Define the model 
model = XGBRegressor(tree_method='gpu_hist', gpu_id=0, predictor="gpu_predictor")

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

In [7]:
# We set up a list to store the final predictions.
final_predictions = []

# We set up a list for storing the mean non squared error scores.
scores = []

# We set the loop to loop across all of the folds. Since we have 5 folds, the loop range will be range(5).
for fold in range(5):
    X_train = X_full[X_full.kfold != fold].reset_index(drop=True) # We set the training data to be all folds different from the current fold number in the loop. We also reset the indices.
    X_valid = X_full[X_full.kfold == fold].reset_index(drop=True) # The validation data is the current fold number in the loop. We also reset the indices.
    X_test_copy = X_test.copy() # We copy the original X_test to not alter or overwrite over it.
    
    y_train = X_train.target # We set the training target equal to the target in the training set. This has to be done every iteration (as the fold and the data changes).
    y_valid = X_valid.target # We set the validation target equal to the target in the validation set. This has to be done every iteration (as the fold and the data changes).
    
    X_train = X_train[useful_features] # We set our training data to be the previously defined useful features of X_train.
    X_valid = X_valid[useful_features] # We set our validation data to be the previously defined useful features of X_valid.
    
    # We activate the pipeline, which preprocesses the training data and fits the model (will take about 10 minutes to run)
    my_pipeline.fit(X_train, y_train)

    preds_valid = my_pipeline.predict(X_valid) # We instruct the pipeline to make predictions on X_valid.
    preds_test = my_pipeline.predict(X_test) # We instruct the pipeline to make predictions on X_test.
    final_predictions.append(preds_test) # We append each of the test predictions on to our final_predictions list.
    rmse = mean_squared_error(y_valid, preds_valid, squared=False) # We store the mean non squared error in a variable.
    print(fold, rmse) # Print the fold number, and the mean non squared error for each fold.
    scores.append(rmse) # We append the rmse value to the scores list.
    
print(np.mean(scores), np.std(scores)) # Print the mean non square error average, and its standard deviation

0 0.7227030406324084
1 0.7306875651326146
2 0.7264380957570341
3 0.7285041616809392
4 0.7446395076435606
0.7305944741693114 0.007499216522979915


As always, let's stack these predictions and build the output file.

In [8]:
predictions = np.mean(np.column_stack(final_predictions), axis=1)

In [9]:
# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test_full.id, 'target': predictions})
output.to_csv('submission.csv', index=False)