# Target encoding and training a lightGBM Model.

- In this section I will perform ordinal encoding on variable 'size' and target encoding on categorical features 'brand', 'material', 'style' and 'color'. This approach was choosen given the hierarchical nature of 'size' feature. Target encoding on the rest of categorical variables was done using TargetEncoder from category_encoders library.


In [22]:
%pip install category_encoders

Note: you may need to restart the kernel to use updated packages.


In [23]:
%pip install optuna

Note: you may need to restart the kernel to use updated packages.


In [24]:
%pip install lightgbm

Note: you may need to restart the kernel to use updated packages.


In [25]:
import pandas as pd
from category_encoders import TargetEncoder
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np
import optuna

In [26]:
df = pd.read_csv('cat_backpack.csv')

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  300000 non-null  int64  
 1   brand               300000 non-null  object 
 2   material            300000 non-null  object 
 3   size                300000 non-null  object 
 4   compartments        300000 non-null  int64  
 5   laptop_compartment  300000 non-null  bool   
 6   waterproof          300000 non-null  bool   
 7   style               300000 non-null  object 
 8   color               300000 non-null  object 
 9   weight_cap          300000 non-null  float64
 10  Price               300000 non-null  float64
dtypes: bool(2), float64(2), int64(2), object(5)
memory usage: 21.2+ MB


In [28]:
display(df.sample(5))

Unnamed: 0,id,brand,material,size,compartments,laptop_compartment,waterproof,style,color,weight_cap,Price
222806,222806,Puma,Nylon,Large,10,False,True,Unknown,Pink,29.457056,125.96579
155198,155198,Adidas,Canvas,Medium,9,False,False,Messenger,Gray,8.432289,85.44422
298917,298917,Jansport,Nylon,Small,6,False,False,Tote,Black,20.077096,128.88365
15641,15641,Adidas,Leather,Large,3,True,True,Backpack,Black,17.483045,27.85988
62573,62573,Puma,Leather,Small,5,False,False,Messenger,Gray,22.898382,40.00212


- Ordinal encoding for size feature.

In [29]:
# Create a dictionary to map size categories to numerical values
size_mapping = {
    'Small': 0,
    'Medium': 1,
    'Large': 2,
    'Unknown': 3  # Or you can assign it -1 or another distinct value
}

# Apply the mapping to the 'size' column
df['size_encoded'] = df['size'].map(size_mapping)

# Drop the original 'size' column (optional)
df.drop('size', axis=1, inplace=True)

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  300000 non-null  int64  
 1   brand               300000 non-null  object 
 2   material            300000 non-null  object 
 3   compartments        300000 non-null  int64  
 4   laptop_compartment  300000 non-null  bool   
 5   waterproof          300000 non-null  bool   
 6   style               300000 non-null  object 
 7   color               300000 non-null  object 
 8   weight_cap          300000 non-null  float64
 9   Price               300000 non-null  float64
 10  size_encoded        300000 non-null  int64  
dtypes: bool(2), float64(2), int64(3), object(4)
memory usage: 21.2+ MB


- Target encoding for 'brand', 'material', 'style' and 'color'.

In [31]:
# Define features (X) and target (y)
X = df.drop(['Price'], axis=1)  # Exclude 'Price' column
y = df['Price']
# drop id column
X = X.drop('id', axis=1)

# List of categorical features to encode
categorical_features = ['brand', 'material', 'style', 'color']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the TargetEncoder
encoder = TargetEncoder(cols=categorical_features)

# Fit the encoder on the training data and transform both training and testing data
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)

# Now X_train_encoded and X_test_encoded have the categorical features target encoded

In [32]:
print(X_train_encoded.sample(5))

            brand   material  compartments  laptop_compartment  waterproof  \
272417  81.956967  82.028371             7                True        True   
200398  81.858243  82.028371             1                True        True   
275984  81.956967  80.479359            10                True        True   
284883  81.956967  81.072777             3               False        True   
150286  81.956967  81.072777             7               False       False   

            style      color  weight_cap  size_encoded  
272417  81.432036  81.803978   22.315614             2  
200398  81.430891  81.675616   28.567489             0  
275984  81.432036  82.409704   24.579628             2  
284883  81.417545  81.803978   14.279998             1  
150286  81.417545  82.010883   26.893251             2  


In [33]:
# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler to the training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)

# Create a LightGBM regressor object
model = lgb.LGBMRegressor()

# Train the model
model.fit(X_train_scaled, y_train)

# Make predictions using the scaled test data
y_pred = model.predict(X_test_scaled)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Base Root Mean Squared Error (RMSE): {rmse}")



Base Root Mean Squared Error (RMSE): 38.922589913117605




In [None]:
# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler to the training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)

# Create a LightGBM regressor object
model = lgb.LGBMRegressor()

# Train the model
model.fit(X_train_scaled, y_train)

# Make predictions using the scaled test data
y_pred = model.predict(X_test_scaled)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error (RMSE): {rmse}")



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002071 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 303
[LightGBM] [Info] Number of data points in the train set: 240000, number of used features: 9
[LightGBM] [Info] Start training from score 81.448481
Root Mean Squared Error (RMSE): 38.922589913117605




In [None]:
# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler to the training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)

def objective(trial):
    # Define the hyperparameters to tune
    param = {
        'objective': 'regression',
        'metric': 'rmse',
        'verbosity': -1,
        'boosting_type': 'gbdt',
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-4, 0.1),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000)
    }

    # Create a LightGBM regressor object with the hyperparameters
    model = lgb.LGBMRegressor(**param)

    # Train the model
    model.fit(X_train_scaled, y_train)

    # Make predictions using the scaled test data
    y_pred = model.predict(X_test_scaled)

    # Calculate RMSE
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    return rmse

# Run the Optuna study
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100)


- Final Root Mean Squared Error (RMSE): 38.90398143180675
- Best trial: 38.90398143180675

- Best hyperparameters: {'lambda_l1': 1.6229992273231575e-06, 'lambda_l2': 0.0005641642076268367, 'num_leaves': 16, 'feature_fraction': 0.42662487262370474, 'bagging_fraction': 0.931976901981435, 'bagging_freq': 5, 'min_child_samples': 9, 'learning_rate': 0.015392905483155068, 'n_estimators': 766}

In [21]:
print(f"Best trial: {study.best_trial.value}")
print(f"Best hyperparameters: {study.best_trial.params}")

# Train the final model with the best hyperparameters
best_params = study.best_trial.params

# Create a LightGBM regressor object with the best hyperparameters
model = lgb.LGBMRegressor(**best_params)

# Train the model
model.fit(X_train_scaled, y_train)

# Make predictions using the scaled test data
y_pred = model.predict(X_test_scaled)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Final Root Mean Squared Error (RMSE): {rmse}")

Best trial: 38.9081140448483
Best hyperparameters: {'lambda_l1': 1.4543158394093417e-07, 'lambda_l2': 1.2972059993778709e-05, 'num_leaves': 21, 'feature_fraction': 0.48555572721680923, 'bagging_fraction': 0.6752773982027428, 'bagging_freq': 7, 'min_child_samples': 88, 'learning_rate': 0.01001847056153441, 'n_estimators': 922}




Final Root Mean Squared Error (RMSE): 38.9081140448483


- Lowest rMSE obtained from this attempt was 38.9081 which still fell short from competition winner(38.82013). 