# Training with XGBoost

- In this secttion I will attempt to minimize rMSE by using the XGBoost regressor and fine tunning it using the Optuna library.

In [1]:
%pip install category_encoders

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install optuna

Note: you may need to restart the kernel to use updated packages.


In [3]:
# Install XGBoost
%pip install xgboost

Note: you may need to restart the kernel to use updated packages.


In [4]:
#import necessary libraries
import pandas as pd
from category_encoders import TargetEncoder
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np
import optuna

- Categorical features will be encoded using TargetEncoder as it was done in the previous test.

In [5]:
df = pd.read_csv('cat_backpack.csv')

In [6]:
display(df.sample(5))

Unnamed: 0,id,brand,material,size,compartments,laptop_compartment,waterproof,style,color,weight_cap,Price
201208,201208,Puma,Unknown,Large,2,True,True,Messenger,Blue,6.129039,39.78459
159152,159152,Adidas,Leather,Small,1,True,False,Messenger,Blue,16.287537,75.76766
243205,243205,Adidas,Polyester,Small,7,True,False,Messenger,Blue,20.834982,82.36465
15998,15998,Under Armour,Canvas,Small,1,False,False,Unknown,Gray,22.101734,111.00893
139804,139804,Adidas,Canvas,Medium,3,True,False,Messenger,Black,28.382303,65.77721


- Ordinal encoding for 'size' feature.

In [7]:
# Create a dictionary to map size categories to numerical values
size_mapping = {
    'Small': 0,
    'Medium': 1,
    'Large': 2,
    'Unknown': 3  # Or you can assign it -1 or another distinct value
}

# Apply the mapping to the 'size' column
df['size_encoded'] = df['size'].map(size_mapping)

# Drop the original 'size' column (optional)
df.drop('size', axis=1, inplace=True)

- Target encoding for 'brand', 'material', 'style' and 'color'.

In [8]:
# Define features (X) and target (y)
X = df.drop(['Price', 'id'], axis=1)  # Exclude 'Price' column
y = df['Price']
# drop id column
#X = X.drop('id', axis=1)

# List of categorical features to encode
categorical_features = ['brand', 'material', 'style', 'color']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the TargetEncoder
encoder = TargetEncoder(cols=categorical_features)

# Fit the encoder on the training data and transform both training and testing data
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)

# Now X_train_encoded and X_test_encoded have the categorical features target encoded

In [9]:
display(X_train_encoded.sample(5))

Unnamed: 0,brand,material,compartments,laptop_compartment,waterproof,style,color,weight_cap,size_encoded
251350,81.956967,80.479359,2,True,False,81.432036,80.511231,26.893252,0
207531,81.858243,80.479359,7,False,False,81.430891,80.511231,6.764248,1
64532,81.858243,82.028371,4,True,False,81.432036,80.985014,13.097618,0
113674,81.956967,82.170161,10,False,False,81.417545,80.985014,7.809743,0
92351,81.333835,81.072777,5,True,True,82.240102,81.014828,10.879323,0


- Train and evaluate an XGBoost regressor.

In [14]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)


In [15]:

# Create DMatrix for XGBoost
dtrain = xgb.DMatrix(X_train_scaled, label=y_train)
dtest = xgb.DMatrix(X_test_scaled, label=y_test)

# Set XGBoost parameters
params = {
    'objective': 'reg:squarederror',
    'max_depth': 3,
    'learning_rate': 0.1,
    'n_estimators': 100
}

# Train the XGBoost model
model = xgb.train(params, dtrain, num_boost_round=100)

# Predict on the test set
y_pred = model.predict(dtest)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5

print(f"Root Mean Squared Error: {rmse}")

Parameters: { "n_estimators" } are not used.



Root Mean Squared Error: 38.91025938703038


- Fine tune the hyperparameters of the XGBoost model using Optuna