# Backpack Kaggle Competition
### W207 Final Project - Spring 2025

Team: Perry Gabriel, Aurelia Yang

University of California, Berkeley

## Description

In this competition, participants are challenged to develop machine learning models to predict the price of a backpack based on various features. This is a great opportunity to test your skills, learn new techniques, and compete with others in the data science community.

## Evaluation

Submissions are evaluated on the root mean squared error between the predicted and actual price of the backpack.

RMSE is defined as:
$$ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} $$

where $y_i$ is the actual price of the backpack and $\hat{y}_i$ is the predicted price of the backpack.

## Data Description

The data consists of the following columns:

- `id`: A unique identifier for the backpack.
- `Brand`: The brand of the backpack.
- `Material`: The material of the backpack.
- `Size`: The size of the backpack.
- `Compartments`: The number of compartments in the backpack.
- `Laptop Compartment`: Whether the backpack has a laptop compartment.
- `Waterproof`: Whether the backpack is waterproof.
- `Style`: The style of the backpack.
- `Color`: The color of the backpack.
- `Weight Capacity (kg)`: The weight capacity of the backpack in kilograms.
- `Price`: The price of the backpack.

## Data Splits
The dataset is split into three parts:
- **Train**: The training set contains 80% of the data and is used to train the model.
- **Validation**: The validation set contains 10% of the data and is used to tune the model.
- **Test**: The test set contains 10% of the data and is used to evaluate the model's performance.

## Important Notes about the Dataset
- There are (4) different datasets: train, train_extra, test, and sample_submission.
- The `train` dataset contains the training data with the target variable `Price`.
- The `train_extra` dataset contains additional training data that can be used to improve the model's performance.
- The `test` dataset contains the test data without the target variable `Price`.
- The `sample_submission` dataset contains a sample submission file with the correct format.
- The `train` and `train_extra` datasets are combined to create a larger training set.
- The `train_extra` dataset was provided by the competition organizers and is not part of the original dataset.

## Submission File

For each `id` in the test set, you must predict the price of the backpack. The file should contain a header and have the following format:

```python
id,Price
1,100
2,200
3,300
```

## Timeline

- **Start Date** - February 1, 2025
- **Entry Deadline** - Same as the Final Submission Deadline
- **Team Merger Deadline** - Same as the Final Submission Deadline
- **Final Submission Deadline** - February 28, 2025

All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.

## Acknowledgements

This dataset was created by [Kaggle](https://www.kaggle.com/datasets/souradippal/student-bag-price-prediction-dataset) for the purpose of hosting a competition.

## Team Members

- [Perry Gabriel](https://www.kaggle.com/prgabriel)
- [Aurelia Yang](https://www.kaggle.com/aureliayang)

## Sections

1. [Exploratory Data Analysis](#1.-Exploratory-Data-Analysis)
2. [Data Preprocessing](#2.-Data-Preprocessing)
3. [Modeling](#3.-Modeling)
4. [Evaluation](#4.-Evaluation)
5. [Optimization](#5.-Optimization)
6. [Final Submission](#6.-Final-Submission)
7. [Conclusion](#7.-Conclusion)

## References
[Backpack Kaggle Competition Link](https://www.kaggle.com/competitions/playground-series-s5e2)

[Backpack Kaggle Competition Dataset](https://www.kaggle.com/datasets/souradippal/student-bag-price-prediction-dataset)


## 0. Setup
Install the required libraries
Uncomment to download the data from Kaggle. This assumes you have the Kaggle API installed and configured.

In [None]:
# !kaggle competitions download -c playground-series-s5e2
# !unzip playground-series-s5e2 -d ../data/raw/
# !pip install -r ../requirements.txt
# !rm -rf playground-series-s5e2.zip

In [None]:
import os
import mlflow
import numpy as np
import pandas as pd
import xgboost as xgb
import mlflow.sklearn
import mlflow.xgboost
import seaborn as sns
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from category_encoders import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings("ignore", category=UserWarning) #used to supress the tf version warning. 

In [None]:
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment(experiment_name='E2E_Kaggle_Backpack_Project')
mlflow.autolog()

In [None]:
raw_data_path = '../data/raw/'

In [None]:
def check_and_import_colab():
    global raw_data_path
    global processed_path
    try:
        from google.colab import drive  
        import google.colab
        print("Running on Google Colab")
        # Import necessary libraries for Google Colab
        drive.mount('/content/drive')

        # define paths
        raw_data_path = "/content/drive/MyDrive/Kaggle_Backpack/data/raw/"
        processed_path = "/content/drive/MyDrive/Kaggle_Backpack/data/processed/"
        return True

    except ImportError:
        print("Not running on Google Colab")
        os.makedirs(raw_data_path, exist_ok=True)
        print("Created 'raw_data_path' directory for non-Colab Environment.")
        return False

on_colab = check_and_import_colab()

if on_colab:
    print("Google Colab environment detected. Paths have been set accordingly.")
else:
    print("Local environment detected. Paths have been set accordingly.")

## 1. Exploratory Data Analysis

In this section, we will explore the data to understand its structure and identify any patterns or trends that may be present.


### 1.1 Load the Data

Let's start by loading the data and taking a look at the first few rows.

In [None]:
train_df = pd.read_csv(filepath_or_buffer=os.path.join(raw_data_path, 'train.csv'), index_col=0, header=0, sep=',')
test_df = pd.read_csv(filepath_or_buffer=os.path.join(raw_data_path, 'test.csv'), index_col=0, header=0, sep=',')
train_extra_df = pd.read_csv(filepath_or_buffer=os.path.join(raw_data_path, 'training_extra.csv'), index_col=0, header=0, sep=',')

train_df.head()

In [None]:
test_df.head()

In [None]:
train_extra_df.head()

### 1.2 Data Summary

Next, let's take a look at the summary statistics of the data.


In [None]:
# Display the summary statistics of the training data
train_df.describe()

In [None]:
test_df.describe()

In [None]:
train_extra_df.describe()

Let's see the data types of each column.

In [None]:
print(f"Data types of columns in training dataset\n{train_df.dtypes}\n")
print(f"Data types of columns in training extra dataset\n{train_extra_df.dtypes}\n")
print(f"Data types of columns in testing dataset\n{test_df.dtypes}\n")

Let's get the shape of the data.

In [None]:
# Display the shape of the dataset.
print(f"Shape of training data: {train_df.shape}")
print(f"Shape of training extra data: {train_extra_df.shape}")
print(f"Shape of testing data: {test_df.shape}")

#### Combine the train and train_extra datasets

Now, let's combine the train and train_extra datasets to create a larger training set.

In [None]:
# Combine train and train_extra datasets
train_df = pd.concat([train_df, train_extra_df], axis=0).reset_index(drop=True)

# Display the shape of the combined dataset
print(f"Shape of combined training data: {train_df.shape}")

# Display the shape of the combined dataset
train_df.shape

# Display the first few rows of the combined dataset
train_df.head()


Let's capture the categories of the categorical variables.

In [None]:
cat_columns = train_df.columns[:-2].tolist()  # Dropped the last two columns since we know that these are numerical columns
print(f'There are {len(cat_columns)} categorical columns:')
print(cat_columns)

num_columns = [train_df.columns[-2]]
print(f'There are {len(num_columns)} numerical column:')
print(num_columns)

### 1.3 Data Visualization

We created visualizations to better understand the data.


In [None]:
# For example, plot a histogram of the price column
plt.hist(train_df['Price'], bins=20, edgecolor='black', color='skyblue', rwidth=0.8)
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.title('Histogram of Price (train_df)')
plt.grid(axis='y', linestyle='--', alpha=0.9)
plt.show()

### 1.4 Correlation Matrix

Finally, let's create a correlation matrix to see how the features are related to each other.


In [None]:
# Select only the numeric columns
numeric_cols = train_df.select_dtypes(include=['float64', 'int64'])

# Create a correlation matrix
corr = numeric_cols.corr()

# Display the correlation matrix
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

### Observations so far: 

- `training_extra` has significantly more records (3.69M) than `training` (300k), which will be useful in improving model training.
- Some categorical columns have substantial missing values:
    - `Brand`: 9705 missing in `train`, 117,000 missing in `train_extra`
    - `Material`, `Style`, `Color`
- `train_extra` has a higher proportion of missing values.
- Considering:
    - Mode imputation for categorical columns
    - Mean/median imputation for numerical columns


### 1.5 Feature Distribution

Let's take a look at the distribution of the features.

In [None]:
# outlier boxplots
plt.figure(figsize=(10, 5))
sns.boxplot(x=train_df["Price"])
plt.title("Boxplot of Backpack Prices")
plt.show()

In [None]:
# categorical feature distribution
for col in cat_columns:
    plt.figure(figsize=(10, 4))
    sns.countplot(y=train_df[col], order=train_df[col].value_counts().index, hue=train_df[col], palette="coolwarm", legend=False)
    plt.title(f"Count Plot of {col}")
    plt.show()


## 2. Data Preprocessing

In this section, we will preprocess the data to prepare it for modeling.

### 2.1 Feature Engineering

In this section, we will create new features that may help improve the performance of our models.


#### Creation of Combined (Combined_list) Features

For each original categorical column, a new feature is generated by combining it with `Weight Capacity`.

This is done to create a new feature that captures the interaction between the original categorical feature and the weight capacity of the backpack. The new feature is created by multiplying the weight capacity by 100 and adding it to the original categorical feature. This allows us to create a new feature that captures the interaction between the original categorical feature and the weight capacity of the backpack.

In [None]:
combined_list = []
label_encoders = {}

for c in cat_columns:  # Use 'cat_columns' as defined earlier in the notebook
    # Initialize and fit a LabelEncoder for the current column
    le = LabelEncoder()
    combined = pd.concat([train_df[c], test_df[c]], axis=0)
    le.fit(combined)
    label_encoders[c] = le  # Store the encoder for potential future use

    # Transform the train and test data
    train_df[c] = le.transform(train_df[c])
    test_df[c] = le.transform(test_df[c])

    # Create a new column combining the encoded value and weight capacity
    new_col = f"{c}_Weight_Capacity_Combined"
    train_df[new_col] = train_df[c] * 100 + train_df["Weight Capacity (kg)"]
    test_df[new_col] = test_df[c] * 100 + test_df["Weight Capacity (kg)"]

    # Append the new column name to the combined_list list
    combined_list.append(new_col)

print(f"We now have {len(combined_list)} new columns")
print(combined_list)

In [None]:
input_variables_le = cat_columns + num_columns + combined_list
print(f"We now have {len(input_variables_le)} columns:")
print(input_variables_le)

In [None]:
input_variables = cat_columns + num_columns
print(f"We now have {len(input_variables)} columns:")
print(input_variables)

In [None]:
# Define the input features and target variable
X_le = train_df[input_variables_le]
y = train_df['Price']

# Split the data into training, validation, and test sets
X_train_le, X_temp_le, y_train_le, y_temp_le = train_test_split(X_le, y, test_size=0.3, random_state=42)
X_valid_le, X_test_le, y_valid_le, y_test_le = train_test_split(X_temp_le, y_temp_le, test_size=0.5, random_state=42)

# For the test dataset (test.csv), ensure it only contains the input features
X_test_final = test_df[input_variables_le]

# Display the shapes of the datasets
print(f"X_train_le shape: {X_train_le.shape}")
print(f"y_train shape: {y_train_le.shape}")
print(f"X_valid shape: {X_valid_le.shape}")
print(f"y_valid shape: {y_valid_le.shape}")
print(f"X_test shape: {X_test_le.shape}")
print(f"y_test shape: {y_test_le.shape}")
print(f"X_test_final shape: {X_test_final.shape}")

In [None]:
# Define the input features and target variable
X = train_df[input_variables]
y = train_df['Price']

# Split the data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# For the test dataset (test.csv), ensure it only contains the input features
X_test_final = test_df[input_variables]

# Display the shapes of the datasets
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_valid shape: {X_valid.shape}")
print(f"y_valid shape: {y_valid.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")
print(f"X_test_final shape: {X_test_final.shape}")

## 3. Modeling

In this section, we will select and train machine learning models to predict the price of the backpack.


In [None]:
# Convert the data into DMatrix format for XGBoost
dtrain = xgb.DMatrix(X_train_le, label=y_train_le)
dvalid = xgb.DMatrix(X_valid_le, label=y_valid_le)
dtest = xgb.DMatrix(X_test_le)

# Define the parameters for the XGBoost model
params = {
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    'learning_rate': 0.02,
    'max_depth': 6,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'seed': 42
}

# Train the XGBoost model
evals = [(dtrain, 'train'), (dvalid, 'valid')]
le_xgb_model = xgb.train(params, dtrain, num_boost_round=1_000, evals=evals, early_stopping_rounds=50, verbose_eval=15)

# Make predictions on the validation set
le_y_pred_valid = le_xgb_model.predict(dvalid)

# Make predictions on the test set
le_y_pred_test = le_xgb_model.predict(dtest)

In [None]:
from sklearn.impute import SimpleImputer, KNNImputer

# y_pred_lr = 0

def evaluate_imputation_strategy(imputer, X_train, X_valid, y_train, y_valid):
    """
    Fits imputer on X_train, transforms X_valid,
    encodes categoricals, trains a linear regression,
    and returns validation MAE and RMSE.
    """
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_absolute_error, root_mean_squared_error
    from copy import deepcopy

    # deep copy
    X_train_copy = deepcopy(X_train)
    X_valid_copy = deepcopy(X_valid)

    # separate numeric and categorical columns
    numeric_cols = X_train_copy.select_dtypes(include=[np.number]).columns
    cat_cols = X_train_copy.select_dtypes(exclude=[np.number]).columns

    # 1. numeric imputation
    X_train_copy[numeric_cols] = imputer.fit_transform(X_train_copy[numeric_cols])
    X_valid_copy[numeric_cols] = imputer.transform(X_valid_copy[numeric_cols])

    # 2. fill categorical missing values
    X_train_copy[cat_cols] = X_train_copy[cat_cols].fillna("Missing")
    X_valid_copy[cat_cols] = X_valid_copy[cat_cols].fillna("Missing")

    # 3. one-hot encode categorical columns
    X_train_copy = pd.get_dummies(X_train_copy, columns=cat_cols)
    X_valid_copy = pd.get_dummies(X_valid_copy, columns=cat_cols)

    # 4. align columns (to match dummy columns between sets)
    X_train_copy, X_valid_copy = X_train_copy.align(X_valid_copy, join='left', axis=1)
    X_valid_copy = X_valid_copy.fillna(0)

    # 5. train and evaluate
    model = LinearRegression()
    model.fit(X_train_copy, y_train)
    y_pred = model.predict(X_valid_copy)
    mae = mean_absolute_error(y_valid, y_pred)
    rmse = root_mean_squared_error(y_valid, y_pred, squared=False)

    return mae, rmse

simple_median_imputer = SimpleImputer(strategy='median')
mae_median, rmse_median = evaluate_imputation_strategy(simple_median_imputer, X_train, X_valid, y_train, y_valid)

knn_imputer = KNNImputer(n_neighbors=5)
mae_knn, rmse_knn = evaluate_imputation_strategy(knn_imputer, X_train, X_valid, y_train, y_valid)

print("MAE with Median Imputer:", mae_median)
print("RMSE with Median Imputer:", rmse_median)
print("MAE with KNN Imputer:", mae_knn)
print("RMSE with KNN Imputer:", rmse_knn)

In [None]:
# Handle missing values by filling them with the median
X_train_le = X_train_le.fillna(X_train_le.median())
X_valid_le = X_valid_le.fillna(X_valid_le.median())
X_test_le = X_test_le.fillna(X_test_le.median())

# Train the Linear Regression model
le_lr_model = LinearRegression()
le_lr_model.fit(X_train_le, y_train_le)

# Make predictions on the validation set
le_y_pred_lr = le_lr_model.predict(X_valid_le)

# Make predictions on the test set
le_y_pred_test_lr = le_lr_model.predict(X_test_le)

# Display the first few predictions
print("First few predictions on the test set:", le_y_pred_test_lr[:5])

In [None]:
# Count missing cells in numeric columns
missing_counts = X_train.select_dtypes(include=[np.number]).isna().sum()
print(missing_counts[missing_counts > 0])

They both have the same MAE and RMSE, so we will go ahead with the median approach, as it is simpler and faster at scale.

In [None]:
# Separate numeric vs. categorical columns
numeric_cols = X_train.select_dtypes(include=[np.number]).columns
cat_cols = X_train.select_dtypes(exclude=[np.number]).columns

# Fill numeric columns with X_train's median
numeric_imputer = SimpleImputer(strategy='median')
X_train[numeric_cols] = numeric_imputer.fit_transform(X_train[numeric_cols])
X_valid[numeric_cols] = numeric_imputer.transform(X_valid[numeric_cols])
X_test[numeric_cols] = numeric_imputer.transform(X_test[numeric_cols])

# Fill categorical columns with "Missing"
X_train[cat_cols] = X_train[cat_cols].fillna("Missing")
X_valid[cat_cols] = X_valid[cat_cols].fillna("Missing")
X_test[cat_cols] = X_test[cat_cols].fillna("Missing")

Now we will standardize numeric features.

In [None]:
# numeric standardization
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# fit on X_train numeric features
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_valid[numeric_cols] = scaler.transform(X_valid[numeric_cols])
X_test[numeric_cols]  = scaler.transform(X_test[numeric_cols])

# (Optional) If we use X_test_final for Kaggle submission
#X_test_final[numeric_cols] = scaler.transform(X_test_final[numeric_cols])

And perform one-hot encoding on the categorical variables.

In [None]:
# One Hot Encoding
from sklearn.preprocessing import OneHotEncoder

# instantiate the encoder
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# fit on train's categorical columns
ohe.fit(X_train[cat_cols])

# transform each dataset's categorical columns
X_train_ohe = ohe.transform(X_train[cat_cols])
X_valid_ohe = ohe.transform(X_valid[cat_cols])
X_test_ohe  = ohe.transform(X_test[cat_cols])

# convert numeric columns to arrays
X_train_numeric = X_train[numeric_cols].values
X_valid_numeric = X_valid[numeric_cols].values
X_test_numeric  = X_test[numeric_cols].values

# concatenate numeric and encoded categorical data
X_train_final = np.concatenate([X_train_numeric, X_train_ohe], axis=1)
X_valid_final = np.concatenate([X_valid_numeric, X_valid_ohe], axis=1)
X_test_final_2 = np.concatenate([X_test_numeric, X_test_ohe], axis=1)

# reminder: if we want to make final Kaggle submissions on test_df,
# we need to transform X_test_final the same way, if it exists.
# e.g.
# X_test_final_ohe = ohe.transform(X_test_final[cat_cols])
# X_test_final_numeric = X_test_final[numeric_cols].values
# X_test_final_new = np.concatenate([X_test_final_numeric, X_test_final_ohe], axis=1)

**IMPORTANT NOTE TO SELF:**
Make the same changes for XGBoost as well, or any other model you train. For example:

xgb_regressor.fit(X_train_final, y_train, eval_set=[(X_train_final, y_train), (X_valid_final, y_valid)], verbose=15)
...
y_pred_test_xgb = xgb_regressor.predict(X_test_final_2)

Otherwise, we'll be mixing data that’s not one-hot-encoded for XGBoost.

Linear Regression Model

In [None]:
# Train the Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train_final, y_train)

# Make predictions on the validation set
y_pred_lr = lr_model.predict(X_valid_final)

# Make predictions on the test set
y_pred_test_lr = lr_model.predict(X_test_final_2)

# Display the first few predictions
print("First few predictions on the test set:", y_pred_test_lr[:5])

XGBoost model

In [None]:
# Convert the one-hot encoded data into DMatrix format
dtrain = xgb.DMatrix(X_train_final, label=y_train)
dvalid = xgb.DMatrix(X_valid_final, label=y_valid)
dtest = xgb.DMatrix(X_test_final_2)

# Define the parameters for the XGBoost model
params = {
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    'learning_rate': 0.02,
    'max_depth': 6,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'seed': 42
}

# Train the XGBoost model
evals = [(dtrain, 'train'), (dvalid, 'valid')]
xgb_model = xgb.train(
    params,
    dtrain,
    num_boost_round=1000,
    evals=evals,
    early_stopping_rounds=50,
    verbose_eval=15
)

# Make predictions
y_pred_valid = xgb_model.predict(dvalid)
y_pred_test = xgb_model.predict(dtest)

print("First few predictions on the test set:", y_pred_test[:5])

XGBRegressor model

In [None]:
# Initialize the XGBRegressor model
xgb_regressor = XGBRegressor(
    objective='reg:squarederror',
    eval_metric='rmse',
    learning_rate=0.02,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    early_stopping_rounds=50,
    random_state=42
)

# Train the model on the training dataset
xgb_regressor.fit(
    X_train_final,
    y_train,
    eval_set=[(X_train_final, y_train), (X_valid_final, y_valid)],
    verbose=15
)

# Make predictions on the validation set
y_pred_valid_xgb = xgb_regressor.predict(X_valid_final)

# Make predictions on the test set
y_pred_test_xgb = xgb_regressor.predict(X_test_final_2)

# Display the first few predictions
print("First few predictions on the test set:", y_pred_test_xgb[:5])

Random Forest Regressor model

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np


rf_model = RandomForestRegressor(
    n_estimators=30,
    max_depth=10,
    min_samples_split=20,
    min_samples_leaf=10,
    max_features='log2',
    bootstrap=True,
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train_final, y_train)

#predictions
y_pred_valid_rf = rf_model.predict(X_valid_final)
y_pred_test_rf  = rf_model.predict(X_test_final_2)

# evaluate
rf_rmse_valid = np.sqrt(mean_squared_error(y_valid, y_pred_valid_rf))
print(f"Random Forest Validation RMSE: {rf_rmse_valid:.3f}")


In [None]:
# increase n estimators and depth
rf_model2 = RandomForestRegressor(
    n_estimators=100,
    max_depth=14,
    min_samples_split=20,
    min_samples_leaf=10,
    max_features='log2',
    bootstrap=True,
    random_state=42,
    n_jobs=-1
)

rf_model2.fit(X_train_final, y_train)

#predictions
y_pred_valid_rf = rf_model2.predict(X_valid_final)
y_pred_test_rf  = rf_model2.predict(X_test_final_2)

# evaluate
rf_rmse_valid = np.sqrt(mean_squared_error(y_valid, y_pred_valid_rf))
print(f"Random Forest Validation RMSE: {rf_rmse_valid:.3f}")

Marginal improvement in RMSE with more estimators and higher depth. There is no meaningful gain.

In [None]:
import tensorflow as tf
from tensorflow import keras

# Ensure a GPU is available (Runtime ▸ Change runtime type ▸ GPU)
print("TensorFlow version:", tf.__version__)
print("GPU available:", tf.config.list_physical_devices('GPU'))

INPUT_DIM = X_train_final.shape[1]

# ── Build the network ─────────────────────────────────────────
def make_model(input_dim):
    return keras.Sequential([
        keras.layers.InputLayer(input_shape=(input_dim,)),
        keras.layers.Dense(256, activation='relu'),
        keras.layers.BatchNormalization(),
        keras.layers.Dropout(0.2),

        keras.layers.Dense(128, activation='relu'),
        keras.layers.BatchNormalization(),
        keras.layers.Dropout(0.2),

        keras.layers.Dense(1)   # linear output for regression
    ])

model = make_model(INPUT_DIM)
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss='mse',
    metrics=[keras.metrics.RootMeanSquaredError(name='rmse')]
)

# ── Train ─────────────────────────────────────────────────────
EPOCHS      = 20
BATCH_SIZE  = 1024   # fits easily in GPU memory
early_stop  = keras.callbacks.EarlyStopping(
    patience=3, restore_best_weights=True, monitor='val_rmse'
)

history = model.fit(
    X_train_final, y_train,
    validation_data=(X_valid_final, y_valid),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    callbacks=[early_stop],
    verbose=2
)

# ── Evaluate ──────────────────────────────────────────────────
y_pred_nn = model.predict(X_valid_final, batch_size=4096).squeeze()
nn_rmse   = np.sqrt(mean_squared_error(y_valid, y_pred_nn))
print(f"Neural‑Net Validation RMSE: {nn_rmse:.3f}")

# ── Predict test if desired ──────────────────────────────────
y_pred_test_nn = model.predict(X_test_final_2, batch_size=4096).squeeze()


In [None]:
# Initialize the XGBRegressor model
le_xgb_regressor = XGBRegressor(
    objective='reg:squarederror',
    eval_metric='rmse',
    learning_rate=0.02,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    early_stopping_rounds=50, 
    random_state=42
)

# Train the model on the training dataset
le_xgb_regressor.fit(
    X_train_le, 
    y_train_le, 
    eval_set=[(X_train_le, y_train_le), (X_valid_le, y_valid_le)], 
    verbose=15
)

# Make predictions on the validation set
le_y_pred_valid_xgb = le_xgb_regressor.predict(X_valid_le)

# Make predictions on the test set
le_y_pred_test_xgb = le_xgb_regressor.predict(X_test_le)

# Display the first few predictions
print("First few predictions on the test set:", le_y_pred_test_xgb[:5])

## 4. Evaluation

In this section, we will evaluate the performance of our models using various metrics.


In [None]:
# Evaluate the xgb model using RMSE
xgb_rmse_valid = np.sqrt(mean_squared_error(y_valid, y_pred_valid))
print(f"Baseline - XGBoost Validation RMSE: {xgb_rmse_valid}")

# Calculate RMSE for the validation set
xgbreg_rmse_valid = np.sqrt(mean_squared_error(y_valid, y_pred_valid_xgb))
print(f"Baseline - XGBoost Regressor Validation RMSE: {xgbreg_rmse_valid}")

# Evaluate the linear regression model using RMSE
rmse_lr = np.sqrt(mean_squared_error(y_valid, y_pred_lr))
print(f"Baseline - Linear Regression Validation RMSE: {rmse_lr}")

# Evaluate the random forest model using RMSE
rf_rmse_valid = np.sqrt(mean_squared_error(y_valid, y_pred_valid_rf))
print(f"Baseline - Random Forest Validation RMSE: {rf_rmse_valid}")

# Evaluate the neural network model using RMSE
nn_rmse_valid = np.sqrt(mean_squared_error(y_valid, y_pred_nn))
print(f"Baseline - Neural Network Validation RMSE: {nn_rmse_valid}")

# Evaluate the LE xgb model using RMSE
le_xgb_rmse_valid = np.sqrt(mean_squared_error(y_valid_le, le_y_pred_valid))
print(f"Baseline - LE XGBoost Validation RMSE: {le_xgb_rmse_valid}")

# Calculate RMSE for the LE xgb regressor validation set
le_xgbreg_rmse_valid = np.sqrt(mean_squared_error(y_valid_le, le_y_pred_valid_xgb))
print(f"Baseline - LE XGBoost Regressor Validation RMSE: {le_xgbreg_rmse_valid}")

# Evaluate the LE linear regression model using RMSE
le_rmse_lr = np.sqrt(mean_squared_error(y_valid_le, le_y_pred_lr))
print(f"Baseline - LE Linear Regression Validation RMSE: {le_rmse_lr}")

In [None]:
import mlflow.sklearn
import mlflow.xgboost
import mlflow.tensorflow

# Log the XGBoost Booster model
mlflow.xgboost.log_model(le_xgb_model, artifact_path="le_xgb_booster_model", registered_model_name="le_xgb_booster_model")

# Log the XGBoost Regressor model
mlflow.sklearn.log_model(le_xgb_regressor, artifact_path="le_xgb_regressor_model", registered_model_name="le_xgb_regressor_model")

# Log the Linear Regression model (Label Encoded)
mlflow.sklearn.log_model(le_lr_model, artifact_path="le_linear_regression_model", registered_model_name="le_linear_regression_model")

# Log the XGBoost Booster model (One-Hot Encoded)
mlflow.xgboost.log_model(xgb_model, artifact_path="xgb_booster_model", registered_model_name="xgb_booster_model")

# Log the XGBoost Regressor model (One-Hot Encoded)
mlflow.sklearn.log_model(xgb_regressor, artifact_path="xgb_regressor_model", registered_model_name="xgb_regressor_model")

# Log the Linear Regression model (One-Hot Encoded)
mlflow.sklearn.log_model(lr_model, artifact_path="linear_regression_model", registered_model_name="linear_regression_model")

# Log the Random Forest model
mlflow.sklearn.log_model(rf_model, artifact_path="random_forest_model", registered_model_name="random_forest_model")

# Log the Random Forest model with increased depth and estimators
mlflow.sklearn.log_model(rf_model2, artifact_path="random_forest_model_v2", registered_model_name="random_forest_model_v2")

# Log the Neural Network model
mlflow.tensorflow.log_model(model, artifact_path="neural_network_model", registered_model_name="neural_network_model")

print("All models logged successfully as MLflow models.")


In [None]:
# # Save the XGBoost Booster model locally
# mlflow.sklearn.save_model(xgb_model, path='../models/baseline/xgb_booster_model')

# # Save the XGBoost Regressor model locally
# mlflow.sklearn.save_model(xgb_regressor, path='../models/baseline/xgb_rgr_model')

# # Save the Linear Regression model locally using mlflow
# mlflow.sklearn.save_model(lr_model, path='../models/baseline/lr_model')

# print("Models saved locally.")


In [None]:

mlflow.end_run()



## 5. Model Optimization

In this section, we will optimize the hyperparameters of our models to improve their performance.


The next cell performs hyperparameter tuning for an ElasticNet regression model using `GridSearchCV`. It defines a parameter grid for `alpha` and `l1_ratio`, trains the model on the training dataset (`X_train_le` and `y_train_le`), and evaluates it on the validation dataset (`X_valid_le` and `y_valid_le`). The best parameters and validation RMSE are logged using MLflow, and the trained model is saved. Finally, predictions are made on the test dataset (`X_test_le`).

In [None]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
import mlflow.sklearn

# Start an MLflow run
with mlflow.start_run(run_name="ElasticNet Hyperparameter Tuning"):
    # Define the parameter grid for ElasticNet
    elasticnet_param_grid = {
        'alpha': [0.01, 0.05, 0.1, 0.5, 1.0],
        'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
    }

    # Initialize the ElasticNet model
    elasticnet_model = ElasticNet()

    # Perform GridSearchCV to find the best hyperparameters
    grid_search_en = GridSearchCV(estimator=elasticnet_model, param_grid=elasticnet_param_grid, scoring='neg_mean_squared_error', cv=3, verbose=2, n_jobs=-1)
    grid_search_en.fit(X_train_le, y_train_le)

    # Get the best model and parameters
    elasticnet_model = grid_search_en.best_estimator_
    best_params_en = grid_search_en.best_params_
    print(f"Best Parameters for ElasticNet: {best_params_en}")

    # Log the best parameters
    mlflow.log_params(best_params_en)

    # Train the model on the training dataset
    elasticnet_model.fit(X_train_le, y_train_le)

    # Make predictions on the validation set
    y_pred_valid_en = elasticnet_model.predict(X_valid_le)

    # Evaluate the model using RMSE
    en_rmse_valid = np.sqrt(mean_squared_error(y_valid_le, y_pred_valid_en))
    print(f"ElasticNet Validation RMSE: {en_rmse_valid}")

    # Log the RMSE metric
    mlflow.log_metric("Validation RMSE", en_rmse_valid)

    # Make predictions on the test set
    y_pred_test_en = elasticnet_model.predict(X_test_le)

    # Log the ElasticNet model
    mlflow.sklearn.log_model(elasticnet_model, artifact_path="le_elasticnet_model", registered_model_name="le_elasticnet_model")

    # Display the first few predictions
    print("First few predictions on the test set:", y_pred_test_en[:5])

print("ElasticNet model training and tracking completed.")

The next cell performs hyperparameter tuning for a `GradientBoostingRegressor` model using `GridSearchCV`. Here's a summary of what it does:

1. **Define Parameter Grid**: It specifies a grid of hyperparameters for the `GradientBoostingRegressor`, including `learning_rate`, `n_estimators`, `max_depth`, `min_samples_split`, `min_samples_leaf`, `subsample`, `max_features`, `loss`, and `criterion`.

2. **Initialize the Model**: It initializes a `GradientBoostingRegressor` instance.

3. **Grid Search with Cross-Validation**: It uses `GridSearchCV` to search for the best combination of hyperparameters based on the negative mean squared error metric. The search is performed using 3-fold cross-validation.

4. **Train the Model**: It fits the model on the training dataset (`X_train` and `y_train`).

5. **Log Best Parameters**: It logs the best hyperparameters found during the grid search using MLflow.

6. **Evaluate the Model**: It makes predictions on the validation dataset (`X_valid`) and calculates the RMSE to evaluate the model's performance.

7. **Log the Model**: It logs the trained `GradientBoostingRegressor` model to MLflow for tracking and reproducibility.

8. **Make Predictions**: It makes predictions on the test dataset (`X_test`) for further analysis.

This cell is part of the model optimization process to improve the performance of the `GradientBoostingRegressor` by finding the best hyperparameters.

In [None]:
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for GradientBoostingRegressor
param_grid = {
    'learning_rate': [0.04, 0.045],
    'n_estimators': [50],
    'max_depth': [6],
    'min_samples_split': [0.00075],
    'min_samples_leaf': [0.0074],
    'subsample': [0.6],
    'max_features': [0.15],
    'loss': ['huber'],
    'criterion': ['squared_error']
}

# Initialize the GradientBoostingRegressor
gbr = GradientBoostingRegressor()

# Start an MLflow run
with mlflow.start_run(run_name="GradientBoostingRegressor Hyperparameter Tuning"):
    # Perform GridSearchCV to find the best hyperparameters
    grid_search_gbr = GridSearchCV(estimator=gbr, param_grid=param_grid, scoring='neg_mean_squared_error', cv=3, verbose=2, n_jobs=-1)
    grid_search_gbr.fit(X_train, y_train)

    # Get the best model and parameters
    gbr_model = grid_search_gbr.best_estimator_
    best_params_gbr = grid_search_gbr.best_params_
    print(f"Best Parameters for GradientBoostingRegressor: {best_params_gbr}")

    # Log the best parameters
    mlflow.log_params(best_params_gbr)

    # Make predictions on the validation set
    y_pred_valid_gbr = gbr_model.predict(X_valid)

    # Evaluate the model using RMSE
    gbr_rmse_valid = np.sqrt(mean_squared_error(y_valid, y_pred_valid_gbr))
    print(f"GradientBoostingRegressor Validation RMSE: {gbr_rmse_valid}")

    # Log the RMSE metric
    mlflow.log_metric("Validation RMSE", gbr_rmse_valid)

    # Log the GradientBoostingRegressor model
    mlflow.sklearn.log_model(gbr_model, artifact_path="gradient_boosting_model", registered_model_name="gradient_boosting_model")

    # Make predictions on the test set
    y_pred_test_gbr = gbr_model.predict(X_test)

    # Save the predictions for further analysis
    # test_predictions = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_test_gbr})
    # test_predictions.to_csv("../data/processed/gbr_test_predictions.csv", index=False)

    # Log the predictions file
    # mlflow.log_artifact("../data/processed/gbr_test_predictions.csv")

print("GradientBoostingRegressor model training and tracking completed.")



## 6. Final Submission

In this section, we will select the best model and make final predictions on the test set.


## 7. Conclusion

In this section, we will summarize our findings and discuss the implications of our results.

New Shuri Baseline Edits

**DONE**:
- one hot encoding categorcal
- median impute for numerical and missing for categorical
  - compared knnimpute vs median impute for numerical (same MAE on a simple linear model)
- verify/removed "multiplying weight capacity by 100" feature
- Center = 0 in Heatmap to highlight negative vs. positive correlation more clearly.
- check missing values meaning -> don't have domain-specific knowledge (confirm this?)
- standardized numeric features (inserted standardscaler step after imputing numeric cols, before ohe)

note: i moved your xgboost after the OHE for cohesion + updated the variable names accordingly

**TO DO**:

- Do Hyperparameter Tuning: - thought you could edit your code for this @perry
  - Instead of a single set of XGBoost parameters, do a small grid search or RandomizedSearchCV on learning_rate, max_depth, n_estimators, etc. Evaluate on validation set.
- Try even more models (Neural Networks, Random Forest)
- create final comparison table


