# Backpack Kaggle Competition
### W207 Final Project - Spring 2025

Team: Perry Gabriel, Aurelia Yang

University of California, Berkeley

## Description

In this competition, participants are challenged to develop machine learning models to predict the price of a backpack based on various features. This is a great opportunity to test your skills, learn new techniques, and compete with others in the data science community.

## Evaluation

Submissions are evaluated on the root mean squared error between the predicted and actual price of the backpack.

RMSE is defined as:
$$ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} $$

where $$y_i$$ is the actual price of the backpack and $$\hat{y}_i$$ is the predicted price of the backpack.

## Data Description

The data consists of the following columns:

- `id`: A unique identifier for the backpack.
- `Brand`: The brand of the backpack.
- `Material`: The material of the backpack.
- `Size`: The size of the backpack.
- `Compartments`: The number of compartments in the backpack.
- `Laptop Compartment`: Whether the backpack has a laptop compartment.
- `Waterproof`: Whether the backpack is waterproof.
- `Style`: The style of the backpack.
- `Color`: The color of the backpack.
- `Weight Capacity (kg)`: The weight capacity of the backpack in kilograms.
- `Price`: The price of the backpack.

## Submission File

For each `id` in the test set, you must predict the price of the backpack. The file should contain a header and have the following format:

```csv
id,Price
1,100
2,200
3,300
```

## Timeline

- **Start Date** - February 1, 2025
- **Entry Deadline** - Same as the Final Submission Deadline
- **Team Merger Deadline** - Same as the Final Submission Deadline
- **Final Submission Deadline** - February 28, 2025

All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.

## Acknowledgements

This dataset was created by [Kaggle](https://www.kaggle.com/datasets/souradippal/student-bag-price-prediction-dataset) for the purpose of hosting a competition.

## Team Members

- [Perry Gabriel](https://www.kaggle.com/prgabriel)
- [Aurelia Yang](https://www.kaggle.com/aureliayang)

## Sections

1. [Exploratory Data Analysis](#1.-Exploratory-Data-Analysis)
2. [Data Preprocessing](#2.-Data-Preprocessing)
3. [Modeling](#3.-Modeling)
4. [Evaluation](#4.-Evaluation)
5. [Optimization](#5.-Optimization)
6. [Final Submission](#6.-Final-Submission)
7. [Conclusion](#7.-Conclusion)


## 1. Exploratory Data Analysis

In this section, we will explore the data to understand its structure and identify any patterns or trends that may be present.


### 1.1 Load the Data

Let's start by loading the data and taking a look at the first few rows.

In [None]:
import os

raw_data_path = '../data/raw/'
os.makedirs(raw_data_path, exist_ok=True)

Uncomment to download the data from Kaggle. This assumes you have the Kaggle API installed and configured.

In [None]:
# !kaggle competitions download -c playground-series-s5e2
# !unzip playground-series-s5e2 -d ../data/raw/
# !pip install -r ../requirement.txt
# !rm -rf playground-series-s5e2.zip

In [None]:
import pandas as pd
import os

train_df = pd.read_csv(filepath_or_buffer=os.path.join(raw_data_path, 'train.csv'), index_col=0, header=0, sep=',')
test_df = pd.read_csv(filepath_or_buffer=os.path.join(raw_data_path, 'test.csv'), index_col=0, header=0, sep=',')

train_df.head()

In [None]:
test_df.head()

### 1.2 Data Summary

Next, let's take a look at the summary statistics of the data.


In [None]:
# Display the summary statistics of the training data
train_df.describe()

In [None]:
test_df.describe()

Let's see the data types of each column.

In [None]:
print(f"Data types of columns in training dataset\n{train_df.dtypes}\n")
print(f"Data types of columns in testing dataset\n{test_df.dtypes}")

Let's get the shape of the data.

In [None]:
# Display the shape of the dataset.
print(f"Shape of training data: {train_df.shape}")
print(f"Shape of testing data: {test_df.shape}")

### 1.3 Data Visualization

We can also create visualizations to better understand the data.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create a pairplot of the training data
sns.pairplot(train_df)
plt.show()

In [None]:
# For example, plot a histogram of the price column
plt.hist(train_df['Price'], bins=20, edgecolor='black')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.title('Histogram of Price')
plt.show()

### 1.4 Correlation Matrix

Finally, let's create a correlation matrix to see how the features are related to each other.


In [None]:
# Select only the numeric columns
numeric_cols = train_df.select_dtypes(include=['float64', 'int64'])

# Create a correlation matrix
corr = numeric_cols.corr()

# Display the correlation matrix
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## 2. Data Preprocessing

In this section, we will preprocess the data to prepare it for modeling.

### 2.1 Missing Values

First, let's check for missing values in the data and decide how to handle them.

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

Let's fill in missing values using forward fill method for both train and test data.

In [None]:
# Handle missing values
train_df.ffill(inplace=True)

# Normalize numerical features
numerical_cols = train_df.select_dtypes(include=['float64', 'int64']).columns
train_df[numerical_cols] = (train_df[numerical_cols] - train_df[numerical_cols].mean()) / train_df[numerical_cols].std()

# Convert categorical features to category type
categorical_cols = train_df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    train_df[col] = train_df[col].astype('category')

In [None]:
# Handle missing values
test_df.ffill(inplace=True)

# Normalize numerical features
numerical_cols = test_df.select_dtypes(include=['float64', 'int64']).columns
test_df[numerical_cols] = (test_df[numerical_cols] - test_df[numerical_cols].mean()) / test_df[numerical_cols].std()

# Convert categorical features to category type
categorical_cols = test_df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    test_df[col] = test_df[col].astype('category')

In [None]:
# Check to see the changes
train_df.isnull().sum()


In [None]:
test_df.isnull().sum()

Now, lets save the data to a new csv file under processed_data folder.

In [None]:
# Check if the directory exists, if not, create it
processed_file_path = '../data/processed'
if not os.path.exists(processed_file_path):
    os.makedirs(processed_file_path)

# Save the transformed training data
train_df.to_csv(processed_file_path + '/train_processed.csv', index=True)
processed_train_df = train_df.copy()

# Save the transformed testing data
test_df.to_csv(processed_file_path + '/test_processed.csv', index=True)
processed_test_df = test_df.copy()

#TODO: Make a new notebook for the next steps

### 2.2 Feature Engineering

In this section, we will create new features that may help improve the performance of our models.


## 3. Modeling

In this section, we will select and train machine learning models to predict the price of the backpack.


## 4. Evaluation

In this section, we will evaluate the performance of our models using various metrics.


## 5. Model Optimization

In this section, we will optimize the hyperparameters of our models to improve their performance.


## 6. Final Submission

In this section, we will select the best model and make final predictions on the test set.


## 7. Conclusion

In this section, we will summarize our findings and discuss the implications of our results.