# Backpack Price Prediction using GradientBoostingRegressor
## Problem Statement
   This notebook demonstrates how to build a machine learning model to predict backpack prices based on various features such as brand, material, size, and other characteristics. We'll use GradientBoostingRegressor.
### Objective
   - Build a predictive model for backpack prices
   - Evaluate model performance using RMSE
   - Generate predictions for the test dataset

## Data Preperation

### Import Required Libraries
We'll use the following libraries:
- pandas & numpy: For data manipulation and numerical operations
- scikit-learn: For machine learning models and preprocessing
- matplotlib & seaborn: For data visualization
- scipy.stats: For statistical analysis

In [39]:

import numpy as np 
import pandas as pd
import os

from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score, mean_squared_error
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor

import pylab 
import scipy.stats as stats

import matplotlib.pyplot as plt
import seaborn as sns

import xgboost as xgb 


## Data Loading

### Load CSV Files
We'll load three datasets:
 1. Training data (`train.csv`): Contains labeled data with known prices
 2. Test data (`test.csv`): Contains unlabeled data for predictions
  3. Sample submission (`sample_submission.csv`): Template for submission format

In [40]:
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
submission = pd.read_csv('input/sample_submission.csv')

In [None]:
train

In [None]:
train.info()

In [None]:
train.isna().sum()

In [None]:
train.dropna(inplace=True)
train.shape

In [None]:
test

In [None]:
test.isna().sum()

In [None]:
submission

## Data Preprocessing
 
### Remove Unnecessary Columns
The 'id' column is dropped as it's not needed for modeling when using pandas DataFrames. This column is just an identifier and doesn't contribute to the prediction task.

In [48]:
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

## Exploratory Data Analysis (EDA)
### Analyze Training Data
We'll examine:
- Distribution of categorical variables
- Feature relationships
- Missing values
- Data quality issues

In [None]:
print('Training Data Shape:', train.shape)

print('\nTest Data Shape:', test.shape)

## Target Variable Analysis
 
### Examine Price Distribution
Understanding the distribution of our target variable (Price) is crucial for:
- Detecting potential outliers
- Identifying if transformations are needed
- Validating model assumptions

In [None]:
for col in train:
    if train[col].dtype == 'object':
        print(col,train[col].unique() )

In [None]:
plt.hist(train['Brand'])
plt.show()

In [None]:
plt.hist(train['Material'])
plt.show()

In [None]:
plt.hist(train['Size'])
plt.show()

In [None]:
plt.hist(train['Laptop Compartment'])
plt.show()

In [None]:
plt.hist(train['Waterproof'])
plt.show()

In [None]:
plt.hist(train['Style'])
plt.show()

In [None]:
plt.hist(train['Color'], bins=6, rwidth=0.8)
plt.show()

In [None]:
sns.displot(train['Price'], kde=True)

### Handle Null Values in Test Set
Strategy:
- Categorical variables: Replace with 'not listed'
- Numerical variables: Replace with -1



In [None]:
for col in test:
    if test[col].dtype == 'object':
        test[col] = test[col].fillna('not listed')
    if test[col].dtype == 'int' or test[col].dtype == 'float':
        test[col] = test[col].fillna(-1)

test.isna().sum().sum()

## Feature Engineering

### Encode Categorical Variables
 Using OrdinalEncoder to convert categorical variables to numerical format:
- Handles unknown categories in test set
- Maintains consistency between train and test data
- Enables model to process categorical features

In [60]:
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

for col in train:
    if train[col].dtype == 'object':
        train[col] = enc.fit_transform(train[col].values.reshape(-1,1))
        test[col] = enc.transform(test[col].values.reshape(-1,1))


In [None]:
train.info()

In [None]:
test.info()

#### Define dependent and independent variables

In [63]:
y = train.pop('Price')
X = train
X_test = test

#### Split into training and validating sets

In [None]:
X_train, X_val, y_train, y_val = train_test_split( X, y, test_size=0.1, shuffle=True, random_state=42)
X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape

## Model Definition and Training

### Initialize and Train the Models

We're comparing several regression models:

Linear Models:
- Linear Regression (baseline)
- Ridge regression (L2 regularization)
- Lasso regression (L1 regularization)
- ElasticNet (combines L1 and L2 regularization)
  - Helps prevent overfitting and handles multicollinearity

Advanced Models:
- Random Forest (ensemble learning with decision trees)
- Support Vector Regression (SVR with RBF kernel)
- Gradient Boosting (sequential ensemble learning)
- XGBoost (optimized gradient boosting implementation)
- Neural Network (MLP with 2 hidden layers: 100 and 50 neurons)

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Initialize different models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=1.0),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'ElasticNet': ElasticNet(random_state=42)
}

# Train and evaluate each model
results = {}
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_val)
    
    # Calculate RMSE
    mse = mean_squared_error(y_val, y_pred)
    rmse = np.sqrt(mse)
    results[name] = rmse

# Display results
for name, rmse in results.items():
    print(f'{name} RMSE: {rmse:.2f}')

# Visualize results
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.bar(results.keys(), results.values())
plt.title('Model Comparison - RMSE Scores')
plt.xticks(rotation=45)
plt.ylabel('RMSE')
plt.tight_layout()
plt.show()

In [None]:
# Initialize different models
models = {
    'Support Vector Regression': SVR(kernel='rbf'),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBRegressor(n_estimators=100, random_state=42),
    'Neural Network': MLPRegressor(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
}

# Train and evaluate each model
results = {}
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_val)
    
    # Calculate RMSE and R2
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    results[name] = {'RMSE': rmse}

# Display results
for name, metrics in results.items():
    print(f'{name}:')
    print(f'  RMSE: {metrics["RMSE"]:.2f}')

# Visualize RMSE results
plt.figure(figsize=(12, 6))
plt.bar([name for name in results.keys()], [metrics['RMSE'] for metrics in results.values()])
plt.title('Model Comparison - RMSE Scores')
plt.xticks(rotation=45)
plt.ylabel('RMSE')
plt.tight_layout()
plt.show()

In [None]:
model = GradientBoostingRegressor(n_estimators=100, random_state=42).fit(X_train, y_train)
model.score(X_train, y_train)

In [None]:
y_pred = model.predict(X_val)
y_pred

In [None]:
df = pd.DataFrame({'y_val':y_val, 'y_pred':y_pred})
df

In [None]:
stats.probplot(y_pred, dist="norm", plot=pylab)
pylab.show()

In [None]:
fig, ax = plt.subplots()
ax.scatter(y_val, y_pred, edgecolors=(0, 0, 0))
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

 ## Model Evaluation
 
 ### Generate Predictions and Calculate Metrics
 - Make predictions on validation set
 - Calculate RMSE and model score
 - Evaluate model performance

In [None]:
pred = model.predict(X_test)
pred

## Generate Test Predictions and Create Submission
 
 ### Format and Save Predictions
  1. Generate predictions for test set
 2. Format according to submission template
 3. Save to CSV file for submission

In [None]:

submission['Price'] = pred
submission.to_csv('submission.csv', index=False)
submission = pd.read_csv('submission.csv')
submission