# Big Mart Sales Prediction

## 1) Problem Statement

The data scientists at BigMart have collected sales data from 2013 for 1559 products across 10 stores in different cities. Along with this, certain attributes of each product and store have been defined. The goal of this data science project is to build a predictive model that estimates the sales of each product at a given store.

- **Business Goal:**  
  The objective is to use the model to understand the factors (properties of products and stores) that are key drivers in increasing sales.

- **Problem Type:**  
  This is a supervised learning problem.

- **Target Feature:**  
  The target variable for prediction is `Item_Outlet_Sales`.

1. **Problem Statement**
2. **Loading Packages and Data**
3. **Exploratory Data Analysis (EDA)**
4. **Label Encoding**
5. **One-Hot Encoding**
6. **Data Preprocessing**
7. **Modeling**
8. **Linear Regression**
9. **Regularized Linear Regression**
10. **Random Forest**
11. **XGBoost**
12. **Predictions & Summary**
13. **Saving the Final Model**
14. **Hyperparameter Tuning with RandomizedSearchCV**
15. **Evaluating RandomizedSearchCV Tuned Models**
16. **Final Predictions Using Best RandomizedSearchCV Model**
17. **GridSearchCV for Hyperparameter Tuning**
18. **Evaluating GridSearchCV Tuned Models**
19. **Final Predictions Using Best GridSearchCV Model**
20. **Saving the Final GridSearchCV Model Predictions**

---

## Data Dictionary

We have two datasets: **train** and **test**. The **train dataset** contains both input and output variables, while the **test dataset** only includes input variables for which sales need to be predicted.

### Train Dataset (8523 rows)

| Variable                  | Description                                                              |
|---------------------------|--------------------------------------------------------------------------|
| `Item_Identifier`          | Unique product ID                                                        |
| `Item_Weight`              | Weight of the product                                                     |
| `Item_Fat_Content`         | Whether the product is low fat or not                                    |
| `Item_Visibility`          | Percentage of total display area allocated to the product in a store    |
| `Item_Type`                | The category the product belongs to                                      |
| `Item_MRP`                 | Maximum Retail Price (list price) of the product                        |
| `Outlet_Identifier`        | Unique store ID                                                          |
| `Outlet_Establishment_Year`| The year in which the store was established                              |
| `Outlet_Size`              | The size of the store in terms of ground area covered                    |
| `Outlet_Location_Type`     | The type of city in which the store is located                           |
| `Outlet_Type`              | Whether the outlet is a grocery store or a supermarket                   |
| `Item_Outlet_Sales`        | Sales of the product in the store (target variable)                     |

### Test Dataset (5681 rows)

| Variable                  | Description                                                              |
|---------------------------|--------------------------------------------------------------------------|
| `Item_Identifier`          | Unique product ID                                                        |
| `Item_Weight`              | Weight of the product                                                     |
| `Item_Fat_Content`         | Whether the product is low fat or not                                    |
| `Item_Visibility`          | Percentage of total display area allocated to the product in a store    |
| `Item_Type`                | The category the product belongs to                                      |
| `Item_MRP`                 | Maximum Retail Price (list price) of the product                        |
| `Outlet_Identifier`        | Unique store ID                                                          |
| `Outlet_Establishment_Year`| The year in which the store was established                              |
| `Outlet_Size`              | The size of the store in terms of ground area covered                    |
| `Outlet_Location_Type`     | The type of city in which the store is located                           |
| `Outlet_Type`              | Whether the outlet is a grocery store or a supermarket                   |

---

## Submission File Format

To submit your predictions, the output file should be structured as follows:

| Variable               | Description                                                   |
|------------------------|---------------------------------------------------------------|
| `Item_Identifier`       | Unique product ID                                             |
| `Outlet_Identifier`     | Unique store ID                                               |
| `Item_Outlet_Sales`     | Predicted sales of the product in the particular store       |


## 2. Loading Packages and Data

In [30]:
# %% 
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, LabelEncoder
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# Load datasets
train = pd.read_csv(r'H:\MY WORK SPACE\PROJECT GITHUB\NOTES\ML\AV\train.csv')
test = pd.read_csv(r'H:\MY WORK SPACE\PROJECT GITHUB\NOTES\ML\AV\test.csv')

# Save the original identifiers before processing
original_test_identifiers = test[['Item_Identifier', 'Outlet_Identifier']].copy()


## 3. Exploratory Data Analysis (EDA)

In [31]:
# Display first few rows of the training data
train.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


## 4. Handling Missing Values and Data Cleaning

In [32]:
# Handle missing values
train['Outlet_Size'] = train['Outlet_Size'].fillna(train['Outlet_Size'].mode()[0])
train['Item_Weight'] = train['Item_Weight'].fillna(train['Item_Weight'].mean())
test['Outlet_Size'] = test['Outlet_Size'].fillna(test['Outlet_Size'].mode()[0])
test['Item_Weight'] = test['Item_Weight'].fillna(test['Item_Weight'].mean())

# Clean Item_Fat_Content
train['Item_Fat_Content'] = train['Item_Fat_Content'].replace({'low fat': 'Low Fat', 'LF': 'Low Fat', 'reg': 'Regular'})
test['Item_Fat_Content'] = test['Item_Fat_Content'].replace({'low fat': 'Low Fat', 'LF': 'Low Fat', 'reg': 'Regular'})


## 5. Feature Engineering

In [33]:
# Create Outlet_Age feature
train['Outlet_Age'] = 2023 - train['Outlet_Establishment_Year']
test['Outlet_Age'] = 2023 - test['Outlet_Establishment_Year']
train.drop('Outlet_Establishment_Year', axis=1, inplace=True)
test.drop('Outlet_Establishment_Year', axis=1, inplace=True)

## 6. Encode Categorical Features

In [34]:
# Encode categorical variables
train['Outlet_Size'] = train['Outlet_Size'].map({'Small': 1, 'Medium': 2, 'High': 3}).astype(int)
test['Outlet_Size'] = test['Outlet_Size'].map({'Small': 1, 'Medium': 2, 'High': 3}).astype(int)

train['Outlet_Location_Type'] = train['Outlet_Location_Type'].str[-1:].astype(int)
test['Outlet_Location_Type'] = test['Outlet_Location_Type'].str[-1:].astype(int)

# Encode 'Item_Identifier' categories
train['Item_Identifier_Categories'] = train['Item_Identifier'].str[0:2]
test['Item_Identifier_Categories'] = test['Item_Identifier'].str[0:2]


## 7. Label Encoding for Ordinal Columns

In [35]:
# Label Encoding for ordinal columns
encoder = LabelEncoder()
ordinal_features = ['Item_Fat_Content', 'Outlet_Type', 'Outlet_Location_Type']
for feature in ordinal_features:
    train[feature] = encoder.fit_transform(train[feature])
    test[feature] = encoder.transform(test[feature])


## 8. One-Hot Encoding

In [36]:
# One-Hot Encoding
train = pd.get_dummies(train, columns=['Item_Type', 'Item_Identifier_Categories', 'Outlet_Identifier'], drop_first=True)
test = pd.get_dummies(test, columns=['Item_Type', 'Item_Identifier_Categories', 'Outlet_Identifier'], drop_first=True)

# Drop 'Item_Identifier' as it's encoded, but keep the original identifiers for final output
train.drop(labels=['Item_Identifier'], axis=1, inplace=True)


## 9. Aligning Train and Test Data

In [None]:
# Align the test DataFrame with the training DataFrame
missing_cols = set(train.columns) - set(test.columns)
for col in missing_cols:
    test[col] = 0  # Add missing columns with default value 0

# Ensure the columns are in the same order
test = test[train.columns.drop('Item_Outlet_Sales')]  # Drop target column if it exists


## 10. Split Data into Features and Target

In [None]:
# Define features and target
X = train.drop('Item_Outlet_Sales', axis=1)
y = train['Item_Outlet_Sales']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 11. Model Definition and Training

In [None]:
# Define models
models = {
    "LinearRegression": Pipeline([
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(degree=2)),
        ('model', LinearRegression())
    ]),
    "Ridge": Pipeline([
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(degree=2)),
        ('model', Ridge(alpha=7, fit_intercept=True))
    ]),
    "Lasso": Pipeline([
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(degree=2)),
        ('model', Lasso(alpha=0.2, fit_intercept=True))
    ]),
    "RandomForest": RandomForestRegressor(),
    "XGBoost": XGBRegressor()
}

# Train and evaluate models
best_model = None
best_score = float('-inf')
model_performance = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    r2 = r2_score(y_test, predictions)
    model_performance[name] = r2
    
    if r2 > best_score:
        best_score = r2
        best_model = model

# Print model performances
print("Model Performance (R² scores):")
for model, score in model_performance.items():
    print(f"{model}: {score:.4f}")


  model = cd_fast.enet_coordinate_descent(


Model Performance (R² scores):
LinearRegression: 0.6094
Ridge: 0.6096
Lasso: 0.6103
RandomForest: 0.5615
XGBoost: 0.5299


## 12. Making Final Predictions

In [None]:
# Use the best model for final predictions
final_predictions = best_model.predict(test)

# Ensure no negative sales values
final_predictions = np.maximum(final_predictions, 0)

# Format the final output
final_output = original_test_identifiers.copy()  # Use the original identifiers
final_output['Item_Outlet_Sales'] = final_predictions

# Save to CSV in the requested format
final_output.to_csv('final_prediction.csv', index=False)

# Display first few rows of the output
print(final_output.head())


  Item_Identifier Outlet_Identifier  Item_Outlet_Sales
0           FDW58            OUT049        1621.924407
1           FDW14            OUT017        1634.367658
2           NCN55            OUT010         553.055585
3           FDQ58            OUT017        2631.382221
4           FDY38            OUT027        5808.189893


## 13. Hyperparameter Tuning (Optional)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

# Define hyperparameter distributions for each model (using a range of values)
param_distributions = {
    'Ridge': {
        'model__alpha': [0.1, 1, 10, 100]
    },
    'Lasso': {
        'model__alpha': [0.1, 0.2, 0.5, 1.0]
    },
    'RandomForest': {
        'n_estimators': [50, 100, 150, 200],
        'max_depth': [5, 10, 15, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    },
    'XGBoost': {
        'learning_rate': [0.01, 0.1, 0.2],
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 9],
        'subsample': [0.8, 0.9, 1.0]
    }
}


## 14. Randomized Search CV for Hyperparameter Tuning

In [None]:
# Update models with a more generalized pipeline for RandomizedSearchCV (excluding XGBoost)
models_tuned = {
    "Ridge": Pipeline([
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(degree=2)),
        ('model', Ridge(fit_intercept=True))
    ]),
    "Lasso": Pipeline([
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(degree=2)),
        ('model', Lasso(fit_intercept=True))
    ]),
    "RandomForest": RandomForestRegressor(),
}

# Perform RandomizedSearchCV for hyperparameter tuning with cross-validation (excluding XGBoost)
best_models = {}
for name, model in models_tuned.items():
    print(f"Tuning {name}...")
    random_search = RandomizedSearchCV(estimator=model, param_distributions=param_distributions[name], 
                                       n_iter=10, cv=5, n_jobs=-1, scoring='neg_mean_squared_error', random_state=42)
    random_search.fit(X_train, y_train)
    
    best_models[name] = random_search.best_estimator_
    print(f"Best parameters for {name}: {random_search.best_params_}")


Tuning Ridge...




Best parameters for Ridge: {'model__alpha': 100}
Tuning Lasso...


  model = cd_fast.enet_coordinate_descent(


Best parameters for Lasso: {'model__alpha': 1.0}
Tuning RandomForest...
Best parameters for RandomForest: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_depth': 5}


## 15. XGBoost Hyperparameter Tuning

In [None]:
# Separate XGBoost tuning using XGBoost's built-in RandomizedSearchCV
print(f"Tuning XGBoost...")

xgb_model = XGBRegressor()

# Perform RandomizedSearchCV with XGBoost's hyperparameters
xgb_random_search = RandomizedSearchCV(estimator=xgb_model, param_distributions=param_distributions['XGBoost'],
                                       n_iter=10, cv=5, n_jobs=-1, scoring='neg_mean_squared_error', random_state=42)
xgb_random_search.fit(X_train, y_train)

best_models['XGBoost'] = xgb_random_search.best_estimator_
print(f"Best parameters for XGBoost: {xgb_random_search.best_params_}")


Tuning XGBoost...
Best parameters for XGBoost: {'subsample': 0.9, 'n_estimators': 50, 'max_depth': 3, 'learning_rate': 0.1}


## 16. Evaluate the Tuned Models

In [46]:
# %%
# Evaluate the tuned models
model_performance = {}
for name, model in best_models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    r2 = r2_score(y_test, predictions)
    mse = mean_squared_error(y_test, predictions)
    rmse = np.sqrt(mse)
    
    model_performance[name] = {
        'R²': r2,
        'MSE': mse,
        'RMSE': rmse
    }

# Print performance of the tuned models
print("\nTuned Model Performance:")
for model, metrics in model_performance.items():
    print(f"{model}: R² = {metrics['R²']:.4f}, MSE = {metrics['MSE']:.4f}, RMSE = {metrics['RMSE']:.4f}")


  model = cd_fast.enet_coordinate_descent(



Tuned Model Performance:
Ridge: R² = 0.6110, MSE = 1057249.2994, RMSE = 1028.2263
Lasso: R² = 0.6123, MSE = 1053805.9981, RMSE = 1026.5505
RandomForest: R² = 0.6173, MSE = 1040154.5184, RMSE = 1019.8797
XGBoost: R² = 0.6138, MSE = 1049633.3852, RMSE = 1024.5162


## 17. Final Predictions Using Best Model

In [48]:
# %%
# Select the best model based on R²
best_model = max(model_performance, key=lambda x: model_performance[x]['R²'])
print(f"\nBest Model: {best_model} with R² = {model_performance[best_model]['R²']:.4f}")

# Use the best model for final predictions
final_predictions = best_models[best_model].predict(test)

# Ensure no negative sales values
final_predictions = np.maximum(final_predictions, 0)

# Format the final output
final_output = original_test_identifiers.copy()  # Use the original identifiers
final_output['Item_Outlet_Sales'] = final_predictions

# Save to CSV in the requested format
final_output.to_csv('final_prediction_randomized_tuned.csv', index=False)

# Display first few rows of the output
print("\nFinal Prediction Output:")
print(final_output.head())



Best Model: RandomForest with R² = 0.6173

Final Prediction Output:
  Item_Identifier Outlet_Identifier  Item_Outlet_Sales
0           FDW58            OUT049        1583.792881
1           FDW14            OUT017        1448.255155
2           NCN55            OUT010         552.911318
3           FDQ58            OUT017        2466.141009
4           FDY38            OUT027        6243.348774


## 18.GridSearchCV for Hyperparameter Tuning

In [49]:
# %%
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for each model
param_grid = {
    'Ridge': {
        'model__alpha': [0.1, 1, 10, 100]
    },
    'Lasso': {
        'model__alpha': [0.1, 0.2, 0.5, 1.0]
    },
    'RandomForest': {
        'n_estimators': [50, 100, 150, 200],
        'max_depth': [5, 10, 15, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    },
    'XGBoost': {
        'learning_rate': [0.01, 0.1, 0.2],
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 9],
        'subsample': [0.8, 0.9, 1.0]
    }
}

# Perform GridSearchCV for hyperparameter tuning with cross-validation
best_grid_models = {}
for name, model in models_tuned.items():
    print(f"Tuning {name} using GridSearchCV...")
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid[name], 
                               cv=5, n_jobs=-1, scoring='neg_mean_squared_error', verbose=1)
    grid_search.fit(X_train, y_train)
    
    best_grid_models[name] = grid_search.best_estimator_
    print(f"Best parameters for {name}: {grid_search.best_params_}")


Tuning Ridge using GridSearchCV...
Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best parameters for Ridge: {'model__alpha': 100}
Tuning Lasso using GridSearchCV...
Fitting 5 folds for each of 4 candidates, totalling 20 fits


  model = cd_fast.enet_coordinate_descent(


Best parameters for Lasso: {'model__alpha': 1.0}
Tuning RandomForest using GridSearchCV...
Fitting 5 folds for each of 144 candidates, totalling 720 fits
Best parameters for RandomForest: {'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 150}


# 19. Evaluate the GridSearchCV Tuned Models

In [50]:
# %%
# Evaluate the models after GridSearchCV tuning
grid_model_performance = {}
for name, model in best_grid_models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    r2 = r2_score(y_test, predictions)
    mse = mean_squared_error(y_test, predictions)
    rmse = np.sqrt(mse)

    grid_model_performance[name] = {
        'R²': r2,
        'MSE': mse,
        'RMSE': rmse
    }

# Print performance of the GridSearchCV tuned models
print("\nGridSearchCV Tuned Model Performance:")
for model, metrics in grid_model_performance.items():
    print(f"{model}: R² = {metrics['R²']:.4f}, MSE = {metrics['MSE']:.4f}, RMSE = {metrics['RMSE']:.4f}")


  model = cd_fast.enet_coordinate_descent(



GridSearchCV Tuned Model Performance:
Ridge: R² = 0.6110, MSE = 1057249.2994, RMSE = 1028.2263
Lasso: R² = 0.6123, MSE = 1053805.9981, RMSE = 1026.5505
RandomForest: R² = 0.6174, MSE = 1039783.3210, RMSE = 1019.6977


## 20. Final Predictions Using Best GridSearchCV Model

In [51]:
# %%
# Select the best model from GridSearchCV based on R²
best_grid_model = max(grid_model_performance, key=lambda x: grid_model_performance[x]['R²'])
print(f"\nBest GridSearchCV Model: {best_grid_model} with R² = {grid_model_performance[best_grid_model]['R²']:.4f}")

# Use the best model for final predictions
final_predictions_grid = best_grid_models[best_grid_model].predict(test)

# Ensure no negative sales values
final_predictions_grid = np.maximum(final_predictions_grid, 0)

# Format the final output
final_output_grid = original_test_identifiers.copy()  # Use the original identifiers
final_output_grid['Item_Outlet_Sales'] = final_predictions_grid

# Save to CSV in the requested format
final_output_grid.to_csv('final_prediction_grid_search.csv', index=False)

# Display first few rows of the output
print("\nFinal Prediction Output with GridSearchCV:")
print(final_output_grid.head())



Best GridSearchCV Model: RandomForest with R² = 0.6174

Final Prediction Output with GridSearchCV:
  Item_Identifier Outlet_Identifier  Item_Outlet_Sales
0           FDW58            OUT049        1608.551442
1           FDW14            OUT017        1436.188902
2           NCN55            OUT010         551.496070
3           FDQ58            OUT017        2492.406512
4           FDY38            OUT027        6294.357435
