<a href="https://colab.research.google.com/github/ibudeX/Product_Sale_Prediction/blob/main/Regression_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression

### Our Project: Predicting Product Sales

In this notebook, we will build machine learning models to predict whether an product will leave the store or stay. This problem is called product sales prediction or product sales volume rate prediction.

Product sales volume is when an product voluntarily leaves a store. For organizations, understanding which products are likely to leave helps them:
- Take proactive steps to retain valuable products
- Improve product conditions and product satisfaction
- Plan for recruitment and training needs
- Reduce the costs associated with product sales volume rate

Product sales volume rate is expensive. Studies show that replacing a product can cost 50-200% of their annual salary when you factor in recruitment, training, and lost productivity.

We will learn and implement several machine learning algorithms, compare their performance, and understand which algorithm works best for this problem.



## Step 1: Import Required Libraries

Libraries are collections of pre-written code that help us perform specific tasks. Instead of writing everything from scratch, we use libraries to make our work easier and faster.

Here are the libraries we will use:

- pandas: For loading and manipulating data in tables
- numpy: For numerical calculations and working with arrays
- matplotlib and seaborn: For creating visualizations and charts
- sklearn: The main machine learning library that contains all the algorithms we need
- xgboost and lightgbm: Advanced machine learning libraries for gradient boosting algorithms

In [None]:
!pip install xgboost lightgbm

Collecting xgboost
  Downloading xgboost-3.1.3-py3-none-win_amd64.whl.metadata (2.0 kB)
Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-win_amd64.whl.metadata (17 kB)
Downloading xgboost-3.1.3-py3-none-win_amd64.whl (72.0 MB)
   ---------------------------------------- 0.0/72.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/72.0 MB ? eta -:--:--
   ---------------------------------------- 0.5/72.0 MB 1.8 MB/s eta 0:00:40
    --------------------------------------- 1.0/72.0 MB 2.1 MB/s eta 0:00:34
    --------------------------------------- 1.6/72.0 MB 2.3 MB/s eta 0:00:32
   - -------------------------------------- 2.1/72.0 MB 2.5 MB/s eta 0:00:29
   - -------------------------------------- 2.9/72.0 MB 2.7 MB/s eta 0:00:26
   - -------------------------------------- 3.4/72.0 MB 2.6 MB/s eta 0:00:26
   -- ------------------------------------- 4.2/72.0 MB 2.7 MB/s eta 0:00:25
   -- ------------------------------------- 4.5/72.0 MB 2.7 MB/s eta 0:00:26
   -- --

In [None]:
# Import libraries for data manipulation and analysis
import pandas as pd
import numpy as np

# Import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import machine learning algorithms
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Import tools for data preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder

# Import tools for model evaluation
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, r2_score, r2_score

# Import tools for visualizing decision trees
from sklearn.tree import plot_tree

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Set random seed for reproducibility (ensures we get the same results every time)
np.random.seed(42)

# Ignore warnings to keep output clean
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")

All libraries imported successfully!


## Step 2: Load the Dataset

Now we will load our product data from a CSV file. CSV stands for Comma Separated Values, which is a simple file format for storing tabular data.

We use pandas to read the CSV file and store it in a DataFrame. A DataFrame is like a spreadsheet or table where data is organized in rows and columns.

In [None]:
# Load the dataset from CSV file
df = pd.read_csv('Product_Sales.csv')

# Display the first few rows of the dataset
print("First 5 rows of the dataset:")
df.head()



First 5 rows of the dataset:


Unnamed: 0,ProductID,ProductCategory,ProductPrice,ProductAge,AdvertisingSpend,DiscountPercentage,HasPromotion,SocialMediaMentions,CompetitorPrice,NumCompetitors,Month,Quarter,IsHolidaySeason,DayOfWeek,StoreLocation,StoreSize,StockLevel,SalesVolume
0,PROD00001,Clothing,51.49,54,9146.65,10,Yes,158,47.8,10,10,Q4,No,Sunday,Rural,Medium,High,784
1,PROD00002,Books,24.2,31,5751.13,15,No,142,21.35,1,8,Q3,No,Wednesday,Urban,Medium,Medium,503
2,PROD00003,Sports,131.82,35,7381.92,25,Yes,146,139.6,7,9,Q3,No,Friday,Rural,Small,Low,209
3,PROD00004,Home & Garden,87.8,49,4901.19,20,No,159,91.47,4,4,Q2,No,Sunday,Suburban,Medium,Low,354
4,PROD00005,Electronics,454.38,36,288.58,10,No,140,482.9,7,10,Q4,No,Wednesday,Urban,Medium,Medium,10


In [None]:
# Display basic information about the dataset
print("\nDataset shape (rows, columns):", df.shape)
print("\nColumn names and data types:")
df.info()


Dataset shape (rows, columns): (5000, 18)

Column names and data types:
<class 'pandas.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ProductID            5000 non-null   str    
 1   ProductCategory      5000 non-null   str    
 2   ProductPrice         5000 non-null   float64
 3   ProductAge           5000 non-null   int64  
 4   AdvertisingSpend     5000 non-null   float64
 5   DiscountPercentage   5000 non-null   int64  
 6   HasPromotion         5000 non-null   str    
 7   SocialMediaMentions  5000 non-null   int64  
 8   CompetitorPrice      5000 non-null   float64
 9   NumCompetitors       5000 non-null   int64  
 10  Month                5000 non-null   int64  
 11  Quarter              5000 non-null   str    
 12  IsHolidaySeason      5000 non-null   str    
 13  DayOfWeek            5000 non-null   str    
 14  StoreLocation        5000 

## Step 3: Understanding the Dataset

Before building any predictive or analytical models, it is important to understand the structure of the dataset and the meaning of each variable. This dataset contains information about products, pricing, promotions, store characteristics, and sales performance.

### Column Descriptions

The dataset consists of **5,000 records**, where each row represents a product observed over a specific time period. Below is a description of each column and what it represents:

**ProductID**
A unique identifier assigned to each product. This helps distinguish one product from another in the dataset.

**ProductCategory**
The category to which the product belongs (for example, electronics, clothing, or groceries). Product category can influence pricing strategies and sales volume.

**ProductPrice**
The selling price of the product. Price is a key factor that directly affects customer demand and sales volume.

**ProductAge**
The age of the product in months or years since its introduction to the market. Older products may experience declining sales compared to newer ones.

**AdvertisingSpend**
The amount of money spent on advertising for the product. Higher advertising spend is often associated with increased visibility and higher sales.

**DiscountPercentage**
The percentage discount applied to the product price. Discounts can stimulate demand and boost short-term sales.

**HasPromotion**
Indicates whether the product is currently under a promotional campaign.

* Yes = Product has an active promotion
* No = No promotion applied

**SocialMediaMentions**
The number of times the product is mentioned on social media platforms. This serves as a proxy for product popularity and public interest.

**CompetitorPrice**
The average price of similar competing products in the market. This helps assess the product’s price competitiveness.

**NumCompetitors**
The number of competing products available in the market. Higher competition may reduce market share and sales volume.

**Month**
The month of the year (1–12) when the data was recorded. This helps capture monthly trends and seasonality.

**Quarter**
The quarter of the year (Q1, Q2, Q3, Q4). Quarterly patterns are useful for understanding seasonal sales behavior.

**IsHolidaySeason**
Indicates whether the observation falls within a holiday season.

* Yes = Holiday period
* No = Non-holiday period
  Sales often increase during holiday seasons.

**DayOfWeek**
The day of the week when sales were recorded (e.g., Monday, Tuesday). Sales volume may vary across weekdays and weekends.

**StoreLocation**
The geographical location of the store (such as urban, suburban, or rural). Location can influence customer traffic and purchasing power.

**StoreSize**
The size of the store (for example, small, medium, or large). Larger stores often have higher sales capacity.

**StockLevel**
The availability level of the product in the store (e.g., low, medium, high). Low stock levels may limit sales even if demand is high.

**SalesVolume**
This is the **target variable** of the dataset. It represents the number of units sold for a product during the observed period. The goal of analysis or modeling is often to understand and predict this value based on the other features.

### What Are We Trying to Predict?

Our goal is to predict the Sales Volume column. We want to build a model that can look at the other columns and predict whether an product will leave the store or not. This is called a regression problem because we are predicting products into two categories: those who will leave and those who will stay.

In [None]:

print("Statistical Summary of Numerical Columns:")
df.describe()



Statistical Summary of Numerical Columns:


Unnamed: 0,ProductPrice,ProductAge,AdvertisingSpend,DiscountPercentage,SocialMediaMentions,CompetitorPrice,NumCompetitors,Month,SalesVolume
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,116.768962,29.7786,5024.273202,12.293,149.968,119.193308,5.5302,6.4326,489.2802
std,120.014916,17.539587,2890.162434,12.126276,12.492101,123.051875,2.853825,3.474978,303.683002
min,5.02,0.0,2.52,0.0,110.0,5.04,1.0,1.0,10.0
25%,33.275,15.0,2511.0375,0.0,141.0,33.595,3.0,3.0,275.0
50%,70.315,29.0,5058.965,10.0,150.0,71.36,6.0,6.0,431.0
75%,151.84,45.0,7545.2725,20.0,158.0,153.615,8.0,9.0,632.0
max,499.72,60.0,9999.25,50.0,198.0,578.25,10.0,12.0,2185.0


In [None]:
# Check how many products left vs stayed
print("\nDistribution of Target Variable (Sales Volume):")
df['SalesVolume'].value_counts()



Distribution of Target Variable (Sales Volume):


SalesVolume
10      27
345     15
297     15
456     15
262     14
        ..
1274     1
851      1
1389     1
730      1
59       1
Name: count, Length: 1176, dtype: int64

In [None]:
print("\nPercentage:")
df['SalesVolume'].value_counts(normalize=True) * 100


Percentage:


SalesVolume
10      0.54
345     0.30
297     0.30
456     0.30
262     0.28
        ... 
1274    0.02
851     0.02
1389    0.02
730     0.02
59      0.02
Name: proportion, Length: 1176, dtype: float64

## Step 4: Checking for Missing Values

Missing values are empty cells in our dataset where data is absent. For example, if an product's age is not recorded, that cell would be empty or contain a special value like NaN (Not a Number).

Missing values can cause problems when training machine learning models, so we need to check if our dataset has any missing values and handle them appropriately.

In [None]:
# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())

# Check total missing values
total_missing = df.isnull().sum().sum()
print(f"\nTotal missing values in dataset: {total_missing}")

if total_missing == 0:
    print("Great! Our dataset has no missing values.")
else:
    print("We need to handle missing values before proceeding.")

Missing values in each column:
ProductID              0
ProductCategory        0
ProductPrice           0
ProductAge             0
AdvertisingSpend       0
DiscountPercentage     0
HasPromotion           0
SocialMediaMentions    0
CompetitorPrice        0
NumCompetitors         0
Month                  0
Quarter                0
IsHolidaySeason        0
DayOfWeek              0
StoreLocation          0
StoreSize              0
StockLevel             0
SalesVolume            0
dtype: int64

Total missing values in dataset: 0
Great! Our dataset has no missing values.


## Step 5: Data Preprocessing

Data preprocessing is the process of preparing our data for machine learning. Raw data often needs to be cleaned and transformed before we can use it to train models.

### Why Do We Need Preprocessing?

Most machine learning algorithms work only with numerical data. However, our dataset contains several columns with text (categorical) values such as product categories, store characteristics, and seasonal indicators. These text values need to be converted into numerical form before they can be used by machine learning models.
In addition, some columns may not contribute useful information for prediction and should be removed to avoid unnecessary complexity.

### Identifying Columns to Remove

Not all columns are useful for predicting sales performance. Below are the columns identified for removal:

ProductID: This is a unique identifier assigned to each product. While it helps with record tracking, it does not contain any meaningful information that can influence sales volume and therefore has no predictive value.

### Encoding Categorical Variables

Several columns in the dataset contain categorical (text-based) data and must be converted into numerical form. This process is known as encoding.

The following columns require encoding: ProductCategory, HasPromotion, Quarter, IsHolidaySeason, DayOfWeek, StoreLocation, StoreSize, StockLevel

We will use Label Encoding to transform these categorical values into numbers. Label Encoding assigns a unique numerical value to each category within a column, making the data suitable for machine learning algorithms.

In [None]:
# Create a copy of the dataframe to preserve the original
df_processed = df.copy()

# Remove columns that are not useful for prediction
columns_to_remove = ['ProductID']
df_processed = df_processed.drop(columns=columns_to_remove)
print(f"Removed columns: {columns_to_remove}")

# Display categorical columns that need encoding
print("\nCategorical columns to encode:")
categorical_columns = df_processed.select_dtypes(include=['object']).columns.tolist()
print(categorical_columns)

# Create label encoders for each categorical column
label_encoders = {}

for col in categorical_columns:
    le = LabelEncoder()
    df_processed[col] = le.fit_transform(df_processed[col])
    label_encoders[col] = le

    print(f"\n{col} encoding:")
    for i, label in enumerate(le.classes_):
        print(f"  {label} -> {i}")



Removed columns: ['ProductID']

Categorical columns to encode:
['ProductCategory', 'HasPromotion', 'Quarter', 'IsHolidaySeason', 'DayOfWeek', 'StoreLocation', 'StoreSize', 'StockLevel']

ProductCategory encoding:
  Books -> 0
  Clothing -> 1
  Electronics -> 2
  Food -> 3
  Home & Garden -> 4
  Sports -> 5

HasPromotion encoding:
  No -> 0
  Yes -> 1

Quarter encoding:
  Q1 -> 0
  Q2 -> 1
  Q3 -> 2
  Q4 -> 3

IsHolidaySeason encoding:
  No -> 0
  Yes -> 1

DayOfWeek encoding:
  Friday -> 0
  Monday -> 1
  Saturday -> 2
  Sunday -> 3
  Thursday -> 4
  Tuesday -> 5
  Wednesday -> 6

StoreLocation encoding:
  Rural -> 0
  Suburban -> 1
  Urban -> 2

StoreSize encoding:
  Large -> 0
  Medium -> 1
  Small -> 2

StockLevel encoding:
  High -> 0
  Low -> 1
  Medium -> 2


In [None]:
# Display the processed dataset
print("\nProcessed dataset (first 5 rows):")
df_processed.head()




Processed dataset (first 5 rows):


Unnamed: 0,ProductCategory,ProductPrice,ProductAge,AdvertisingSpend,DiscountPercentage,HasPromotion,SocialMediaMentions,CompetitorPrice,NumCompetitors,Month,Quarter,IsHolidaySeason,DayOfWeek,StoreLocation,StoreSize,StockLevel,SalesVolume
0,1,51.49,54,9146.65,10,1,158,47.8,10,10,3,0,3,0,1,0,784
1,0,24.2,31,5751.13,15,0,142,21.35,1,8,2,0,6,2,1,2,503
2,5,131.82,35,7381.92,25,1,146,139.6,7,9,2,0,0,0,2,1,209
3,4,87.8,49,4901.19,20,0,159,91.47,4,4,1,0,3,1,1,1,354
4,2,454.38,36,288.58,10,0,140,482.9,7,10,3,0,6,2,1,2,10


## Step 6: Splitting the Data into Features and Target

In machine learning, we separate our data into two parts:

1. Features (X): These are the input columns that we use to make predictions. Features are the information we know about each product, such as age, job satisfaction, and monthly income.

2. Target (y): This is the output column that we want to predict. In our case, it's the Sales Volume column that tells us whether an product left or stayed.

Think of it like this: Features are the clues, and the target is the answer we're trying to guess.

In [None]:
# Separate features (X) and target (y)
X = df_processed.drop('SalesVolume', axis=1)  # All columns except Sales Volume
y = df_processed['SalesVolume']  # Only the Sales Volume column

print("Features (X) shape:", X.shape)
print("Target (y) shape:", y.shape)



Features (X) shape: (5000, 16)
Target (y) shape: (5000,)


In [None]:
print("\nFeature columns:")
print(X.columns.tolist())




Feature columns:
['ProductCategory', 'ProductPrice', 'ProductAge', 'AdvertisingSpend', 'DiscountPercentage', 'HasPromotion', 'SocialMediaMentions', 'CompetitorPrice', 'NumCompetitors', 'Month', 'Quarter', 'IsHolidaySeason', 'DayOfWeek', 'StoreLocation', 'StoreSize', 'StockLevel']


In [None]:
print("\nFirst few rows of features:")
X.head()




First few rows of features:


Unnamed: 0,ProductCategory,ProductPrice,ProductAge,AdvertisingSpend,DiscountPercentage,HasPromotion,SocialMediaMentions,CompetitorPrice,NumCompetitors,Month,Quarter,IsHolidaySeason,DayOfWeek,StoreLocation,StoreSize,StockLevel
0,1,51.49,54,9146.65,10,1,158,47.8,10,10,3,0,3,0,1,0
1,0,24.2,31,5751.13,15,0,142,21.35,1,8,2,0,6,2,1,2
2,5,131.82,35,7381.92,25,1,146,139.6,7,9,2,0,0,0,2,1
3,4,87.8,49,4901.19,20,0,159,91.47,4,4,1,0,3,1,1,1
4,2,454.38,36,288.58,10,0,140,482.9,7,10,3,0,6,2,1,2


In [None]:
print("\nFirst few values of target:")
print("\nNote: 0 = No sales volume (stayed), 1 = Yes sales volume (left)")
y.head()



First few values of target:

Note: 0 = No sales volume (stayed), 1 = Yes sales volume (left)


0    784
1    503
2    209
3    354
4     10
Name: SalesVolume, dtype: int64

## Step 8: Train-Test Split

### What is Train-Test Split?

Before we train our machine learning models, we need to split our data into two parts:

1. Training Set: This is the data we use to teach the model. The model learns patterns from this data.

2. Testing Set: This is the data we use to evaluate how well the model performs. The model has never seen this data during training.

### Why Do We Need This Split?

Imagine you are studying for an exam. You practice with sample questions (training data), and then you take the actual exam with different questions (testing data). If the exam only had the exact same questions you practiced, you might do well but it wouldn't truly test your understanding. Similarly, we test our model on new data it hasn't seen to check if it really learned the patterns or just memorized the training data.

This concept is called generalization. We want our model to generalize well, meaning it should perform well on new, unseen data, not just the data it was trained on.

### The 80-20 Split

We typically use 80% of our data for training and 20% for testing. This is a common practice in machine learning that gives the model enough data to learn from while keeping sufficient data to evaluate its performance.

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape[0], "products")
print("Testing set size:", X_test.shape[0], "products")



Training set size: 4000 products
Testing set size: 1000 products


## Step 9: Understanding Evaluation Metrics

Before we start building models, we need to understand how to measure their performance. Just like students get grades on exams, machine learning models get evaluated using specific metrics.

### The Four Key Metrics

We will use four main metrics to evaluate our models:

#### 1. R² Score
R² Score tells us what percentage of predictions were correct overall. It's calculated as:

R² Score = (Correct Predictions) / (Total Predictions)

For example, if our model made 100 predictions and 85 were correct, the R² score is 85%.

#### 2. MAE
MAE tells us: Of all the products we predicted would leave, how many actually left?

MAE = (True Positives) / (True Positives + False Positives)

High MAE means when the model predicts sales volume, it's usually right. This is important if risk management efforts are expensive, as we don't want to waste resources on false alarms.

#### 3. RMSE
RMSE tells us: Of all the products who actually left, how many did we correctly identify?

RMSE = (True Positives) / (True Positives + False Negatives)

High RMSE means we catch most of the products who will leave. This is important if missing a departing product is very costly.

#### 4. R² Score
R² Score is the harmonic mean of MAE and RMSE. It provides a single number that balances both metrics.

R² Score = 2 * (MAE * RMSE) / (MAE + RMSE)

R² Score is useful when you want a balance between MAE and RMSE.

### Confusion Matrix

A confusion matrix is a table that shows four types of predictions:

- True Positives (TP): We predicted sales volume, and the product actually left. Correct prediction.
- True Negatives (TN): We predicted risk management, and the product actually stayed. Correct prediction.
- False Positives (FP): We predicted sales volume, but the product actually stayed. Wrong prediction (False alarm).
- False Negatives (FN): We predicted risk management, but the product actually left. Wrong prediction (Missed detection).

Let's create a function to display all these metrics in an organized way.

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

def evaluate_model(model_name, y_true, y_pred):
    """
    Evaluates a regression model and displays key performance metrics.

    Parameters:
    - model_name: Name of the model being evaluated
    - y_true: Actual target values
    - y_pred: Predicted target values
    """

    print(f"\n{'='*60}")
    print(f"Evaluation Results for {model_name}")
    print(f"{'='*60}\n")

    # Calculate metrics
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))

    # Display metrics
    print(f"R² Score : {r2:.4f}")
    print(f"MAE      : {mae:.4f}")
    print(f"RMSE     : {rmse:.4f}\n")

    return r2, mae, rmse


print("Evaluation function created successfully!")
print("We will use this function to evaluate all regression models.")

Evaluation function created successfully!
We will use this function to evaluate all regression models.


# Machine Learning Algorithms



## Algorithm 1: Linear Regression


In [None]:
# Create a Linear Regression model
linear_model = LinearRegression()

# Train the model on training data
# This is where the model learns patterns from the data
print("Training Linear Regression model...")
linear_model.fit(X_train, y_train)
print("Training completed!")

# Make predictions on test data
y_pred_linear = linear_model.predict(X_test)

# Evaluate the model
lr_r2, lr_mae, lr_rmse = evaluate_model("Linear Regression", y_test, y_pred_linear)



Training Linear Regression model...
Training completed!

Evaluation Results for Linear Regression

R² Score : 0.8083
MAE      : 101.6684
RMSE     : 137.0120



## Algorithm 2: Decision Tree



In [None]:
# Create a Decision Tree REGRESSION model
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor(
    max_depth=5,
    min_samples_split=20,
    random_state=42
)

# Train the model
print("Training Decision Tree Regression model...")
tree_model.fit(X_train, y_train)
print("Training completed!")

# Make predictions
y_pred_tree = tree_model.predict(X_test)

# Evaluate the model (Regression metrics)
dt_r2, dt_mae, dt_rmse = evaluate_model(
    "Decision Tree Regression",
    y_test,
    y_pred_tree
)



Training Decision Tree Regression model...
Training completed!

Evaluation Results for Decision Tree Regression

R² Score : 0.7257
MAE      : 125.0824
RMSE     : 163.8784



## Algorithm 3: Random Forest


In [None]:
# Create a Random Forest REGRESSION model


from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=10,
    random_state=42,
    n_jobs=-1  # use all CPU cores for faster training
)

# Train the model
print("Training Random Forest Regression model...")
print("This may take a moment as we're training 100 trees...")
rf_model.fit(X_train, y_train)
print("Training completed!")

# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model (Regression metrics)
rf_r2, rf_mae, rf_rmse = evaluate_model(
    "Random Forest Regression",
    y_test,
    y_pred_rf
)

Training Random Forest Regression model...
This may take a moment as we're training 100 trees...
Training completed!

Evaluation Results for Random Forest Regression

R² Score : 0.8604
MAE      : 86.4011
RMSE     : 116.9216



## Algorithm 4: Gradient Boosting


In [None]:

from sklearn.ensemble import GradientBoostingRegressor

gb_model = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=4,
    random_state=42
)

# Train the model
print("Training Gradient Boosting Regression model...")
print("This may take a moment as trees are built sequentially...")
gb_model.fit(X_train, y_train)
print("Training completed!")

# Make predictions
y_pred_gb = gb_model.predict(X_test)

# Evaluate the model (Regression metrics)
gb_r2, gb_mae, gb_rmse = evaluate_model(
    "Gradient Boosting Regression",
    y_test,
    y_pred_gb
)



Training Gradient Boosting Regression model...
This may take a moment as trees are built sequentially...
Training completed!

Evaluation Results for Gradient Boosting Regression

R² Score : 0.9288
MAE      : 62.7052
RMSE     : 83.5276



## Algorithm 5: XGBoost (Extreme Gradient Boosting)



In [None]:
from xgboost import XGBRegressor

xgb_model = XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=4,
    random_state=42,
    eval_metric='rmse',
    n_jobs=-1
)

# Train the model
print("Training XGBoost Regression model...")
xgb_model.fit(X_train, y_train)
print("Training completed!")

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model (Regression metrics)
xgb_r2, xgb_mae, xgb_rmse = evaluate_model(
    "XGBoost Regression",
    y_test,
    y_pred_xgb
)



Training XGBoost Regression model...
Training completed!

Evaluation Results for XGBoost Regression

R² Score : 0.9265
MAE      : 63.4158
RMSE     : 84.8444



## Algorithm 6: LightGBM (Light Gradient Boosting Machine)


In [None]:
from lightgbm import LGBMRegressor

lgbm_model = LGBMRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=4,
    random_state=42,
    verbose=-1
)

# Train the model
print("Training LightGBM Regression model...")
lgbm_model.fit(X_train, y_train)
print("Training completed!")

# Make predictions
y_pred_lgbm = lgbm_model.predict(X_test)

# Evaluate the model (Regression metrics)
lgbm_r2, lgbm_mae, lgbm_rmse = evaluate_model(
    "LightGBM Regression",
    y_test,
    y_pred_lgbm
)




Training LightGBM Regression model...
Training completed!

Evaluation Results for LightGBM Regression

R² Score : 0.9282
MAE      : 63.2684
RMSE     : 83.8698



In [None]:
# Create a comparison dataframe
results_df = pd.DataFrame({
    "Model": [
        "Linear Regression",
        "Decision Tree",
        "Random Forest",
        "Gradient Boosting",
        "XGBoost",
        "LightGBM"
    ],
    "R² Score": [lr_r2, dt_r2, rf_r2, gb_r2, xgb_r2, lgbm_r2],
    "MAE": [lr_mae, dt_mae, rf_mae, gb_mae, xgb_mae, lgbm_mae],
    "RMSE": [lr_rmse, dt_rmse, rf_rmse, gb_rmse, xgb_rmse, lgbm_rmse]
})

# Sort by R² Score (descending)
results_df = results_df.sort_values("R² Score", ascending=False)

# Display comparison
print("\n" + "="*80)
print("MODEL COMPARISON - TRAIN-TEST SPLIT RESULTS")
print("="*80)
print(results_df.to_string(index=False))
print("="*80)

# Find best model
best_model_idx = results_df["R² Score"].idxmax()
best_model_name = results_df.loc[best_model_idx, "Model"]
best_r2 = results_df.loc[best_model_idx, "R² Score"]

print(f"\nBased on R² Score, the best performing model is: {best_model_name}")
print(f"R² Score: {best_r2:.4f}")
print("\nNotice how the R² scores reflect how well each model predicts the target variable.")



MODEL COMPARISON - TRAIN-TEST SPLIT RESULTS
            Model  R² Score        MAE       RMSE
Gradient Boosting  0.928751  62.705162  83.527556
         LightGBM  0.928166  63.268371  83.869825
          XGBoost  0.926486  63.415829  84.844380
    Random Forest  0.860392  86.401063 116.921608
Linear Regression  0.808293 101.668413 137.011998
    Decision Tree  0.725739 125.082421 163.878398

Based on R² Score, the best performing model is: Gradient Boosting
R² Score: 0.9288

Notice how the R² scores reflect how well each model predicts the target variable.


## Introduction to K-Fold Cross-Validation

### What is Cross-Validation?

So far, we've evaluated our models using a single train-test split. While this gives us useful information, it has a limitation: our results depend on which specific 20% of data happened to end up in the test set. If we had chosen different data for testing, we might get slightly different results.

Cross-validation is a more robust way to evaluate model performance. It gives us a better estimate of how well our model will perform on new, unseen data.

### What is K-Fold Cross-Validation?

K-Fold Cross-Validation is a specific type of cross-validation. Here's how it works:

1. Divide the entire dataset into K equal parts (called "folds"). K is typically 5 or 10.
2. Perform K rounds of training and testing:
   - In round 1: Use fold 1 as test set, train on folds 2-K
   - In round 2: Use fold 2 as test set, train on folds 1 and 3-K
   - Continue until each fold has been used as a test set exactly once
3. Calculate the average performance across all K rounds

### An Analogy

Imagine you want to evaluate an product's performance. Instead of giving just one project review, you evaluate them on five different projects throughout the year and calculate the average score. This average gives you a more reliable estimate of their ability than a single project would.

### Why Use K-Fold Cross-Validation?

1. More Reliable: We get multiple performance estimates instead of just one
2. Better Use of Data: Every data point gets used for both training and testing
3. Reduces Variance: Results are less dependent on one particular split
4. Detects Overfitting: Large differences between folds can indicate overfitting

### Choosing K

Common choices for K:
- K=5: Good balance between computational cost and reliability
- K=10: More computationally expensive but often provides better estimates
- K=number of samples: Called Leave-One-Out Cross-Validation, very expensive but maximally thorough

For our tutorial, we'll use K=5 as it provides a good balance.

Let's now apply K-Fold Cross-Validation to all our models.

In [None]:
# Dictionary of all regression models
models = {
    'Linear Regression': linear_model,
    'Random Forest': rf_model,
    'Gradient Boosting': gb_model,
    'XGBoost': xgb_model,
    'LightGBM': lgbm_model
}


from sklearn.model_selection import cross_val_score

# Store cross-validation results
cv_results = {}

print("Performing 5-Fold Cross-Validation for all models...\n")

for name, model in models.items():
    print(f"Evaluating {name}...")

    # 5-Fold CV for regression
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

    # Store results
    cv_results[name] = {
        'scores': cv_scores,
        'mean': cv_scores.mean(),
        'std': cv_scores.std()
    }

    # Display results for each fold
    for i, score in enumerate(cv_scores, start=1):
        print(f"  Fold {i}: {score:.4f}")
    print(f"  Mean R² Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})\n")

print("Cross-validation completed!")


Performing 5-Fold Cross-Validation for all models...

Evaluating Linear Regression...
  Fold 1: 0.8198
  Fold 2: 0.7977
  Fold 3: 0.8069
  Fold 4: 0.8014
  Fold 5: 0.8173
  Mean R² Score: 0.8086 (+/- 0.0087)

Evaluating Random Forest...
  Fold 1: 0.8691
  Fold 2: 0.8523
  Fold 3: 0.8574
  Fold 4: 0.8365
  Fold 5: 0.8582
  Mean R² Score: 0.8547 (+/- 0.0106)

Evaluating Gradient Boosting...
  Fold 1: 0.9354
  Fold 2: 0.9308
  Fold 3: 0.9248
  Fold 4: 0.9243
  Fold 5: 0.9324
  Mean R² Score: 0.9295 (+/- 0.0043)

Evaluating XGBoost...
  Fold 1: 0.9358
  Fold 2: 0.9263
  Fold 3: 0.9261
  Fold 4: 0.9238
  Fold 5: 0.9335
  Mean R² Score: 0.9291 (+/- 0.0047)

Evaluating LightGBM...
  Fold 1: 0.9348
  Fold 2: 0.9284
  Fold 3: 0.9259
  Fold 4: 0.9231
  Fold 5: 0.9321
  Mean R² Score: 0.9289 (+/- 0.0042)

Cross-validation completed!


In [None]:
import joblib

# Save the best model (Gradient Boosting)
best_model = gb_model
joblib.dump(best_model, "gb_model.pkl")

print("Best model saved as 'best_model_gb.pkl'")

# To load it later
loaded_model = joblib.load("gb_model.pkl")
y_pred_loaded = loaded_model.predict(X_test)


Best model saved as 'best_model_gb.pkl'


## Understanding Cross-Validation Results

For each model, we now have five R² score scores, one for each fold. Let's understand what these numbers tell us:

### Mean R² Score
The mean (average) of the five scores gives us an overall estimate of how well the model performs. This is more reliable than the single R² score score we got from our train-test split.

### Standard Deviation
The standard deviation tells us how much the scores vary across folds:
- Low standard deviation: The model performs consistently across different subsets of data. This is good.
- High standard deviation: The model's performance varies a lot. This might indicate that the model is unstable or overfitting.

### What to Look For
- High mean R² score: The model performs well on average
- Low standard deviation: The model is stable and consistent
- Ideally, we want both high mean and low standard deviation

Let's create a comparison of cross-validation results.