<a href="https://colab.research.google.com/github/khal94/Prediction-of-Product-Sales/blob/main/project1part5%266.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 - Part 5

The goal of this step is to help the retailer by using machine learning to make predictions about future sales based on the data provided.

In [1]:
## Typical Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Modeling & preprocessing import
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer,make_column_transformer,make_column_selector
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
file_path = '/content/drive/MyDrive/Colab Notebooks/sales_predictions_2023.csv'

In [4]:
df = pd.read_csv(file_path)

In [5]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [7]:
duplicates = df.duplicated()

In [8]:
df_cleaned = df.drop_duplicates()


In [9]:
# Fix inconsistencies in categorical data
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'LF': 'Low Fat', 'low fat': 'Low Fat', 'reg': 'Regular'})

In [10]:
# Select features and target
X = df.drop(columns=['Item_Outlet_Sales'])  # Features matrix (all columns except 'Item_Outlet_Sales')
y = df['Item_Outlet_Sales']  # Target vector ('Item_Outlet_Sales' column)

In [11]:
# Preprocessing pipeline for numeric features
numeric_features = ['Item_Weight', 'Item_Visibility', 'Item_MRP']
numeric_transformer = StandardScaler()

In [12]:
# Preprocessing pipeline for categorical features
categorical_features = ['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier']
categorical_transformer = OneHotEncoder(drop='first')

In [13]:
# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [14]:
X_preprocessed = preprocessor.fit_transform(X)


In [15]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

In [16]:
# Drop the 'Item_Identifier' column
X.drop(columns=['Item_Identifier'], inplace=True)

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [18]:
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Training set shape: (6818, 10) (6818,)
Testing set shape: (1705, 10) (1705,)


In [19]:
# Define numeric features (columns)
numeric_features = X.select_dtypes(include=['float64', 'int64']).columns.tolist()

In [20]:
# Define categorical features (columns)
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

In [21]:
# Create preprocessing pipeline for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Handle missing values by replacing them with median
    ('scaler', StandardScaler())  # Scale numerical features
])

In [22]:
# Create preprocessing pipeline for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing values by replacing them with most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical features
])

In [23]:
# Combine preprocessing steps using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [24]:

# Display the preprocessing object
preprocessor

# Project 1 - Part 6

# Modeling

In [25]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [26]:
preprocessor

In [27]:
## Make and fit model
linreg_pipe = make_pipeline(preprocessor,LinearRegression())
linreg_pipe.fit(X_train, y_train)

In [28]:
# Get predictions to use to evaluate model
y_hat_train = linreg_pipe.predict(X_train)
y_hat_test = linreg_pipe.predict(X_test)

In [29]:
def evaluate_model(y_true, y_pred, split='training'):
  """ prints RMSE, and R2 metrics, include which data split was evaluated

  Args:
    y_true: y-train or y-test
    y_pred: result of model.predict(X)
    split: which data split is being evaluate ['training','test']
  """

  r2 = r2_score(y_true,y_pred)
  mae = mean_absolute_error(y_true,y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = mean_squared_error(y_true,y_pred,squared=False)


  print(f'Results for {split} data:')
  print(f"  - R^2 = {round(r2,3)}")
  print(f"  - MAE = {round(mae,3)}")
  print(f"  - MSE = {round(mse,3)}")
  print(f"  - RMSE = {round(rmse,3)}")
  print()

In [30]:
## Evaluate model's performance
evaluate_model(y_train, y_hat_train,split='training')
evaluate_model(y_test, y_hat_test,split='testing')

Results for training data:
  - R^2 = 0.559
  - MAE = 847.221
  - MSE = 1303094.392
  - RMSE = 1141.532

Results for testing data:
  - R^2 = 0.579
  - MAE = 792.025
  - MSE = 1143542.155
  - RMSE = 1069.365



**Interpretation**



High and Similar R-squared Values.
both training and test R-squared values are high (close to 1) and similar (difference less than 0.1), it suggests the model is capturing the underlying trend in the data well and generalizes reasonably to unseen data (good fit).

In [31]:
from sklearn.ensemble import RandomForestRegressor

In [32]:
rf_tree_pipe = make_pipeline(preprocessor,RandomForestRegressor(random_state = 42))
rf_tree_pipe.fit(X_train, y_train)

In [33]:
y_hat_train = rf_tree_pipe.predict(X_train)
y_hat_test = rf_tree_pipe.predict(X_test)

In [34]:
## Evaluate model's performance
evaluate_model(y_train, y_hat_train,split='training')
evaluate_model(y_test, y_hat_test,split='testing')

Results for training data:
  - R^2 = 0.938
  - MAE = 298.023
  - MSE = 184357.241
  - RMSE = 429.368

Results for testing data:
  - R^2 = 0.567
  - MAE = 757.836
  - MSE = 1176191.819
  - RMSE = 1084.524



**Interpretation**

High Training, Low Test R-squared.
The training R-squared is significantly higher than the test R-squared (difference greater than 0.1), it indicates the model is overfitting to the training data. The model performs well on data it has seen but might not generalize well to unseen data.


Random Forest Test R-squared: 0.567
Linear Regression Test R-squared: 0.579
the linear regression mode has the higher test R-squared score compared to the Random Forest model. This suggests the linear regression model performs better on unseen data in this case.

In [35]:
from sklearn.model_selection import GridSearchCV

In [39]:
#create a range of max_depth values
n_estimators = [100, 200, 300]

In [40]:
#create a dataframe to store train and test scores.
scores = pd.DataFrame(columns=['Train', 'Test'], index=n_estimators)

In [41]:
#loop over the values in depths
for n in n_estimators:
  #fit a new model with max_depth
  rf = RandomForestRegressor(random_state = 42, n_estimators=n)
  #put the model into a pipeline
  rf_pipe = make_pipeline(preprocessor, rf)
  #fit the model
  rf_pipe.fit(X_train, y_train)
  #create prediction arrays
  train_pred = rf_pipe.predict(X_train)
  test_pred = rf_pipe.predict(X_test)

  #evaluate the model using R2 Score
  train_r2score = r2_score(y_train, train_pred)
  test_r2score = r2_score(y_test, test_pred)

  #store the scores in the scores dataframe
  scores.loc[n, 'Train'] = train_r2score
  scores.loc[n, 'Test'] = test_r2score

In [42]:
scores

Unnamed: 0,Train,Test
100,0.937676,0.567254
200,0.938962,0.568283
300,0.939699,0.568033


In [43]:
best_estimators = scores.sort_values(by='Test', ascending=False).index[0]
best_estimators

200

In [44]:
best_rf = RandomForestRegressor(random_state = 42, n_estimators=best_estimators)

best_rf_pipe = make_pipeline(preprocessor, best_rf)

best_rf_pipe.fit(X_train, y_train)

print('Training Scores for High Variance Decision Tree')
evaluate_model(y_train, best_rf_pipe.predict(X_train), split = 'training')

print('\n')

print('Testing Scores for High Variance Decision Tree')
evaluate_model(y_test, best_rf_pipe.predict(X_test), split = 'testing')

Training Scores for High Variance Decision Tree
Results for training data:
  - R^2 = 0.939
  - MAE = 296.052
  - MSE = 180552.445
  - RMSE = 424.915



Testing Scores for High Variance Decision Tree
Results for testing data:
  - R^2 = 0.568
  - MAE = 757.525
  - MSE = 1173392.776
  - RMSE = 1083.233



# Model Recommendation
Based on the exploration of different models, I recommend the tuned Random Forest model for the sales prediction task.

Here's the justification:

**Hyperparameter tuning**: GridSearchCV identified optimal hyperparameters, potentially leading to better performance compared to a default Random Forest model.

**Flexibility**: Random Forest models are generally more flexible and less prone to overfitting compared to linear regression models, especially for complex datasets with non-linear relationships between features and the target variable.

**Interpretability (to a certain extent)**: While not as interpretable as linear regression through coefficients, Random Forest models provide feature importance scores that can offer insights into which features contribute most to the predictions. This can be valuable for understanding the key drivers of sales in your data.

# Interpreting R-squared for Stakeholders
R-squared is a metric between 0 and 1 that tells you how well your model's predictions fit the actual data. Here's an analogy for non-technical stakeholders:

 R-squared represents the proportion of  arrows that land close to the bullseye (actual values) compared to being scattered all over the place. A higher R-squared (closer to 1) signifies that your model's predictions are generally close to the actual sales values.

# Choosing Another Regression Metric: Mean Squared Error (MSE)
While R-squared is a good starting point, I also recommend considering the Mean Squared Error (MSE) metric for a more comprehensive picture.

MSE represents the average squared difference between the predicted and actual sales values. In simpler terms, it tells you the average amount of error your model makes in its predictions. A lower MSE indicates that the model's predictions are, on average, closer to the actual sales figures.

# Why MSE?

R-squared is a proportion, making it difficult to interpret the magnitude of errors directly. MSE provides the error in the units of your target variable (sales in this case), making it easier to understand the average prediction error.

# Comparing Training vs. Test Scores: Overfitting Analysis
To assess overfitting, compare the R-squared and MSE scores on the training set (where the model is trained) and the test set (unseen data used for evaluation).

Significant Difference: If the training set scores are considerably higher than the test set scores (e.g., R-squared much higher on training data), it suggests the model might be overfitting to the training data and might not perform well on unseen data.
Similar Scores: If the scores on both sets are similar, it indicates the model is generalizing reasonably well and is not overly memorizing the training data.
By considering both R-squared and MSE on the training and test sets, you can get a better sense of how well your model is fitting the data and generalizing to unseen sales scenarios.