<a href="https://colab.research.google.com/github/kritp144/Sales-Prediction/blob/main/SalesPredictionModel6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SALES PREDICTION PROJECT

KRIT PATEL 

In this project, we will create 2 models through which we will predict Sales. The first model will use linear regression to predict sales, while the second model will use a regression tree to predict sales. We will then evalate each model based on their R2 Score, MSE and RMSE. We will then recommend which model should be used.

LINEAR REGRESSION - A linear regression is used to predict the value of a target (Sales) based on the values of the features (Columns used to predict the target)

REGRESSION TREE - A regression tree is a series of questions designed to predict a continuous value. In the case of this data set, we will analyze how the feature columns affect the target column.

R2 SCORE - R2 Score represpents the proportion of the variance for our target that is explained by our features.

MEAN SQUARED ERROR (MSE) - The Mean Squared Error exagerates the bigger errors as it squares all errors. This shows us the impact of big errors

ROOT MEAN SQUARED ERROR (RMSE) - This Test will show the same values as the target but they will also exagerate the larger errors.

Here below we define all the necessary libraries required to conduct this project 

In [66]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn import set_config

# This shows the pipeline diagram
set_config(display= 'diagram') 

Here below we copy the pathway into the dataframe and view the first five rows of data inthe dataframe.

In [67]:
# Defining the pathway
sales_path = '/content/drive/MyDrive/DATA SCIENCE/CODES/project 1/sales_predictions.csv'

# Defining the DatatFlow and Reading the Data
sales_df = pd.read_csv(sales_path)
sales_df.head() 

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


We can now identify each feature as numeric, categorical, or ordinal.

# IDENTIFY EACH FEATURE

NUMERICAL FEATURES: Item_Weight, Item_Visibility, Item_MRP, Outlet_Establishment_Year


ORDINAL FEATURES: Outlet_Size, Outlet_Location_Type. 


NORMINAL FEATURES: Item_Type, Outlet_Type, Item_fat_Content

# ORDINAL ENCODING

Since Outlet Size and Outlet location type store string values that describe an order, we can assign them intiger values that describe the same order. This will replace the need to One hot encode these values.

Check the value_counts() to determine how many categories are present in Outlet Size

In [68]:
sales_df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

Assign numeric values to each category present in Outlet Size

In [69]:
#Define a dictionary to ordinal encode Outlet Size
outlet_size = {'High': 2,
               'Medium': 1,
               'Small': 0}
sales_df['Outlet_Size'] = sales_df['Outlet_Size'].replace(outlet_size)

sales_df['Outlet_Size'].value_counts() 

1.0    2793
0.0    2388
2.0     932
Name: Outlet_Size, dtype: int64

Check the value_counts() to determine how many categories are present in Outlet Location Type

In [70]:
sales_df['Outlet_Location_Type'].value_counts()

Tier 3    3350
Tier 2    2785
Tier 1    2388
Name: Outlet_Location_Type, dtype: int64

Assign numeric values to each category present in Outlet Location Tyoe

In [71]:
#Define a dictionary to ordinal encode Outlet Tier
outlet_tier = {'Tier 3': 2,
               'Tier 2': 1,
               'Tier 1': 0}
sales_df['Outlet_Location_Type'] = sales_df['Outlet_Location_Type'].replace(outlet_tier)

sales_df['Outlet_Location_Type'].value_counts() 

2    3350
1    2785
0    2388
Name: Outlet_Location_Type, dtype: int64

# CLEANING

Some of the data can be cleaned without being leaked while the rest can be imputed through a pipeline. Here below we can check the norminal features to see if the catagories need any cleaning

In [72]:
sales_df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

We notice that low fat is represented in 3 different ways and regular fat products have been represented in 2 different categories. We can clean that such that it only shows 2 categories Low Fat and Regular 

In [73]:
# Define a dictionary to replace Item_Fat_Content
item_fat_content = {'Low Fat' : 'Low Fat',
                    'Regular' : 'Regular',
                    'LF' : 'Low Fat',
                    'reg' : 'Regular',
                    'low fat' : 'Low Fat'}

sales_df['Item_Fat_Content'] = sales_df['Item_Fat_Content'].replace(item_fat_content)

sales_df['Item_Fat_Content'].value_counts() 





Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

We can now split our data into Features (X) and the target (Y). We can also get rid of the identifier columns as those would possibly have a unique value for each representing category

In [74]:
X = sales_df.drop(columns= ['Item_Outlet_Sales', 'Item_Identifier', 'Outlet_Identifier'])
y = sales_df['Item_Outlet_Sales']   

We can now perform a train test split that splits the feature and Target into 2 training sets and 2 testing sets

In [75]:
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=42)  

Here below we create two column selectors.

The cat_selector will collect all the data that had a d_type: object

The num_selector will collect all the data that had a d_type: int or float


In [76]:
# Instantiating Column Selectors

#Categorical Column Selector
cat_selector = make_column_selector(dtype_include= 'object')

# Numerical Column Selector
num_selector = make_column_selector(dtype_include= 'number')

Here ebelow we create inputers that will impute the missing values.

The freq_imputer will impute the missing data with the most frequently occuring value in that data set. We chose this as this would best represent categorical data. 

The mean_imputer will impute the missing data with the mean of the value in that data set. We chose this as this would best represent numerical data. 

In [77]:
# Instantiating Transformers

# Frequency Imputers
freq_imputer = SimpleImputer(strategy= 'most_frequent')

# Mean Imputer
mean_imputer = SimpleImputer(strategy= 'mean')

# Scaler
scaler= StandardScaler()

# One Hot Encoder
ohe = OneHotEncoder(sparse= False, handle_unknown= 'ignore')

We can now create a pipeline that imputes the missing vales in the data with the mean from the data avalable and then scales the data.

In [78]:
#Instantiating Pipelines

# Numerical Pipeline
numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe 

We can now create a pipeline that imputes the missing vales in the data with the most frequently occuring values from the data avalable and then  one hot encodes the data.

In [79]:
# Categorical Pipwline
categorical_pipe = make_pipeline(freq_imputer, ohe)
categorical_pipe

The numeric_pipe and num_selector are stored in one tuple while the categorical_pide and cat_selector are stored in another tuple. Both of these tuples are added to a column transformer.

In [80]:
# Instantiating Column Transformer

#Tuples for Coulumn Transformer
number_tuple = (numeric_pipe, num_selector)
category_tuple = (categorical_pipe, cat_selector)

# Column Transformer
preprocessor = make_column_transformer(number_tuple, category_tuple)
preprocessor

We can now fit the Training set into the preprocessor.

In [81]:
# Fit the training data in the preprocessor
preprocessor.fit(X_train, y_train)  

We can now transform part of this data to represent all imputed values, scale and one hot encode all values in the X_train and X_test sample. 

In [82]:
# Transforning Training and Testing Data
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test) 

# LINEAR REGRESSION

Here below we create an empty linear regression model

In [83]:
# Creating an empty Linear Regression Model

lin_reg = LinearRegression()


We can now fit the Training Set into the Linear Regression Model

In [84]:
# Fit the linear Regression Model with the Processed X_train data and y_train data

lin_reg.fit(X_train_processed, y_train)

We can now create a prediction model using the Linear Regression Model

In [85]:
# Create a prediction model using Linear Regression

train_pred_lin = lin_reg.predict(X_train_processed)
test_pred_lin = lin_reg.predict(X_test_processed)

# EVALUATION OF LINEAR REGRESSION

R2 SCORE

In [86]:
# Obtain the R2 scores 

lin_train_r2 = r2_score(y_train, train_pred_lin)
lin_test_r2 =  r2_score(y_test, test_pred_lin)

print('Linear Regression Training R2 Score:', lin_train_r2)
print('Linear Regression Testing R2 Score:', lin_test_r2)

Linear Regression Training R2 Score: 0.5605641946335693
Linear Regression Testing R2 Score: 0.5658139226347666


The R2 score indicates that this model is underfit as the R2 Scores of both the training and testing data are too low. This indicates a high bias. It also shows us that the model performs slightly better on the testing data than the training data.

MSE

In [87]:
# Obtain the Mean Square Error

lin_train_mse = mean_squared_error(y_train, train_pred_lin)
lin_test_mse = mean_squared_error(y_test, test_pred_lin)

print('Linear Regression Training Mean Square Error:', lin_train_mse)
print('Linear Regression Testing Mean Square Error:', lin_test_mse)

Linear Regression Training Mean Square Error: 1300490.8009649059
Linear Regression Testing Mean Square Error: 1197909.526132542


The MSE indicates that some of the errors in are huge which is a result of the actual sales price being significantly different than the predicted sales price

RMSE

In [88]:
# Obtain the Root Mean Square Error
lin_train_rmse = np.sqrt(lin_train_mse)
lin_test_rmse = np.sqrt(lin_test_mse)


print('Linear Regression Training Root Mean Squared Error:', lin_train_rmse)
print('Linear Regression Testing Root Mean Squared Error:', lin_test_rmse)

Linear Regression Training Root Mean Squared Error: 1140.3906352495649
Linear Regression Testing Root Mean Squared Error: 1094.49053268292


The RMSE indicates on average what the discrepancy is between the predicted sales vs actual sales for the bigger errors. We can notice that the dicrepancy is $1094 on the testing data.

# REGRESSION TREE

Here below we create a loop to determine the depth through which our Decision Tree Model will perform the best. It will look through a range of depths from 1 through 38 and store the R2 scores in a Dataframe  in a descending order.

In [89]:
# CREATE A RANGE OF max_depth values
depths = range(1,38)

# CREATE A DATAFRAME TO STORE TEAIN AND TEST SCORES
scores = pd.DataFrame(columns= ['Train', 'Test'], index = depths)

#  LOOP OVER THE VALUES IN THE DEPTH RANGE
for depth in depths:

  # FIT THE NEW MODEL WITH THE DEPTH
  dec = DecisionTreeRegressor(max_depth=depth)

  # FIT THE MODEL
  dec.fit(X_train_processed, y_train)

  # CREATE A PREDICTION ARRAY
  train_pred = dec.predict(X_train_processed)
  test_pred = dec.predict(X_test_processed)

  # EVALUATE THE MODEL USING R2 SCORE
  train_r2 = r2_score(y_train, train_pred)
  test_r2 = r2_score(y_test, test_pred)

  # STORE THE SCORES IN A DATAFRAME
  scores.loc[depth, 'Train'] = train_r2
  scores.loc[depth, 'Test'] = test_r2
  
# GET THE BEST SCORES IN ORDER
best_depth = scores.sort_values(by= 'Test', ascending= False).index[0]

The depth with the best R2 Scores is applied to max_depth in the empty decision tree.

In [90]:
# Creating an empty Decision Tree Model 
 
dec_tree = DecisionTreeRegressor(random_state= 42, max_depth= best_depth)

The training set is for into the Decision Tree

In [91]:
# Fit the Training Set into the Decision Tree

dec_tree.fit(X_train_processed, y_train) 

A prediction model is created with the decision tree.

In [92]:
# Create a Prediction Model for the Decision Tree

train_pred_dec = dec_tree.predict(X_train_processed) 
test_pred_dec = dec_tree.predict(X_test_processed)

# EVALUATION OF REGRESSION TREE

R2 SCORE

In [93]:
# Obtain the R2 Score


dec_train_r2 = r2_score(y_train, train_pred_dec)
dec_test_r2 =  r2_score(y_test, test_pred_dec)

print('Regression Tree Training R2 Score:', dec_train_r2)
print('Regression Tree Testing R2 Score:', dec_test_r2)

Regression Tree Training R2 Score: 0.6039397477322956
Regression Tree Testing R2 Score: 0.5947099753159972


The R2 Score indicates an improvement over the linear regression model. The data is still bias as the R2 Score is still relatively low. This model underfits the dataset as well.

MSE

In [94]:
# Obtain the Mean Squared Error


dec_train_mse = mean_squared_error(y_train, train_pred_dec)
dec_test_mse = mean_squared_error(y_test, test_pred_dec)

print('Regression Tree Training Mean Square Error:', dec_train_mse)
print('Regression Tree Testing Mean Square Error:', dec_test_mse)

Regression Tree Training Mean Square Error: 1172122.7729098853
Regression Tree Testing Mean Square Error: 1118185.973077762


The MSE indicates some of the bigger errors have slightly reduced compared to the Linear Regression Model.

RMSE

In [95]:
# Obtain the Root Mean Squared Error


dec_train_rmse = np.sqrt(dec_train_mse)
dec_test_rmse = np.sqrt(dec_test_mse)


print('Regression Tree Training Root Mean Squared Error:', dec_train_rmse)
print('Regression Tree Testing Root Mean Squared Error:', dec_test_rmse)

Regression Tree Training Root Mean Squared Error: 1082.6461900869947
Regression Tree Testing Root Mean Squared Error: 1057.4431299496734


The RMSE indicates that the discrepancy between the predicted sales and actual sales dropped for the bigger errors in comparison to the Linear Regression Model. The new discrepancy is on average $1057 for large errora as compared to 1094 usd. for the Linear Regression Model.

# MODEL CHOICE

Both these models underfit the data, this could either be because there isnt enough data or there is very little correlation in the data to the target. To address this, we can either add for data or add new features to the data. In the case of this dataset it seems to underfit because there is very little correlation between the features and the target. Therefore additional features may be required to better fit both the models.

Out of the two models presented above, I would recommend the Regression Tree Model as it performed slightly better than the Linear Regression Model.