<a href="https://colab.research.google.com/github/mohamadqawasmii/Analysis-of-sales-/blob/main/Sales_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sales prediction project, Part 5

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Loading Data

In [None]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder,StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer,make_column_transformer,make_column_selector
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer

In [None]:
path = "/content/drive/MyDrive/AXSOSACADEMY/02-MachineLearning/Week06/sales_predictions_2023.csv"
df=pd.read_csv(path)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


### Cleaning Data


In [None]:
# Checking the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [None]:
# Checking Duplicates
df.duplicated().sum()

0

In [None]:
# Checking Null Values
df.isna().sum()

Unnamed: 0,0
Item_Identifier,0
Item_Weight,1463
Item_Fat_Content,0
Item_Visibility,0
Item_Type,0
Item_MRP,0
Outlet_Identifier,0
Outlet_Establishment_Year,0
Outlet_Size,2410
Outlet_Location_Type,0


Those missing value will be imputed lately after splitting

In [None]:
# addressing the unique data

cat_cols = df.select_dtypes("object").columns

for col in cat_cols:
    print(df[col].value_counts() )
    print ("\n")

Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: count, Length: 1559, dtype: int64


Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: count, dtype: int64


Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: count, dtype: int64


Outlet_Identifier
OUT027    935
OUT013    932
OUT049    930
OUT046    930
OUT035    930
OUT045    929
OUT018    928
OUT017    926
OUT010    555
OUT019    528
Name: cou

In [None]:
# Dropping Outlet_Identifier column
df.drop(columns="Outlet_Identifier",inplace=True)

In [None]:
# Replacing inconsistent categories in Item_Fat_Content for encoding later
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'LF':"Low Fat",'reg':'Regular' , "low fat": "Low Fat"})
df['Item_Fat_Content'].value_counts()

Unnamed: 0_level_0,count
Item_Fat_Content,Unnamed: 1_level_1
Low Fat,5517
Regular,3006


### Defining X and y

In [None]:
# Defining X and y
# The target is Item_Outlet_Sales

X = df.drop(columns="Item_Outlet_Sales")
y = df["Item_Outlet_Sales"]
X.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,1999,Medium,Tier 1,Supermarket Type1
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,2009,Medium,Tier 3,Supermarket Type2
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,1999,Medium,Tier 1,Supermarket Type1
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,1998,,Tier 3,Grocery Store
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,1987,High,Tier 3,Supermarket Type1


### Train Test Split

In [None]:
# Perfoming a train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, train_size = 0.7)

In [None]:
# Creating a list for numeric features
num_cols = make_column_selector(dtype_include='number')
num_cols(X_train)


['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']

In [None]:
# Pipeline for numeric features
num_imputer = SimpleImputer(strategy='median')
# Creating a scaler to scale the features
scaler = StandardScaler()
# Creating Pipeline
num_pipe = make_pipeline(num_imputer, scaler)
num_tuple = ('num',num_pipe, num_cols)

In [None]:
# Creating a list for ordinal data
ord_col=['Outlet_Location_Type','Outlet_Size','Item_Fat_Content']

In [None]:
# Creating an ordinal pipeline
ord_col=['Outlet_Location_Type','Outlet_Size','Item_Fat_Content']
odr_impute = SimpleImputer(strategy='most_frequent')
qual_ordinal_location=('Tier 1','Tier 2','Tier 3')
qual_ordinal_size=('Small','Medium','High')
qual_ordinal_fat=('Low Fat','Regular')
ordinal_category_orders=[qual_ordinal_location , qual_ordinal_size , qual_ordinal_fat]
ord_scaler=StandardScaler()
ord_encoder=OrdinalEncoder(categories=ordinal_category_orders)
ord_pipi=make_pipeline(odr_impute,ord_encoder,ord_scaler)
ord_tuple=('ord',ord_pipi,ord_col)

In [None]:
# Creating a list for categorical features
cat_cols = X_train.select_dtypes('object').columns

In [None]:

# Creating categorical preprocessing objects
impute_cat = SimpleImputer(strategy='constant', fill_value='NA')
encoder = OneHotEncoder(handle_unknown='ignore',sparse_output=False)
# Creating a pipline for categorical features
cat_pipe = make_pipeline(impute_cat,encoder)
cat_tuple = ('cat',cat_pipe, cat_cols)

In [None]:
## Concat column transformer
preprocessor  = ColumnTransformer([num_tuple, cat_tuple, ord_tuple],
                                  verbose_feature_names_out=False)
preprocessor

In [None]:
# Fitting on train data
preprocessor.fit(X_train)

# Part 6

## CRISP-DM Phase 4 - Modeling

### Linear Regression Model

In [None]:
# Build a linear regression model.
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
# Fit linear regression
linreg_pipe = make_pipeline(preprocessor,LinearRegression())
linreg_pipe.fit(X_train, y_train)

In [None]:
# Get predictions
y_hat_train = linreg_pipe.predict(X_train)
y_hat_test = linreg_pipe.predict(X_test)

In [None]:
# Costum Evaluation
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

def evaluate_model(y_true, y_pred, split='training'):
  """ prints RMSE, and R2 metrics, include which data split was evaluated

  Args:
    y_true: y-train or y-test
    y_pred: result of model.predict(X)
    split: which data split is being evaluate ['training','test']
  """

  r2 = r2_score(y_true,y_pred)
  mae = mean_absolute_error(y_true,y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = mean_squared_error(y_true,y_pred,squared=False)


  print(f'Results for {split} data:')
  print(f"  - R^2 = {round(r2,3)}")
  print(f"  - MAE = {round(mae,3)}")
  print(f"  - MSE = {round(mse,3)}")
  print(f"  - RMSE = {round(rmse,3)}")

In [None]:
## Evaluate model's performance
evaluate_model(y_train, y_hat_train,split='training')
evaluate_model(y_test, y_hat_test,split='testing')

Results for training data:
  - R^2 = 0.675
  - MAE = 730.773
  - MSE = 961298.775
  - RMSE = 980.458
Results for testing data:
  - R^2 = -2.2137526981458333e+20
  - MAE = 2388885071705.636
  - MSE = 6.200706978685387e+26
  - RMSE = 24901218802872.656




Compare the training vs. test R-squared values and answer the question: to what extent is this model overfit/underfit?


*   This model performs well on both training set and testing set.




## Random Forest model

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Pipeline and fitting the random forest
rf_tree_pipe = make_pipeline(preprocessor,RandomForestRegressor(random_state = 42))
rf_tree_pipe.fit(X_train, y_train)

# Getting predictions for training and test data
y_hat_train = rf_tree_pipe.predict(X_train)
y_hat_test = rf_tree_pipe.predict(X_test)

In [None]:
# Evaluate Performance
evaluate_model(y_train, y_hat_train,split='training')
evaluate_model(y_test, y_hat_test,split='testing')

Results for training data:
  - R^2 = 0.936
  - MAE = 301.239
  - MSE = 188081.314
  - RMSE = 433.683
Results for testing data:
  - R^2 = 0.546
  - MAE = 782.768
  - MSE = 1272843.53
  - RMSE = 1128.204




Compare the training vs. test R-squared values and answer the question: to what extent is this model overfit/underfit?


*   It's good for the training data but not very well for the testing data

Compare this model's performance to the linear regression model: which model has the best test scores?

*   Eventhough the model have bias, it's better than the linear regression modle



## GridSearch CV

In [None]:
#Use GridSearchCV to tune at least two hyperparametersUse
rf_tree_pipe.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('num',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer(strategy='median')),
                                                    ('standardscaler',
                                                     StandardScaler())]),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7dbd5304ee30>),
                                   ('cat',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer(fill_value='NA',
                                                                   strategy='constant')),
                                                    ('onehotencoder',
                                                     OneHotEncoder(h...
         dtype='object')),
                          

In [None]:
# Identifying Parametars
params = {'randomforestregressor__max_depth': [None,10,15,20],
          'randomforestregressor__n_estimators':[10,100,150,200],
          'randomforestregressor__min_samples_leaf':[2,3,4],
          'randomforestregressor__max_features':['sqrt','log2',None],
          'randomforestregressor__oob_score':[True,False],
          }

In [None]:
# Importing the GridSearchCV class from sklearn.model_selection
from sklearn.model_selection import GridSearchCV

# Identifting GridSearch
gridsearch = GridSearchCV (rf_tree_pipe, params, n_jobs=-1, verbose=1)
gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 288 candidates, totalling 1440 fits


In [None]:
# Obtain best parameters
gridsearch.best_params_

In [None]:
# Define and refit best model
best_rf = gridsearch.best_estimator_
evaluate_regression(best_rf, X_train, y_train, X_test, y_test)

## CRISP-DM Phase 5 - Evaluation



#Recommended Model

- Overall, which model do you recommend?

the random forest model shows a strong training R2 of 0.936, meaning it captures a significant amount of variance in the training data.
although its testing R2 is lower at 0.546, it still significantly outperforms the Linear regression model, which struggles with a negative R2 on the testing set, this suggests the random forest model has better predictive capabilities for new data.

## Explanation for Stakeholders

- Understanding r-squared

the testing R2 of 0.546 for the random forest model indicates that approximately 54.6% of the variablity in the outcome can be accounted for by the model's features. This suggests that the model is somewhat effective, but there is potential for further improvement.

- selected metric: RMSE

reason for Choosing rmse: The root mean squared error (RMSE) is selected as it expresses the model's prediction errors in the same units as the target variable. The Random Forest model has an RMSE of 1128.204 on the testing set, implying that, on average, the model's predictions differ from the actual values by around 1128 units.

##Overfitting vs. Underfitting

- Comparison of Training and Testing Performance

 random forest training R2: 0.936

 random rorest testing R2: 0.546

the noticeable gap between the training and testing R2 values suggests that the random forest model may be overfitting the training data, meaning it performs exceptionally well on that data but not as well on new, unseen data however it still demonstrates much better overall performance compared to the Linear Regression model, which fails to generalize effectively.

##Conclusion

in conclusion, the random forest model is recommended for implementation, While it does exhibit some signs of overfitting its performance is significantly stronger than that of the linear regression model with some potential refinements and feature adjustments, the random forest model can likely be improved for better predictions on unseen data.