<a href="https://colab.research.google.com/github/hansolothe3rd/Food-Sales-Predictions/blob/main/Project_1_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 - Final
- Daniel Barella
- 11/15/22

# Assignment:

  This week, you will finalize your sales prediction project. The goal of this is to help the retailer understand the properties of products and outlets that play crucial roles in predicting sales.

1. Your first task is to build a linear regression model to predict sales.

- Build a linear regression model.
- Evaluate the performance of your model based on r^2.
- Evaluate the performance of your model based on rmse.

2. Your second task is to build a regression tree model to predict sales.

- Build a simple regression tree model.
- Compare the performance of your model based on r^2.
- Compare the performance of your model based on rmse.

3. You now have tried 2 different models on your data set. You need to determine which model to implement.

- Overall, which model do you recommend?
- Justify your recommendation.

4. To finalize this project, complete a README in your GitHub repository including:

- An overview of the project
- 2 relevant insights from the data (supported with reporting quality visualizations)
- Summary of the model and its evaluation metrics
- Final recommendations 

Please note:

- Do not include detailed technical processes or code snippets in your README. If readers want to know more technical details they should be able to easily find your notebook to learn more.
- Make sure your GitHub repository is organized and professional. Remember, this should be used to showcase your data science skills and abilities.

In [144]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn import set_config
set_config(display='diagram')

df = pd.read_csv('/content/sales_predictions (1).csv')
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [145]:
def eval_regression(true, pred):
  """Takes true and predicted values (arrays) and prints MAE, MSE, RMSE and R2"""
  mae = mean_absolute_error(true, pred)
  mse = mean_squared_error(true, pred)
  rmse = np.sqrt(mse)
  r2 = r2_score(true, pred)

  print(f'MAE {mae},\n MSE {mse},\n RMSE: {rmse},\n R^2: {r2} ')

In [146]:
sp_df = df.copy()

In [147]:
sp_df = sp_df.drop(['Item_Identifier','Outlet_Identifier'],axis=1)

In [148]:
sp_df.shape

(8523, 10)

In [149]:
sp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Weight                7060 non-null   float64
 1   Item_Fat_Content           8523 non-null   object 
 2   Item_Visibility            8523 non-null   float64
 3   Item_Type                  8523 non-null   object 
 4   Item_MRP                   8523 non-null   float64
 5   Outlet_Establishment_Year  8523 non-null   int64  
 6   Outlet_Size                6113 non-null   object 
 7   Outlet_Location_Type       8523 non-null   object 
 8   Outlet_Type                8523 non-null   object 
 9   Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(5)
memory usage: 666.0+ KB


In [150]:
sp_df.nunique

<bound method DataFrame.nunique of       Item_Weight Item_Fat_Content  Item_Visibility              Item_Type  \
0           9.300          Low Fat         0.016047                  Dairy   
1           5.920          Regular         0.019278            Soft Drinks   
2          17.500          Low Fat         0.016760                   Meat   
3          19.200          Regular         0.000000  Fruits and Vegetables   
4           8.930          Low Fat         0.000000              Household   
...           ...              ...              ...                    ...   
8518        6.865          Low Fat         0.056783            Snack Foods   
8519        8.380          Regular         0.046982           Baking Goods   
8520       10.600          Low Fat         0.035186     Health and Hygiene   
8521        7.210          Regular         0.145221            Snack Foods   
8522       14.800          Low Fat         0.044878            Soft Drinks   

      Item_MRP  Outlet_Estab

In [151]:
sp_df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [152]:
sp_df = sp_df.replace({'Item_Fat_Content':{'LF': 'Low Fat','reg': 'Regular','low fat': 'Low Fat'}})
sp_df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

In [153]:
sp_df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [154]:
sp_df['Outlet_Size'] = sp_df['Outlet_Size'].replace('High', 'Large')
sp_df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
Large      932
Name: Outlet_Size, dtype: int64

In [155]:
sp_df.isna().sum()

Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

## 1. [X] Your first task is to build a linear regression model to predict sales.


In [156]:
y = sp_df['Item_Outlet_Sales']
X = sp_df.drop(columns = 'Item_Outlet_Sales')

In [157]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [158]:
scaler = StandardScaler()
mean_imputer = SimpleImputer(strategy='mean')
missing_imputer = SimpleImputer(strategy='constant', fill_value="Missing")
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

num_pipe = make_pipeline(mean_imputer, scaler)
cat_pipe = make_pipeline(missing_imputer, ohe)

cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

cat_tuple = (cat_pipe, cat_selector)
num_tuple = (num_pipe, num_selector)

preprocessor = make_column_transformer(cat_tuple, num_tuple, remainder='passthrough')
preprocessor

- [X] Build a linear regression model.


In [159]:
linreg = LinearRegression()

linreg_pipe = make_pipeline(preprocessor, linreg)
linreg_pipe

In [160]:
linreg_pipe.fit(X_train, y_train)

training_predictions = linreg_pipe.predict(X_train)
test_predictions = linreg_pipe.predict(X_test)
training_predictions[:10]

array([3842., 2586., 2544., 1510., 1900., -136., 1518., 5570., 4160.,
       2074.])

In [161]:
print('Train')
eval_regression(y_train, training_predictions)
print('\nTest')
eval_regression(y_test, test_predictions)

Train
MAE 847.8499817897373,
 MSE 1300470.9310713324,
 RMSE: 1140.3819233359202,
 R^2: 0.5605709086700534 

Test
MAE 804.9409381511028,
 MSE 1194416.6540692842,
 RMSE: 1092.8937066656044,
 R^2: 0.5670799251055714 


- [X] Evaluate the performance of your model based on r^2.


- The R^2 score of 56% shows us that our model could use alot of tuning as only a little over half of the observed variation can be explained.


- [X] Evaluate the performance of your model based on rmse.

- The RMSE is high at 1092, which means that our model has lots of error.

## 2. [X] Your second task is to build a regression tree model to predict sales.


In [164]:
dec_tree = DecisionTreeRegressor(random_state = 42)
dec_tree_pipe = make_pipeline(preprocessor, dec_tree)


In [166]:
dec_tree_pipe.fit(X_train, y_train)

- [X] Build a simple regression tree model.


In [167]:
train_preds = dec_tree_pipe.predict(X_train)
test_preds = dec_tree_pipe.predict(X_test)

In [168]:
print('Train')
eval_regression(y_train, train_preds)
print('\nTest')
eval_regression(y_test, test_preds)

Train
MAE 1.6007220580327663e-16,
 MSE 3.0330171474830394e-29,
 RMSE: 5.50728349323243e-15,
 R^2: 1.0 

Test
MAE 1044.290584702018,
 MSE 2248752.903184245,
 RMSE: 1499.584243443577,
 R^2: 0.18493243379615865 


- [X] Compare the performance of your model based on r^2.


- Looks like we have now overfitted, because our r^2 score for our training data is perfect, but the test data score is down to 18%.

- [X] Compare the performance of your model based on rmse.

- The RMSE went up with this model to 1499, which indicates that there is a lot of error with this model.

## 3. [X] You now have tried 2 different models on your data set. You need to determine which model to implement.


- [X] Overall, which model do you recommend?


- Between the linear regression model and the regression tree model, I would recommend the linear regression model.

- [X] Justify your recommendation.

- Although the simple regression tree model learned our training data perfectly, it is completly unable to handle new data such as our test data.