<a href="https://colab.research.google.com/github/jonmessier/Sales-Predictions/blob/main/Project_1_Part_6_Standalone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Project 1 - Part 6(Core)

This week, you will finalize your sales prediction project. The goal of this is to help the retailer understand the properties of products and outlets that play crucial roles in predicting sales.

- [ ] Your first task is to build a linear regression model to predict sales.
 - [ ] Build a linear regression model.
 - [ ] Evaluate the performance of your model based on r^2.
 - [ ] Evaluate the performance of your model based on rmse.
- [ ] Your second task is to build a regression tree model to predict sales.
 - [ ] Build a simple regression tree model.
 - [ ] Compare the performance of your model based on r^2.
 - [ ] Compare the performance of your model based on rmse.
- [ ] You now have tried 2 different models on your data set. You need to determine which model to implement.
 - [ ] Overall, which model do you recommend?
 - [ ] Justify your recommendation.
- [ ] To finalize this project, complete a README in your GitHub repository including:
 - [ ] An overview of the project
 - [ ] 2 relevant insights from the data (supported with reporting quality visualizations)
 - [ ] Summary of the model and its evaluation metrics
 - [ ] Final recommendations 

Here is a template you can use for your readme if you would like. You can look at the raw readme file to copy it if you want.

Please note:
- Do not include detailed technical processes or code snippets in your README. If readers want to know more technical details they should be able to easily find your notebook to learn more.
- Make sure your GitHub repository is organized and professional. Remember, this should be used to showcase your data science skills and abilities.

Commit all of your work to GitHub and turn in a link to your GitHub repo with your final project.

#Custom Functions

In [1]:
#Define an inspection function to report for duplicates, and Nan values
#remove duplicates and output list of nan counts and total
def df_inspect(df):
  if df.duplicated().sum() >>0:
    print(f'The total number of duplicates are : {df.duplicated().sum()}\n')
    df.drop_duplicates(inplace=True)
    print('All duplicate entries have been removed.\n')
  print(f'There are no duplicate entries.\n')
  #Nan values
  print(f'The total number of NaN-values is:{df.isna().sum().sum()}')
  print(f'The NaN-values are found in the following features:')
  print(df.isna().sum())
  #shape
  print(f'\nThere are {df.shape[0]} rows, and {df.shape[1]} columns.')
  print(f'The rows represent {df.shape[0]} observations, and the columns represent {df.shape[1]-1} features and 1 target variable.\n')
  print(df.info())
  print(f'\nThe column names are:\n {df.columns}')

In [2]:
# Create a function to take the true and predicted values
# and print MAE, MSE, RMSE, and R2 metrics
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name=''):
  train_predictions = model.predict(X_train)
  test_predictions = model.predict(X_test)
  print(f'{model_name} evaluation: ')
  #MAE
  train_mae = mean_absolute_error(y_train, train_predictions)
  test_mae = mean_absolute_error(y_test, test_predictions)
  print(f'Train MAE = {train_mae}')
  print(f'Test MAE = {test_mae}')
  #MSE
  train_mse = mean_squared_error(y_train, train_predictions)
  test_mse = mean_squared_error(y_test, test_predictions)
  print(f'Train MSE = {train_mse}')
  print(f'Test MSE = {test_mse}')
  #RMSE
  train_rmse = np.sqrt(mean_squared_error(y_train, train_predictions))
  test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
  print(f'Train RMSE = {train_rmse}')
  print(f'Test RMSE = {test_rmse}')
  #R2
  train_r2 = r2_score(y_train, train_predictions)
  test_r2 = r2_score(y_test, test_predictions)
  print(f'Train R2 = {train_r2}')
  print(f'Test R2 = {test_r2}')
  report = {'Model':model_name,'Train_MAE': train_mae,'Train_MSE': train_mse, 'Train_RMSE':train_rmse, 'Train_R2':train_r2,
            'Test_MAE': test_mae, 'Test_MAE': test_mse, 'Test_RMSE':train_rmse, 'Test_R2':test_r2}

  return report

#Data/Class Import

In [3]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer


# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer

# Models
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

# Regression Metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# Set global scikit-learn configuration 
from sklearn import set_config
# Display estimators as a diagram
set_config(display='diagram') # 'text' or 'diagram'}

In [4]:
url = 'https://drive.google.com/uc?id=1syH81TVrbBsdymLT_jl2JIf6IjPXtSQw'
df = pd.read_csv(url)

##Data Overview

In [5]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


##Data Inspection/Cleanup

In [6]:
#use custom inspection function to review data
df_inspect(df)

There are no duplicate entries.

The total number of NaN-values is:3873
The NaN-values are found in the following features:
Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

There are 8523 rows, and 12 columns.
The rows represent 8523 observations, and the columns represent 11 features and 1 target variable.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-n

>- No Duplicate entries
- We have NaN values in `Item_Weight` and `Outlet_Size` features.  We will fill these with a Simple Imputer
- Target Variable is `Item_outlet_Sales`
- Feature `Dtypes` look appropriate.
- Column names appear without inconsitancies/errors  

###Inspect Numerical

In [7]:
df.describe(include="number")

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


> No unusual values noted

###Inspect Categorical

In [8]:
categoricals = df.select_dtypes(include='object')

for col in categoricals.columns:
  print(col)
  print(categoricals[col].value_counts(), '\n')

Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: Item_Identifier, Length: 1559, dtype: int64 

Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64 

Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64 

Outlet_Identifier
OUT027    935
OUT013    932
OUT049    930
OUT046    930
OUT035    930
OUT045    929
OUT018    928
OUT017    926
OUT010    55

>- Inconsistancy noted in `Item_Fat_Content`
- Replace inconsistent Item_Fat_Content values

In [9]:
df.replace(to_replace=['LF', 'low fat'], value='Low Fat', inplace=True)
df.replace('reg','Regular', inplace=True)
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

>Cleaned up inconsistancies

###Missing Values
We review the missing data, but do not make changes at this time.  

In [10]:
# Display the count of missing values by column
print(df.isna().sum())

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64


> We have missing data in both numerical (`Item_Weight:float64`) and categorical (`Outlet_Size:object`) column types.  These will be replaced with SimpleImputer

#Train Test Split

In [11]:
# Define features (X) and target (y)
target = 'Item_Outlet_Sales'
X = df.drop(columns = [target]).copy()
y = df[target].copy()

In [12]:
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [13]:
X_train.shape

(6392, 11)

In [14]:
X_test.shape

(2131, 11)

#Prepare Data

##Instantiate Imputers

In [15]:
#instantiate the StandardScaler to scale numerical data
scaler = StandardScaler()

#Use a 'mean-value' strategy for missing numeric data
mean_imputer = SimpleImputer(strategy='mean')

#Use a most_frequent strategy for missing ordinal data
freq_imputer = SimpleImputer(strategy='most_frequent')

#For missing values with nominal data, replace with 'UNK'
missing_imputer = SimpleImputer(strategy='constant', fill_value='UNK')

##Encoders
- We will use a One Hot Encoder for `Categorical:Nominal` and an Ordinal Encoder for `Categorical:Ordinal`.  Numeric values will be scalled with StandardScaler.
- Our ordinal entries need to be encoded to capture the realtive values in a non-language based style.

- `Item_Outlet_Size` - `Small:0`, `Medium:1`, `High:2`

### One Hot Encoder

In [16]:
#OneHot Encoder for categorical - nominal
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

### Ordinal Encoder

In [17]:
#Ordinal encoder for categorical - ordinal data. We are looking at the Outlet_Size
os_labels = ['Small', 'Medium', 'High']

#handle_unknown is 'error' by default.  That's a good place to start
#but it may cause problems in a production model.  
ordinal = OrdinalEncoder(categories = [os_labels])

##Pre-Processor Pipelines
The preprocessor pipelines pair the proper imputer and encoders.


In [18]:
# Setup the pipelines.  We pair the imputer with the Encoder
#numerical pipeline - mean_imputer/scaler encoder
num_pipeline = make_pipeline(mean_imputer, scaler)

#ordinal values -most frequent/ordinal encoder
ord_pipeline = make_pipeline(freq_imputer, ordinal)

#nominal values - missing imputer/ohe
nom_pipeline = make_pipeline(missing_imputer, ohe)

##Create Tuples
Use a tuple to pair the correct pipline with the data columns

In [19]:
# Create column lists for objects and a number selector
ordinal_cols = ['Outlet_Size']
nominal_cols = ['Item_Identifier',
                'Item_Fat_Content',
                'Item_Type',
                'Outlet_Identifier',
                'Outlet_Location_Type',
                'Outlet_Type']
#numeric column selector
num_selector = make_column_selector(dtype_include='number')

In [20]:
# Setup the tuples to pair the processors with the column selectors
numeric_tuple = (num_pipeline, num_selector)
ordinal_tuple = (ord_pipeline, ordinal_cols)
nominal_tuple = (nom_pipeline, nominal_cols)

In [21]:
# Instantiate the make column transformer.  Drop all columns not inlcuded in our selected lists
preprocessor = make_column_transformer(ordinal_tuple, 
                                       numeric_tuple, 
                                       nominal_tuple, 
                                       remainder='drop')

In [22]:
# Fit the column transformer on the X_train
preprocessor.fit(X_train)

##Check Pipline and Data Transformation
Transform the data and look at the output shape and NaN count to make sure that our pipeline is working.  This is just a test and will not be used in our models since the data goes through the preprocessor transformation as part of the model pipeline

In [23]:
# Transform the X_train and the X_test
X_train_transformed = preprocessor.transform(X_train)
X_test_transformed = preprocessor.transform(X_test)
X_train_transformed.shape

(6392, 1590)

In [24]:
X_test_transformed.shape

(2131, 1590)

In [25]:
#Check that all values have been imputed
np.isnan(X_train_transformed).sum()

0

#Model Data

##Linear Regression Model

In [26]:
# Create an instance of the model
lin_reg = LinearRegression()

# Create a model pipeline
lin_reg_pipe = make_pipeline(preprocessor, lin_reg)

In [27]:
# Fit the model
lin_reg_pipe.fit(X_train, y_train)

####Evaluate Model

In [28]:
lr_report = evaluate_model(lin_reg_pipe, X_train, X_test, y_train, y_test, model_name='Linear Regression')        

Linear Regression evaluation: 
Train MAE = 735.9569605757197
Test MAE = 1624888007884.2693
Train MSE = 971858.6346592548
Test MSE = 3.815264441214404e+26
Train RMSE = 985.8289073968438
Test RMSE = 19532701915542.57
Train R2 = 0.671608994460443
Test R2 = -1.3828546027094994e+20


##Baseline Model
Mean Regression Dummy model

In [29]:
# Create an instance of the model
dummy = DummyRegressor(strategy='mean')
# Create a model pipeline
dummy_pipe = make_pipeline(preprocessor, dummy)
# Fit the model
dummy_pipe.fit(X_train, y_train)

In [30]:
dr_report = evaluate_model(dummy_pipe, X_train, X_test, y_train, y_test, model_name='Dummy Regression')        

Dummy Regression evaluation: 
Train MAE = 1360.2184410159132
Test MAE = 1326.121044678208
Train MSE = 2959455.7045265585
Test MSE = 2772144.4627103633
Train RMSE = 1720.306863477141
Test RMSE = 1664.9758144520788
Train R2 = 0.0
Test R2 = -0.004772483978719766


##Decision Tree Model

In [31]:
#Create the model
dec_tree = DecisionTreeRegressor(random_state = 42)

# Create a model pipeline
dec_tree_pipe = make_pipeline(preprocessor, dec_tree)

In [32]:
#Fit the model
dec_tree_pipe.fit(X_train, y_train)

####Evaluate Model

In [33]:
dt_report = evaluate_model(dec_tree_pipe, X_train, X_test, y_train, y_test, model_name='Decision Tree')

Decision Tree evaluation: 
Train MAE = 1.2005415435245748e-16
Test MAE = 1004.0579559831067
Train MSE = 2.4643264323299693e-29
Test MSE = 2173522.1707433704
Train RMSE = 4.96419825584149e-15
Test RMSE = 1474.2870041967305
Train R2 = 1.0
Test R2 = 0.2122000495077334


#Model Comparision

In [34]:
report=[lr_report,dr_report, dt_report]
reportdf = pd.DataFrame(report)
reportdf

Unnamed: 0,Model,Train_MAE,Train_MSE,Train_RMSE,Train_R2,Test_MAE,Test_RMSE,Test_R2
0,Linear Regression,735.957,971858.6,985.8289,0.671609,3.815264e+26,985.8289,-1.382855e+20
1,Dummy Regression,1360.218,2959456.0,1720.307,0.0,2772144.0,1720.307,-0.004772484
2,Decision Tree,1.200542e-16,2.4643260000000002e-29,4.964198e-15,1.0,2173522.0,4.964198e-15,0.2122


#Recomendation