<a href="https://colab.research.google.com/github/jdjones91/Sales_Predictions/blob/main/Sales_Predictions_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# James Jones
###09-22-2022

## We will begin pre-processing a dataset to develop a machine learning model that can help predict product sales for grocery retailer

In [2]:
# Load in necessary libraries

import pandas as pd # To load and manipulate our dataframe
import numpy as np # To view our results
from sklearn.model_selection import train_test_split # To split our data into training and testing sets
from sklearn.compose import make_column_transformer, make_column_selector # To create our transformers
from sklearn.preprocessing import StandardScaler, OneHotEncoder # To scale our numeric data and OneHotEncode our nominal data
from sklearn.pipeline import make_pipeline # To create our pipelines
from sklearn.impute import SimpleImputer # To impute missing values
from sklearn import set_config # To create simple diagrams showing our processing steps
set_config(display='diagram')

In [3]:
# Load in data
df = pd.read_csv('/content/drive/MyDrive/Data Sets (CD)/sales_predictions.csv')
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


### We will give this dataframe a once over before moving to pre-processing, just to ensure that it is as clean as we can make it

In [4]:
# Make a copy to manipulate for machine learning, so we don't lose our original
DF = df.copy()

In [5]:
# Now, let's inspect our dataset. We want to perform a little cleaning prior to pre-processing

DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [6]:
# Item_Weight and Item_Size appear to be missing values. This will be addressed later
  # Let's check for duplicates
DF.duplicated().sum()

0

In [7]:
# No duplicates. We can move on to inconsistencies in ORDINAL data
DF['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [8]:
# Standardize this category to enable better One Ho Encoding down the line
DF = DF.replace({'LF':'Low Fat', 'low fat':'Low Fat',
                      'reg':'Regular'})

In [9]:
# Check that the changes we made were appropriate
DF['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

In [10]:
# Look into Outlet_Establishment_Year
DF['Outlet_Establishment_Year'].value_counts()

1985    1463
1987     932
1999     930
1997     930
2004     930
2002     929
2009     928
2007     926
1998     555
Name: Outlet_Establishment_Year, dtype: int64

In [11]:
# Simce these are numeric, they well throw any potatial model off if we simply scale these values
# Let's convert this category to an object, so it will One Hot Encode instead
DF['Outlet_Establishment_Year'] = DF['Outlet_Establishment_Year'].astype('object')

In [12]:
# We successfully converted the data type to an object
DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   object 
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), object(8)
memory usage: 799.2+ KB


In [13]:
# Now, we will remove unnecessary columns. Some of these will gum up our models, and therefore we will just remove them

DF = DF.drop(columns = ['Outlet_Identifier', 'Item_Identifier'])
DF.head()
  # The two above rows have so many unique values, that it can throw off our whole model

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,9.3,Low Fat,0.016047,Dairy,249.8092,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,5.92,Regular,0.019278,Soft Drinks,48.2692,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,17.5,Low Fat,0.01676,Meat,141.618,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,19.2,Regular,0.0,Fruits and Vegetables,182.095,1998,,Tier 3,Grocery Store,732.38
4,8.93,Low Fat,0.0,Household,53.8614,1987,High,Tier 3,Supermarket Type1,994.7052


# Now, we want to identify our Features (denoted by 'X') and our Target(denoted by 'y')
  - Then, we will train test split our dataset

In [14]:
y = DF['Item_Outlet_Sales'] # What we're trying to predict
X = DF.drop(columns = 'Item_Outlet_Sales') # All data EXCEPT what we're trying to predict

In [15]:
# Train Test Split allows us to sepparate our data, at random, to a training set and a testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [16]:
X_train.info() # This is an unnecessary step, that is only to demonstrate a successful split

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6392 entries, 4776 to 7270
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Weight                5285 non-null   float64
 1   Item_Fat_Content           6392 non-null   object 
 2   Item_Visibility            6392 non-null   float64
 3   Item_Type                  6392 non-null   object 
 4   Item_MRP                   6392 non-null   float64
 5   Outlet_Establishment_Year  6392 non-null   object 
 6   Outlet_Size                4580 non-null   object 
 7   Outlet_Location_Type       6392 non-null   object 
 8   Outlet_Type                6392 non-null   object 
dtypes: float64(3), object(6)
memory usage: 499.4+ KB


# Create a Pre-processing object to prepare our data for machine learning
  - Since we dropped those extra rows, we now only have one remaining column that is misssing data, "Item_Weight"
    - The "Item_Weight" column is a float. For this reason we can use "mean" when building our simple imputer 
    - The "Outlet_Size" is an object, so we will use simple imputer "most frequent"

In [17]:
# Instantiate imputer
mean_imputer = SimpleImputer(strategy = 'mean')
freq_imputer = SimpleImputer(strategy = 'most_frequent')

# Instantiate scaler
scaler = StandardScaler()

In [18]:
#Instantiate column selectors
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

In [19]:
# Instantiate One Hot Encoder
ohe = OneHotEncoder(handle_unknown = 'ignore', sparse = False)
  # handle_unknown = 'ignore' allows values that were'nt encountered (if any) durring the fit process to pass through indtead of throwing an error
  # Sparse = False does not compress our data, and makes it more ledgible

In [20]:
# Instantiate numeric pipeline
num_pipeline = make_pipeline(mean_imputer, scaler)
num_pipeline

In [21]:
# Instantiate categorical pipeline
cat_pipeline = make_pipeline(freq_imputer, ohe)
cat_pipeline

In [22]:
# Creat tuples for transformers
num_tuple = (num_pipeline, num_selector)
cat_tuple = (cat_pipeline, cat_selector)

In [59]:
# Create column transformer (we'll call it 'preprocessor')
preprocessor = make_column_transformer(num_tuple, cat_tuple, remainder = 'drop') # Anything un-transformed, if any
preprocessor

In [60]:
# Fit transformer on training data
preprocessor.fit(X_train)

## Now, we can actually transform our data

In [82]:
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [62]:
# Now, let's inspect the results for any inconsistencies

# X_train first
print(np.isnan(X_train_processed).sum().sum(), 'missing values in training data') # Check for missing values
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype) # Check data type (should all be numeric)
print('\n')
print('Shape of data is', X_train_processed.shape) # Shows number of columns and rows (indicative of One Hot Encoding)
print('\n')
X_train_processed # Shows our processed data as a NumPy array

0 missing values in training data


All data in X_train_processed are float64


Shape of data is (6392, 40)




array([[ 0.81724868, -0.71277507,  1.82810922, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.5563395 , -1.29105225,  0.60336888, ...,  0.        ,
         1.        ,  0.        ],
       [-0.13151196,  1.81331864,  0.24454056, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 1.11373638, -0.92052713,  1.52302674, ...,  1.        ,
         0.        ,  0.        ],
       [ 1.76600931, -0.2277552 , -0.38377708, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.81724868, -0.95867683, -0.73836105, ...,  1.        ,
         0.        ,  0.        ]])

In [63]:
# For X_test
print(np.isnan(X_test_processed).sum().sum(), 'missing values in training data')
print('\n')
print('All data in X_train_processed are', X_test_processed.dtype)
print('\n')
print('Shape of data is', X_test_processed.shape) # Note, the number of columns match
print('\n')
X_test_processed

0 missing values in training data


All data in X_train_processed are float64


Shape of data is (2131, 40)




array([[ 0.33100885, -0.77664625, -0.99881554, ...,  1.        ,
         0.        ,  0.        ],
       [-1.17989246,  0.1003166 , -1.58519423, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.37844688, -0.48299432, -1.59578435, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [-1.13957013,  1.21832428,  1.09397975, ...,  1.        ,
         0.        ,  0.        ],
       [-1.49772727, -0.77809567, -0.36679966, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.52076098, -0.77976293,  0.11221189, ...,  1.        ,
         0.        ,  0.        ]])

# Let's build some models!
### We will build a Linear Regression Model, and a Decision Tree to make predictions on our data
  - We will start with Linear regression using our scaled 'DF'

In [64]:
# We have already instantiated some functions that we will use in our Linear Regression model
# Next, we need to import our LinearRegression library
from sklearn.linear_model import LinearRegression

In [65]:
# Now, instantiate LinearRegression
linear_reg = LinearRegression()

In [66]:
# Create our regression pipeline
linear_reg_pipe = make_pipeline(preprocessor, linear_reg)

In [67]:
# Visualization helps in comprehension
linear_reg_pipe

In [103]:
# Now, we can fit it to our training data
linear_reg_pipe.fit(X_train, y_train)

In [104]:
# Have our model make predictions
train_pred = linear_reg_pipe.predict(X_train)
test_pred = linear_reg_pipe.predict(X_test)

In [105]:
# This step is unecessary, and only to see that our model is out-putting data
train_pred

array([3812.5, 2655. , 2610.5, ..., 3737. , 1929.5, 1538. ])

## Now we will implement Regression Metrics to evaluate our model's performane 

In [106]:
# Before we can run our regression metrics via Scikit-Learn, we need to import the Library
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [107]:
# R-Squared (R2) score is how we evaluate the 'fit' of our model on our training and testing sets

train_r2 = r2_score(y_train, train_pred)
test_r2 = r2_score(y_test, test_pred)

print(f'Model Training R2: {train_r2}')
print(f'Model Testing R2 {test_r2}')

# These fit's aren't great, but they're very consistent. 

Model Training R2: 0.5615568470884735
Model Testing R2 0.5671012098305973


In [108]:
# Now, we will check our Root Mean Square Error RMSE to get a sense of how far off our predictions are from the true values
# First: we need to calculate our Mean Squared Error MSE (MSE is not in the original usits, and is less interpretable)

train_MSE = mean_squared_error(y_train, train_pred)
test_MSE = mean_squared_error(y_test, test_pred)

print(f'Model Training MSE: {train_MSE}')
print(f'Model Testing MSE: {test_MSE}')

Model Training MSE: 1297553.089994627
Model Testing MSE: 1194357.9299963608


In [109]:
# Root Mean Squared Error (RMSE)

train_RMSE = np.sqrt(train_MSE)
test_RMSE = np.sqrt(test_MSE)

print(f'Model Training RMSE: {train_RMSE}')
print(f'Model Testing RMSE: {test_RMSE}')

# This metric shows that, on adverage, our prediction was off by $1092 according to the SQUARED error

Model Training RMSE: 1139.1018786722402
Model Testing RMSE: 1092.866840011335


### Linear Regression Model Performance
  - A Testing R2 score of 0.562 indicates that our model can explain about %56.7 of our data in the testing set
  - A Testing RMSE score of 1092.87 indicates that, per the SQUARE of our values (which more highly punishes large errors) our predictions were, on average, $1092.87 off of our actual values

## This model performance is not very good. Therefore we will try something else

# Now, we will create a decision tree

In [93]:
# I usually like to display my DF.head() at the begining of any model building/manipulation
DF.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,9.3,Low Fat,0.016047,Dairy,249.8092,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,5.92,Regular,0.019278,Soft Drinks,48.2692,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,17.5,Low Fat,0.01676,Meat,141.618,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,19.2,Regular,Low Fat,Fruits and Vegetables,182.095,1998,,Tier 3,Grocery Store,732.38
4,8.93,Low Fat,Low Fat,Household,53.8614,1987,High,Tier 3,Supermarket Type1,994.7052


### We are building a regression tree, so we can use our original preprocessor to have all numeric data
  - Scaling is NOT required for Dicision Trees, but will not skew results by any noticable margin

In [94]:
# Import our DecisionTreeRegressor
from sklearn.tree import DecisionTreeRegressor

In [95]:
# We will use the same X and y values, but they do not need to be scaled
# Instantiate our tree
dec_tree = DecisionTreeRegressor(random_state = 42) # Always keep random state consistent

In [96]:
# Fit dec_tree on our processed data sets
dec_tree.fit(X_train_processed, y_train)

In [97]:
# Set predictions
train_pred2 = dec_tree.predict(X_train_processed)
test_pred2 = dec_tree.predict(X_test_processed)

In [114]:
# Evaluate our model

train_score = dec_tree.score(X_train_processed, y_train)
test_score = dec_tree.score(X_test_processed, y_test)

In [98]:
# Decision tree scores automatically output R2 scores
print(train_score)
print(test_score)
  # This shows a perfect fit on our training data, and a terrible fit on our testing data

1.0
0.168867192073617


In [112]:
# Let's tune the deoth on our Decision Tree, to try and get a better fit on our testing data
dec_tree.get_depth()

44

In [115]:
# We can now dcreate a for loop to loop through our tree depths, and find the best depth for our testing score
depths = list(range(1, 44)) # will try every value between 1 and 44 (44 was our original tree depth)

scores = pd.DataFrame(index=depths, columns=['Test Score','Train Score'])
for depth in depths:
    dec_tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
    dec_tree.fit(X_train_processed, y_train)
    train_score = dec_tree.score(X_train_processed, y_train)
    test_score = dec_tree.score(X_test_processed, y_test)
    scores.loc[depth, 'Train Score'] = train_score
    scores.loc[depth, 'Test Score'] = test_score

In [116]:
# The above wont automatically display anything,so we will try displaying them in order of highest testing data score to find our BEST depth

sorted_scores = scores.sort_values(by = 'Test Score', ascending = False)
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
5,0.59471,0.60394
4,0.584005,0.582625
6,0.582326,0.615161
7,0.579067,0.626841
8,0.564124,0.643842


In [119]:
# Our best depth for the decision tree is 5. Create a new decision tree with max depth of 5
dec_tree_5 = DecisionTreeRegressor(max_depth = 5, random_state = 42)
dec_tree_5.fit(X_train_processed, y_train)
train_5_score = dec_tree_5.score(X_train_processed, y_train)
test_5_score = dec_tree_5.score(X_test_processed, y_test)
print(train_5_score)
print(test_5_score)

0.6039397477322956
0.5947099753159972


In [120]:
# Assign new variables to prediction using dec_tree_5
train_pred5 = dec_tree_5.predict(X_train_processed)
test_pred5 = dec_tree_5.predict(X_test_processed)

In [121]:
# Calculate MSE before RMSE
train_MSE2 = mean_squared_error(y_train, train_pred5)
test_MSE2 = mean_squared_error(y_test, test_pred5)

print(f'Model Training MSE: {train_MSE2}')
print(f'Model Testing MSE: {test_MSE2}')

Model Training MSE: 1172122.7729098853
Model Testing MSE: 1118185.973077762


In [122]:
# Root Mean Squared Error (RMSE)

train_RMSE2 = np.sqrt(train_MSE2)
test_RMSE2 = np.sqrt(test_MSE2)

print(f'Model Training RMSE: {train_RMSE2}')
print(f'Model Testing RMSE: {test_RMSE2}')

Model Training RMSE: 1082.6461900869947
Model Testing RMSE: 1057.4431299496734


# Our original decision tree did not do a good job of predicting our test data
## Therefore, we adjusted our depth to our opitimum testing score
  - This resulted in an R2 of 0.595, indicating that our model can explain about %59.5 of our data in the testing set
  - This also resulted in an RMSE score of 1057.44, indicating that per the SQUARE of our values (which more highly punishes large errors) our predictions were, on average, $1057.44 off of our actual values

# Neither model did an outstanding job predictinf our outcomes, but we can see that our Decision Tree was slightly better. The fit and margin or error between actual and predicted values were both increased over our Linear Regression model. We should, therefore, use a descision tree in this instance.