### • Business Understanding

•  **Introduction:** The project, involves working with an American retail chain operating across California (CA), Texas (TX), and Wisconsin (WI). This retailer offers a diverse range of products, including hobbies, foods, and household items, across ten stores. The objective is to develop and deploy two distinct machine learning models as APIs to address specific business challenges:

1. Predictive Model: A predictive model to forecast sales revenue for specific items in particular stores on given dates.
2. Forecasting Model: A time-series-based forecasting model to predict total sales revenue across all stores and items for the next seven days.

These models will help optimize inventory, pricing, and decision-making for our retail partner.

•  **Dataset:** To develop and evaluate the models, below datasets are provided:

        - Training Data
        - Evaluation Data
        - Calendar
        - Events
        - Items Price per Week

•  **Business Problem:** The primary business problems that the machine learning models aim to address are as follows:

1. Sales Prediction: The retailer needs to accurately predict the sales revenue for individual items in specific stores for any given date.

2. Sales Forecasting: The retailer seeks to forecast total sales revenue across all stores and items for the next seven days.

This information is crucial for inventory management, pricing strategies, and overall business planning. Our machine learning models aim to address these business challenges by providing precise predictions and forecasts.

The following activitives are performed for this Regression learning task.

• Business Understanding

• Data Understanding

        1] Loading Data
        2] Exploring Data
        3] Combining the training and other datasets

• Data Preparation

        4] Feature Engineering
        5] Features Selection
        6] Splitting Data into Different Sets

• Modeling

        7] Assessing Baseline Performance
        8] Sales Prediction Model: XGBoost Algorithm
        9] Sales Forecasting Model: Prophet Algorithm

• Model Evaluation

        10] Analysing Model Performance

In [48]:
# Importing Python and the necessary libraries
import numpy as np
import pandas as pd

# Importing SKLearn libraries for building a Predictive ML Model
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
import xgboost
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae

# Importing dump library from joblib
from joblib import dump

# Importing formatting and other required libraries
import sys
import warnings

# Including the project root directory
sys.path.append('/Users/monalipatil/Monali/MDSI-Semester1/Advanced Machine Learning Application/Assignment2/adv_mla_assignment2')

# Importing class and functions defined to build predictive machine learning model
from src.models.null import NullAccuracy
from src.data.sets import features_scaling, ordinal_transform, save_datasets
from src.models.performance import evaluating_mas_score, evaluating_rmse_score

In [49]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [50]:
#Ignoring warnings to maintain a clean coding.
warnings.filterwarnings('ignore')

## Building a Predictive Model

• Loading the training and validation datasets for developing a machine learning model.

In [51]:
# Defining data files path 
# Note: Change this path to the relevant directory
file_url = '/Users/monalipatil/Monali/MDSI-Semester1/Advanced Machine Learning Application/Assignment2/adv_mla_assignment2'

# Loading the training and validation dataset into a separate pandas dataframe
df_train = pd.read_csv(file_url + '/data/processed/retail_training_dataset.csv')
df_validation = pd.read_csv(file_url + '/data/processed/retail_validation_dataset.csv')

• Generating duplicates of the training and validation datasets.

In [52]:
# Creating a copy of the original training and validation datasets
df_train_copied = df_train.copy()
df_validation_copied = df_validation.copy()

In [53]:
# Displaying the random few datapoints of the training dataset
df_train.sample(5)

Unnamed: 0,item_id,store_id,date,event_name,event_type,revenue,day_of_week,month,year,week_of_year
17924466,FOODS_3_220,WI_2,2012-09-06,,,0.0,3,9,2012,36
17539704,FOODS_2_029,CA_3,2012-08-25,,,0.0,5,8,2012,34
11007316,HOBBIES_2_011,CA_1,2012-01-24,,,1.94,1,1,2012,4
7380796,FOODS_2_390,CA_1,2011-09-27,,,0.0,1,9,2011,39
24429827,HOUSEHOLD_2_143,CA_3,2013-04-08,,,6.46,0,4,2013,15


In [54]:
# Verifying the feature's datatypes 
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43814130 entries, 0 to 43814129
Data columns (total 10 columns):
 #   Column        Dtype  
---  ------        -----  
 0   item_id       object 
 1   store_id      object 
 2   date          object 
 3   event_name    object 
 4   event_type    object 
 5   revenue       float64
 6   day_of_week   int64  
 7   month         int64  
 8   year          int64  
 9   week_of_year  int64  
dtypes: float64(1), int64(4), object(5)
memory usage: 3.3+ GB


• Creating a list of numerical features named 'numerical_features'.

In [55]:
# Creating a list to contain the list of features that are of numerical datatype
numerical_features = df_train.select_dtypes(include=['int64']).columns

• Creating a list of categorical features named 'categorical_features'.

In [56]:
# Creating a list to contain the list of features that are of categorical datatype
categorical_features = ['item_id', 'store_id', 'event_name', 'event_type']

• Performing features scaling for the training and validation datasets.

In [57]:
# Invoking the function to scale the numerical features values of the training and validation datasets
X_train_numerical, scaler = features_scaling(df_train, numerical_features)
df_train_numerical, scaler = features_scaling(df_train, numerical_features)

X_validate_numerical, scaler = features_scaling(df_validation, numerical_features)
df_validate_numerical, scaler = features_scaling(df_validation, numerical_features)

In [58]:
# Storing the scaler object in the models directory and naming the file as 'scaler.joblib'
dump(scaler, '../../models/scaler.joblib')

['../../models/scaler.joblib']

• Transforming categorical features data into numerical.

In [59]:
# Invoking the function to transform the categorical 'item_id' feature values to numerical of the training and validation dataset
item_id_train, ordinal = ordinal_transform(df_train, categorical_feature='item_id')
item_id_validate, ordinal = ordinal_transform(df_validation, categorical_feature='item_id')

In [60]:
# Invoking the function to transform the categorical 'store_id' feature values to numerical of the training and validation dataset
store_id_train, ordinal = ordinal_transform(df_train, categorical_feature='store_id')
store_id_validate, ordinal = ordinal_transform(df_validation, categorical_feature='store_id')

In [61]:
# Invoking the function to transform the categorical 'event_name' feature values to numerical of the training and validation dataset
event_name_train, ordinal = ordinal_transform(df_train, categorical_feature='event_name')
event_name_validate, ordinal = ordinal_transform(df_validation, categorical_feature='event_name')

In [62]:
# Invoking the function to transform the categorical 'event_type' feature values to numerical of the training and validation dataset
event_type_train, ordinal = ordinal_transform(df_train, categorical_feature='event_type')
event_type_validate, ordinal = ordinal_transform(df_validation, categorical_feature='event_type')

In [63]:
# Storing the ordinal object in the models directory and naming the file as 'ordinal.joblib'
dump(ordinal, '../../models/ordinal.joblib')

['../../models/ordinal.joblib']

• Aggregating all the transformed predictors to form the features for both the training and validation datasets.

In [64]:
# Combining all the transformed features of the training datatset
X_train = X_train_numerical.copy()
X_train['item_id'] = item_id_train['item_id']
X_train['store_id'] = store_id_train['store_id']
X_train['event_name'] = event_name_train['event_name']
X_train['event_type'] = event_type_train['event_type']

In [65]:
# Combining all the transformed features of the validation datatset
X_validate = X_validate_numerical.copy()
X_validate['item_id'] = item_id_validate['item_id']
X_validate['store_id'] = store_id_validate['store_id']
X_validate['event_name'] = event_name_validate['event_name']
X_validate['event_type'] = event_type_validate['event_type']

• Extracting response 'revenue' variable of training the validation dataset.

In [66]:
# Extracting values for target 'revenue' variable of training the validation dataset
y_train = df_train['revenue']
y_validate = df_validation['revenue']

• Storing the training and validation datasets in the data/processed directory.

In [67]:
# Invoking the function to store the prepared datasets in the data/processed directory
save_datasets(X_train, y_train, X_validate, y_validate, path='../../data/processed/')

### • Modeling

#### 7] Assessing Baseline Performance

In [68]:
# Creating a instance of the NullAccuracy class 
baseline = NullAccuracy()

# Invoking a method to evaluate the baseline performance score
y_base = baseline.fit_predict(y_train)

In [69]:
# Calculating the baseline performance scores MAS and RMSE   
print('Performance Scores:') 
print('Baseline MAS score:', evaluating_mas_score(y_train, pd.Series(y_base.flatten())))
print('Baseline RMSE score:', evaluating_rmse_score(y_train, pd.Series(y_base.flatten())))

Performance Scores:
Baseline MAS score: 4.3372
Baseline RMSE score: 9.0484


#### 8] Developing a Predictive Regression Model

#### * Experiment 1

• Instantiating instance of the XGBoost Regression algorithm.

In [70]:
# Generating a 'xgb_regressor' instance using the XGBRegressor class with the default hyperparameters
xgb_regressor = XGBRegressor(random_state=9)

• Fitting the XGBoost Regression model with the training data.

In [71]:
# Training the XGBoost Regressor model using the selected features of the training dataset
xgb_regressor.fit(X_train, y_train)

• Accessing the model performance measures on training and validation datasets.

In [72]:
# Calculating the predicitive model's performance MAS and RMSE scores - training dataset
print('Performance Scores of the XGBoost Algorithm:') 
print('Training MAS score:', evaluating_mas_score(y_train, xgb_regressor.predict(X_train)))
print('Training RMSE score:', evaluating_rmse_score(y_train, xgb_regressor.predict(X_train)))

Performance Scores of the XGBoost Algorithm:
Training MAS score: 4.0605
Training RMSE score: 8.7152


In [73]:
# Calculating the predicitive model's performance MAS and RMSE scores - validation dataset
print('Performance Scores of the XGBoost Algorithm:') 
print('Validation MAS score:', evaluating_mas_score(y_validate, xgb_regressor.predict(X_validate)))
print('Validation RMSE score:', evaluating_rmse_score(y_validate, xgb_regressor.predict(X_validate)))

Performance Scores of the XGBoost Algorithm:
Validation MAS score: 4.3389
Validation RMSE score: 10.4643


#### * Experiment 2

• Instantiating instance of the Linear Regression algorithm.

In [74]:
# Generating a 'lnr_regression' instance using the LinearRegression class with the default hyperparameters
lnr_regression = LinearRegression()

• Fitting the Linear Regression model with the training data.

In [75]:
# Training the Linear Regression model using the selected features of the training dataset
lnr_regression.fit(X_train, y_train)

• Accessing the model performance measures on training and validation datasets.

In [76]:
# Calculating the predicitive model's performance MAS and RMSE scores - training dataset
print('Performance Scores of the Linear Regression Algorithm:') 
print('Training MAS score:', evaluating_mas_score(y_train, lnr_regression.predict(X_train)))
print('Training RMSE score:', evaluating_rmse_score(y_train, lnr_regression.predict(X_train)))

Performance Scores of the Linear Regression Algorithm:
Training MAS score: 4.2788
Training RMSE score: 9.0114


In [77]:
# Calculating the predicitive model's performance MAS and RMSE scores - validation dataset
print('Performance Scores of the Linear Regression Algorithm:') 
print('Validation MAS score:', evaluating_mas_score(y_validate, lnr_regression.predict(X_validate)))
print('Validation RMSE score:', evaluating_rmse_score(y_validate, lnr_regression.predict(X_validate)))

Performance Scores of the Linear Regression Algorithm:
Validation MAS score: 4.4691
Validation RMSE score: 10.6893


#### • Developing a Pipeline to be employed as a ML Model Servicing

• Constructing a list to hold the features categorized as both numerical and categorical.

In [78]:
# Creating a list to contain the list of features that are categorical and numerical datatype
categorical_features = ['item_id', 'store_id', 'event_name', 'event_type']
numerical_features = df_train.select_dtypes(include=['int64']).columns

• Establishing a pipeline called numerical_transformer for conducting scaling for numerical features.

In [79]:
# Constructing a Pipeline named 'numerical_transformer' to conduct Scaling
numerical_transformer = Pipeline(steps=[('scaling', StandardScaler())])

• Establishing a Pipeline named categorical_transformer utilizing OneHotEncoder for categorical features transformation.

In [80]:
# Constructing a Pipeline named 'categorical_transformer' involving LabelEncoder to transform categorical features
categorical_transformer = Pipeline(steps=[('ordinal_encoder', OrdinalEncoder())])

• Building a ColumnTransformer to preprocess numerical and categorical features separately.

In [81]:
# Creating a ColumnTransformer named 'preprocessor' to apply different preprocessing steps on the numerical and categorical features
preprocessor = ColumnTransformer(transformers=[('numerical_features', numerical_transformer, numerical_features), 
                                               ('categorical_features', categorical_transformer, categorical_features)])

• Forming a pipeline with two stages: preprocessing and the instantiation of an XGBoost Regression algorithm.

In [82]:
# Building a pipeline comprising two stages: preprocessing and the instantiation of an XGBoost Regression algorithm instance
xgb_pipe = Pipeline(steps=[('preprocessor', preprocessor), 
                           ('xgb_regressor', XGBRegressor(random_state=9))])

• Fitting the XGBoost Regression Predictive model with the training data.

In [83]:
# Training the XGBoost Regressor model using the selected features of the training dataset
xgb_pipe.fit(df_train_copied, df_train_copied['revenue'])

• Predicting revenue utilizing the trained pipeline for the training dataset.

In [84]:
# Utilizing the trained pipeline to predict revenue for the training dataset
xgb_pipe.predict(df_train_copied)

array([3.3664663, 3.3664663, 3.3664663, ..., 2.800957 , 2.800957 ,
       2.800957 ], dtype=float32)

• Utilizing the trained pipeline to make a revenue prediction for an individual data point.

In [85]:
# Utilizing the trained pipeline to predict revenue for the single datapoint
obs = pd.DataFrame(df_train_copied.iloc[0]).transpose()
xgb_pipe.predict(obs)

array([3.3664663], dtype=float32)

### • Model Evaluation

#### 10] Analysing Model Performance

• Assessing the performance measures of the pipeline model employing the XGBoost Algorithm.

In [86]:
# Calculating pipeline's performance scores MAS and RMSE - training dataset
print('Performance Scores of the trained Pipeline employing XGBoost Algorithm:') 
print('Training MAS score:', evaluating_mas_score(df_train_copied['revenue'], xgb_pipe.predict(df_train_copied)))
print('Training RMSE score:', evaluating_rmse_score(df_train_copied['revenue'], xgb_pipe.predict(df_train_copied)))

Performance Scores of the trained Pipeline employing XGBoost Algorithm:
Training MAS score: 4.0605
Training RMSE score: 8.7152


In [87]:
# Calculating pipeline's performance scores MAS and RMSE - validation dataset
print('Performance Scores of the trained Pipeline employing XGBoost Algorithm:') 
print('Validation MAS score:', evaluating_mas_score(df_validation_copied['revenue'], xgb_pipe.predict(df_validation_copied)))
print('Validation RMSE score:', evaluating_rmse_score(df_validation_copied['revenue'], xgb_pipe.predict(df_validation_copied)))

Performance Scores of the trained Pipeline employing XGBoost Algorithm:
Validation MAS score: 4.4432
Validation RMSE score: 10.3767


• Saving the trained pipeline XGBoost model into the models directoy named as a 'xgbrevenue_predictiveregressor.joblib'.

In [88]:
# Storing the fittted revenue forecasting phophet regressor in the models directory and naming the file as 'forecasting_revenue_regressor.joblib'
dump(xgb_pipe, '../../models/predictive/xgbrevenue_predictiveregressor.joblib')

['../../models/predictive/xgbrevenue_predictiveregressor.joblib']