Let's install and import the required libraries.

In [None]:
%pip install numpy pandas matplotlib plotly seaborn

In [None]:
%pip install opendatasets --upgrade

In [None]:
import os
import matplotlib
import opendatasets as od
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size']=14
matplotlib.rcParams['figure.figsize']=(10,6)
matplotlib.rcParams['figure.facecolor']='#00000000'

## Step 1 - Understand Business Requirements & Nature of Data

<img src="https://i.imgur.com/63XEArk.png" width="640">


Most machine learning models are trained to serve a real-world use case. It's important to understand the business requirements, modeling objectives and the nature of the data available before you start building a machine learning model.

### Understanding the Big Picture

The first step in any machine learning problem is to read the given documentation, talk to various stakeholders and identify the following:

1. What is the business problem you're trying to solve using machine learning?
2. Why are we interested in solving this problem? What impact will it have on the business?
3. How is this problem solved currently, without any machine learning tools?
4. Who will use the results of this model, and how does it fit into other business processes?
5. How much historical data do we have, and how was it collected?
6. What features does the historical data contain? Does it contain the historical values for what we're trying to predict.
7. What are some known issues with the data (data entry errors, missing data, differences in units etc.)
8. Can we look at some sample rows from the dataset? How representative are they of the entire dataset.
9. Where is the data stored and how will you get access to it?
10. ...


Gather as much information about the problem as possible, so that you're clear understanding of the objective and feasibility of the project.

### Working with Real World Data

Whenever possible, try to work with real world datasets. [Kaggle](https://kaggle.com/datasets) is a great source for real-world data.

## Step 2 - Classify the problem as supervised/unsupervised & regression/classification

<img src="https://i.imgur.com/rqt2A7F.png" width="640">

Here's the landscape of machine learning([source](https://medium.datadriveninvestor.com/machine-learning-in-10-minutes-354d83e5922e)):

<img src="https://miro.medium.com/max/842/1*tlQwBmbL6RkuuFq8OPJofw.png" width="640">



Here are the topics in machine learning that we're studying in this course ([source](https://vas3k.com/blog/machine_learning/)): 

<img src="https://i.imgur.com/VbVFAsg.png" width="640">



### Loss Functions and Evaluation Metrics

Once you have identified the type of problem you're solving, you need to pick an appropriate evaluation metric. Also, depending on the kind of model you train, your model will also use a loss/cost function to optimize during the training process.

* **Evaluation metrics** - they're used by humans to evaluate the ML model

* **Loss functions** - they're used by computers to optimize the ML model

They are often the same (e.g. RMSE for regression problems), but they can be different (e.g. Cross entropy and Accuracy for classification problems).


## Step 3 - Download, clean & explore the data and create new features

<img src="https://i.imgur.com/0f7foe7.png" width="640">

### Downloading Data

There may be different sources to get the data:

* CSV files
* SQL databases
* Raw File URLs
* Kaggle datasets 
* Google Drive
* Dropbox
* etc.

Identify the right tool/library to get the data. 

For the Rossmann Store Sales prediction dataset, we'll use the `opendatasets` library. Make sure to [accept the competition rules](https://www.kaggle.com/c/rossmann-store-sales/rules) before executing the following cell.

In [None]:
od.download('https://www.kaggle.com/c/rossmann-store-sales')

In [None]:
os.listdir('rossmann-store-sales')

In [None]:
ross_df=pd.read_csv('./rossmann-store-sales/train.csv',low_memory=False)

In [None]:
ross_df

In [None]:
store_df=pd.read_csv('./rossmann-store-sales/store.csv')

In [None]:
store_df

We can merge the two data frames to get a richer set of features for each row of the training set. 

In [None]:
merged_df=ross_df.merge(store_df,how='left',on='Store')

In [None]:
merged_df

In [None]:
merged_df.shape

The dataset also contains a test set.

In [None]:
test_df=pd.read_csv('rossmann-store-sales/test.csv')

In [None]:
merged_test_df=test_df.merge(store_df,how='left',on='Store')

In [None]:
merged_test_df

### Cleaning Data

The first step is to check the column data types and identify if there are any null values.

In [None]:
merged_df.info()

It appears that there are no null values.

In [None]:
round(merged_df.describe().T,2)

In [None]:
merged_df.duplicated().sum()

Let's also parse the date column

In [None]:
merged_df['Date']=pd.to_datetime(merged_df.Date)

In [None]:
merged_test_df['Date']=pd.to_datetime(merged_test_df.Date)

In [None]:
merged_df.Date.min(),merged_df.Date.max()

In [None]:
merged_test_df.Date.min(),merged_test_df.Date.max()

### Exploratory Data Analysis and Visualization

Objectives of exploratory data analysis:

- Study the distributions of individual columns (uniform, normal, exponential)
- Detect anomalies or errors in the data (e.g. missing/incorrect values)
- Study the relationship of target column with other columns (linear, non-linear etc.)
- Gather insights about the problem and the dataset
- Come up with ideas for preprocessing and feature engineering



Let's study the distribution of the target "Sales" column

In [None]:
sns.histplot(data=merged_df,x='Sales')
plt.show()

Can you explain why the sales are 0 on so many dates? 

Let's check if this is because the store was closed.

In [None]:
merged_df.Open.value_counts()

In [None]:
merged_df.Sales.value_counts()[0]

To make our modeling simple, let's simply exclude the dates when the store was closed (we can handle it as a special case while making predictions. 

In [None]:
merged_df=merged_df[merged_df.Open==1].copy()

In [None]:
sns.histplot(data=merged_df,x='Sales')
plt.show()

Let's explore some other columns

In [None]:
plt.figure(figsize=(18,8))
temp_df=merged_df.sample(40000)
sns.scatterplot(x=temp_df.Sales,y=temp_df.Customers,hue=temp_df.Date.dt.year,alpha=0.8)
plt.title('Sales vs Customers')
plt.show()

In [None]:
plt.figure(figsize=(18,8))
temp_df=merged_df.sample(10000)
sns.scatterplot(x=temp_df.Store,y=temp_df.Sales,hue=temp_df.Date.dt.year,alpha=0.8)
plt.title('Stores Vs Sales')
plt.show()

In [None]:
sns.barplot(data=merged_df,x='DayOfWeek',y='Sales')
plt.show()

In [None]:
sns.barplot(data=merged_df,x='Promo',y='Sales')
plt.show()

In [None]:
selected_cols=['Store',
               'DayOfWeek',
               'Date',
               'Sales',
               'Customers',
               'Open',
               'Promo',
               'SchoolHoliday',
               'CompetitionDistance',
               'CompetitionOpenSinceMonth',
               'CompetitionOpenSinceYear',
               'Promo2',
               'Promo2SinceWeek',
               'Promo2SinceYear']
correlation_matrix=merged_df[selected_cols].corr()
print(correlation_matrix['Sales'].sort_values(ascending=False))

### Feature Engineering

Feature engineer is the process of creating new features (columns) by transforming/combining existing features or by incorporating data from external sources. 


For example, here are some features that can be extracted from the "Date" column:

1. Day of week
2. Day or month
3. Month
4. Year
5. Weekend/Weekday
6. Month/Quarter End


In [None]:
merged_df['Day']=merged_df.Date.dt.day
merged_df['Month']=merged_df.Date.dt.month
merged_df['Year']=merged_df.Date.dt.year

In [None]:
merged_test_df['Day']=merged_test_df.Date.dt.day
merged_test_df['Month']=merged_test_df.Date.dt.month
merged_test_df['Year']=merged_test_df.Date.dt.year

In [None]:
sns.barplot(data=merged_df,x='Year',y='Sales')
plt.show()

In [None]:
sns.barplot(data=merged_df,x='Month',y='Sales')
plt.show()

## Step 4 - Create a training/test/validation split and prepare the data for training

<img src="https://i.imgur.com/XZ9aP10.png" width="640">

### Train/Test/Validation Split

The data already contains a test set, which contains over one month of data after the end of the training set. We can apply a similar strategy to create a validation set. We'll the last 25% of rows for the validation set, after ordering by date

In [None]:
len(merged_df)

In [None]:
train_size=int(.75*len(merged_df))
train_size

In [None]:
sorted_df=merged_df.sort_values('Date')
train_df,val_df=sorted_df[:train_size],sorted_df[train_size:]

In [None]:
len(train_df),len(val_df)

In [None]:
train_df

In [None]:
train_df.Date.min(),train_df.Date.max()

In [None]:
val_df.Date.min(),val_df.Date.max()

In [None]:
merged_test_df.Date.min(),merged_test_df.Date.max()

In [None]:
train_df

In [None]:
train_df.columns

### Input and Target columns

Let's also identify input and target columns. Note that we can't use the no. of customers as an input, because this information isn't available beforehand. Also, we needn't use all the available columns, we can start out with just a small subset.

In [None]:
input_cols=['Store','DayOfWeek','Promo','StateHoliday','StoreType','Assortment','Day','Month','Year']

In [None]:
target_col='Sales'

Let's also separate out numeric and categorical columns.

In [None]:
merged_df[input_cols].nunique()

In [None]:
train_inputs=train_df[input_cols].copy()
train_targets=train_df[target_col].copy()

In [None]:
val_inputs=val_df[input_cols].copy()
val_targets=val_df[target_col].copy()

In [None]:
test_inputs=merged_test_df[input_cols].copy()
# The test data doesn't have targets

Note that some columns can be treated as both numeric and categorical, and it's up t you to decide how you want to deal with them.

In [None]:
numeric_cols=['Store','Day','Month','Year']
categorical_cols=['DayOfWeek','Promo','StateHoliday','StoreType','Assortment']

### Imputation, Scaling and Encode

Let's impute missing data from numeric columns and scale the values to the $(0, 1)$ range. 

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imputer=SimpleImputer(strategy='mean').fit(train_inputs[numeric_cols])

In [None]:
train_inputs[numeric_cols]=imputer.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols]=imputer.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols]=imputer.transform(test_inputs[numeric_cols])

Note that this step wasn't necessary for the store sales dataset, as there were no null values. Also, we can apply a different imputation strategy to different columns depending on their distributions (e.g. mean for normally distribute and median for exponentially distributed).

Let's also scale the values to the $(0, 1)$ range.

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler=MinMaxScaler().fit(train_inputs[numeric_cols])

In [None]:
train_inputs[numeric_cols]=scaler.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols]=scaler.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols]=scaler.transform(test_inputs[numeric_cols])

Finally, let's encode categorical columns as one-hot vectors.

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
encoder=OneHotEncoder(sparse_output=False,handle_unknown='ignore').fit(train_inputs[categorical_cols])
encoded_cols=list(encoder.get_feature_names_out(categorical_cols))

In [None]:
train_inputs[encoded_cols]=encoder.transform(train_inputs[categorical_cols])
val_inputs[encoded_cols]=encoder.transform(val_inputs[categorical_cols])
test_inputs[encoded_cols]=encoder.transform(test_inputs[categorical_cols])

Let's now extract out the numeric data.

In [None]:
X_train=train_inputs[numeric_cols+encoded_cols]
X_val=val_inputs[numeric_cols+encoded_cols]
X_test=test_inputs[numeric_cols+encoded_cols]

## Step 5 - Create quick & easy baseline models to benchmark future models

<img src="https://i.imgur.com/1DLgiEz.png" width="640">

A quick baseline model helps establish the minimum score any ML model you train should achieve.


### Fixed/Random Guess

Let's define a model that always returns the mean value of Sales as the prediction.

In [None]:
def return_mean(inputs):
    return np.full(len(inputs),merged_df.Sales.mean())

In [None]:
train_preds=return_mean(X_train)
train_preds

Let's evaluate this to using the RMSE score.

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
np.sqrt(mean_squared_error(train_preds,train_targets))

In [None]:
np.sqrt(mean_squared_error(return_mean(X_val),val_targets))

The model is off by about $3000 on average.

Let's try another model, which makes a random guess between the lowest and highest sale.

In [None]:
def guess_random(inputs):
    lo,hi=merged_df.Sales.min(),merged_df.Sales.max()
    return np.random.random(len(inputs))*(hi-lo)+lo

In [None]:
train_preds=guess_random(X_train)
train_preds

In [None]:
np.sqrt(mean_squared_error(train_preds,train_targets))

In [None]:
np.sqrt(mean_squared_error(guess_random(X_val),val_targets))

Clearly, this model is much worse.

### Baseline ML model

Let's train a simple `LinearRegression` model, with no customization.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linreg=LinearRegression()

In [None]:
linreg.fit(X_train,train_targets)

`model.fit` uses the following workflow for training the model ([source](https://www.deepnetts.com/blog/from-basic-machine-learning-to-deep-learning-in-5-minutes.html)):

1. We initialize a model with random parameters (weights & biases).
2. We pass some inputs into the model to obtain predictions.
3. We compare the model's predictions with the actual targets using the loss function.  
4. We use an optimization technique (like least squares, gradient descent etc.) to reduce the loss by adjusting the weights & biases of the model
5. We repeat steps 1 to 4 till the predictions from the model are good enough.


<img src="https://www.deepnetts.com/blog/wp-content/uploads/2019/02/SupervisedLearning.png" width="480">

The we have fit the model, the model can now be used to make predictions. Note that the parameters of the model will not be updated during prediction.



In [None]:
train_preds=linreg.predict(X_train)
train_preds

In [None]:
np.sqrt(mean_squared_error(train_preds,train_targets))

In [None]:
val_preds=linreg.predict(X_val)
val_preds

In [None]:
np.sqrt(mean_squared_error(val_preds,val_targets))

Note that a simple linear regression model isn't much better than our fixed baseline model which always predicts the mean.

Based on the above baselines, we now know that any model we train should have ideally have a RMSE score lower than $2800. This baseline can also be conveyed to other stakeholders to get a sense of whether the range of loss makes sense. 

## Step 6 - Pick a strategy, train a model & tune hyperparameters

<img src="https://i.imgur.com/aRuE5mw.png" width="640">


### Systematically Exploring Modeling Strategies

Scikit-learn offers the following cheatsheet to decide which model to pick.

![](https://scikit-learn.org/stable/_static/ml_map.png)


Here's the general strategy to follow:

- Find out which models are applicable to the problem you're solving.
- Train a basic version for each type of model that's applicable
- Identify the modeling approaches that work well and tune their hypeparameters
- [Use a spreadsheet](Machine%20Learning%20Experiment%20Tracking.xlsx) to keep track of your experiments and results.

Let's define a function `try_model`, which takes a model, then performs training and evaluation.

In [None]:
def try_model(model):
    # Fit the model
    model.fit(X_train,train_targets)

    # Generate predictions
    train_preds=model.predict(X_train)
    val_preds=model.predict(X_val)

    # Compute RMSE
    train_rmse=np.sqrt(mean_squared_error(train_targets,train_preds))
    val_rmse=np.sqrt(mean_squared_error(val_targets,val_preds))
    return train_rmse,val_rmse

### Linear Models

Read about linear models here: https://scikit-learn.org/stable/modules/linear_model.html

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso,ElasticNet,SGDRegressor

In [None]:
try_model(LinearRegression())

In [None]:
try_model(Ridge())

In [None]:
try_model(Lasso())

In [None]:
try_model(ElasticNet())

In [None]:
try_model(SGDRegressor())

### Tree Based Models

* Decision trees: https://scikit-learn.org/stable/modules/tree.html
* Random forests and gradient boosting: https://scikit-learn.org/stable/modules/ensemble.html

In [None]:
from sklearn.tree import DecisionTreeRegressor,plot_tree

In [None]:
tree=DecisionTreeRegressor(random_state=42)
try_model(tree)

In [None]:
plt.figure(figsize=(40,20))
plot_tree(tree,max_depth=3,filled=True,feature_names=numeric_cols+encoded_cols)
plt.show()

Let's try a random forest.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
%%time
rf = RandomForestRegressor(random_state=42, n_jobs=-1)
try_model(rf)

We've seen a significant reduction in the loss by using a random forest. 

## Step 7 - Experiment and combine results from multiple strategies

<img src="https://i.imgur.com/ZqM6R8w.png" width="640">

In general, the following strategies can be used to improve the performance of a model:

- Gather more data. A greater amount of data can let you learn more relationships and generalize the model better.
- Include more features. The more relevant the features for predicting the target, the better the model gets.
- Tune the hyperparameters of the model. Increase the capacity of the model while ensuring that it doesn't overfit.
- Look at the specific examples where the model make incorrect or bad predictions and gather some insights
- Try strategies like grid search for hyperparameter optimization and K-fold cross validation
- Combine results from different types of models (ensembling), or train another model using their results.

### Hyperparameter Optimization & Grid Search

You can tune hyperparameters manually, our use an automated tuning strategy like random search or Grid search. Follow this tutorial for hyperparameter tuning using Grid search: https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/

<img src="https://i.imgur.com/EJCrSZw.png" width="480">

### K-Fold Cross Validation


Here's what K-fold cross validation looks like visually ([source](https://vitalflux.com/k-fold-cross-validation-python-example/)):

<img src="https://i.imgur.com/MxnzWwT.png" width="480">

Follow this tutorial to apply K-fold cross validation: https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/

### Ensembling and Stacking

Ensembling refers to combining the results of multiple models. Here's what ensembling looks like visually([source](https://www.kdnuggets.com/2019/01/ensemble-learning-5-main-approaches.html)):

<img src="https://i.imgur.com/rrOKVEd.png" width="480">


Stacking is a more advanced version of ensembling, where we train another model using the results from multiple models. Here's what stacking looks like visually ([source](https://medium.com/ml-research-lab/stacking-ensemble-meta-algorithms-for-improve-predictions-f4b4cf3b9237)): 

<img src="https://i.imgur.com/VVzCWNB.png" width="400">

Here's a tutorial on stacking: https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/

## Step 8 - Interpret models, study individual predictions & present your findings

<img src="https://i.imgur.com/9axhOrA.png" width="640">

### Feature Importance

You'll need to explain why your model returns a particular result. Most scikit-learn models offer some kind of "feature importance" score.

In [None]:
rf.feature_importances_

In [None]:
importance_df=pd.DataFrame({
    'feature':numeric_cols+encoded_cols,
    'importance':rf.feature_importances_
}).sort_values('importance',ascending=False)
importance_df.head(10)

In [None]:
sns.barplot(data=importance_df.head(10),x='importance',y='feature')
plt.show()

The above chart can be presented to non-technical stakeholders to explain how the model arrives at its result. For greater explainability, a single decision tree can be used.

### Looking at individual predictions

In [None]:
def predict_input(model,single_input):
    if single_input['Open']==0:
        return 0
    input_df=pd.DataFrame([single_input])
    input_df['Date']=pd.to_datetime(input_df.Date)
    input_df['Day']=input_df.Date.dt.day
    input_df['Month']=input_df.Date.dt.month
    input_df['Year']=input_df.Date.dt.year
    input_df[numeric_cols]=imputer.transform(input_df[numeric_cols])
    input_df[numeric_cols]=scaler.transform(input_df[numeric_cols])
    input_df[encoded_cols]=encoder.transform(input_df[categorical_cols])
    X_input=input_df[numeric_cols+encoded_cols]
    pred=model.predict(X_input)[0]
    return pred

In [None]:
sample_input = {'Id': 1,
 'Store': 1,
 'DayOfWeek': 4,
 'Date': '2015-09-17 00:00:00',
 'Open': 1.0,
 'Promo': 1,
 'StateHoliday': '0',
 'SchoolHoliday': 0,
 'StoreType': 'c',
 'Assortment': 'a',
 'CompetitionDistance': 1270.0,
 'CompetitionOpenSinceMonth': 9.0,
 'CompetitionOpenSinceYear': 2008.0,
 'Promo2': 0,
 'Promo2SinceWeek': np.nan,
 'Promo2SinceYear': np.nan,
 'PromoInterval': np.nan}

sample_input

In [None]:
predict_input(rf,sample_input)

Look at various examples from the training, validation and test sets to decide if you're happy with the result of your model.

### Presenting your results

* Create a presentation for non-technical stakeholders
* Understand your audience - figure out what they care about most
* Avoid showing any code or technical jargon, include visualizations
* Focus on metrics that are relevant for the business
* Talk about feature importance and how to interpret results
* Explain the strengths and limitations of the model
* Explain how the model can be improved over time

### Making a submission on Kaggle

If you're participating in a Kaggle competition, you can generate a submission CSV file and make a submission to check your score on the test set.

In [None]:
test_preds=rf.predict(X_test)
test_preds

In [None]:
submission_df=pd.read_csv('./rossmann-store-sales/sample_submission.csv')

In [None]:
submission_df['Sales']=test_preds*test_df['Open'].astype('float')

In [None]:
submission_df.fillna(0,inplace=True)

In [None]:
submission_df.to_csv('submission.csv',index=None)

In [None]:
!head submission.csv

In [None]:
from IPython.display import FileLink

In [None]:
FileLink('submission.csv')

You can now make a submission on this page: https://www.kaggle.com/c/rossmann-store-sales/submit
