In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

# What is the "Recipe" for Machine Learning

We will define a methodical approach to solve problems using Machine Learning.

This is the *Recipe for Machine Learning*

<table>
    <tr>
        <th><center>Recipe for Machine Learning</center></th>
    </tr>
    <tr>
        <td><img src="images/ML_process.jpg" width=800></td>
    </tr>
</table>
​

There are no short-cuts !

Each step in the Recipe  both prepares you for the next and, crucially, gives you *deeper insight*
which improves the result.

# Get the data

The first step is obtaining data for training and evaluation.

This is often the most challenging part !
- Interesting data is scattered: requires collection
- Supervised Learning requires labelled data; where do the labels come from ?

In this course, we will usually provide you with data so you will be mostly insulated from this challenge.
- Learning how to obtain data is a good skill to learn
    - Web scraping

Let's visit the notebook section [Get the Data](Recipe_for_ML.ipynb#Recipe-step-A:-Get-the-data)

## Look at the data

Always put your eyes on the data !
- You will learn about its "shape":
    - tabular ?  
    - What are the attribute names ?
    - What are the types of the attributes ? Numeric ? Text ?
- You will learn about potential data problems
    - missing data
    - strange values
    
Don't even try to do anything with your data until you have at least the most basic understanding by
performing an inspection.

Let's visit the notebook and [Look at the data](Recipe_for_ML.ipynb#Recipe-A.2:-Have-a-look-at-the-data)

## Define a Performance Measure

Our model "learns" from training data, so we might expect it to predict well on training examples
- the training examples are *in sample*: used by the model to learn $\Theta$

How well should I expect the model to predict on  examples not encountered during training ?
- "test" examples never seen during training, called *out of sample* examples

We define a *Performance Measure* to measure how well the model performs out of sample.

Let's visit the notebook and [Define a Performance Measure](Recipe_for_ML.ipynb#select_performance_measure)


### Performance Measure versus Loss Function

There may some confusion between the Performance Measure and Loss functions
- they are both evaluated over a set of examples
- they both measure performance of some sort


A Performance Measure can be thought of as the promise you make
- to a client/customer/boss
- on how well your model will perform on arbitrary, yet to be seen examples (non-training, out-of-sample)

In order for you to have confidence in your promise
- you evalute the Performance Measure on *out of sample* examples
- using the out of sample examples *once* so that your model doesn't learn from them (i.e., become in-sample)

To illustrate, let
- $\X$ denote our set of training examples, $\x^\ip \in \X$
- $\Xt$ denote a set of test examples (out of sample: not used in training), $\xt^\ip \in \Xt$


<table>
    <tr>
        <th><center>Loss on  <i>Training example</i></center></th>
    </tr>
    <tr>
        <td><img src="images/Intro_training.png"</td>
    </tr>
</table>
​

<table>
    <tr>
        <th><center>Performance on  <i>Test example</i></center></th>
    </tr>
    <tr>
        <td><img src="images/Performance_measure.png"</td>
    </tr>
</table>
​

- A Performance Measure
    - is a property of the *problem* (not the model used to solve the problem)
    - you may have more than one Performance Measure
        - each expressing some desired quality of the prediction
    - is evaluated *out of sample*, that is, on non-training examples

- The Loss Function
    - is a property of a *model*: it guides a particular model's search for the best $\Theta$
        - different models may have different Loss Functions
            - but the *problem's* Performance Metric is the same
    - is evaluated *in sample*, that is, on training data

## Create a test set

Let's visit the notebook and [Create a test set](Recipe_for_ML.ipynb#Recipe-A.4:-Create-a-test-set-and-put-it-aside-!)

# Exploratory Data Analysis

This is one of the key steps of a good Data Scientist.

Besides "seeing" the data, we need to hear it: what is it telling us that may aid prediction

- any problems with the data that would inhibit learning ?
- any apparant relationship between target and a single feature ?
- any apparant relationship between target and combinations of features ?
- any apparant relationship between features ?
- what are the relative magnitudes of features ?

Often,  understanding the data intimately can lead to
- transformations of the features that will aid prediction
- improved models

Let's visit the notebook and perform [Exploratory Data Analysis](Recipe_for_ML.ipynb#Recipe-Step-B:-Exploratory-Data-Analysis-(EDA))

# Prepare the data

It is not always the case that the data in "raw" form is adequate for modelling
- Cleanliness
    - dealing with missing data or anamolous values
- Numericalization
    - Converting non-numeric/categorical data into appropriate numbers
- Scaling, normalization
    - putting features on compatible scales
- Creating new "synthetic" features from original features
    - Knowing when/how to do this is what separates a good Data Scientist from an average one


We will call the process of preparing the data *Transformations* or *Feature Engineering*.

Transformation takes an example in raw form and creates a "processed" example suitable for modelling.

It is important to emphasize that "example" means either training, validation, or test.

Always apply transformations consistently to all example
- In particular: transformations applied to a training example should be applied to test examples at inference time

Let's visit the notebook [Prepare the data](Recipe_for_ML.ipynb#Recipe-Step-C:-Prepare-the-data)

# Train a model

The model is our "predictor": the machine that takes features and produces predictions.

All the prior steps of the recipe were "prep-work": preparing the ingredients (data) for cooking (modelling)

Unlike actual cooking, this step is *iterative*
- we try one model
- fit the model to the data
- examine the results critically
    - has the Loss improved ? Is it good enough ?
    - learn lessons from errors
- improve the model and repeat

The iterative nature is often overlooked in the rush to learn models.

But Error Analysis is key to guiding us on the weaknesses of the existing model, and to improving the model.

<table>
    <tr>
        <th><center>Iterative training</center></th>
    </tr>
    <tr>
        <td><img src="images/ML_process_iterate.jpg"></td>
    </tr>
</table>
​

## Select a model and train it
Let's move to the notebook and [Select and Train a model](Recipe_for_ML.ipynb#Recipe-Step-D:-Train-a-model)

## Validation and Cross Validation

We have just fit our first model and evaluated the Performance Measure on the test examples.

Can we continue trying to improve the model and re-evaluate the Performance metric on the same test examples ?

No !  By seeing the "out of sample" examples, we have made them "in-sample"
- We can improve our model by causing it to perform well on the test examples
- Result won't be a realistic measure of performance on unseen examples
    - Like seeing the questions before the exam !

Fortunately, there is a way to both
- save your test examples for single use
- create pseudo-test examples that can be reused

Let's return to the notebook and explore [Validation and Cross Validation](Recipe_for_ML.ipynb#Recipe-D.3:--Validation-and-Cross-Validation)

## Error analysis

Now that we have fit a model, we have 
- an estimate of Performance Measure using Validation/Cross Validation

If this measures is not "good enough", we will want to improve predictions
- we might improve prediction by *changing* to a different model
- we might improve prediction by *adding features*

How do we know if the Performance Measure is "good enough" and how to improve our model ?

Unfortunately, many people (and courses!) don't explore this enough.

Let's move to the notebook to [Examine the errors](Recipe_for_ML.ipynb#Recipe-D.4:--Error-analysis)
to see why a deeper analysis may be warranted.



## Iterate: Linear Regression with higher order features

The Error Analysis we performed on our first model (single non-constant feature) suggested a need for improvement.

Two types of improvement come to mind
- Hypothesis iteration: try  a different model
    
- Feature iteration: change/add to the features of the current model
    - adding a previously discarded feature
    - creating a synthetic feature


We will take the approach of adding a feature.

Let's extend Linear Regression with [higher order features](Linear_Regression_HigherOrderFeatures.ipynb) (separate notebook).

## When to stop iterating

Adding second order features resulted in a perfect in-sample fit, so there is no point iterating further.

In general, this will not be the case.

How do we know that our model is "good enough" ?

We will postpone this question until we do a Deeper Dive in [Bias and Variance](Bias_and_Variance.ipynb)

# Fine tune

There are often "tweaks" that can be applied to a near-final model in order to squeeze out increase
performance.

For example: many models have *hyper parameters*.

These are values that are *chosen* at model construction, rather than *discovered* by fitting during training ($\Theta$)

- the degree $d$ of the polynomial when constructing higher order features $\x^d$
- whether to include/exclude the interecept $\Theta_0$ in a Linear Regression
- strength of the regularization penalty (coming attraction: to be discussed together with the Loss function)
- the $k$ in K Nearest Neigbhors

Perhaps a different choice of a hyper-parameter would improve the model ?

We can try many choices before settling on the one giving the best Performance Metric.

Hyper parameters search is another reason for using Cross Validation
- we can't use the Test set more than once
- with a single Validation set: we might overfit to the validation set
    - that is, choose a value for the hyper parameter that is best for this *single* validation set

We will perform a Deeper Dive in Fine Tuning in a separate module.

# Recap

- We have briefly detailed the multi-step process for Machine Learning
- This should be a model for you (and your assignments !)
- We will perform Deeper Dives on some of the steps.

In [2]:
print("Done !")

Done !
