In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import training_models_helper as tmh
%aimport training_models_helper

tm = tmh.TrainingModelsHelper()

import svm_helper
%aimport svm_helper
svmh = svm_helper.SVM_Helper()

kn = tmh.KNN_Helper()

import transform_helper
%aimport transform_helper

th = transform_helper.Transformation_Helper()

iph = transform_helper.InfluentialPoints_Helper()

import svm_helper
%aimport svm_helper
svmh = svm_helper.SVM_Helper()

# Understanding the Loss function

In performing Error Analysis (post-training) out of sample, we *identified* examples where our model failed to generalize.

What can we do to make the model better ?  How can we influence the models' choice of $\Theta$ to lead
to a better fit ?



In the event that the Performance Metric (evaluated out of sample) and the Loss Function (evaluated in sample)
differ
- We must see how we can influence the Loss function
- In the hope that better in sample performance leads to better out of sample performance

Now is a good time to recall the distinction between Accuracy (Performance Metric) and Cross Entropy (Loss function) for classification

Recall the mapping of probability to prediction

$$
\hat{y}^\ip = 
\left\{
    {
    \begin{array}{lll}
     \text{Negative} & \textrm{if } \hat{p}^\ip   < 0.5  \\
     \text{Positive}& \textrm{if } \hat{p}^\ip \ge 0.5 
    \end{array}
    }
\right.
$$

where, for Logistic Regression,  probability $\hat{p}^\ip$ is a function of $\x^\ip$ and parameters $\Theta$.
$$
\hat{p}^\ip = \sigma( \Theta^T \cdot \x^\ip )
$$
- Accuracy (for example $i$) won't *necessarily* vary with $\Theta$ unless $\hat{p}^\ip$ crosses the threshold of $0.5$
- But $\hat{p}^\ip$ will vary with $\Theta$



Thus, a $\Theta'$ which pushes $\hat{p}^\ip$ closer to the correct probability (0 or 1)
may be preferred to a $\Theta$ that leaves $\hat{p}^\ip$ farther away.

In this section
- We will be using *training examples* (in-sample) rather than out of sample data.
- In an attempt to reduce the Loss function
- Under the assumption that better in sample performance will lead to better out of sample performance



Recall that 
- the model is a function of parameters $\Theta$
- $\Theta$ is found by minimizing Average Loss $\loss_\Theta$
- The Average Loss is the average of the per-examples losses $\loss_\Theta^\ip, i=1, \ldots, m$



<table>
    <tr>
        <th><center>Training Example</center></th>
    </tr>
    <tr>
        <td><img src="images/Intro_training_1.png"</td>
    </tr>
</table>
•


<table>
    <tr>
        <th><center>Training</center></th>
    </tr>
    <tr>
        <td><img src="images/Intro_training_2.png"</td>
    </tr>
</table>
​

<table>
    <tr>
        <th><center>Training Example</center></th>
    </tr>
    <tr>
        <td><img src="images/Intro_training.png"</td>
    </tr>
</table>
​


<table>
    <tr>
        <th><center>Training Example</center></th>
    </tr>
    <tr>
        <td><img src="images/Intro_error_analysis.png"</td>
    </tr>
</table>
•


# Conditional loss

The key to improving the model is understanding who each per example loss contributes
to the optimizer choosing $\Theta$.

One way to try to improve the model is to look at the per-example losses in Training
- similar to the way we looked at Errors out of sample
- colors represented groups similar examples


<table>
    <tr>
        <th><center>Loss analysis: conditional loss</center></th>
    </tr>
    <tr>
        <td><img src="images/Intro_error_analysis_1.png"</td>
    </tr>
</table>


## What can we do to reduce loss ?

Understanding the per example loss can help you "push" the optimizer toward find a "better" $\Theta$.

We will outline some simple strategies via examples that identify a probelm and propose a solution.



## Increase number of "problem" training example

In our MNNIST digit classification error analysis, we identified a certain sub-class of the digit "8" that was mis-classified
- at least one of the "holes" in the 8 was very small
- the digit was slanted in "opposite" direction

Recall
$$
\loss_\Theta  = { 1\over{m} } \sum_{i=1}^m \loss^\ip_\Theta
$$

So problem example $i$'s contribution to Average Loss is ${ 1\over{m} }\loss^\ip_\Theta$.

If the number of problem training examples of a particular type is small, the sum  of the per example losses
due to the problem examples may not have enough of an impact on $\loss$ to affect the solution $\Theta$.



One strategy for pushing the model to better fit the problem examples is to increase their number !
- so total weight of this particular class of problems has greater impact on $\loss$

If you can find (or synthesize) similar types of problem examples, adding them to the training set
forces the optimizer to better accomodate these examples.

We will introduce *Data Augmentation* is a later module.

## Decrease the influence of a "problem" example

Sometimes the problem is not having too few "problem" examples, but is having a few "problems" that
are so off-scale that they unduly influence $\Theta$.

- That is: $\loss_\Theta^\ip$ is so large that it dominates $\loss$ and forces $\Theta$ to accomodate

##  Influential points

Some models may be quite sensitive to just a few observations.

This is particularly true for Linear Regression.

Our discussion is somewhat specialized to Linear Regression but you may come to see a similar
phenomenon in other models.

Loosely speaking, an observation is **influential** if 
- the parameter estimate $\Theta$ changes greatly depending on whether the observation is included/excluded

Feature values on the extreme ends of the range have greater potential
for being influential.

This is one argument for constraining the range of the feature (MinMax, Standardization).

The **leverage** of an observation is related to the value of a feature in relation to the mean (across observations) of the feature
- extreme values of the feature have higher leverage

It is not always the case, but high leverage sometimes makes the point influential

[Influence from leverage and distance](http://onlinestatbook.com/2/regression/influential.html)
>An observation's influence is a function of two factors: (1) how much the observation's value on the predictor variable differs from the mean of the predictor variable and (2) the difference between the predicted score for the observation and its actual score. The former factor is called the observation's leverage. The latter factor is called the observation's distance.

Calculation of Leverage (h) of example $i$, feature $j$

[formula](https://learnche.org/pid/least-squares-modelling/outliers-discrepancy-leverage-and-influence-of-the-observations#leverage)

$$ 
\begin{array}{lll}
h^\ip_j & = & { 1 \over n }+ \frac{ (\x^\ip_j - \bar{\x_j})^2}{ \sum_i { (\x^\ip_j - \bar{\x_j})^2} } \\
    & = & \frac{ 1 + \left( \frac{\x^\ip_j - \bar{\x_j}}{\sigma_{\x_j} } \right) ^2}{n}
\end{array}
$$

You can see that the leverage of $\x^\ip_j$ depends on the (standardized) distance of $x^\ip_j$ from the mean (over all $i$) of $\x_i$.

Here's an interactive tool to get a feel for influential points.

It allows you to change the value of a single data point and see how the Linear Regression is affected.

Observe how the slope changes (displayed in the title)
- The `x_l` slider chooses the index of the data point to change
- The `y_l` slider chooses how much the data point changes
 - i.e., will change $\x^\ip$ when `x_l = i`
- 10 data points
  

In [4]:
# Generate some points
(x_ip,y_ip) = iph.gen_data(10)

# Fit a line to the points; get a function to update the fit and the plot
fit_update = iph.plot_init()  


In [5]:
iph.plot_interact(fit_update)

interactive(children=(IntSlider(value=5, description='x_l', max=9), IntSlider(value=0, description='y_l', max=…

Play around with the example
- choose a point to move using the top slidier
- choose how much to move the chosen point with the bottom slider
- see the effect of the change on the Slope (in the title)

Observe 
- changing a point in the middle has little effect on the slope
- changing a point closer to either extreme can have a big effect on the slope

This illustrates the effect of a single example on $\Theta$

Knowing how influential the point is on $\Theta$ mave cause you to reduce its influence
- drop the example
    - possible error, outlier
- clip the value (bound the range)

In [8]:
print("Done")

Done
