In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import training_models_helper as tmh
%aimport training_models_helper

tm = tmh.TrainingModelsHelper()

import svm_helper
%aimport svm_helper
svmh = svm_helper.SVM_Helper()

kn = tmh.KNN_Helper()

import transform_helper
%aimport transform_helper

th = transform_helper.Transformation_Helper()

iph = transform_helper.InfluentialPoints_Helper()

import svm_helper
%aimport svm_helper
svmh = svm_helper.SVM_Helper()

# Improving prediction: Understanding the Loss function

In performing Error Analysis 
- we *identified* **test** examples where our model failed to generalize correctly
- but we didn't propose a **solution** to improve generalization

That is the topic of this module.

When we perform Error Analysis
- we focus on the Performance Metric (e.g., Accuracy)
- on a per-example basis
- for **out of sample** examples

But we can't *directly* influence the Performance on an out of sample example.

Instead, we will perform a *Loss Analysis*
- we focus on the Loss
- on a per-example basis
- for **in sample** examples

This is the analog of Error Analysis, but performed on Training rather than Test examples.

Reminder
- the Loss Function and Performance metrics are **not necessarily** the same
- the Loss Function is computed **in-sample**
- the Performance Metric is computed **out of sample**

The hope is
- that improving in-sample Loss
- will *indirectly* lead to a better out of sample Performance Metric

To illustrate, recall how Logistic Regression creates a prediction
- the model computes a score/logit for an example
$$
\hat{\mathbf{p}}^\ip = \sigma( \Theta^T \cdot \x^\ip )
$$
- the score/logit is converted into a prediction by comparison with a threshold
$$
\hat{\y}^\ip = 
\left\{
    {
    \begin{array}{lll}
     \text{Negative} & \textrm{if } \hat{\mathbf{p}}^\ip   < 0.5  \\
     \text{Positive}& \textrm{if } \hat{\mathbf{p}}^\ip \ge 0.5 
    \end{array}
    }
\right.
$$

The Loss is Binary Cross Entropy evaluated on the probabilities $\hat{\mathbf{p}}^\ip$

As we observed [before](Classification_Loss_Function.ipynb#Classification:-Loss-function)
- a small change in $\hat{\mathbf{p}}^\ip$
- does **not necessarily** change prediction $\y^\ip$
- unless $\hat{\mathbf{p}}^\ip$ crosses the threshold

Thus
- the Loss varies continuously with changes in parameters $\Theta$
- but the Performance **may not**


Loss Analysis
- examines the per-example Loss using in-sample examples
- in order to finding common attributes of problematic (mis-predicted) in-sample examples

Once we *diagnose* the problem with Loss
- we can explore remedies
    - feature engineering
    - pre-processing
- with the goal of causing the optimizer to change $\Theta$
- in order to push $\hat{\mathbf{p}}^\ip$ in the right direction

Recall the basics of minimizing Loss Functions
- Predictions $h(\x; \Theta)$  are a function of both inputs and parameters $\Theta$
- A given $\Theta$ induces a per-example loss $\loss_\Theta^\ip$
- Average Loss is the average of the per-examples losses $\loss_\Theta^\ip, i=1, \ldots, m$
- We seek the optimal $\Theta^*$: $$
\Theta^* = \argmin{\Theta} { \loss_\Theta }
$$

In pictures:


<table>
    <tr>
        <th><center>Training Example</center></th>
    </tr>
    <tr>
        <td><img src="images/W1_L4_s55_Intro_training.png"</td>
    </tr>
</table>
​


<table>
    <tr>
        <th><center>Training Example</center></th>
    </tr>
    <tr>
        <td><img src="images/Intro_error_analysis.png"</td>
    </tr>
</table>
•


# Conditional loss

In Error Analysis we partition test examples into groups with some common property, such as
- Commonality of result: TP, FN, TN, FP
- Commonality of features
in order to compute a *conditional* out of sample metric.

In Loss Analysis we partition training examples into groups to
in order to compute a *conditional* in sample metric.

The following picture uses colors to identify which group a training example belongs to:



<table>
    <tr>
        <th><center>Loss analysis: conditional loss</center></th>
    </tr>
    <tr>
        <td><img src="images/Intro_error_analysis_1.png"</td>
    </tr>
</table>


The real advantage of performing Conditional analysis in sample
- In sample examples (training/validation) can be re-used, unlike Test examples
- Added features based on in sample analysis is likely to affect the Loss
    - Unknown whether it will affect Performance Metric (when it is different than Loss, e.g., Accuracy)

## What can we do to reduce loss ?

Understanding the per example loss can help you "push" the optimizer toward find a "better" $\Theta$.

We will outline some simple strategies via examples that identify a probelm and propose a solution.



## Increase number of "problem" training example

For MNIST digit classification
- We hypothesize a commonality that causes images of the digit 8 to be mis-classified
    - 8's that are slanted in the "opposite" direction of normal
    - We will refer to this as the *problematic* class

One reason our classifier may fail on this sub-class of 8's
- There are many fewer of them than the more prevalent images of 8's

Mathematically, the Average Loss is equally weighted
$$
\loss_\Theta  = { 1\over{m} } \sum_{i=1}^m \loss^\ip_\Theta
$$

but the cumulative weight of the problematic class (mis-shaped 8's) is very small.

So even if all examples in the problematic class were mis-classified
- The impact on Average Loss may be sufficiently small.
- That $\Theta$ doesn't get updated in the direction that will improve these examples
    - Especially if we end optimization before absolute convergence occurs, as is common


One strategy for pushing the model to better fit the problematic examples is
- Increase their cumulative weight in the Loss
- By increasing their number !

The strategy known as *Data Augmentation* adds examples to the Training examples
- Here we try to find/synthesize more instances of the problematic type


We can augment examples by repeating them (as above).
- re-sampling the Training data
- covered in the module on Imbalanced Data

This is a simple method that works well for most data types.

For some types of data (e.g., Image), other means of augmentation are available.
- Create a new training example
- By perturbing the features of an existing training example
- In such a way as to preserve the label

For instance, give a training example we can
- Add a small quantity of noise to the feature vector
- Perform data-type specific transformations
    - Images: [shift, rotate, transpose](DataAugmentation/Data_augmentation.ipynb#Original-image)

## Influential points

We have described the case where the issue is mis-classification of an important but small sub-class.
- Which results in a small cumulative contribution to the Loss 

Sometimes the problem is a small sub-class with an *out-sized* contribution to the Loss
- Having a few problem examples
- Whose contribution to Loss is so large
- That it pushes $\Theta$ in the wrong direction for the more numerous non-problem examples

That is: $\loss_\Theta^\ip$ is so large (for some example $i$) that 
- $\Theta$ is changed to reduce $\loss_\Theta^\ip$ 
- Resulting in an increase in $\loss_\Theta^{(i')}$ for each non-problematic example $i'$



The phenomenon we just described is sometimes called *Influential Points*.

These have been particularly well-studied in the context of Linear Regression.

We will use Linear Regression as an illustration.


Loosely speaking, an example is **influential** if 
- the parameter estimate $\Theta$ changes greatly depending on whether the example is included/excluded

Feature values on the extreme ends of the range have greater potential
for being influential.

This is one argument for constraining the range of the feature (MinMax, Standardization).

Here's an interactive tool to get a feel for influential points in Linear Regression.

It allows you to change the value of a single data point and see the effect on the fitted line.

Observe how the slope changes (displayed in the title)
- 10 labeled examples $\{ [\x^\ip, \y^\ip] \, | \, 0 \le i \lt 10  \}$
- The top slider chooses the index $i \in  \{ 0 \ldots 9 \}$ of one data point to change
- The bottom slider is the new value $\y^\ip$ for the point at the chosen index $i$

  

In [4]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from ipywidgets import FloatSlider, Button, HBox, VBox, Output

# Generate base data: many points along a line
np.random.seed(42)
n_points = 15
x = np.linspace(-3, 3, n_points)
y_orig = 2 + 1.5 * x + np.random.normal(0, 0.3, size=n_points)

# Indices for the three controllable points
idx_left = 0
idx_mid = n_points // 2
idx_right = -1

# Output widget for plot
out = Output()

# Sliders for the three points
slider_left = FloatSlider(
    value=y_orig[idx_left],
    min=y_orig[idx_left] - 8,
    max=y_orig[idx_left] + 8,
    step=0.1,
    description='Left y'
)
slider_mid = FloatSlider(
    value=y_orig[idx_mid],
    min=y_orig[idx_mid] - 8,
    max=y_orig[idx_mid] + 8,
    step=0.1,
    description='Mid y'
)
slider_right = FloatSlider(
    value=y_orig[idx_right],
    min=y_orig[idx_right] - 8,
    max=y_orig[idx_right] + 8,
    step=0.1,
    description='Right y'
)

# Reset button
reset_btn = Button(description="Reset", button_style='warning')

def plot_regression(left_y, mid_y, right_y):
    y_mod = y_orig.copy()
    y_mod[idx_left] = left_y
    y_mod[idx_mid] = mid_y
    y_mod[idx_right] = right_y
    model = LinearRegression()
    model.fit(x.reshape(-1, 1), y_mod)
    y_pred = model.predict(x.reshape(-1, 1))
    r2 = model.score(x.reshape(-1, 1), y_mod)
    slope = model.coef_[0]
    intercept = model.intercept_

    plt.figure(figsize=(8, 5))
    # Plot fixed points
    mask = np.ones(n_points, dtype=bool)
    mask[[idx_left, idx_mid, idx_right]] = False
    plt.scatter(x[mask], y_mod[mask], color='gray', label='Fixed points')
    # Plot controllable points
    plt.scatter(x[idx_left], y_mod[idx_left], color='red', s=120, label='Left')
    plt.scatter(x[idx_mid], y_mod[idx_mid], color='green', s=120, label='Mid')
    plt.scatter(x[idx_right], y_mod[idx_right], color='blue', s=120, label='Right')
    plt.plot(x, y_pred, color='black', lw=2, label='OLS fit')
    plt.xlabel('x')
    plt.ylabel('y')
    plt.title('Effect of Influential Points on Linear Regression')
    plt.legend()
    eqn = f"$y = {intercept:.2f} + {slope:.2f}x$\n$R^2 = {r2:.3f}$"
    plt.text(0.05, 0.95, eqn, transform=plt.gca().transAxes, fontsize=12, verticalalignment='top')
    plt.ylim(min(y_mod)-2, max(y_mod)+2)
    plt.show()

def update_plot(*args):
    with out:
        out.clear_output(wait=True)
        plot_regression(slider_left.value, slider_mid.value, slider_right.value)

def reset_sliders(b):
    slider_left.value = y_orig[idx_left]
    slider_mid.value = y_orig[idx_mid]
    slider_right.value = y_orig[idx_right]

# Attach callbacks
slider_left.observe(update_plot, names='value')
slider_mid.observe(update_plot, names='value')
slider_right.observe(update_plot, names='value')
reset_btn.on_click(reset_sliders)

In [5]:
# Initial plot
update_plot()

# Display controls and plot
ui = VBox([HBox([slider_left, slider_mid, slider_right, reset_btn]), out])
display(ui)

VBox(children=(HBox(children=(FloatSlider(value=-2.35098575409663, description='Left y', max=5.64901424590337,…

Play around with the tool
- Move a point close to either end of the range
- Move a point close to the middle of the range

Observe how the slope changes as you move the point.

You will see that points closer to either end have greater influence on the slope.
anging $\y^\ip$ for $i$ near either end ($0$ or $9$) has a large effect on the fit


The solution is to somehow reduce example $i$'s contribution to Average Loss
- Removing the example: possible data error or outlier
- Down-weighting
- Clipping the values of the features/target to some upper bound


### Further background 

Consider feature $j$.

The **leverage** of example $i$ is related to
- How far $\x_j^\ip$ is from $\bar{\x}$, the average of $\x_j$ across all examples

It is not always the case, but high leverage sometimes makes the point influential

Reference:
[Influence from leverage and distance](http://onlinestatbook.com/2/regression/influential.html)
>An observation's influence is a function of two factors: (1) how much the observation's value on the predictor variable differs from the mean of the predictor variable and (2) the difference between the predicted score for the observation and its actual score. The former factor is called the observation's leverage. The latter factor is called the observation's distance.

Calculation of Leverage (h) of example $i$, feature $j$

[formula](https://learnche.org/pid/least-squares-modelling/outliers-discrepancy-leverage-and-influence-of-the-observations#leverage)

$$ 
\begin{array}{lll}
h^\ip_j & = & { 1 \over n }+ \frac{ (\x^\ip_j - \bar{\x_j})^2}{ \sum_i { (\x^\ip_j - \bar{\x_j})^2} } \\
    & = & \frac{ 1 + \left( \frac{\x^\ip_j - \bar{\x_j}}{\sigma_{\x_j} } \right) ^2}{n}
\end{array}
$$

You can see that the leverage of $\x^\ip_j$ depends on the (standardized) distance of $\x^\ip_j$ from the mean (over all $i$) of $\x_i$.

In [6]:
print("Done")

Done
