# Gradient Descent--Huber Loss
### Author: Jayanth Raman

**Date: 10 Dec 2019**

HomeWork2 of Neural Network Basics, Tokyo DataScience

# Huber Loss
Point-wise Huber loss is defined piecewise by:
$$
L_\gamma (e_i) = 
\begin{cases}
 e_i^2                   & \text{for } |e_i| \le \gamma, \quad \gamma > 0\\
 \gamma (2 |e_i| - \gamma), & \text{otherwise.}
\end{cases}
$$
where, $\gamma > 0$ is a threshold and it defines the point where the Huber loss changes from a quadratic to a linear function or vice versa.  And $e_i$ are the individual data points.

The overall Huber loss is the average:
$$
L_\gamma (e) = \frac{1}{M} \sum_{i=1}^{M} L_\gamma (e_i)
$$
for $M$ points of $e_i$.

When $e$ is the error, $e = \hat{y} - y$, then the point-wise loss is
$$
L_\gamma (y_i, \hat{y_i}) = 
\begin{cases}
 (\hat{y_i} - y_i)^2                   & \text{for } |\hat{y_i} - y_i| \le \gamma, \quad \gamma > 0\\
 \gamma (2 |\hat{y_i} - y_i| - \gamma), & \text{otherwise.}
\end{cases}
$$
where, $\hat{y}$ is the prediction of $y$.

And the overall Huber loss is the average
$$
L_\gamma (y, \hat{y}) = \frac{1}{M} \sum_{i=1}^{M} L_\gamma (y_i, \hat{y_i})
$$

[From [Wikipedia](https://en.wikipedia.org/wiki/Huber_loss)]

# Summary
We generated synthetic data with a few outliers.  Then, we fit two models:
 * One with a MSE-loss function
 * Another with a Huber-loss function

The Huber-loss model outperformed the MSE-loss model.  We calculated the MSE-loss for the resulting models.  The loss for the Huber-loss model was 0.1342 and the loss for the MSE-loss model was 0.2467.

We also ran the model on the same synthetic data, but suppressing the outliers.  Both models were similar and both had an MSE-loss of approximately 0.1170.


# Calculations and Observations

In [1]:
# Imports
import bokeh.plotting as bkp
import functools
import numpy as np
import pandas as pd   # for summary statistics
#
bkp.reset_output()
bkp.output_notebook()

In [2]:
# Function to generate `y` given `x` and a set of weights.
def linear_function(weights, x):
    '''
    weights = [a0, a1, a2, ...]
    output = a0 + a1 * x + a2 * x ** 2 + ...
    '''
    if isinstance(x, list):
        x = np.array(x)
    out = weights[0]
    xn = x
    for w in weights[1:]:
        out = out + w * xn
        xn = xn * x
    return out

def test_linear_function():
    x1 = 3.1415
    tmp1 = 4 + 3.2 * x1 + 2.1 * x1 * x1 + 8.53 * x1 * x1 * x1
    assert linear_function([6.24, 4.14], x1) == 6.24 + 4.14 * x1
    assert linear_function([4, 3.2, 2.1, 8.53], x1) == tmp1
    assert all(linear_function([4, 3.2, 2.1, 8.53], [x1, x1]) == np.array([tmp1, tmp1]))
test_linear_function()

In [3]:
def gen_training_data(wt_true, npoints=50, gauss_noise_sd=0.4, binomp={'p': 0.03, 'mag': 6.0}, seed=414243):
    '''Generate (simulate) training data'''
    np.random.seed(seed)
    xtrain = np.random.random(npoints)
    # e.g. 6 * binomial(1, 0.03, size=n) - 6 * binomial(1, 0.03, size=n)
    binomial = np.random.binomial
    ntrials = 1
    outlier_noise = binomial(ntrials, binomp['p'], size=npoints) - binomial(ntrials, binomp['p'], size=npoints)
    outlier_noise = outlier_noise * binomp['mag']
    ytrain = linear_function(wt_true, xtrain) + gauss_noise_sd * np.random.randn(npoints)
    return xtrain, ytrain, outlier_noise

In [4]:
def update_weights_mse(weights, x, y, learning_rate):
    y_predicted = linear_function(weights, x)
    derivative_of_loss = 2 * (y_predicted - y)
    weights[0] -= learning_rate * derivative_of_loss.mean()
    m1 = derivative_of_loss.mean()
    for ii in range(1, len(weights)):
        derivative_of_loss = x * derivative_of_loss
        weights[ii] -= learning_rate * derivative_of_loss.mean()
    return None

def update_weights_huber(weights, x, y, learning_rate, gamma):
    y_predicted = linear_function(weights, x)
    err = y_predicted - y
    derivative_of_loss = 2.0 * np.sign(err) * np.minimum(gamma, np.abs(err))
    weights[0] -= learning_rate * derivative_of_loss.mean()
    m1 = derivative_of_loss.mean()
    for ii in range(1, len(weights)):
        derivative_of_loss = x * derivative_of_loss
        weights[ii] -= learning_rate * derivative_of_loss.mean()
    return None

In [5]:
def mse_loss(y, yhat):
    return np.mean((y - yhat) ** 2)

def huber_loss(y, yhat, gamma):
    errabs = np.abs(y - yhat)
    loss = np.where(errabs <= gamma, (errabs ** 2), gamma * (2.0 * errabs - gamma))
    return np.mean(loss)

# test
def test():
    wt_true = [3.0, 6.0]
    xtrain, ytrain, outlier_noise = gen_training_data(wt_true, npoints=70)
    _ytrue = linear_function(wt_true, xtrain)
    ytrain = ytrain + outlier_noise
    print(mse_loss(_ytrue, ytrain), mse_loss(ytrain, _ytrue))
    print(huber_loss(_ytrue, ytrain, np.inf), huber_loss(ytrain, _ytrue, np.inf))
    assert huber_loss(_ytrue, ytrain, np.inf) == mse_loss(_ytrue, ytrain)
    print(huber_loss(_ytrue, ytrain, np.std(ytrain)), huber_loss(ytrain, _ytrue, np.std(ytrain)))
test()

2.1804075712825233 2.1804075712825233
2.1804075712825233 2.1804075712825233
1.252087574809613 1.252087574809613


In [6]:
# Train
def train(xtrain, ytrain, nweights, update_func, niter, learning_rate=0.02,
          loss_func=None, init_weights=None, loss_iter=20):
    # initialize weights
    if init_weights is None:
        weights = np.random.randn(nweights)
    else:
        weights = init_weights
    loss_values = []
    for ii in range(niter):
        update_func(weights, xtrain, ytrain, learning_rate)
        if loss_func and ii % loss_iter == (loss_iter - 1):
            loss_values.append(loss_func(linear_function(weights, xtrain), ytrain))
    return weights, loss_values

In [7]:
wt_true = [3.0, 6.0]
xtrain, ytrain, outlier_noise = gen_training_data(wt_true, npoints=70)
print('Number of outliers:', sum(np.where(outlier_noise, 1, 0)))

Number of outliers: 4


In [8]:
print('ytrain summary stats -- no outliers:')
print(pd.Series(ytrain).describe())
print()
print('ytrain with outliers summary stats:')
print(pd.Series(ytrain + outlier_noise).describe())

ytrain summary stats -- no outliers:
count    70.000000
mean      5.993224
std       1.806728
min       2.545869
25%       4.440494
50%       6.017799
75%       7.658137
max       8.887732
dtype: float64

ytrain with outliers summary stats:
count    70.000000
mean      5.993224
std       1.997503
min       0.680286
25%       4.440494
50%       6.017799
75%       7.735788
max       9.859972
dtype: float64


In [9]:
p = bkp.figure(title='Training Data With Outliers', y_range=[-1, 11], width=600, height=400)
xin, xout = xtrain[outlier_noise == 0], xtrain[outlier_noise != 0]
yin, yout = ytrain[outlier_noise == 0], (ytrain + outlier_noise)[outlier_noise != 0]
p.circle(xin, yin, radius=0.01, legend_label='normal data')
p.circle(xout, yout, radius=0.01, color='darkorange', legend_label='outliers')
p.legend.location = 'bottom_left'
p.legend.border_line_alpha = 0.4
p.legend.border_line_color = 'black'
bkp.show(p)

## Predictions Without Outliers
First, we compare predictions using the MSE and Huber loss using data without outliers i.e. we do not yet add the noise that will move some of the points and make them outliers.

In [10]:
# No outliers - Predict using MSE Loss
np.random.seed(424242)
weights_mse, loss_values_mse = train(xtrain, ytrain, nweights=2, update_func=update_weights_mse,
                                     loss_func=mse_loss, niter=250 * 20)

In [11]:
print('loss values (last few):', loss_values_mse[-10:])
print('True weights:', wt_true)
print('Predicted weights:', weights_mse)
yhat = linear_function(weights_mse, xtrain)
print('MSE loss:', mse_loss(yhat, ytrain))

loss values (last few): [0.1170278207989682, 0.11702782079886256, 0.11702782079876847, 0.11702782079868472, 0.11702782079861017, 0.11702782079854383, 0.11702782079848476, 0.11702782079843213, 0.11702782079838533, 0.11702782079834366]
True weights: [3.0, 6.0]
Predicted weights: [3.07659014 5.79207763]
MSE loss: 0.11702782079834366


In [12]:
# No outliers - Predict using Huber Loss
np.random.seed(424242)
gamma = np.std(ytrain + outlier_noise)
print(gamma)
update_func = functools.partial(update_weights_huber, gamma=gamma)
loss_func = functools.partial(huber_loss, gamma=gamma)
weights_huber, loss_values_huber = train(xtrain, ytrain, nweights=2, update_func=update_func,
                                         loss_func=loss_func, niter=250 * 20)

1.9831842254929397


In [13]:
print('gamma:', gamma)
print('loss values (last few):', loss_values_huber[-10:])
print('True weights:', wt_true)
print('Predicted weights:', weights_huber)
yhat = linear_function(weights_huber, xtrain)
print('MSE loss:', mse_loss(yhat, ytrain))

gamma: 1.9831842254929397
loss values (last few): [0.11702782079915823, 0.11702782079903169, 0.11702782079891905, 0.11702782079881884, 0.11702782079872952, 0.11702782079865007, 0.11702782079857933, 0.11702782079851635, 0.11702782079846025, 0.11702782079841041]
True weights: [3.0, 6.0]
Predicted weights: [3.07659024 5.79207745]
MSE loss: 0.11702782079841041


In [14]:
p = bkp.figure(title='Predictions -- Data Without Outliers', y_range=[-1, 11], width=600, height=400)
p.circle(xtrain, ytrain, radius=0.01)
xtmp = np.linspace(0, 1)
ytrue = linear_function(wt_true, xtmp)
ymse = linear_function(weights_mse, xtmp)
yhuber = linear_function(weights_huber, xtmp)
p.line(xtmp, ytrue, color='gray', line_width=3, legend_label='true weight')
p.line(xtmp, ymse, color='blue', line_width=3, legend_label='MSE')
p.line(xtmp, yhuber, color='darkorange', line_width=3, legend_label='Huber')
p.legend.location = 'bottom_right'
bkp.show(p)

From the above figure, the three predictions are practically indistinguishable.

Both of the loss functions yielded the same **MSE loss of 0.1170**.  In fact, the Huber loss is also approximately the same.

Note that the Huber loss parameter, $\gamma$, was set to one standard deviation of $y$, the training data.

The true weights and the weights from the predictions are below:

In [15]:
print(f'True weights      : {wt_true}')
print(f'MSE-loss weights  : {weights_mse}')
print(f'Huber-loss weights: {weights_huber}')

True weights      : [3.0, 6.0]
MSE-loss weights  : [3.07659014 5.79207763]
Huber-loss weights: [3.07659024 5.79207745]


### Training Loss
For completeness, we plot the training loss against the training iterations for both MSE and Huber losses.  The training loss decreases with the number of steps as expected.

In [16]:
p = bkp.figure(title='Training Loss vs Iterations (no outliers)', width=600, height=400)
p.line(np.arange(len(loss_values_mse)) * 20, loss_values_mse, line_width=2, legend_label='MSE')
p.line(np.arange(len(loss_values_huber)) * 20, loss_values_huber, line_width=2,
       color='darkorange', legend_label='Huber')
p.xaxis.axis_label = 'Training Steps'
p.yaxis.axis_label = 'Loss'
bkp.show(p)

## Predictions With Outliers
We now add the outlier noise, thus creating the outliers, and then train the linear model using both MSE loss and Huber loss.

In [17]:
# With outliers - Predict using MSE Loss
np.random.seed(424242)
weights_mse_ol, loss_values_mse_ol = train(xtrain, ytrain + outlier_noise, nweights=2, update_func=update_weights_mse,
                             loss_func=mse_loss, niter=250 * 20)
print(loss_values_mse_ol[-10:])
print(wt_true, weights_mse_ol)

[1.99351065913857, 1.9935106591385188, 1.9935106591384735, 1.993510659138434, 1.993510659138397, 1.9935106591383653, 1.9935106591383374, 1.993510659138311, 1.993510659138289, 1.9935106591382687]
[3.0, 6.0] [3.68645222 4.58096621]


In [18]:
print('loss values (last few):', loss_values_mse_ol[-4:])
print('True weights:', wt_true)
print('Predicted weights:', weights_mse_ol)
yhat = linear_function(weights_mse_ol, xtrain)
print('MSE loss (all data):', mse_loss(yhat, ytrain + outlier_noise))
yhat = linear_function(weights_mse_ol, xtrain[outlier_noise == 0])
print('MSE loss (unaffected data):', mse_loss(yhat, ytrain[outlier_noise == 0]))

loss values (last few): [1.9935106591383374, 1.993510659138311, 1.993510659138289, 1.9935106591382687]
True weights: [3.0, 6.0]
Predicted weights: [3.68645222 4.58096621]
MSE loss (all data): 1.9935106591382687
MSE loss (unaffected data): 0.24677995511981501


In [19]:
# With outliers - Predict using Huber Loss
np.random.seed(424242)
gamma = np.std(ytrain + outlier_noise)
update_func = functools.partial(update_weights_huber, gamma=gamma)
loss_func = functools.partial(huber_loss, gamma=gamma)
weights_huber_ol, loss_values_huber_ol = train(xtrain, ytrain + outlier_noise, nweights=2, update_func=update_func,
                                               loss_func=loss_func, niter=250 * 20)
print('gamma:', gamma)
print(loss_values_huber_ol[-10:])
print(wt_true, weights_huber_ol)

gamma: 1.9831842254929397
[1.2135555810637617, 1.2135555810630088, 1.2135555810623326, 1.2135555810617251, 1.2135555810611796, 1.213555581060689, 1.213555581060249, 1.2135555810598533, 1.2135555810594978, 1.213555581059179]
[3.0, 6.0] [3.29274357 5.34311077]


In [20]:
print('gamma:', gamma)
print('loss values (last few):', loss_values_huber_ol[-10:])
print('True weights:', wt_true)
print('Predicted weights:', weights_huber_ol)
yhat = linear_function(weights_huber_ol, xtrain)
print('MSE loss (all data):', mse_loss(yhat, ytrain + outlier_noise))
yhat = linear_function(weights_huber_ol, xtrain[outlier_noise == 0])
print('MSE loss (unaffected data):', mse_loss(yhat, ytrain[outlier_noise == 0]))

gamma: 1.9831842254929397
loss values (last few): [1.2135555810637617, 1.2135555810630088, 1.2135555810623326, 1.2135555810617251, 1.2135555810611796, 1.213555581060689, 1.213555581060249, 1.2135555810598533, 1.2135555810594978, 1.213555581059179]
True weights: [3.0, 6.0]
Predicted weights: [3.29274357 5.34311077]
MSE loss (all data): 2.047293898186882
MSE loss (unaffected data): 0.13419719922409176


In [21]:
p = bkp.figure(title='Predictions -- Data Without Outliers', y_range=[-1, 11], width=600, height=400)
p.circle(xtrain, ytrain + outlier_noise, radius=0.01)
xtmp = np.linspace(0, 1)
ytrue = linear_function(wt_true, xtmp)
ymse = linear_function(weights_mse_ol, xtmp)
yhuber = linear_function(weights_huber_ol, xtmp)
p.line(xtmp, ytrue, color='gray', line_width=3, legend_label='true weight')
p.line(xtmp, ymse, color='blue', line_width=3, legend_label='MSE')
p.line(xtmp, yhuber, color='darkorange', line_width=3, legend_label='Huber')
p.legend.location = 'bottom_right'
bkp.show(p)

### Training Loss
For completeness, we plot the training loss against the training iterations for both MSE and Huber losses.  The training loss decreases with the number of steps as expected.  We can also observe that the training loss for the MSE-loss model is higher than that of the Huber-loss model.

In [22]:
p = bkp.figure(title='Training Loss vs Iterations (data with outliers)', width=600, height=400)
p.line(np.arange(len(loss_values_mse_ol)) * 20, loss_values_mse_ol, line_width=2, legend_label='MSE')
p.line(np.arange(len(loss_values_huber_ol)) * 20, loss_values_huber_ol, line_width=2,
       color='darkorange', legend_label='Huber')
p.xaxis.axis_label = 'Training Steps'
p.yaxis.axis_label = 'Loss'
bkp.show(p)

## Comparison of MSE Loss
The MSE loss using the weights of the model using MSE-loss and the model using Huber loss is shown below.

| Model      | Loss (unaffected data) | Loss (no outliers) |
|------------|-----------------------:|-------------------:|
| MSE Loss   |                 0.2467 |             0.1170 |
| Huber Loss |                 0.1342 |             0.1170 |

# Appendix

In [23]:
# different gamma
np.random.seed(424242)
gamma = 6.0
update_func = functools.partial(update_weights_huber, gamma=gamma)
loss_func = functools.partial(huber_loss, gamma=gamma)
weights, loss_values = train(xtrain, ytrain + outlier_noise, nweights=2, update_func=update_func,
                             loss_func=loss_func, niter=250 * 20)
print(loss_values[-10:])
print(wt_true)
print(weights)

[1.9935106591385687, 1.993510659138518, 1.993510659138473, 1.9935106591384324, 1.993510659138397, 1.993510659138365, 1.9935106591383362, 1.993510659138311, 1.9935106591382883, 1.9935106591382683]
[3.0, 6.0]
[3.68645222 4.58096621]
