1. Squared Error Loss


Squared Error loss for each training example, also known as L2 Loss, is the square of the difference between the actual and the predicted values:



The corresponding cost function is the Mean of these Squared Errors (MSE).

I encourage you to try and find the gradient for gradient descent yourself before referring to the code below.

def update_weights_MSE(m, b, X, Y, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate partial derivatives
        # -2x(y - (mx + b))
        m_deriv += -2*X[i] * (Y[i] - (m*X[i] + b))

        # -2(y - (mx + b))
        b_deriv += -2*(Y[i] - (m*X[i] + b))

    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b

2. Absolute Error Loss

Absolute Error for each training example is the distance between the predicted and the actual values, irrespective of the sign. Absolute Error is also known as the L1 loss:



As I mentioned before, the cost is the Mean of these Absolute Errors (MAE).

The MAE cost is more robust to outliers as compared to MSE. However, handling the absolute or modulus operator in mathematical equations is not easy. I’m sure a lot of you must agree with this! We can consider this as a disadvantage of MAE.

Here is the code for the update_weight function with MAE cost:

def update_weights_MAE(m, b, X, Y, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate partial derivatives
        # -x(y - (mx + b)) / |mx + b|
        m_deriv += - X[i] * (Y[i] - (m*X[i] + b)) / abs(Y[i] - (m*X[i] + b))

        # -(y - (mx + b)) / |mx + b|
        b_deriv += -(Y[i] - (m*X[i] + b)) / abs(Y[i] - (m*X[i] + b))

    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b

3. Huber Loss

The Huber loss combines the best properties of MSE and MAE. It is quadratic for smaller errors and is linear otherwise (and similarly for its gradient). It is identified by its delta parameter:

loss function

def update_weights_Huber(m, b, X, Y, delta, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # derivative of quadratic for small values and of linear for large values
        if abs(Y[i] - m*X[i] - b) <= delta:
          m_deriv += -X[i] * (Y[i] - (m*X[i] + b))
          b_deriv += - (Y[i] - (m*X[i] + b))
        else:
          m_deriv += delta * X[i] * ((m*X[i] + b) - Y[i]) / abs((m*X[i] + b) - Y[i])
          b_deriv += delta * ((m*X[i] + b) - Y[i]) / abs((m*X[i] + b) - Y[i])
    
    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b
    
    
    
 Huber loss is more robust to outliers than MSE. It is used in Robust Regression, M-estimation and Additive Modelling. A variant of Huber Loss is also used in classification.

Binary Classification Loss Functions



The name is pretty self-explanatory. Binary Classification refers to assigning an object into one of two classes. This classification is based on a rule applied to the input feature vector. For example, classifying an email as spam or not spam based on, say its subject line, is binary classification.

I will illustrate these binary classification loss functions on the Breast Cancer dataset.

We want to classify a tumor as ‘Malignant’ or ‘Benign’ based on features like average radius, area, perimeter, etc. For simplification, we will use only two input features (X_1 and X_2) namely ‘worst area’ and ‘mean symmetry’ for classification. The target value Y can be 0 (Malignant) or 1 (Benign).

Here is a scatter plot for our data:
loss function

 

1. Binary Cross Entropy Loss
Let us start by understanding the term ‘entropy’. Generally, we use entropy to indicate disorder or uncertainty. It is measured for a random variable X with probability distribution p(X):

loss function

The negative sign is used to make the overall quantity positive.

A greater value of entropy for a probability distribution indicates a greater uncertainty in the distribution. Likewise, a smaller value indicates a more certain distribution.

This makes binary cross-entropy suitable as a loss function – you want to minimize its value. We use binary cross-entropy loss for classification models which output a probability p.

Probability that the element belongs to class 1 (or positive class) = p
Then, the probability that the element belongs to class 0 (or negative class) = 1 - p
Then, the cross-entropy loss for output label y (can take values 0 and 1) and predicted probability p is defined as:

loss function

This is also called Log-Loss. To calculate the probability p, we can use the sigmoid function. Here, z is a function of our input features:

loss function

The range of the sigmoid function is [0, 1] which makes it suitable for calculating probability.

loss function

Try to find the gradient yourself and then look at the code for the update_weight function below.

def update_weights_BCE(m1, m2, b, X1, X2, Y, learning_rate):
    m1_deriv = 0
    m2_deriv = 0
    b_deriv = 0
    N = len(X1)
    for i in range(N):
        s = 1 / (1 / (1 + math.exp(-m1*X1[i] - m2*X2[i] - b)))
        
        # Calculate partial derivatives
        m1_deriv += -X1[i] * (s - Y[i])
        m2_deriv += -X2[i] * (s - Y[i])
        b_deriv += -(s - Y[i])

    # We subtract because the derivatives point in direction of steepest ascent
    m1 -= (m1_deriv / float(N)) * learning_rate
    m2 -= (m2_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m1, m2, b

2. Hinge Loss
Hinge loss is primarily used with Support Vector Machine (SVM) Classifiers with class labels -1 and 1. So make sure you change the label of the ‘Malignant’ class in the dataset from 0 to -1.

Hinge Loss not only penalizes the wrong predictions but also the right predictions that are not confident.

Hinge loss for an input-output pair (x, y) is given as:

hinge loss

def update_weights_Hinge(m1, m2, b, X1, X2, Y, learning_rate):
    m1_deriv = 0
    m2_deriv = 0
    b_deriv = 0
    N = len(X1)
    for i in range(N):
        # Calculate partial derivatives
        if Y[i]*(m1*X1[i] + m2*X2[i] + b) <= 1:
          m1_deriv += -X1[i] * Y[i]
          m2_deriv += -X2[i] * Y[i]
          b_deriv += -Y[i]
        # else derivatives are zero

    # We subtract because the derivatives point in direction of steepest ascent
    m1 -= (m1_deriv / float(N)) * learning_rate
    m2 -= (m2_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m1, m2, b

--> Multi-Class Classification Loss Functions



Emails are not just classified as spam or not spam (this isn’t the 90s anymore!). They are classified into various other categories – Work, Home, Social, Promotions, etc. This is a Multi-Class Classification use case.

We’ll use the Iris Dataset for understanding the remaining two loss functions. We will use 2 features X_1, Sepal length and feature X_2, Petal width, to predict the class (Y) of the Iris flower – Setosa, Versicolor or Virginica

Our task is to implement the classifier using a neural network model and the in-built Adam optimizer in Keras. This is because as the number of parameters increases, the math, as well as the code, will become difficult to comprehend.

If you are new to Neural Networks, I highly recommend reading this article first.

Here is the scatter plot for our data:

loss function

 

1. Multi-Class Cross Entropy Loss
The multi-class cross-entropy loss is a generalization of the Binary Cross Entropy loss. The loss for input vector X_i and the corresponding one-hot encoded target vector Y_i is:

loss function

We use the softmax function to find the probabilities p_ij:

“Softmax is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer.” Google Developer’s Blog

loss function

Finally, our output is the class with the maximum probability for the given input.

We build a model using an input layer and an output layer and compile it with different learning rates. Specify the loss parameter as ‘categorical_crossentropy’ in the model.compile() statement:



# importing requirements
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import adam

# alpha = 0.001 as given in the lr parameter in adam() optimizer

# build the model
model_alpha1 = Sequential()
model_alpha1.add(Dense(50, input_dim=2, activation='relu'))
model_alpha1.add(Dense(3, activation='softmax'))

# compile the model
opt_alpha1 = adam(lr=0.001)
model_alpha1.compile(loss='categorical_crossentropy', optimizer=opt_alpha1, metrics=['accuracy'])

# fit the model
# dummy_Y is the one-hot encoded 
# history_alpha1 is used to score the validation and accuracy scores for plotting 
history_alpha1 = model_alpha1.fit(dataX, dummy_Y, validation_data=(dataX, dummy_Y), epochs=200, verbose=0)