In [4]:
import numpy as np

# LOSS FUNCTIONS

Loss functions quantify how close a given neural network is to the ideal toward which it is training. The idea is simple. We calculate a metric based on the error we observe in the network’s predictions. We then aggregate these errors over the entire dataset and average them and now we have a single number representative of how close the neural network is to its ideal. Looking for this ideal state is equivalent to finding the parameters (weights and biases) that will minimize the “loss” incurred from the errors. In this way, loss functions help reframe training neural networks as an optimization problem. In most cases, these parameters cannot be solved for analytically, but, more often than not, they can be approximated well with iterative optimization algorithms like gradient descent. The following section provides an overview on commonly seen loss functions, linking them back to their origins in machine learning, as necessary.
<img src="loss.png">

## Loss Functions for Regression

### Mean Squared Error Loss 

In [22]:
def meansquarederror(ytrue,ypred):
    trainingexamples = ytrue.shape[0]
    print("total no of training examples is"+str(trainingexamples))
    errorvector = ypred-ytrue
    print("shape of error vector"+str(errorvector.shape))
    error = (1/trainingexamples)*np.sum(errorvector**2)
    return error
    
    

In [23]:
genearatetrainingsets = np.random.randn(100,1)

In [24]:
genearatetrainingsetspred = np.random.randn(100,1)

In [25]:
meansquarederror(genearatetrainingsets,genearatetrainingsetspred)

total no of training examples is100
shape of error vector(100, 1)


2.1300515122978085

### Is MSE Really a Convex Loss Function? 

In a technical sense, the MSE is a convex loss function. However, when dealing with hidden layers in neural networks, the convex property no longer holds true, because we could have multiple parameter sets of values resulting in the same loss value.
### Optimizing MSE 
Optimizing the MSE is equivalent to optimizing for the mean.


### Mean absolute error loss

In [28]:
def meanabsoluteerror(ytrue,ypred):
    trainingexamples = ytrue.shape[0]
    print("total no of training examples is"+str(trainingexamples))
    errorvector = ypred-ytrue
    print("shape of error vector"+str(errorvector.shape))
    error = (1/trainingexamples)*np.sum(np.abs(errorvector))
    return error

In [29]:
meanabsoluteerror(genearatetrainingsets,genearatetrainingsetspred)

total no of training examples is100
shape of error vector(100, 1)


1.1399313206114992

### Mean squared log error los

In [32]:
def meansquarelogerror(ytrue,ypred):
    trainingexamples = ytrue.shape[0]
    print("total no of training examples is"+str(trainingexamples))
    errorvector = np.log(np.abs(ypred))-np.log(np.abs(ytrue))
    print("shape of error vector"+str(errorvector.shape))
    error = (1/trainingexamples)*np.sum(errorvector**2)
    return error

In [33]:
meansquarelogerror(genearatetrainingsets,genearatetrainingsetspred)

total no of training examples is100
shape of error vector(100, 1)


2.6846011395015252

### Mean absolute percentage error loss

In [45]:
def meanabsolutepercentageloss(ytrue,ypred):
    trainingexamples = ytrue.shape[0]
    print("total no of training examples is"+str(trainingexamples))
    errorvector = 100*(np.abs(ypred-ytrue)/np.abs(ytrue))
    print("shape of error vector"+str(errorvector.shape))
    error = (1/trainingexamples)*np.sum(errorvector)
    return error

In [46]:
meanabsolutepercentageloss(genearatetrainingsets,genearatetrainingsetspred)

total no of training examples is100
shape of error vector(100, 1)


1860.9473782172968

# Loss Functions for Classification

### Hinge loss 
Hinge loss is the most commonly used loss function when the network must be optimized for a hard classification. For example, 0 = no fraud and 1 = fraud, which by convention is called a 0-1 classifier. The 0,1 choice is somewhat arbitrary and –1, 1 is also seen in lieu of 0–1. Hinge loss is also seen in a class of models called maximummargin classification models (e.g., support vector machines, a somewhat distant cousin to neural networks). 
The hinge loss is mostly used for binary classifications. There are extensions for multiclass classification (e.g., “one versus all,” “one versus one”) for the hinge loss that are not covered here.
#### Hinge Loss Is a Convex Function 
Note that like the MSE, the hinge loss is known to be a convex function.


In [71]:
def hingeloss(ytrue,ypred):
    createarray =[]
    trainingexamples = ytrue.shape[0]
    createzeros = np.zeros(trainingexamples)
    print("total no of training examples is"+str(trainingexamples))
    errorvector=1-ypred*ytrue
    #print( errorvector)
    for i in errorvector:
        #print(i)
        if i >0:
            createarray.append(i)
        else:
            createarray.append(0)
    createarray = np.array(createarray)
    #print(createarray)
    #errorvector = np.maximum(createzeros,errorvector)
    print("shape of error vector"+str(createarray.shape))
    error = (1/trainingexamples)*np.sum(createarray)
    return error

In [72]:
hingeloss(genearatetrainingsets,genearatetrainingsetspred)

total no of training examples is100
shape of error vector(100,)


array([1.1800846])

### Logistic loss
Logistic loss functions are used when probabilities are of greater interest than hard classifications. Great examples of these would be flagging potential fraud, with a human-in-the-loop solution or predicting the “probability of someone clicking on an ad,” which can then be linked to a currency number.
Predicting valid probabilities means generating numbers between 0 and 1. Predicting valid probabilities also means making sure the probability of mutually exclusive outcomes should sum to one. For this reason, it is essential that the very last layer of a neural network used in classification is a softmax. Note that the sigmoid activation function also will give valid values between 0 and 1. However, you cannot use it for scenarios in which the outputs are mutually exclusive, because it does not model the dependencies between the output values. Now that we have made sure our neural network will produce valid probabilities for the classes we have, we can dive headlong into the loss function and into the idea of what we should be optimizing here. We want to optimize for what is formally called the “maximum likelihood.” In other words, we want to maximize the probability we predict for the correct class AND we want to do so for every single sample we have. 

### Negative log likelihood. 
For the sake of mathematical convenience, when dealing with the product of probabilities, it is customary to convert them to the log of the probabilities; hence, the product of the probabilities transforms to the sum of the log of the probabilities.

#### Negative Log Likelihood and Maximizing Probability
The logarithm is a monotonically increasing function. Thus, minimizing the negative log likelihood is equivalent to maximizing the probability

In [97]:
def sigmoid(arrays):
    return 1/(1+np.exp((-1)*arrays))

In [104]:
sigtrain=sigmoid(genearatetrainingsets)
sigpred=sigmoid(genearatetrainingsetspred)

In [105]:
def logisticsloss(ytrue,ypred):
    trainingexamples = ytrue.shape[0]
    errorvector  = (ytrue*np.log(ypred))+(ytrue*np.log(1-ypred))
    return (1/trainingexamples)*np.sum(errorvector)

In [106]:
logisticsloss(sigtrain,sigpred)

-0.8463647615390332

### Cross-entropy
Cross-entropy has its origin in information theory, whereas negative log likelihood for classification has its origin in statistical modeling. the cross-entropy between two probability distributions—in this case, what we predict and what we have observed under the same criteria

In [107]:
def crossentropy(ytrue,ypred):
    trainingexamples = ytrue.shape[0]
    errorvector = -1 * ytrue * np.log(ypred)
    return (1/trainingexamples)*np.sum(errorvector)
    

In [108]:
crossentropy(sigtrain,sigpred)

0.441498924599787

## Loss Functions for Reconstruction 
This set of loss functions relates to what is called reconstruction. The idea is simple. A neural network is trained to recreate its input as closely as possible. So, why is this any different from memorizing the entire dataset? The key here is to tweak the scenario so that the network is forced to learn commonalities and features across the dataset. In one approach, the number of parameters in the network is constrained such that the network is forced to compress the data and then re-create it. Another often-used approach is to corrupt the input with meaningless “noise” and train the network to ignore the noise and learn the data. Examples of these kinds of neural nets are restricted Boltzmann machines, autoencoders, and so on. These neural networks all use loss functions that are rooted in information theory.


In [None]:
def reconstruction(ytrue,ypred):
    trainingexamples = ytrue.shape[0]
    errorvector = -1 * ytrue * np.log(ypred)
    return (1/trainingexamples)*np.sum(errorvector)