# Using loss functions and optimizers in TensorFlow

Tensorflow has a lot of inbuilt functions and classes that are very convenient to use.

In [tf.keras.losses](https://www.tensorflow.org/api_docs/python/tf/keras/losses/) we find loss functions and in [tf.keras.optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) we find optimizers, including the standard gradient descent.

## Mean Squared Error

The mean squared error is exactly that, the mean of the squared error, that is:

$$\mathcal{L}_{\text{MSE}}= \text{E} (y - \hat{y})^2$$

It is used for regression tasks where larger deviations from the target are more important than small errors. It is derived from the euclidean distance.

In [7]:
import tensorflow as tf
import numpy as np

BATCH_SIZE = 8
N_PREDICTED_FEATURES = 5

targets = tf.random.uniform((BATCH_SIZE, N_PREDICTED_FEATURES))
predictions = tf.random.uniform((BATCH_SIZE, N_PREDICTED_FEATURES))

mse_loss = tf.keras.losses.MeanSquaredError()
mean_squared_error = mse_loss(targets,predictions)
print(f"MSE error for the batch:\n {mean_squared_error.numpy()} \n")

MSE error for the batch:
 0.2522013485431671 



## Categorical CrossEntropy

The CrossEntropy loss is a loss function used for classification tasks.

The entropy is defined as $H(p) = \sum_i^n -p_i  \log{(p_i)}$ and the crossentropy is defined as

$$ H(p, y) = \sum_i^n -y_i \log{(p_i)}$$ 

where $y$ is the one-hot encoded label and $p$ is the categorical probability distribution over the classes.

Another form of the CrossEntropy loss considers binary classification, where multiple classes can be labeled as true. It is defined as:

$$H_{\text{binary}}(p, y) = \sum_i^n -y_i \log{(p_i)} - (1-y_i) \log{(1-p_i)}$$

where each element in the predicted probabilities $p$ is the probability of that class being correct, regardless of the correctness of other classes (p does not sum to 1).



In [8]:
#Categorical CrossEntropy
labels = [[0,1,0],
         [0,0,1],
         [1,0,0],
         [1,0,0],
         [0,1,0]]

labels = tf.constant(labels, dtype=tf.float32)

logits = tf.random.normal(labels.shape)

# turn network output into categorical probability distribution over the labels
predictions = tf.nn.softmax(logits)


# calculate categorical crossentropy

CCE_loss = tf.keras.losses.CategoricalCrossentropy()
batch_loss = CCE_loss(labels, predictions)

print(f"CCE loss between predicted label probabilities and ground truth labels is \n {batch_loss.numpy()}")

CCE loss between predicted label probabilities and ground truth labels is 
 1.5787382125854492


In [9]:
# Binary CrossEntropy
labels = [1,
          0,
          0,
          1,
          0,
          0,
          1,
          0]
labels = tf.constant(labels, dtype = tf.float32)

predictions = tf.random.uniform(labels.shape)

BCE_loss = tf.keras.losses.BinaryCrossentropy()
batch_loss = BCE_loss(labels,predictions)

print(f"BCE loss between predicted label probabilities and ground truth labels is \n{batch_loss.numpy()}")

BCE loss between predicted label probabilities and ground truth labels is 
0.9398701786994934


# Making use of optimizers in the train loop
- Can be found in tf.keras.optimizers
- Optimizers take care of applying the computed gradients to update the parameters efficiently

In [13]:
# choose optimizer and loss
optimizer = tf.keras.optimizers.SGD(learning_rate=0.005,
                                   momentum=0)

loss_function = tf.keras.losses.MeanSquaredError()

# create data
X = tf.random.uniform((20,1), minval= 0, maxval = 10)
Y = X * np.pi

# a simple linear univariate model function without bias
def model(x, parameter):
    return x * parameter

# initialize parameter variable to a value far away from pi
parameter_estimate = tf.Variable(7.5, trainable=True, dtype=tf.float32)

print("parameter value before training:", parameter_estimate.numpy())

#iterate over epochs
for epoch in range(2):

    # iterate over training examples (no batch dimension, but loss_function can take that too)
    for x,y in zip(X,Y):
        
        # within GradientTape context manager, calculate loss between targets and prediction
        with tf.GradientTape() as tape:

            prediction = model(x, parameter_estimate)

            loss = loss_function(y, prediction)

        # outside of context manager, obtain gradients with respect to list of trainable variables
        gradients = tape.gradient(loss, [parameter_estimate])

        #apply gradients with optimizer
        optimizer.apply_gradients(zip(gradients, [parameter_estimate]))
        
print("parameter value after training: ", parameter_estimate.numpy())

parameter value before training: 7.5
parameter value after training:  3.1415927
