In [14]:
import tensorflow as tf
from tensorflow import keras


# 1 - Some theory

## 1.1 Multi class classification

There are more than two classes and every observation belongs to one and only one class. E.g., An ecommerce company wants to categorize products like smartphones based on their brands (Samsung, Huawei, Apple, Xiaomi, Sony or Other).

In multi-class classification, the neural network has the same number of output nodes as the number of classes. Each output node belongs to some class and outputs a score for that class.

Scores from the last layer are passed through a <b>softmax layer</b>. The softmax layer converts the score into probability values. At last, data is classified into a corresponding class, that has the highest probability value. 

As for the loss function, we generally use <b>categorical_crossentropy</b> form multi class classification.

Finally, regarding the <b>target</b>, we have to feed a one hot encoded vector (e.g: [0, 1, 0, 0 ,0]) to the neural network. This vector is compared with the softmax layer to return the accurracy. 

### 1.1.1 Softmax layer

Source: https://towardsdatascience.com/softmax-activation-function-explained-a7e1bc3ad60

The output layer of a multi class classificatio poroblem must tell us first, what is the probability distribution of each class, so then with a max probability selection we can transform it to a one cot encoded vector to compare with the actual value. There we need a function that takes whatever values and transforms them into a probability distribution. Softmax function to the rescue.

The function is great for classification problems, especially if you’re dealing with multi-class classification problems, as it will report back the “confidence score” for each class. Since we’re dealing with probabilities here, the scores returned by the softmax function will add up to 1.The predicted class is, therefore, the item in the list where confidence score is the highest.

Lets look at the mathematical expression for the softmax function:

<img src=eq1.png width="100">

It states that we need to apply a standard exponential function to each element of the output layer, and then normalize these values by dividing by the sum of all the exponentials. Doing so ensures the sum of all exponentiated values adds up to 1.

<img src=eq2.jpeg width="300">

In [2]:
#Code snippet in python
import numpy as np
def softmax(scores):
    exp = np.exp(scores)
    scores = exp / np.sum(exp)
    return scores
print(f'output: {softmax([5, 7, 4, 6])}')
print(f'suma: {sum(softmax([5, 7, 4, 6]))}')

output: [0.08714432 0.64391426 0.0320586  0.23688282]
suma: 1.0


### 1.1.2 Loss function: Categorical crossentropy 

source: https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451

## 1. 2 Multi Label Classification

The only difference with Multi Class is that here a data sample can belong to multiple classes. We have to handle a few things differently in multi-label classification.

Example of application is medical diagnosis where we need to prescribe one or many treatments to a patient based on his signs and symptoms.
By analogy, we can design a multi-label classifier for car diagnosis. It takes as input all electronic measures, errors, symptoms, mileage and predicts the parts that need to be replaced in case of incident on the car.

### 1.2.1 Activation Function

The final score for each class should be independent of each other. Thus we <b>can not apply softmax activation</b>, because softmax converts the score into probabilities taking other scores into consideration.
The reason for the final score to be independent is obvious. If a movie genre is action, then it should not affect if the movie is thriller too.

We use the <b>sigmoid activation function</b> on the final layer. Sigmoid converts each score of the final node between 0 to 1 independent of what the other scores are.If the score for some class is more than 0.5, the data is classified into that class. And there could be multiple classes having a score of more than 0.5 independently. Thus the data could be classified into multiple classes. Following is the code snippet for sigmoid activation.


In [3]:
#Code snippet in python (note that sum is not necesary = 1)
def sigmoid(scores):
   
    scores = np.negative(scores)
    exp = np.exp(scores)
    scores = 1 / (1 + exp)
    return scores
print(sigmoid([2, -1, .15, 3]))


[0.88079708 0.26894142 0.53742985 0.95257413]


### 1.2.2 Loss function: (sum of)Binary Crossentropy

source: https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451

# 2 - Use Case: Multi-Label Image Classification in TensorFlow 2.0


Multi-label classification is also very common in computer vision applications. We, humans, use our instinct and impressions to guess the content of a new movie when seing its poster (action? drama? comedy? etc.). You have probably been in such situation in a metro station where you wanted to guess the genre of a movie from a wall poster. If we assume that in your inference process, you are using the color information of the poster, saturation, hues, texture of the image, body or facial expression of the actors and any shape or design that makes a genre recognizable, then maybe there is a numerical way to extract those significant patterns from the poster and learn from them in a similar manner. How to build a deep learning model that learns to predict movie genres?

### 2.1 The dataset (Movie Genre from its Poster)

This dataset is hosted on Kaggle and contains movie posters from IMDB Website. A csv fileMovieGenre.csv can be downloaded. It contains the following information for each movie: IMDB Id, IMDB Link, Title, IMDB Score, Genre and a link to download the movie poster. In this dataset, each Movie poster can belong to at least one genre and can have at most 3 labels assigned to it. The total number of posters is around 40K.

You can decide to ignore all labels with less than 1000 observations (Short, Western, Musical, Sport, Film-Noir, News, Talk-Show, Reality-TV, Game-Show). This means that the model will not be trained to predict those labels due to the lack of observations on them.

<img src=eq3.png width="800">

### 2.2 Build a fast input pipeline

### 2.2.1 Loading and parsing images

We ise the tf.data API to handle the images

    It is faster
    It provides fine-grained control
    It is well integrated with the rest of TensorFlow

You first need to write some function to parse image files and generate a tensor representing the features and a tensor representing the labels.

    In the parsing function you can resize the image to adapt to the input expected by the model.
    You can also scale the pixel values to be between 0 and 1. This is a common practice that helps speed up the convergence of training. If you consider every pixel as a feature, you would like these features to have a similar range so that the gradients don’t go out of control and that you only need one global learning rate multiplier.

In [4]:
IMG_SIZE = 224 # Specify height and width of image to match the input format of the model
CHANNELS = 3 # Keep RGB color channels to match the input format of the model

In [5]:
def parse_function(filename, label):
    """Function that returns a tuple of normalized image array and labels array.
    Args:
        filename: string representing path to image
        label: 0/1 one-dimensional array of size N_LABELS
    """
    # Read an image from a file
    image_string = tf.io.read_file(filename)
    # Decode it into a dense vector
    image_decoded = tf.image.decode_jpeg(image_string, channels=CHANNELS)
    # Resize it to fixed shape
    image_resized = tf.image.resize(image_decoded, [IMG_SIZE, IMG_SIZE])
    # Normalize it from [0, 255] to [0.0, 1.0]
    image_normalized = image_resized / 255.0
    return image_normalized, label

### 2.2.2 Batching and shuffling

To train a model on our dataset you want the data to be:

    Well shuffled

    Batched

    Batches to be available as soon as possible.

These features can be easily added using the <b>tf.data.Dataset abstraction.</b>


In [6]:
BATCH_SIZE = 256 # Big enough to measure an F1-score
AUTOTUNE = tf.data.experimental.AUTOTUNE # Adapt preprocessing and prefetching dynamically to reduce GPU and CPU idle time
SHUFFLE_BUFFER_SIZE = 1024 # Shuffle the training data by a chunck of 1024 observations

In [7]:
# Creating a function that generates training and validation datasets for TensorFlow.

In [8]:

def create_dataset(filenames, labels, is_training=True):
    """Load and parse dataset.
    Args:
        filenames: list of image paths
        labels: numpy array of shape (BATCH_SIZE, N_LABELS)
        is_training: boolean to indicate training mode
    """
    
    # Create a first dataset of file paths and labels
    dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
    # Parse and preprocess observations in parallel
    dataset = dataset.map(parse_function, num_parallel_calls=AUTOTUNE)
    
    if is_training == True:
        # This is a small dataset, only load it once, and keep it in memory.
        dataset = dataset.cache()
        # Shuffle the data each buffer size
        dataset = dataset.shuffle(buffer_size=SHUFFLE_BUFFER_SIZE)
        
    # Batch the data for multiple steps
    dataset = dataset.batch(BATCH_SIZE)
    # Fetch batches in the background while the model is training.
    dataset = dataset.prefetch(buffer_size=AUTOTUNE)
    
    return dataset

### 2.2.3 Transfer learning with TF.Hub

Instead of building and training a new model from scratch, you can use a pre-trained model in a process called transfer learning. The majority of pre-trained models for vision applications were trained on ImageNet which is a large image database with more than 14 million images divided into more than 20 thousand categories. 

The idea behind transfer learning is that these models, because they were trained in a context of large and general classification tasks, can then be used to address a more specific task by extracting and transfering meaningful features that were previously learned. 

All you need to do is acquire a pre-trained model and simply add a new classfier on top of it. The new classification head will be trained from scratch so that you repurpose the objective to your multi-label classfication task.


<b>TensorFlow Hub</b> is a library that allows to publish and reuse pre-made ML components. Using TF.Hub, it becomes simple to retrain the top layer of a pre-trained model to recognize the classes in a new dataset. TensorFlow Hub also distributes models without the top classification layer. These can be used to easily perform transfer learning.


We will be using a <b>headless model</b>, pre-trained instance of <b>MobileNet V2</b> with a <b>depth multiplier</b> of 1.0 and an input size of 224x224. 

In [9]:
import tensorflow_hub as hub

feature_extractor_url = "https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/feature_vector/4"
feature_extractor_layer = hub.KerasLayer(feature_extractor_url,
                                         input_shape=(IMG_SIZE,IMG_SIZE,CHANNELS))

The <b>feature extractor</b> we are using here accepts images of shape (224, 224, 3) and returns a 1280-length vector for each image.

You should <b>freeze</b> the variables in the feature extractor layer, so that the training only modifies the new classification layers. Usually, it is a good practice when working with datasets that are very small compared to the orginal dataset the feature extractor was trained on.

In [10]:
feature_extractor_layer.trainable = False

#### Attach a classification head

Now, you can wrap the feature extractor layer in a <b>tf.keras.Sequential</b> model and add new layers on top.

In [17]:
N_LABELS=3
model = tf.keras.Sequential([
    feature_extractor_layer,
    keras.layers.Dense(1024, activation='relu', name='hidden_layer'),
    keras.layers.Dense(N_LABELS, activation='sigmoid', name='output')
])

In [21]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 1280)              2257984   
_________________________________________________________________
hidden_layer (Dense)         (None, 1024)              1311744   
_________________________________________________________________
output (Dense)               (None, 3)                 3075      
Total params: 3,572,803
Trainable params: 1,314,819
Non-trainable params: 2,257,984
_________________________________________________________________
