## <center>Visual Concept Detection Task</center>

### Group:
 - Nooshin Shojaee
 - Francesco Ciraolo
     - Student#: 020167902F
 - Lucas Souza Romao
     - Student#: 0190727830

### Introduction

As part of the evaluation of the semester a challenge was given based on ImageCLEF 2008: Visual Concept Detection Task [[1]](#1).
This challenge utilizes the IAPR TC-12 dataset that contains a split in the train set with x images and a test set with x images. Each of these images can have more than one classification associated with it. As it is a multi-classification problem the decision was to utilize a convolutional neural network (CNN), for better accuracy we decide to move forward in the utilization of EfficientNet as the main model, it provides us better results than ResNet50.
In the next sections, we will be giving a walkthrough of the code.

### Enviroments Setup

As the frameworks we will be mainly utilizing Keras that runs on top the Tensorflow [[2]](#2) plataform, also to support the plot of the images matplot and seaborn are being used.

In [None]:
import pandas as pd
import numpy as np
np.set_printoptions(precision=2)
import os

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import ReduceLROnPlateau

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import regularizers, optimizers
from tensorflow.keras.models import Sequential
import tensorflow.keras.layers as layers

from PIL import Image
import imghdr

import matplotlib.pyplot as plt
import matplotlib.style as style

import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score,roc_curve, confusion_matrix, auc
from scipy import interp


### Understading the data

In [None]:
train_pics_dir = './train/train/'
test_pics_dir = './test/test/'

In [None]:
test_df = pd.read_csv('test/test.anno.txt', sep=" ")
labels = test_df.columns
file_names = test_df['file_name']
#test_df['file_name'] = test_df['file_name'].apply(lambda x: os.path.join(test_pics_dir, x))
test_df['classes'] = test_df.loc[:, (test_df.columns != 'file_name')].values.tolist()
test_df.head()

In [None]:
train_df = pd.read_csv('train/train.anno.txt', sep=" ", names=labels)
#train_df['file_name'] = train_df['file_name'].apply(lambda x: os.path.join(train_pics_dir, x))
#train_df['classes'] = add_labels_dataframe(train_df)
train_df['classes'] = train_df.loc[:, (train_df.columns != 'file_name')].values.tolist()
train_df.head()

In [None]:
labels_freq = []

for column in train_df:
    if train_df[column].name != 'file_name' and train_df[column].name != 'classes':
        labels_freq.append(
            [train_df[column].name, 
            train_df[column].value_counts()[1]
            ])
    df_labels_freq = pd.DataFrame(labels_freq, columns = ['Label', 'Count']).sort_values(by=['Count'], ascending=False)

style.use("fivethirtyeight")
plt.figure(figsize=(8,8))
sns.barplot(y=df_labels_freq.Label, x=df_labels_freq.Count)
plt.title("Label frequency - Training Set", fontsize=14)
plt.xlabel("")
plt.ylabel("")
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

As it can be noticed in the plot above, our train set is imbalance, it will be a problem for our model because the difference between the majority classes such as outdoor, day and sky and the minority classes such as night, beach and animal, to overcome this problem we tried to use techniques to oversampling the minority ones and under sampling the majority. For it utilized SMOTETomek that is a combination of over and under sampling using SMOTE and Tomek links.[[3]](#3)
However the results after gave us a train set with more than 8 thousand obersavations as it can been seeing below:

<img src="over.png">

Notice that the classes are more well balanced but still not in the ideal, to have a better view about this effect of a big enlarge in our training set we can see the correlation plot.

In [None]:
sns.heatmap(train_df.corr(), linewidths=.5)

As we can see the labels are quite well correlated so the under and oversampling doesn't help in our problem as always we can be increasing a majority class by doing oversampling of a minory one, to overcome it the class weights will be calculate in order to give those wieghts as parameters in the model, it will kind say to our model during the training to pay more attention in the labeling.

### Data Preprocessing

In this section the data will be handle in order to utilize it to train the model, for it we decide to utilize ImageDataGenerator provided by Tensorflow, generator are a easy way to handle data since it is possible to add data argumentation on it and generate the images for the model.
For the input size of the image since EfficientNetB2 was choose, it requires an image size of 260x260. The channel will be the same RGB. 
For the batch size the best results were with the size of 32. But also tried with 16, 64 and 120.

In [None]:
def add_labels_dataframe(dataset):
    
    df = dataset.copy()
    
    for column in df.loc[:, df.columns != 'file_name']:
        df[column]= df[column].replace(1,df[column].name)
    
    df['classes'] = df.loc[:, (df.columns != 'file_name')].values.tolist()
    df['classes']= df['classes'].apply(lambda x : list(filter(lambda b: b != 0, x)))
    
    return df['classes']

In [None]:
CHANNELS = 3
IMG_HEIGHT = 260
IMG_WIDTH = 260
BATCH_SIZE = 2 


In [None]:
class_weights = {}
weights =list()

In [None]:
index = 0
train_size = len(train_df)

for column in train_df:
    if train_df[column].name != 'file_name' and train_df[column].name != 'classes':
        class_weights[index] = len(train_df[train_df[column] == 1]) / train_size
        weights.append(len(train_df[train_df[column] == 1]) / train_size)
        index += 1

First we need to create the Data Generator for each set, note that the only set that will be argumentated is the train set, we don't want the validation and test set with image transformations besides a rescale, otherwise the accurancy and predictions can be misleading.
As the total number of images are not so expressive due this the validation set will be 10%.

In [None]:
train = ImageDataGenerator(rescale=1./255,
                               rotation_range=45,
                               shear_range=0.2,
                               zoom_range=0.4,
                               horizontal_flip=True,
                               vertical_flip=True)

val_test = ImageDataGenerator(rescale=1./255)


In [None]:
train_df, val_df = train_test_split(train_df, test_size=0.1)

In [None]:
train_data_gen = train.flow_from_dataframe(dataframe=train_df,
                                               batch_size=BATCH_SIZE,
                                               directory=train_pics_dir,
                                               x_col='file_name',
                                               y_col=train_df.columns[1:18].tolist(),
                                               shuffle=True, seed=42,
                                           class_mode='raw',
                                               target_size=(IMG_HEIGHT, IMG_WIDTH))

val_data_gen = val_test.flow_from_dataframe(dataframe=val_df,
                                             batch_size=BATCH_SIZE,
                                             directory=train_pics_dir,
                                             x_col='file_name',
                                             y_col=train_df.columns[1:18].tolist(),
                                             shuffle=False, seed=42,
                                             class_mode='raw',
                                       target_size=(IMG_HEIGHT, IMG_WIDTH))

test_generator = val_test.flow_from_dataframe(dataframe=test_df,
                                                directory=test_pics_dir,
                                                x_col='file_name',
                                                y_col=None,
                                                batch_size=1,
                                                shuffle=False,
                                                class_mode=None,
                                                target_size=(IMG_HEIGHT, IMG_WIDTH))

### Model EfficientNet

Behind EfficinetNet there is an intuition published by Mingxing Tan and Quoc V. Le, published with a paper in May 2019. The idea is, for the convolutional neural networks, to overcome the monodimensional scaling, a widely shared approach. In the monodimensional scaling is required to chose between network width, network depth and image resolution. In EfficientNet all the three previous mentioned dimensions are scaled at the same time.


![scaling_comparison](img/scaling_comparison.png)


The compound scaling allows, how the paper shows, to obtain better results with fewer parameters.


![results_comparison](img/results_comparison.png)


We had two main reasons to implement this method. The first motive is contextualized to the challenge approach of this project; we decided to be more "adventurous" and try to implement a newer and promising model over a older but more stable one. The second reason is that, having this opportunity, would be a gr

Our main raisons to implement this method are:
- A great interest in this promising approach
- The attempt to solve the challenge limiting the re

Behind EfficinetNet there is an intuition published by Mingxing Tan and Quoc V. Le, published with a paper in May 2019. The idea is, for the convolutional neural networks, to overcome the monodimensional scaling, a widely shared approach. In the monodimensional scaling is required to chose between network width, network depth and image resolution. In EfficientNet all the three previous mentioned dimensions are scaled at the same time.

In [None]:
from tensorflow.keras.applications import EfficientNetB2

In [None]:
effic_b2 = EfficientNetB2(weights=None, 
                          include_top=False, 
                          drop_connect_rate=0.2, 
                          pooling='avg', 
                          input_shape=(IMG_WIDTH, IMG_HEIGHT, CHANNELS))


In [None]:
model = Sequential()
model.add(effic_b2)
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(17, activation='sigmoid'))

model.summary()

In [None]:
model.compile(optimizer='adam',loss="binary_crossentropy",metrics=["accuracy"])

To explain our model, as the output layer we have the sigmoid as activation function since the problem has non-exclusive labels for each image, otherwise Softmax would be use [[4]](#4). As the optmizer Adam was choose and for the loss since we are using Sigmoid it  is binary cross entropy. Worth it to mentioned that a run was made using Softmax and categorical cross entropy but the loss was increasing very fast and results were not accurate at all. 

In [None]:
reduce_lr = ReduceLROnPlateau(monitor='val_loss', 
                        factor=0.2, 
                        patience=3, 
                        verbose=1,
                        mode='auto', 
                        min_delta=0.0001)

The learning rate is not being set, the default when using Adam without specifing this rate is <i>learning_rate=0.001</i> as can be found in the documentation. Why? Because we want to make sure that our model don't stop learn, so after 3 epochs that is not improving it will get the factor and calculate a new learning rate utilizing: <i>new_lr = lr * factor</i>.

Below we suggest to use HPC facilities [[5]](#5) for a faster tranining, together with this notebook we will be adding the python file only with the necessary code to read the sets and train the model.

In [None]:
with tf.device('/gpu:0'):
    model_trained = model.fit(
        train_data_gen,
        steps_per_epoch=train_data_gen.n//train_data_gen.batch_size,
        epochs = 2,
        callbacks=[reduce_lr],
        validation_data=val_data_gen,
        validation_steps=val_data_gen.n//val_data_gen.batch_size,
        class_weight=class_weights
    )

In [None]:
model_trained.model.save('model_b2.h5')

In [None]:
#TTraining and validation loss
loss = model_trained.history['loss']
val_loss = model_trained.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, 'y', label='Training loss')
plt.plot(epochs, val_loss, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.savefig('loss_b2.jpg')
plt.cla()

In [None]:
#Training and validation accuracy
plt.plot(epochs, model_trained.history["accuracy"], 'y', label='Training Accuracy')
plt.plot(epochs, model_trained.history["val_accuracy"], 'r', label='Validation Accuracy')
plt.title("Training and validation accuracy")
plt.ylabel("accuracy")
plt.xlabel("epoch")
plt.legend()
plt.show()
plt.savefig('accurancy_b2.jpg')
plt.cla()

In [None]:
plt.plot(epochs, model_trained.history["lr"])
plt.title("Learning Rate")
plt.ylabel("LR")
plt.xlabel("epoch")
plt.savefig('learning_curve.jpg')
plt.show()

Load model in case of it being trainned in HPC.

In [None]:
model_trained = tf.keras.models.load_model('model_b2.h5')

### Prediction

In [None]:
test_generator.reset()
yhat = model_trained.predict(test_generator)

In [None]:
out = (yhat > 0.5).astype("int32")


In [None]:
test_classes = np.array(list(test_df.classes.values))
thresholds = np.arange(0, 1, 0.001)

def to_labels(pos_probs, threshold):
    return (pos_probs >= threshold).astype('int')

scores = [f1_score(test_classes, to_labels(yhat, t), average='micro') for t in thresholds]

ix = tf.argmax(scores)
print('Threshold=%.3f, F-Score=%.5f' % (thresholds[ix], scores[ix]))

In [None]:
df_tt = pd.DataFrame(data=yhat, columns=labels[1:].to_list())

In [None]:
thresholds_dict = {}

for i in labels[1:]:
    precision, recall, thresholds = precision_recall_curve(test_df[i].values, df_tt[i].values)
    fscore = (2 * precision * recall) / ( precision + recall)
    ix = tf.argmax(fscore)
    thresholds_dict[i] = thresholds[ix]

In [None]:
thresholds_dict

For to have a better F1 result, first we need to find the optimal threshold, so each label will have their own threshold, in order to achieve it the harmonic mean of precision and recall.

In [None]:
for i in thresholds_dict:
    df_tt[i] = df_tt[i].map(lambda x: 1 if x > thresholds_dict[i] else 0)

In [None]:
df_tt['file_name'] = test_df['file_name']

In [None]:
df_tt.to_csv('./test/test.eval_b21.txt', index=None, sep=' ')

In [None]:
os.system('./test/eval.py')

### References

<a id='1'>[1]</a>   ImageCLEF. <i>ImageCLEF 2008: Visual Concept Detection Task.</i> URL:https://www.imageclef.org/2008/vcdt.

<a id='2'>[2]</a>   Tensorflow. <i>Tensorflow.</i> URL:https://www.tensorflow.org/about

<a id='3'>[3]</a>   SMOTETomek. <i>imbalanced-learn.</i> URL:https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.combine.SMOTETomek.html

<a id='4'>[4]</a>   Jason Brownlee. <i>Multi-Label Classification with Deep Learning</i> URL:https://machinelearningmastery.com/multi-label-classification-with-deep-learning/

<a id='4'>[5]</a>  S.  Varrette  et  al. “Management  of  an  Academic  HPCCluster:  The  UL  Experience”.  In:<i>Proc. of the 2014Intl. Conf. on High Performance Computing & Sim-ulation (HPCS 2014)</i> . Bologna, Italy: IEEE, July 2014,pp. 959–967.