<p style="text-align:right; font-size:14px;"> University of Studies of Florence
<p style="text-align:right; font-size:14px;"> Department of Engineering Information </p>
<p style="text-align:right; font-size:14px;"> Pistoia, April 1, 2021 </p>


<h1 align=center>Visual and Multimedia Recognition</h1>
<h2 align=center>
Unsupervised Learning with Deepcluster on VID Dataset
</h1>

<br>


In [1]:
__AUTHOR__ = {'lp': ("Lorenzo Pisaneschi",
                    "lorenzo.pisaneschi1@stud.unifi.it",
                     "https://github.com/pisalore/video_deepcluster")}

__TOPICS__ = ['Unsupervised Learning', 'Deep Learning', 'k-Means Clustering', 'AlexNet', 'Vid Dataset']


__KEYWORDS__ = ['Python', 'Pytorch', 'AI', 'Machine Learning', 'clustering']

<h1>Introduction</h1>

AI and Machine Learning algorithms are increasingly pervasive in daily life, with the result we need ever larger
dataset and ever more annotated data for our supervised learning algorithms. For this reason, unsupervised learning is
becoming important, in order to obtain useful information from input data to be maybe used later in a supervised learning
process.

<h3> Unsupervised Learning Challanges </h3>

Obviously, since our algorithm does not know how the output should be, we have to evaluate if the algorithm has learnt
something useful to our purpose. For example, using clustering algorithms, it is crucial to understand if data are
collected in a desired way for further inference on data.

<br>
<img style="display: block;
  margin-left: auto;
  margin-right: auto;
  width: 50%;" src="slides/images/img1.png" width="500">



<h3> Thrilling idea </h3>

The following points have to be fixed:

- Well tested pretrained CNNs are available, like those
obtained starting from famous ImageNet dataset, a fully supervised dataset consisting
in one million of images distributed over 1000 categories.

- Unsupervised learning algorithms can be applied on data from any domain, and unsupervised methods
could be applied to deep learning models.

- We'd like to have greater dataset to perform ever complex tasks with ever more data.

So, why not alternating clustering and convnet weights update using predicted cluster
assignments and build a classification model in order to annotate data without
any manual preprocessing?

<h1> Deepcluster </h1>

Facebook AI research has come in help in this direction, developing a new modern approach called Deepcluster.

As stated above, the main idea of deepcluster is to exploit both CNNs and clustering to create an unsupervised algorithm
which is able to obtain useful generalized visual features.

Again, the main goal is the developing of a scalable domain independent model with and end-to-end training (using both input
and output for weights optimization).


<h3> How does deepclustering works? </h3>

<br>
<img style="display: block;
  margin-left: auto;
  margin-right: auto;
  width: 70%;" src="slides/images/img2.png" width="500">

The idea of deepcluster is simple, and it is illustrated in the figure above.

1. First, features are computed using the chosen convolutional neural network.
2. The obtained features are clustered using a clustering algorithm.
3. The cluster assignments are then used as pseudo-labels to optimize the convent using backpropagation.

<h3> ... a little more in detail </h3>

* Given $ f_\theta(x_n) $ the convent features mapping using an image $x_n $
* Given the chosen clustering algorithm (k-means)
* Given $ y_n $ as the cluster assignments and $ C $ as the centroids matrix of dimension $ d\times k $

We want to solve this problem:

$
\begin{align}
\label{eq:kmeans}
  \min_{C \in \mathbb{R}^{d\times k}}
  \frac{1}{N}
  \sum_{n=1}^N
  \min_{y_n \in \{0,1\}^{k}}
  \| f_\theta(x_n) -  C y_n \|_2^2
  \quad
  \text{such that}
  \quad
  y_n^\top 1_k = 1.
  \end{align}
$

Which result is composed by cluster assignments used after as pseudo-labels.

Then, weights are updated optimizing the following problem:

$\begin{align}
\min_{\theta, W} \frac{1}{N} \sum_{n=1}^N \ell\left(g_W\left( f_\theta(x_n) \right), y_n\right)
\end{align} $

Where $ \theta $ represents the mapping parameters and $ W $ are the classifier parameters,$ g_W$ is the classifier
and $\ell$ is the multinomial logistic loss function.

<h3>Here we are!</h3>

Once out deepcluster models is trained, we could use it as a "classic" classifier, maybe using fine tuning or
transfer learning techniques.

In this way, exploiting the pretraining with deepcluster, hopefully we could be able to generalize and automatically
annotate our dataset of images.

<br>
<img style="display: block;
  margin-left: auto;
  margin-right: auto;
  width: 80%;" src="slides/images/img3.png">

<h1> VID Dataset </h1>

Now that the workflow of this project is clear, it is important to take a look to
the chosen dataset.

It is the VID dataset:
* It consists of video, sampled in JPEG format frames.
* The training set is formed by 3862 video for a total of 1.122.397 frames
* The validation set is formed by 555 video for a total of 176.126
* Each video belongs to one of 30 categories.

Since data are derived from videos, we want to use only the part of image containing the
object of interest. For this reason, annotations are also provided, in XML format,
indicating object crop coordinates of the specified frame.


<h3> Data preprocessing (I)</h3>

First, it is important to know our data and make a good preprocessing.

* As we can imagine, not all images where correctly annotated, resulting in the need to make dataset consistent.

* Another aspects involve performance: how much can an on demand crop image cost when loading data for feature computing and
clustering using deepcluster? Does it make sense to perform the same operation on same images for many epochs?


<h3> Data preprocessing (II)</h3>

For the reasons underlined above, it is necessary to:

1. Discard all not annotated images
2. Perform an offline crop process parsing the remained consistent images XML annotations
3. Save all the work, to make the obtained dataset always available in the process.

Preprocessing is needed both for training and validation set. <br>
With Python these operations are straightforward.

<br>
<img style="display: block;
  margin-left: auto;
  margin-right: auto;
  width: 90%;" src="slides/images/img6.png">

After data preprocessing, the dataset have been sub-sampled for speed up computation, and finally it is composed by
362.044 frames for training and 54.306 frames for validation.

<h1>The process: deepcluster training</h1>

We are at our starting point. The framework used for this project is PyTorch.

* Deepcluster was trained for 100 epochs, using a batch size of 256 and LR = 0.05.

* For clustering, k-means is used. For this reason, $k$ is an important parameter to be chosen; consequently, it is also
important to determine if a choice is better than another.

* For this reason, deepcluster training was performed with three different $k$ values: 30, 150 and 300. trainings
have been then evaluated using the NMI metric (Normalized Mutual Information):, which measures the information shared between
two different assignments:

$\begin{align}
\mathrm{NMI}(A;B)=\frac{\mathrm{I}(A;B)}{\sqrt{\mathrm{H}(A) \mathrm{H}(B)}}
\end{align}
$

The more NMI is closer to 1, the more an assignment A is deterministically predictable from an assignment B.

<h3>Clustering stabilization</h3>

<img style="display: block;
  margin-left: auto;
  margin-right: auto;
  width: 60%;" src="slides/images/img7.png">

* NMI is increasing over the times, saturating ca. after 20 epochs. This fact states that there are less and less reassignments
over the time.
* Consequence is that dependence between clusters and labels increases over the time: features are capturing information to be
used after in the second training phase.

* With k = 300 we obtain the best performance: apparently over segmentation is beneficial.

<h3>Clusters visualization (example for k = 300)</h3>

<img style="display: block;
  margin-left: auto;
  margin-right: auto;
  width: 70%;" src="slides/images/img8.png">


<h1>The process: deepcluster evaluation</h1>

After deepcluster training we have three models, one for each K we have used for k-means clustering.

The idea now is to try to check if unsupervised training with deepcluster was
useful for the purpose of build an images classifier.



<h3>Fine tuning vs Training from scratch</h3>


<img style="display: block;
  margin-left: auto;
  margin-right: auto;
  width: 80%;" src="slides/images/img9.png">


To do this evaluation, we can do the following:

* Fine tune our models adding a  30-dimension top layer (the classifier):
30 is the VID dataset categories number
* Training an AlexNet from scratch using exactly the same data provided early to deepcluster (but using annotated labels)
* Compare the results

<h3>Validation Loss and Accuracy</h3>

<img style="display: block;
  margin-left: auto;
  margin-right: auto;
  width: 100%;" src="slides/images/img10.png">

* As we can see, training with deepcluster models falls in overfitting, despite data augmentation and fine tuning. This could be
due to the fact that images are video frames: features provides information on very similar data.

* However, overfitting occurs in different moments with respect to each model. The one pretrained with 300-means performs
better than the others; in general it seems that better the clustering phase, better the supervised training.

* The previous consideration is supported by the validation accuracy plot.


<h1>Further steps and Conclusions</h1>

Unsupervised learning is becoming more and more important since we have the need of more and more automatically annotated data.
Indeed, unsupervised learning, at the moment, is a first step in more complex Machine Learning pipelines which are composed
by an unsupervised phase, which outputs are used as supervised learning inputs.

The work here presented is an example on how, starting from clustering, it is possible to obtain a pretrained model to be used
as a starting point of a fine-tuned AlexNet in oder to classify.

Some future developments could be:
* Use feature extraction instead fine tuning (using the last two convolutional layers if using AlexNet)
* Use other architectures (for example, VGG16)
* Use another dataset and varying the number of images during che features computing / clustering
* Exploit video information (for VID dataset)

<h1 style="text-align:center">THANKS FOR YOUR ATTENTION !!!</h1>