# Agile semi-automatic image clustering (filtering) using pre-trained Convolutional Neural Networks and clustering algorithms
### Authors: Michał Woźniak (id: 385190), Michał Wrzesiński (id: 385197)
### Date: 14.11.2019

In [1]:
#setting width of jupyter notebook document to 80%
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

# 1. Introduction
## 1.1 Background
Currently, deep neural networks are used in many areas. One of them is the Computer Vision. More and more commercial companies are deciding to use these technologies to solve their daily problems and tasks. They usually decide to outsource a project to a data science consulting company. This consulting company is given limited time (always around 3-6 months) to prepare a code that will serve as a Proof of Concept or Most Valuable Product. Therefore, the race against time begins.
<br><br>
As everyone knows, know-how alone is not enough. The key is correct and reliable preparation of data for analysis and learning. Very often, customers specify their requirements and set business goals, but at the same time provide data in very poor quality. In the case of photos, it is usually a large number of images that are not useful in the analysis and should be discarded at the very beginning of the study. Clients usually do a snapshot of data from databases and do not care what they are providing to machine learning consultants. In this situation you have to manage somehow. By nature, it would not be optimal to filter photos manually. Usually, such work is outsourced to a low-skilled person to reduce the cost of their remuneration. However, it happens that regular scientists do such exercises during the project because of the scale of the problem and time, which is required to employ low-skilled worker. So far in practice, not developed a good methodology of fighting this problem. This article may be the solution.
## 1.2 Purpose of the study
In this paper authors would like to propose new approach to removing unneeded images from databases, which will later be used to train machine learning algorithms. Their goal was to develop a semi-automatic solution based on pre-trained convolutional neural networks and clustering algorithms that would be fast, precise and robust in image clustering. The word fast means the lack of carrying out the learning process. This assumption enables scientists to quickly prototype during a commercial project. Precision and robustness address the issue of algorithm resistance to various types and classes of images. The simplest solution would seem to be the use of pre-trained convolutional neural networks with the top soft-max layer. However, this network is only trained for approximately 1,000 classes on ImageNet dataset. Therefore, it is useful for a strictly defined number of cases (1000). In this paper researchers are proposing to turn off a top layer of the network and treat CNN as a tool dedicated to feature extraction from images. That's why the approach presented in this article is not limited to any number of classes. The only condition and assumption is to define clearly distinguishable classes before starting the analysis (e.g. by receiving a data set from the Vienna Zoo we want to perform segmentation that will divide the set into animals and their food, and runs - as you can see the classes are distinguishable). 
## 1.3 Scientific hypothesis
The major hypothesis verified in this paper is whether it is possible to construct fast, precise and robust semi-automatic algorithm for image clustering using pre-trained convolutional neural networks and unsupervised machine learning clustering algorithms?
## 1.4  Business hypothesis
In addition to the scientific hypothesis, a business hypothesis should also be stated because that takes into account the economic aspect of this study. The issue addressed in this article can be considered as a binary classification problem. Suppose company X hired company Y to conduct a PoC for the Computer Vision project. Company Y is considering whether to use the model proposed in this article or to hire a low qualified person to carry out the process of clustering the photos received from company X. <br><br> Company Y is experienced and knows that this decision will influence on the rest of the project, because it will affect the quality of the data. In this specific image case: if we assume that class 1 represents needed images and class 0 represents unneeded images: an increase of False Negative metric leads to a decrease of dataset size, while an increase in the False Positive metric increases the noise in the data. Both situation are undesirable. According to that, they found that they must optimize two metrics: <br><br>
\begin{equation*}
Precision = \frac{TP}{TP+FP}\\
\\and\\\\
Recall = \frac{TP}{TP+FN} 
\end{equation*}<br><br>
Finally they decided to use simple linear equation as Quality Metric of their work:<br><br>
\begin{equation*}
Quality Metric = 0.3 * Precision + 0.7 * Recall.
\end{equation*}<br>
This approach enables them to handle this particular business problem (Recall is much more important from their experience). Of course Company Y wants to maximize Quality Metric. <br><br>
Company Y knows that cost of hiring: 
* low qualified person is 20 PLN per hour
* highly qualified Data Scientist is 90 PLN per hour.

What is more they already obtained performance statistics when image clustering is performed by:
* low qualified person (fully manual approach) ~ 100 images per hour,
* highly qualified Data Scientist (semi-automatic approach using model from this paper) ~ 100 images per 10 minutes.

They realized that low qualified person can obtain Quality Metric = 0.95. This statistic is unknown for new approach. Company Y assumed that if Quality Metric is in range [0.9, 1.0], it will be worth to use new approach, because cost of low qualified person will be increasing in linear manor and cost of highly qualified person will be constant (it always gonna be 90PLN/6 = 15 PLN). But, otherwise they won't take that risk. Company Y calculate their savings. In case of:
* 100 images => savings =  5 PLN
* 1000 images => savings = 185 PLN
* 10000 images => savings = 1985 PLN
* 100000 images => savings = 19985 PLN
* etc.

According to that, impact is well visible. To sum up, the business hypothesis is: whether a new approach can obtain Quality Metric in range [0.9, 1.0]?

# 2 Methodology of the research
## 2.1 Unsupervised Machine Learning Algorithms
Scientists based on knowledge of the pros and cons of various unsupervised machine learning algorithms decided to inspect performance of: K-Means, K-Med, DBSCAN and OTPICS in this problem. Below they posted short description of each considered algorithm.
### 2.1.1 K-Means
### 2.1.2 K-Medoid
### 2.1.3 DBSCAN
### 2.1.4 OPTICS

## 2.2 Pre-trained Convolutional Neural Networks 
Scientists had to decide which CNN architecture they wanted to choose. Keras library provides the following pre-trained architectures on the ImageNet set:
* Xception
* VGG16
* VGG19
* ResNet
* ResNetV2
* InceptionV3
* InceptionResNetV2
* MobileNet
* MobileNetV2
* DenseNet
* NASNet. 

According to their experience, accumulated expert knowledge and some researches: for example [Simone Bianco et al. from 2018](https://arxiv.org/pdf/1810.00736.pdf), they decided to use [Inception ResNet V2](https://arxiv.org/pdf/1602.07261.pdf), which is at the same time very efficient and precise. This model can be considered as a state of the art simultaneously with ResNet 152. To be more specific: Inception ResNet V2 is a variation of Inception V3 model which borrows some ideas from Microsoft's ResNet papers. Schematic diagram of Inception-ResNet-v2 ([source](https://ai.googleblog.com/2016/08/improving-inception-and-image.html)): <br><br>
![Schematic diagram of Inception-ResNet-v2](https://1.bp.blogspot.com/-O7AznVGY9js/V8cV_wKKsMI/AAAAAAAABKQ/maO7n2w3dT4Pkcmk7wgGqiSX5FUW2sfZgCLcB/s640/image00.png)<br><br>
What's important Keras implementation allow researchers to turn off top layer (soft-max) of pre-trained Inception ResNet V2 model. From technical point of view scientist have to keep in mind that default input size for this model is 299x299 pixels, so for sure some preprocessing is required.

## 2.3 Data
In this study, scientists simulated a project in the medical field. For the purposes of the paper, they created a data set consisting of 200 images (all converted to .png extension). The data was collected using Google Images. No ready dataset was used in the study. <br><br>In this specific collection there are clearly distinguishable 4 classes:
* medical documentation scans/images (like: prescriptions, discharges from hospitals, diagnoses, test results, etc.) - 50 images
* X-rays images - 50 images
* damaged limbs/organs images - 50 images
* other images (like: crashed cars, hospitals, safety suits, etc.). Below researchers are presenting some examples from each class - 50 images.

In [2]:
print("Medical documentation scans/images")
display(HTML("<table><tr><td><img src='../images/dataset/doc (10).png' height='500' width=500'></td><td><img src='../images/dataset/doc (12).png' height='500' width=500'></td></tr></table>"))
display(HTML("<table><tr><td><img src='../images/dataset/doc (14).png' height='500' width=500'></td><td><img src='../images/dataset/doc (17).png' height='500' width=500'></td></tr></table>"))
print("X-rays images")
display(HTML("<table><tr><td><img src='../images/dataset/roentgen (10).png' height='500' width=500'></td><td><img src='../images/dataset/roentgen (12).png' height='500' width=500'></td></tr></table>"))
display(HTML("<table><tr><td><img src='../images/dataset/roentgen (14).png' height='500' width=500'></td><td><img src='../images/dataset/roentgen (17).png' height='500' width=500'></td></tr></table>"))
print("Damaged limbs/organs images")
display(HTML("<table><tr><td><img src='../images/dataset/break (10).png' height='500' width=500'></td><td><img src='../images/dataset/break (12).png' height='500' width=500'></td></tr></table>"))
display(HTML("<table><tr><td><img src='../images/dataset/break (8).png' height='500' width=500'></td><td><img src='../images/dataset/break (7).png' height='500' width=500'></td></tr></table>"))
print("Other images")
display(HTML("<table><tr><td><img src='../images/dataset/others (2).png' height='500' width=500'></td><td><img src='../images/dataset/others (12).png' height='500' width=500'></td></tr></table>"))
display(HTML("<table><tr><td><img src='../images/dataset/others (14).png' height='500' width=500'></td><td><img src='../images/dataset/others (17).png' height='500' width=500'></td></tr></table>"))

Medical documentation scans/images


X-rays images


Damaged limbs/organs images


Other images


#### Warning: Not all data has been anonymized, so be careful not to publish this article prior to anonymization!!!
If you want inspect all images, they are located in "image/dataset/" folder.

## 2.4 Modeling approach - goal and  pipeline
Based on collected set, the main goal will be gathering medical documentation scans/images from database. The rest of images are irrelevant for future analysis of data scientist so they are redundant, thus this task converged to binary classification.
<br><br>
In this particular Computer Vision problem pipeline is quite simple:
* First of all prepared dataset needs to be loaded and preprocessed (resizing, dimension expanding, subtracting the mean RGB channels of the ImageNet dataset and etc.).
* Second step is application of Inception ResNet V2 (with frozen top layer) on every image in dataset. As a result of this procedure researchers should obtain list of extracted features for each image. 
* Third and the most crucial step is connected with unsupervised machine learning algorithms. Scientist will run and test performance of: K-Mmeas, K-Medoid, DBSCAN, OPTICS on obtained features from images and will choose the best clustering model in this specific case. Procedure of running and testing consist of standard prediagnostics (like: Shilhouette, Elbow method, Hopkins' statistic etc.) and postdiagnostic (Rand Index) analysis. However, the most important during the judgment will be the metric defined in the subsection "1.4 Business hypothesis", i.e. Quality Metric. This metric will ultimately determine the best model.
* At the end it is necessary to sum up all results and save prepared cluster of interests.

# 3 Modeling
Scientists commented almost every step which was realized by them.

Dependencies loading

In [3]:
from keras.preprocessing import image
from keras.applications.inception_resnet_v2 import InceptionResNetV2
from keras.applications.inception_resnet_v2 import preprocess_input
import numpy as np
import glob

Using TensorFlow backend.


Obtaining names of all images from specified directory

In [4]:
files_list = glob.glob("../images/dataset/*.png")
len(files_list)

200

Image preprocessing in loop

In [5]:
preprocessed_images = dict() # dictionary for preprocessed images
for i in files_list:
    try:
        img_path = i
        img = image.load_img(img_path, target_size=(299, 299)) # loading image to PIL and resizing to (299x299)
        img_data = image.img_to_array(img) # transformin PIL image to numpy array and adding channels there
        img_data = np.expand_dims(img_data, axis=0) # transforming numpy array to tensor style - it implies new shape: number_of_images x width x height x channels
        img_data = preprocess_input(img_data) # subtracts the mean RGB channels of the ImageNet dataset and other adequations for model
        preprocessed_images.update({i:img_data}) # adding img_data to dictionary for preprocessed images
    except:
        print(f"Fatal error for {i} image")

InceptionResNetV2 model loading

In [6]:
model_cnn = InceptionResNetV2(weights='imagenet', include_top=False, classes=1000) #loading pre-trained model from Keras library without top layer
#model.summary() - print summary of model

Application of CNN model on previously preprocessed images

In [7]:
extracted_features = dict() # dictionary for extracted features for each image
for i,j in preprocessed_images.items():
    preds_features = np.array(model_cnn.predict(j)) # making prediction using InceptionResNetV2 model and saving it to numpy array
    extracted_features.update({i:preds_features.flatten()}) # collapsing array into one dimension

Printing shape of extracted features for sample image

In [15]:
list(extracted_features.values())[0].shape

(98304,)