# Agile semi-automatic image clustering (filtering) using pre-trained Convolutional Neural Networks and clustering algorithms
### Authors: Michał Woźniak (id: 385190), Michał Wrzesiński (id: 385197)
### Date: 14.11.2019

# 1. Introduction
## 1.1 Background
Currently, deep neural networks are used in many areas. One of them is the Computer Vision. More and more commercial companies are deciding to use these technologies to solve their daily problems and tasks. They usually decide to outsource a project to a data science consulting company. This consulting company is given limited time (always around 3-6 months) to prepare a code that will serve as a Proof of Concept or Most Valuable Product. Therefore, the race against time begins.
<br><br>
As everyone knows, know-how alone is not enough. The key is correct and reliable preparation of data for analysis and learning. Very often, customers specify their requirements and set business goals, but at the same time provide data in very poor quality. In the case of photos, it is usually a large number of images that are not useful in the analysis and should be discarded at the very beginning of the study. Clients usually do a snapshot of data from databases and do not care what they are providing to machine learning consultants. In this situation you have to manage somehow. By nature, it would not be optimal to filter photos manually. Usually, such work is outsourced to a low-skilled person to reduce the cost of their remuneration. However, it happens that regular scientists do such exercises during the project because of the scale of the problem and time, which is required to employ low-skilled worker. So far in practice, not developed a good methodology of fighting this problem. This article may be the solution.
## 1.2 Purpose of the study
In this paper authors would like to propose new approach to removing unneeded images from databases, which will later be used to train machine learning algorithms. Their goal was to develop a semi-automatic solution based on pre-trained convolutional neural networks and clustering algorithms that would be fast, precise and robust in image clustering. The word fast means the lack of carrying out the learning process. This assumption enables scientists to quickly prototype during a commercial project. Precision and robustness address the issue of algorithm resistance to various types and classes of images. The simplest solution would seem to be the use of pre-trained convolutional neural networks with the top soft-max layer. However, this network is only trained for approximately 1,000 classes on ImageNet. Therefore, it is useful for a strictly defined number of cases (1000). The approach presented in this article is not limited to any number of classes. The only condition and assumption is to select clearly different classes before starting the analysis (e.g. by receiving a data set from the Vienna Zoo we want to perform segmentation that will divide the set into animals and their food, and runs - as you can see the classes are distinguishable).
## 1.3 Scientific hypothesis
The major hypothesis verified in this paper is whether it is possible to construct fast, precise and robust semi-automatic algorithm for image clustering using pre-trained convolutional neural networks and unsupervised machine learning clustering algorithms?
## 1.4  Business hypothesis
In addition to the scientific hypothesis, a business hypothesis should also be stated because that takes into account the economic aspect of this study. The issue addressed in this article can be considered as a binary classification problem. Suppose company X hired company Y to conduct a PoC for the Computer Vision project. Company Y is considering whether to use the model proposed in this article or to hire a low qualified person to carry out the process of clustering the photos received from company X. <br><br> Company Y is experienced and knows that this decision will influence on the rest of the project, because it will affect the quality of the data. In this specific image case: if we assume that class 1 represents needed images and class 0 represents unneeded images: an increase of False Negative metric leads to a decrease of dataset size, while an increase in the False Positive metric increases the noise in the data. Both situation are undesirable. According to that, they found that they must optimize two metrics: <br><br>
\begin{equation*}
Precision = \frac{TP}{TP+FP}
\\and\\
Recall = \frac{TP}{TP+FN} 
\end{equation*}<br><br>
Finally they decided to use simple linear equation as Quality Metrics of their work:<br><br>
\begin{equation*}
Quality Metric = 0.3 * Precision + 0.7 * Recall.
\end{equation*}<br>
This approach enables them to handle this particular business problem (Recall is much more important from their experience). Of course Company Y wants to maximize Quality Metric. <br><br>
Company Y knows that cost of hiring: 
* low qualified person is 20 PLN per hour
* highly qualified Data Scientist is 90 PLN per hour.

What is more they already obtained performance statistics when image clustering is performed by:
* low qualified person (fully manual approach) ~ 100 images per hour,
* highly qualified Data Scientist (semi-automatic approach using model from this paper) ~ 100 images per 10 minutes.

They realized that low qualified person can obtain Quality Metric = 0.95. This statistic is unknown for new approach. Company Y assumed that if Quality Metric is in range [0.9, 1.0], it will be worth to use new approach, because cost of low qualified person will be increasing in linear manor and cost of highly qualified person will be constant (it always gonna be 90PLN/4 = 15 PLN). But, otherwise they won't take that risk. Company Y calculate their savings. In case of:
* 100 images => savings =  5 PLN
* 1000 images => savings = 185 PLN
* 10000 images => savings = 1985 PLN
* 100000 images => savings = 19985 PLN
* etc.

According to that, impact is well visible. To sum up, the business hypothesis is: whether a new approach can obtain Quality Metric in range [0.9, 1.0]?

# 2 Methodology of the research
## 2.1 Unsupervised Machine Learning Algorithms
Scientists based on knowledge of the pros and cons of various unsupervised machine learning algorithms decided to inspect performance of: K-Means, K-Med, DBSCAN and OTPICS in this problem. Below they posted short description of each considered algorithm.
### 2.1.1 K-Means
### 2.1.2 K-Med
### 2.1.3 DBSCAN
### 2.1.4 OPTICS
## 2.2 Pre-trained Convolutional Neural Networks 
## 2.3 Data
## 2.4 Scientific approach