## Introduction

Graphene is a 2-dimensional material used in various fields ranging from electronics and energy storage to medicine. Typically, graphene is obtained through the exfoliation of graphite. Flakes of graphite of different shapes and thickness emerge during the process. Thin flakes of up to several layers of graphene are useful to obtaining the material. Hence, a problem of segmentation of thin flakes is present.

An accurate segmentation can be done manually, but the process takes a couple of minutes and requires a presence of a human. Therefore, automated methods of segmentation of graphene flakes from microscopic images are in high demand. To reduce the processing time and automate the process different techniques can be applied. More classical machine learning approaches have been proposed as well as approaches that leverage deep learning and in particular computer vision. We will focus on the latter.

We compare here methods using different architectures: UNet and OCRNet. Both methods significantly reduce processing time, but at the expense of the prediction accuracy.

## Methods
### Data
We classify each pixel to be one of the four classes: background (BG), monolayer (ML), bilayer (BL), three layer (3L).
Mono-, bi- and three layer classes contain pixels that correspond to flakes of 1, 2 and 3 layers of graphene respectively. Background is the class for all the other pixels.

The dataset contains images obtained from the microscope. The images are of three different scales: 100x, 50x, 20x. Accurate segmentation masks for these images are obtained manually.

The other category of images in the dataset (apart from 100x, 50x and 20x) are images scraped from the outputs of a jupyter notebook of original implementation of the UNet approach. These images are of an unknown scale and are very noisy and of poor quality. They were scraped in the beginning of the project in order to enlarge a small dataset. Later on we stopped using them.

We randomly picked 10% of pictures of each scale to be in a validation set. The distribution of images in the dataset is the following:

|       | 100x | 50x | 20x | 20x_n | scraped | TOTAL |
|:-----:|:----:|:---:|:---:|:-----:|:-------:|:-----:|
| train |  85  | 13  | 76  |  58   |   50    |  282  |
|  val  |  10  |  1  |  9  |   7   |    7    |  34   |
| TOTAL |  95  | 14  | 85  |  65   |   57    |  316  |

The 20x_n category are images of a 20x scale containing only background pixels.
Since the images are of different scale, some categories are omitted while training and evaluation.

BG, ML, BL and 3L classes are highly imbalanced:

|     | 100x & 50x | 20x     | 20x_n   | scraped | Altogether |
|-----|------------|---------|---------|---------|------------|
| BG  | 0.87195    | 0.99519 | 1.00000 | 0.97615 | 0.99143    |
| ML  | 0.04823    | 0.00214 | 0.00000 | 0.00606 | 0.00339    |
| BL  | 0.07237    | 0.00146 | 0.00000 | 0.01779 | 0.00418    |
| 3L  | 0.00745    | 0.00121 | 0.00000 | 0.00000 | 0.00100    |

The table is normalized by columns. The scraped images do not contain 3L class and 20x_n category contains only BG class.

Some samples from the dataset: `TODO`

### UNet approach
The original paper of this approach can be found [here](https://arxiv.org/abs/2103.13495).

The approach consists of applying two models consequently: the main UNet model and the color classification model. The purpose of the UNet component is to select "interesting regions", inside which the color classification component classifies each pixel independently to the 4 classes.

Interesting region of an image is a set of pixels that covers Ml, BL, and 3L classes and only them.

#### UNet component

The UNet component is a classical [UNet](https://colab.research.google.com/github/zaidalyafeai/Notebooks/blob/master/unet.ipynb) architecture that takes an image of a shape (height, width, channels) as an input and returns a matrix of a shape (height, width, 1) representing probability of a pixel being in the interesting region.

The ground truth for an interesting region of an image is a mask that covers ML, BL, 3L classes and only them. The aim of the UNet model is to cover this region as accurately as it could. The objective is to remove the major part of the background that could contain pixels of random colors. This restricts the domain of colors for the color model from the whole image to the pixels representing ML, BL, 3L classes and their surroundings.

As a loss function we've used a weighted binary cross entropy between predicted probabilities and ground truth masks. For one pixel the loss is the following: $L(y) = - w \cdot \bar{y} \log(y) - (1 - \bar{y})\log(1 - y)$, where $y$ is a prediction, $\bar{y}$ is a ground truth and $w$ is a weight of the misclassification of a pixel in the interesting region.

The total loss is a mean of the losses of each pixel.

Since the classes are highly imbalanced even if we combine ML, BL, and 3L together against BG, we need to weight the mistakes so that our model doesn't predict background for the whole image. The weight was empirically chosen to be $50$.

Prediction samples (not the best ones):

![unet_pred](./resources/UNet_pred00.png)
![unet_pred](./resources/UNet_pred02.png)
![unet_pred](./resources/UNet_pred04.png)

#### Color classification component

In the original paper this component is implemented using an SVM, but we've replaced it with a simple fully connected perceptron (FNN) for two reasons:
* The implementation of SVM in sklearn didn't use GPU, which made the process of prediction and training slow.
* The predictions of the FNN were as good as the prediction of the SVM.

The next table contains the f1-score measured inside the interesting regions. Scraped and some of 100x & 50x images were used as a dataset, because at the time these were the only available images. There were no 3L class as well.

| class | f1-score (FNN) |  f1-score (SVM)  |
|:-----:|:--------------:|:----------------:|
|  BG   |     0.786      |      0.804       |
|  ML   |     0.489      |      0.308       |
|  BL   |     0.842      |      0.848       |

The component takes an RGB color as an input and predicts a probability distribution of this color to belong to BG, ML, BL or 3L class. The multi-class cross entropy is used a loss function. Worth noting that inside the interesting region the imbalance is not that severe, because the model takes as an input the insides of the interesting regions.

Prediction samples (not the best ones): Here ML is Red, BL is Green, and 3L is Blue

![ColNet_pred](./resources/ColNet_pred00.png)
![ColNet_pred](./resources/ColNet_pred02.png)
![ColNet_pred](./resources/ColNet_pred04.png)

#### Median correction

Since the model output for the pixel directly depends on the pixels color and the images in the dataset have different color distributions (e.g. some are shifted to red), some kind of normalization is required.

The color of the graphene in the picture strongly depends on the color of the surface, because graphene is very thin. So, it's logical to unify the surface color. We linearly scale each color channel, so that the median color results in $(0.6, 0.6, 0.6)$ and then clip the values that are larger than 1 to 1. (given that the channels are at [0, 1] initially)

The rationale behind that is that the surface occupies more than a half of all pixels in the picture. Moreover, the color of the surface is roughly the same throughout the whole picture. So, the median color is expected to be a color of the surface.

#### Data and general method

The data used were 100x & 50x images and their masks. The images are of the shape (2880, 2048, 3), we resize them to (256, 256, 3) and apply the median correction.

We train the UNet component using ground truth interesting regions calculated from the ground truth masks. From UNet component we get probabilities for each pixel to belong to the interesting region. We compare each pixel of this probability matrix with a threshold of $0.884$ (empirically chosen value) to get an interesting region prediction mask. The pixels with probability $> 0.884$ are set to be $1$ in the mask. We take only the pixels that are marked as $1$ in the interesting region masks and get a dataset of colors consisting of these pixels. Then we train the color FNN on this dataset. As the ground truth we take the true classes of pixels. Finally, we evaluate the metrics of the model considering the predictions for pixels outside the interesting regions to be BG.

#### Metrics

The metrics used are precision, recall and f1-score.

Precision is calculated as $P = \frac{TP}{TP + FP}$.

Recall is calculated as $R = \frac{TP}{TP + FN}$.

F1-score is just a harmonic average of $P$ and $R$: $F = \frac{2PR}{P + R}$.

Where $TP$ -- the number of

Here are the results:

| class | precision | recall | f1-score | support |
|:-----:|:---------:|:------:|:--------:|:-------:|
|  BG   |   0.984   | 0.962  |  0.973   | 658402  |
|  ML   |   0.511   | 0.729  |  0.601   |  26584  |
|  BL   |   0.689   | 0.646  |  0.667   |  32224  |
|  3L   |   0.062   | 0.161  |  0.090   |  3686   |

### OCRNet approach

The original paper of this approach can be found [here](https://www.sciencedirect.com/science/article/pii/S0952197622007333).
The original paper of the OCRNet can be found [here](https://arxiv.org/abs/1909.11065)
