In [1]:
# load packages
import numpy as np
import pandas as pd

##  Image Colour Analysis

The purpose of this notebook is to explore the feasibility of classifying different rock types using pictures.

Rock-type information gathering at a mine, can be a complex task to do. Even experienced geologists or mining engineers have a wide range of accuracy when it comes to rock identification. The task is challenging due to the rocks heterogeneous properties nature. However, rock images are non-homogeneous in their shape, texture and colour and Computer Vision has proved to be a good tool when it comes to analyzing complex rock images for rock-type classification purpose.

In Computer Vision Technology, data is presented as images. Various information such as colour or texture could be extracted from images. 

The methodology that needs to be followed is: image acquisition, feature extraction, and classification. 

Ideally, all images should be captured in a similar scenario. 

For feature extraction, we are going to obtain some colour distribution plots for the three main channels: Blue, Green and Red. We will also obtain the Kurtosis and Skewness of the distributions and analyze how different rocks are from the tables and images we obtain. 

For the colour features such as skewness, kurtosis, and mean average pixel, we will obtain each rock type and perform a statisitcal test (i.e. ANOVA). Then, we can see if there are certain rock types which can be easier to distinguish from one another.

The notebook is divided into the following 3 parts:
- description of image features
- how to extract these features
- anova test on features

### Description of image features

In this project, we are going to extract 3 colour based features (skewness, kurtosis, average pixel. From QIO's expertise, we know that each rock has a different colour pattern. 

The images we are working with are of RGB (Red, Green, Blue) format; this tells us that each image is represented on these 3 colour scales/channels. We can obtain 9 features for each image. 

The following are the description of each feature.

__Skewness__: It measures of the asymmetry of a frequency distribution. In this case, it determines the spread of the image pixel value from the mean pixel value;
- If the skewness is zero, the data is normally distributed.
- If the skewness is positive, the spread of the pixel value tends more towards the right from mean (longer right tail)
- If the skewness is negative, the spread of the pixel value tends more towards the left of the mean (longer left tail).

__Kurtosis__: It measures how the tails of a distribution differ from the normal distribution; 
- If the kurtosis is zero, the data is normally distributed.
- If the kurtosis is positive, the distribution has heavier tails than a normal distribution (peaked distribution)
- If the kurtosis is negative, the distribution has lighter tails than a normal distribution (flat distribution)

__Average pixel__: average pixel inside each region of interest (i.e. region of each rock type in every image)

### How are we extracting these features?

The objective of IRONN is to identify the boundary of different rock types and classify them in every image. 

The first thing to do is to label the data properly. We are using a labelling package called [Labelme](https://github.com/wkentaro/labelme). 

The figure below shows an example image that has been labelled.

<img src="../../docs/00_images/03_eda_reports/00_img_poly_example.jpg" width="500px">



How do we extract the colour proprties for each rock type? 
Let's take HEM as an example. First, we need to have an image that contains only HEM. So, we make sure that we extract just HEM and extract the information just for HEM. We calculate the properties of HEM on the blue, red and green channels. We can provide colour distribution on each scale to visualize each feature. The following table shows the HEM portion and its colour distribution. 



HEM Portion | Colour Distribution of HEM
- | - 

<img src="../../docs/00_images/03_eda_reports/01_sample_img_mask.jpg" alt="Drawing" style="width: 400px;"/>|<img src="../../docs/00_images/03_eda_reports/02_colour_dist.jpg" alt="Drawing" style="width: 400px;"/>



Based on the visualization above, we can see that on blue channel, the distribution of HEM has a longer tail on the right side. This indicates that it has a positive skewness value. 

The actual skewness value is about 0.8, which is calculated from IRONN. We can then generate a table shown below that contains features on 3 channels for each rock type in every image. Each row contains image name, rock type, and corresponding 9 features. There is a last added column where we can do additional grouping based only on Ore, Dilution Waste, and Contamination Waste.

There are duplicated image names in the sense that each image contains more than one rock type.

In [12]:
# Read roughness table
roughness_df = pd.read_csv("../../ironn/modules/output/img_tbl_w_roughness.csv", index_col=0)
roughness_df = roughness_df.dropna(axis=0)

# Glance through first 3 rows
roughness_df.head(3)

Unnamed: 0,file_name,Type,SkewnessBlue,KurtosisBlue,MeanPixelBlue,SkewnessGreen,KurtosisGreen,MeanPixelGreen,SkewnessRed,KurtosisRed,MeanPixelRed,CombinedType
0,20190516_bw-718-095_3_JH.JPG,HEM,0.656949,0.821174,82.554793,0.872976,1.250464,77.444193,0.952001,1.284822,75.444946,ORE
1,20190516_bw-718-095_3_JH.JPG,QR,0.11049,-0.149865,111.069092,0.272767,-0.175188,109.944361,0.167813,-0.375757,115.239605,DW
2,20190503_bw-718-095-7_AS.JPG,HEM,8.107456,129.676875,39.687769,9.118331,151.565428,29.072989,8.625957,141.175701,28.335578,ORE


### ANOVA Test on Features as a First Approach

The one-way ANOVA tests whether the mean of some numeric variable differs across the levels of one categorical variable. In our case, numeric variable would be several roughness features and the categorical variable would be different types of rocks.

- __$H_0$: the average value of each roughness feature is the same for all rock types__
- __$H_A$: the average is not the same for all groups__

The ANOVA test has important assumptions that must be satisfied in order for the associated p-value to be valid.

- The samples are independent.
- Each sample is from a normally distributed population.
- The population standard deviations of the groups are all equal. This property is known as homoscedasticity.

The following table shows the result after performing ANOVA test.

In [5]:
pd.read_csv("../../ironn/modules/output/stat_test_result/anova_test_all_combinedtype.csv",index_col=0).round(decimals=4)

Unnamed: 0,Test statistic,p-value
KurtosisBlue,6.3132,0.0019
MeanPixelBlue,21.2605,0.0
SkewnessGreen,4.2772,0.014
KurtosisGreen,0.7748,0.4609
MeanPixelGreen,27.9081,0.0
SkewnessRed,17.6222,0.0
KurtosisRed,6.982,0.001
MeanPixelRed,51.097,0.0



It would seem that besides `Green Kurtosis` and `Red Kurtosis`, all p-value < 0.05; which would show that there is a significant difference between the mean of each group for each roughness feature.

## Are the rocks the right colour?

First of all, it is also extremely important to see that the rock colour IRONN is giving us, is also what QIO expects for it to be:

<img src="../../docs/00_images/03_eda_reports/03_rock_prop_image.png" width="900px">

In [7]:
pd.read_csv("../../ironn/modules/output/stat_test_result/mean_pixel_tbl_type.csv",index_col=0).round(decimals=3)

Unnamed: 0,AvgMeanPixelBlue,SdMeanPixelBlue,AvgMeanPixelGreen,SdMeanPixelGreen,AvgMeanPixelRed,SdMeanPixelRed,MaxValueColour
AMP,81.496,26.657,79.834,27.06,78.552,27.451,Blue
BS,94.71,31.625,99.611,30.422,104.169,28.9,Red
GN,91.373,22.188,91.589,22.218,91.919,22.41,Red
HEM,87.373,38.28,86.361,38.324,89.491,37.617,Red
IFG,70.317,24.702,81.074,26.52,94.063,29.83,Red
LIM1,65.54,21.4,78.421,26.131,94.847,32.082,Red
LIM1-2,87.712,0.0,127.656,0.0,171.859,0.0,Red
LIMO,63.268,22.631,74.605,25.283,89.733,29.16,Red
MAG,94.004,36.946,92.322,38.232,91.935,40.035,Blue
MS,143.913,3.558,149.43,4.879,145.732,7.221,Green


We can see that most rocks do pretty well in terms of what colour they are supposed to be! This is amazing! Since we only have 3 channels to work through, we are expecting that Gray will become `Blue` and Brown will lean to `Red`. 

The only 2 rocks that we are worried about is GN, which for IRONN is `Red` but for QIO it is `Pale Gray`; BS should also be leaning more towards the green channel (currently in the red channel). SIF and WSIF should be leaning more towards the `Green` channel instead of the `Blue` channel. However, if we look at the numbers of `Avg Mean Pixel`, the difference is very it is not very different. 

We need to figure out which images might be causing the disruptions with the colour channel or see if we can gather more data that is completely accurate.

There are a couple of images that although we might be wrong, seem to be taken behind a window. This will definitely alter the colour channels for Blue.

## Pairwise comparison

We saw earlier that the ANOVA Test said that there might be interesting differences in Colour properties for all the images. This, and the fact that most rocks are falling in their "Right Colour Bin" makes us wonder if the rocks' images' properties differ when tested pairwise. 

We are going to do Tukey’s multi-comparison method tests at P < 0.05. This way, we will correct for the fact that multiple comparisons are being made which would normally increase the probability of a significant difference being identified). 

A results of ’reject = True’ means that a significant difference has been observed. The following table shows the pairwise comparison  result.

In [8]:
pd.read_csv("../../ironn/modules/output/stat_test_result/pairwise_test_result_type.csv", index_col=0).head(20)

Unnamed: 0,Group1,Group2,SkewnessBlue,MeanPixelBlue,SkewnessGreen,KurtosisGreen,MeanPixelGreen,SkewnessRed,KurtosisRed,MeanPixelRed,CountRejection
0,IFG,QR,False,True,False,False,True,True,False,True,4
1,HEM,QR,False,True,False,False,True,True,False,True,4
2,AMP,QR,False,True,False,False,True,True,False,True,4
3,AMP,WSIF,False,True,False,False,True,False,False,True,3
4,IFG,WSIF,False,True,False,False,True,False,False,False,2
5,LIM1,QR,False,True,False,False,True,False,False,False,2
6,AMP,IFG,False,True,False,False,False,False,False,True,2
7,LIM1,WSIF,False,True,False,False,True,False,False,False,2
8,HEM,WSIF,False,True,False,False,True,False,False,False,2
9,BS,LIM1,False,True,False,False,False,False,False,False,1


OK... now, this is not as happy as we would want it to be... 

We can see that just IFG and QR are easily identifiable from each other, image wise. Only 7 rocks have at least 3 differenatiatable features out of the 9. Other rocks can be diferentiate between each other either only by 1 feature or no feature that can diferentate them. 

## Reasons why this might not be going the expected way?

Some images, seem to have taken a wrong label/mask and are signaling up in the sky. We are currently reviewing why. This would definitely change the MeanPixelBlue.

Some images also seem to have been taken from behind a glass/window (?) which would also impact the `Blue` Channel.

Unbalanced data. Some rock types have more samples than others. But also, the size of the labels is bigger or the pixel resolution might be creating some noise. We need to figure a way to normalize this noise.

## OK, if 10+ rock classification is hard, what about 3 classes?

We are also looking into the possibility of classifying material between Ore, Dilution Waste and Contamination Waste.

Intuitively, `Ore` should tend to be more `Red`. `Dilution Waste` should also have tinges of red as it contains `Ore`. For Contamination Waste, for now, we are just going to say that any colour, just not red, would satisfy us. This will be later on addressed by the experts.

In [9]:
pd.read_csv("../../ironn/modules/output/stat_test_result/mean_pixel_tbl_combinedtype.csv",index_col=0).round(decimals=3)

Unnamed: 0,AvgMeanPixelBlue,SdMeanPixelBlue,AvgMeanPixelGreen,SdMeanPixelGreen,AvgMeanPixelRed,SdMeanPixelRed,MaxValueColour
CW,87.566,27.425,86.801,27.785,86.212,28.11,Blue
DW,95.637,40.009,100.72,39.736,109.947,39.191,Red
ORE,81.562,34.801,84.63,34.533,91.054,35.122,Red


We can see that our intuition did not fail us. And we would like to see if IRONN can at least diferentiate between these rock classes given the previous 9 rock image features.

In [10]:
pd.read_csv("../../ironn/modules/output/stat_test_result/pairwise_test_result_combinedtype.csv", index_col=0)

Unnamed: 0,Group1,Group2,SkewnessBlue,MeanPixelBlue,SkewnessGreen,KurtosisGreen,MeanPixelGreen,SkewnessRed,KurtosisRed,MeanPixelRed,CountRejection
0,DW,ORE,False,True,True,False,True,True,True,True,6
1,CW,DW,False,True,True,False,True,True,False,True,5
2,CW,ORE,True,True,False,False,False,False,True,True,4


Now, this looks way better.

Ore and Dilution Waste would be the easiest to differentiate for IRONN. 

The hardest one would be Ore from Contamination Waste. However, it still has 3 meaningfully different features, and we know that Contamination Waste tends to be blue-er. 