# Unsupervised Learning 2 - Other techniques (on Tileset7) - Aug 2017
Created:  16 Aug 2018 <br>
Last update: 24 Aug 2018 (small changes, but same results)


### Use some more unsupervised techniques learned from DataCamp

This continues the work from 'realxtals1-unsupervised1.ipynb'. Some of the functions of that notebook have now been moved into imgutils plus some other extensions in imgutils visualization (mostly adding the 'large heatmap' capability and extra annotations)

About the data: The data used here has been prepared in my prior notebooks. It's a bunch of images (from a larger 'tile set') sliced up in sub-images and image statistics applied on each sub-image.

<hr>
## 1. Imports

In [None]:
# this will remove warnings messages
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

# import
from sklearn import cluster
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import silhouette_score

import imgutils

In [None]:
# Re-run this cell if you altered imgutils
import importlib
importlib.reload(imgutils)

<hr>
## 2. Import Crystal Image Data & Statistics
The data was labeled and exported to csv in the notebook realxtals1_dataeng1.ipynb

#### About the data:
The CSV contains the image files, slice information (sub-images) and associated statistics, which are the features for which a classifier needs to be found. 

The goal is to find the clustering in feature-space and use those to categorize the images. For this particular dataset, a single statistics could be used to label into three classes:<br>

A = subimage contains no crystal, <br>
B = part of subimage contains crystal, <br>
C = (most of) subimage contains crystal

But the labels have been added here for analyses, eventually the data will be unlabelled.

 



Import data:

In [None]:
df = pd.read_csv('../data/Crystals_Apr_12/Tileset7-2.csv', sep=';')
df.head(3)

imgnames = df['filename'].unique()

<hr>
## 3. Re-do some of the clustering from previous notebook

(so we have some comparison material)

### First vectorize the data:

In [None]:
# convert into X Y vectors:
feature_cols = ['|img_std|', '|img_std2|', '|img_mean|','|img_skewness|', '|img_kurtosis|','|img_mode|']
X = df.loc[:,feature_cols]

### k-means:

In [None]:
number_of_clusters = 3

In [None]:
k_means = cluster.KMeans(algorithm='auto', n_clusters=number_of_clusters, n_init=10, init='k-means++')
k_means_pred = k_means.fit_predict(X)
print("score (silhouette): ", silhouette_score(X, k_means_pred))

### Hierarchical clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Affinity = {“euclidean”, “l1”, “l2”, “manhattan”, “cosine”}
# Linkage = {“ward”, “complete”, “average”}

Hclustering = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='complete')
hierarch_pred = Hclustering.fit_predict(X)
print("score (silhouette): ", silhouette_score(X, hierarch_pred))


### Spectral and DBScan:

In [None]:
spectral = cluster.SpectralClustering(n_clusters=number_of_clusters,eigen_solver='arpack',affinity="nearest_neighbors")
spectral_pred = spectral.fit_predict(X)
print("score (silhouette): ", silhouette_score(X, spectral_pred))

dbscan = cluster.DBSCAN(eps=0.5, metric='euclidean', min_samples=10)
dbscan_pred = dbscan.fit_predict(X)
print("score (silhouette): ", silhouette_score(X, dbscan_pred))

### Also get the PCA transformed data and k-means and hierach with PCA

In [None]:
from sklearn import decomposition

fieldnames = ['pca_1','pca_2','pca_3', 'pca_4', 'pca_5']

n_comp = 5;

pca = decomposition.TruncatedSVD(n_components=n_comp)
X_fit = pca.fit_transform(X)

# convert into X Y vectors:
df_pca = pd.DataFrame(X_fit[:,0:n_comp], columns=fieldnames[:n_comp])
X_pca = df_pca.loc[:,fieldnames[:n_comp]]

In [None]:
df_pca.head(3)

In [None]:
k_means_pca = cluster.KMeans(algorithm='auto', n_clusters=3, n_init=10, init='k-means++')
k_means_pca_pred = k_means_pca.fit_predict(X_pca)
print("score (silhouette): ", silhouette_score(X_pca, k_means_pca_pred))

In [None]:
Hclustering_pca = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='complete')
hierarch_pca_pred = Hclustering_pca.fit_predict(X_pca)
print("score (silhouette): ", silhouette_score(X, hierarch_pca_pred))

In [None]:
df['dummy'] = 0
imgutils.show_large_heatmap(df, 'dummy', imgnames[0:6], n_rows=2, n_cols=3, fig_size=(8,6))

<hr>
## 4. Visualize  k-means and hierarchical clustering  and assess scores

(The large heat map is now part of imgutils)


In [None]:
# Add the unsupervised clustering results to the dataframe
df3 = df
df3['k_means'] = k_means_pred
df3['hierarch'] = hierarch_pred

figsize=(8,6)

In [None]:
# show heatmaps:
imgutils.show_large_heatmap(df3, 'k_means', imgnames[0:6], n_rows=2, n_cols=3, fig_size=figsize)

In [None]:
imgutils.show_large_heatmap(df3, 'hierarch', imgnames[0:6], n_rows=2, n_cols=3, fig_size=figsize)

### The baseline score of these two with manual counting

(see previous notebook (unsupervised1). The idea is to count true positives and false positives on the important categories)


In [None]:
def count_imgs_per_class(df_imgstats, classcolumn):
    return df_imgstats[classcolumn].value_counts()

def print_scores(methodname, class_count_tuples): 
    """
    the class tuple has form (classname, n_true_pos, n_false_pos, n_real_pos)
    """
    print("")
    print("{:<20}|{:^12}|{:^12}|".format(methodname.upper(), "True Pos", "False Pos"))
    print("-"*(20+12+12+3))
    
    def print_score_line(class_name, TPR, FDR):
        print("{:<20}|{:^12.2%}|{:^12.2%}| ".format(class_name, TPR, FDR ))  
    
    for (class_name, n_true_pos, n_false_pos, n_real_pos) in class_count_tuples:
        TPR = n_true_pos/n_real_pos
        FDR = n_false_pos/(n_true_pos + n_false_pos)
        print_score_line(class_name, TPR, FDR)
    
    print("-"*(20+12+12+3))

In [None]:
print_scores('Manual (using STD)', [('Full Crystal', 11, 2, 11),  ('Partial Crystal', 6, 4, 8) ])
print_scores('Hierarchical', [('Full Crystal', 11, 1, 11), ('Partial Crystal', 7, 12, 8) ])
print_scores('K-means', [('Full Crystal', 10, 1, 11), ('Partial Crystal', 7, 13, 8) ])

**REMARKS**: 
- some of the false positives in category 'Partial' are 'Full Ones' and some false positives in 'Full' are partial ones. So this score is a bit to strict, but accounting for this would require more complex scoring (or a full confusion matrix, where you still need to remark that some confusion is not so critical)
- running the algorithms gives some variation, so these scores may deviate a bit (it is manually counted)

<hr>
## 5. 'Unsupervise scoring' based on cluster 'shape' (instead of using ground truth labels)

The silhouette-score assesses 'cluster consistency' in a single number, and is part of sklearn package 
see https://en.wikipedia.org/wiki/Silhouette_(clustering)

In [None]:
from sklearn.metrics import silhouette_score

In [None]:
def print_score(name, data, labels):
    print("%s: %f" % (name,silhouette_score(data, labels)))

### Scores for k-means, spectral, dbscan and hierarchical

In [None]:
print_score('k-means', X, k_means_pred)
print_score('spectral', X, spectral_pred)
print_score('dbscan', X, dbscan_pred)
print_score('hierarchical', X, hierarch_pred)

### And the pca variants

In [None]:
print_score('k-means PCA', X, k_means_pca_pred)
print_score('k-means PCA2', X_pca, k_means_pca_pred)
print_score('hierarchical PCA', X, hierarch_pca_pred)
print_score('hierarchical PCA2', X_pca, hierarch_pca_pred)


### the intrinsic scoring slighlty prefers k-means

<hr>
## 6. Check some other interesting properties (from DataCamp)

### Examine PCA components importance, with the 'variance explained' of the PCA

In [None]:
var_ex = pca.explained_variance_ratio_
print(var_ex)

In [None]:
plt.plot(var_ex)

Hard to point out 'elbow', but from the values indeed the first two or three are significant


###  And look at the correlation between statistics

In [None]:
from scipy.stats import pearsonr

print(feature_cols)

In [None]:
print("std - std2:", pearsonr(df['|img_std|'], df['|img_std2|']))
print("std - mean:", pearsonr(df['|img_std|'], df['|img_mean|']))
print("mean - kurtosis:", pearsonr(df['|img_mean|'], df['|img_kurtosis|']))
print("mean - skewness:", pearsonr(df['|img_mean|'], df['|img_skewness|']))
print("kurtosis - skewness:", pearsonr(df['|img_kurtosis|'], df['|img_skewness|']))
print("mean - mode:", pearsonr(df['|img_mean|'], df['|img_mode|']))


(the first value is the correction coefficient, second a p-value; see https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.pearsonr.html)

As expected, std and std2 are highly correlated.  Interesting/puzzling that they can be combined to identify clusters

Correlation between PCA components?

In [None]:
print("pca_1 - pca_2:", pearsonr(df_pca['pca_1'], df_pca['pca_2']))
print("pca_1 - pca_3:", pearsonr(df_pca['pca_1'], df_pca['pca_3']))
print("pca_2 - pca_3:", pearsonr(df_pca['pca_2'], df_pca['pca_3']))

Indeed, as expected, pca components are uncorrelated.

<hr>

## 7.NMF Similarity (Non-Negative Matrix Factorization)

The idea is as follows:
* standardize the data in such a way that is has no zero values
* apply NMF
* normalize each sample (so their feature vector has lenght 1),
* pick one sample as the reference
* perform a dot product of all samples with the reference samples;
* if feature vectors are very similar, the dot product will be close to one

If we pick a sub image with clear x-crystals as the refernece, see how well this dot product works for a heat map

More info on NMF: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html (and on DataCamp)

In [None]:
import sklearn.preprocessing as skpreproc
from sklearn.decomposition import NMF

# assure the all data is non negative and in sane rane
X_scaled = skpreproc.minmax_scale(X)

nmf = NMF()
X_nmf = nmf.fit_transform(X_scaled)

# for feature comparison, each feature vector should be normalized (i.e. per sample)
X_nmf_norm = skpreproc.normalize(X_nmf)

df_nmf = pd.DataFrame(X_nmf_norm)
df_nmf.head(5)

Now let's select a reference tile from the first image

In [None]:
# Re-run this cell if you altered imgutils
import importlib
importlib.reload(imgutils)

img1, dummy = imgutils.getimgslices_fromdf(df, imgnames[0])
imgutils.showimgs(img1, tile_labels=True, fig_size=(6,6))

Let's use tile (2,3) as the reference image 'with crystal'; need to determine it's row number...

In [None]:
print(df[(df['filename']==imgnames[0]) & (df['s_y']==2) &  (df['s_x']==3)].iloc[:,2:10])

Ah, that is why I introduced alias, so I can just use 

In [None]:
df[df['alias']=="img0_2-3"]

Anyway, it's row 11

In [None]:
ref_vect = X_nmf_norm[11]
similarities = df_nmf.dot(ref_vect)

# check if indeed similarity of row 11 is 1 
print(similarities[10:13])

In [None]:
# assign this to the dataframe and then use this as heats
df3['similarity_img0-2-3'] = similarities
imgutils.show_large_heatmap(df3, 'similarity_img0-2-3', imgnames[0:6], n_rows=2, n_cols=3, fig_size=figsize)

Hmmm, hard to assess, let's map to integer

In [None]:
df3['|sim_img0-2-3|'] = df3['similarity_img0-2-3'].map(lambda x: int(x * 3 - 0.0001))
df3['|sim_img0-2-3|'].value_counts()

In [None]:
imgutils.show_large_heatmap(df3, '|sim_img0-2-3|', imgnames[0:6], n_rows=2, n_cols=3, fig_size=figsize)

Hmmm, color coding can be misleading. What if I just define two clusters?


In [None]:
df3['|sim_img0-2-3b|'] = df3['similarity_img0-2-3'].map(lambda x: int(x * 2 - 0.0001))
df3['|sim_img0-2-3b|'].value_counts()
imgutils.show_large_heatmap(df3, '|sim_img0-2-3b|', imgnames[0:6], n_rows=2, n_cols=3, fig_size=figsize)

Ok. Let's also try without NMF decomposition (just similarity via dot product of original feature vectors). But first assess scores.

### Clustering score

In [None]:
# we can assess the clustering in either 'original space' or in NMF transformed space:
print_score('NMF (original)', X, df3['|sim_img0-2-3|'] )
print_score('NMF Transformed', X_nmf, df3['|sim_img0-2-3|'] )


Not very high...

<hr>
## 8. Similarity without any transformation

In [None]:
X_norm = skpreproc.normalize(X) # the orignal features, but normalized per sample
ref_vect2 = X_norm[11]  
df3['sim2_img0-2-3'] = X_norm.dot(ref_vect2)

# do a quick check; element 11 should have value 1
df3['sim2_img0-2-3'].iloc[9:13]

In [None]:
imgutils.show_large_heatmap(df3, 'sim2_img0-2-3', imgnames[0:6], n_rows=2, n_cols=3, fig_size=figsize)

This looks actually pretty good. For a heatmap the gradual scale is nice, but for classification we need to reduce this to e.g. 3 clusters to compare it's intrinsic score to the other approaches



In [None]:
# as we observed negatives, we are going to make again 3 categories
# but note that range is now -1 to +1,
df3['|sim2_img0-2-3|'] = df3['sim2_img0-2-3'].map(lambda x: int( (x+1) * 3 / 2 - 0.0001))
df3['|sim2_img0-2-3|'].value_counts()

In [None]:
imgutils.show_large_heatmap(df3, '|sim2_img0-2-3|', imgnames[0:6], n_rows=2, n_cols=3, fig_size=figsize)

### Scoring with silhouette

In [None]:
# assess the clustering in 'original space' and 'rescaled space':
print_score('Similarity (original)', X, df3['|sim2_img0-2-3|'] )
print_score('Similarity (scaled)', X_norm, df3['|sim2_img0-2-3|'] )


A better way to score this would be using the assessment from the heatmap, which involves counting :-(

In [None]:
print_scores('Similarity', [('Full Crystal', 11, 1, 11),  ('Partial Crystal', 7, 13, 8) ])

Again, the false positive of the full-crystal is debatable.

In general, I should maybe even aim for a more binary classification with or without)

<hr>
## 9. Similarity based on PCA components

In [None]:
# normalize the PCA features per sample so we can compare the feature vectors
X_pca_norm = skpreproc.normalize(X_pca) 

# compare via dot product with the reference image
ref_vect3 = X_pca_norm[11]  
df3['sim3_img0-2-3'] = X_pca_norm.dot(ref_vect3)

# do a quick check; element 11 should have value 1
df3['sim3_img0-2-3'].iloc[9:13]

In [None]:
imgutils.show_large_heatmap(df3, 'sim3_img0-2-3', imgnames[0:6], n_rows=2, n_cols=3, fig_size=figsize)

Looks good; let's also try when truncated to 3 categories

In [None]:
nclust = 3
df3['|sim3_img0-2-3|'] = df3['sim3_img0-2-3'].map(lambda x: int( ((x+1)/2) * nclust  - 0.0001))
df3['|sim3_img0-2-3|'].value_counts()

imgutils.show_large_heatmap(df3, '|sim3_img0-2-3|', imgnames[0:6], n_rows=2, n_cols=3, fig_size=figsize)

### Score (silhouette)

In [None]:
# assess the clustering score in 'original space' and 'pca space':
print_score('PCA Similarity (original)', X, df3['|sim3_img0-2-3|'] )
print_score('PCA Similarity (pca)', X_pca, df3['|sim3_img0-2-3|'] )
print_score('PCA Similarity (normalized pca)', X_pca_norm, df3['|sim3_img0-2-3|'] )

So, with similarity approach to cluster into 3 groups, gives a clustering score of ~ 0.42 in original or PCA space

<hr>
## 10. Conclusions

* Via extra study and the DataCamp courses, I learned a few new techniques which I utilized here.
* One of them is a scoring based on the clusters via **silhouette scoring** , which does not require
labelling
* An alternate unsupervised learning technique is **NMF (Non-negative Matrix Factorization)** plus feature vector similarity;
(often this is used for texts or for images (but than based on the pixel values)
* The **NMF approach did not work** well for this data set
* However, using the same **similarity vector** approach ** on original statistics works well** (on this dataset)
* Applying the similarity approach on the PCA transformed statistics gave similar results

In other words: **(simple) vector similarity can be a good alternative for the heat maps**


## 11. Next Steps:
* try whole pipeline on the harder set
* consider chaining steps together via sklearn.pipeline

Michael Janus, 16 August 2018