# Mean Shift Clustering
## Author: [Jeremiah Croshaw](https://linktr.ee/jeremiahcroshaw)
#### Last Edited: Sept 23 2020

Since this code was written while employed by [Quantum Silicon Inc.](https://www.quantumsilicon.com/), I have been advised to share it under the GNU-GPL
***
Copyright (C) 2020  Jeremiah Croshaw

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; version 2
of the License.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
***

### This code demonstrates unsupervised learning of defects for further segmentation of the 'unknown' defect class.  Further detials and results will be presented in Croshaw's PhD Thesis


### This code was developed for follow up work to our [published work](https://iopscience.iop.org/article/10.1088/2632-2153/ab6d5e) on defect segmentation of scanning probe images of the H-Si(100) surface.  


author corresponence: croshaw@ualberta.ca

In [9]:
import h5py
import matplotlib.pyplot as plt
import numpy as np
import cv2
import os

from sklearn.cluster import MeanShift


def normalize(image): # just normalizes the image if needed.
    image=np.abs(image)
    image=image-image.min()
    image=image/image.max()
    return image

f = h5py.File("unknown_pure.h5")

print(list(f.keys()))
data = list(f["unknown"])
images = np.asarray(data)
data = np.asarray(data)

print(data.shape)
data_star = np.resize(data,(data.shape[0],1600)) # resized to a 2d array from a 3d array.
# here is the perfect oportunity to introduce more filter options or image processing.


#here we cluster.  The key variable is bandwidth, which is an estimate for the distance between kernels (or blobs in the data)
#the optimized value of this will change based on the features and data type.
clustering = MeanShift(bandwidth = 1470).fit(data_star) 

labels = clustering.labels_
#print(labels)
labels_unique, count_unique = np.unique(labels, return_counts = True)
print("number of unique labels",labels_unique.shape)
print("number of each element",count_unique) # how many entries you got in each entry.  As you can see it is very heavy on three elements.
print(count_unique.shape) # this is how many unique clusters you ended up with

##############################
# saving the defects to the respective class
#for x in range(0,703):
#    label = clustering.predict(data_star[x].reshape(1,-1))
#    if not os.path.exists('.\\1470\\'+ str(label)):
#        os.makedirs('.\\1470\\'+ str(label))
#    cv2.imwrite('.\\1470\\'+str(label)+'\\'+str(label) + '_'+ str(x) + '.png',data_star[x].reshape(40,40))
##############################



['unknown']
(703, 40, 40)
bandwidth 7020
number of unique labels (28,)
number of each element [259 190 180   2   2   5   2   2   2   2   2   4   2   2   2   2   2   2
   3   2   2   4   2   2  21   1   1   1]
(28,)
