Clustering

Clustering is an interesting field of Unsupervised Machine learning where we classify 
datasets into set of similar groups. It is part of ‘Unsupervised learning’ meaning, where
there is no prior training happening and the dataset will be unlabeled. Clustering can be
done using different techniques like K-means clustering, Mean Shift clustering, DB Scan 
clustering, Hierarchical clustering etc. 

Image clustering


Image clustering is an essential data analysis tool in machine
learning and computer vision. Many applications
such as content-based image annotation and
image retrieval can be viewed as different instances
of image clustering. Technically, image clustering
is the process of grouping images into clusters such that the
images within the same clusters are similar to each other,
while those in different clusters are dissimilar.

In [1]:
import os
from sklearn.cluster import KMeans   # import Kmeans from sklearn
from sklearn import metrics
import matplotlib.pyplot as plt
import shutil

In [2]:
from keras.preprocessing import image
from keras.applications.vgg16 import VGG16        # import VGG from keras application as VGG16
from keras.applications.vgg16 import preprocess_input
import numpy as np

model = VGG16(weights='imagenet', include_top=False)    

img_path = '/home/chinesh/Desktop/tapchief/bits/dataset/_83930440_lion-think-976.jpg'
img = image.load_img(img_path, target_size=(224, 224)) 
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = preprocess_input(img_data)

vgg16_feature = model.predict(img_data)  

print(vgg16_feature.shape)  # print the shape of the feature

Using TensorFlow backend.








(1, 7, 7, 512)


 VGG is a convolutional neural network model for image recognition proposed by the Visual Geometry Group in the University of Oxford, where VGG16 refers to a VGG model with 16 weight layers, and VGG19 refers to a VGG model with 19 weight layers. The architecture of VGG16: the input layer takes an image in the size of (224 x 224 x 3), and the output layer is a softmax prediction on 1000 classes. From the input layer to the last max pooling layer (labeled by 7 x 7 x 512) is regarded as the feature extraction part of the model.

In [64]:
def extract_feature(directory):
    vgg16_feature_list = []

    for filename in os.listdir(directory):

        img = image.load_img(os.path.join(directory,filename), target_size=(224, 224))
        img_data = image.img_to_array(img)
        img_data = np.expand_dims(img_data, axis=0)
        img_data = preprocess_input(img_data)

        vgg16_feature = model.predict(img_data)
        vgg16_feature_np = np.array(vgg16_feature)
        vgg16_feature_list.append(vgg16_feature_np.flatten())

    vgg16_feature_list_np = np.array(vgg16_feature_list)
    
    return vgg16_feature_list_np

The given dataset has three classes that are: Lion , Fish and Zebra, but we are not providing any 
    supervision to the model i.e. we are not specifying which image is associated with which
    class / cluster. For this we using unsupervised image clustering to create the clusters.

In [59]:
train_feature_vector = extract_feature('/home/chinesh/Desktop/tapchief/bits/final_dataset')  # pass the path of the folder where you have the training dataset
number_of_clusters = 3

kmeans_model = KMeans(n_clusters=number_of_clusters) # create the kmeans object and initialize it with the number_of_clusters
kmeans_model.fit(train_feature_vector) # call fit function on the train_feature_vector 
   


0
100
200
300


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [60]:
# create a test vector using extract_feature function. It will return a feature vector of size 
# number of images * size of the feature vector

test_vector  = extract_feature('/home/chinesh/Desktop/tapchief/bits/test_dataset/')

0


In [65]:
test_vector.shape

(33, 25088)

In [66]:
# use the kmeans model to predict the labels for the test vector

In [61]:
labels = kmeans_model.predict(test_vector)


In [67]:
# Using the labels and the images, save the images in the different folders in respective 
#clusters.

In [62]:
directory = '/home/chinesh/Desktop/tapchief/bits/test_dataset'

In [63]:
path = {0:'label_0',  1:'label_1' ,2:'label_2'}
save_dir = '/home/chinesh/Desktop/tapchief/bits/result'
count = 0
for filename in os.listdir(directory):
    print(os.path.join(save_dir,path[y_kmeans[count]]))
    shutil.copy(os.path.join(directory,filename), os.path.join(save_dir,path[y_kmeans[count]]))
    count += 1

/home/chinesh/Desktop/tapchief/bits/result/label_1
/home/chinesh/Desktop/tapchief/bits/result/label_0
/home/chinesh/Desktop/tapchief/bits/result/label_2
/home/chinesh/Desktop/tapchief/bits/result/label_0
/home/chinesh/Desktop/tapchief/bits/result/label_1
/home/chinesh/Desktop/tapchief/bits/result/label_0
/home/chinesh/Desktop/tapchief/bits/result/label_0
/home/chinesh/Desktop/tapchief/bits/result/label_0
/home/chinesh/Desktop/tapchief/bits/result/label_0
/home/chinesh/Desktop/tapchief/bits/result/label_0
/home/chinesh/Desktop/tapchief/bits/result/label_1
/home/chinesh/Desktop/tapchief/bits/result/label_0
/home/chinesh/Desktop/tapchief/bits/result/label_1
/home/chinesh/Desktop/tapchief/bits/result/label_0
/home/chinesh/Desktop/tapchief/bits/result/label_0
/home/chinesh/Desktop/tapchief/bits/result/label_0
/home/chinesh/Desktop/tapchief/bits/result/label_2
/home/chinesh/Desktop/tapchief/bits/result/label_0
/home/chinesh/Desktop/tapchief/bits/result/label_2
/home/chinesh/Desktop/tapchief/