###### Clustering

Clustering is an interesting field of Unsupervised Machine learning where we classify 
datasets into set of similar groups. It is part of ‘Unsupervised learning’ meaning, where
there is no prior training happening and the dataset will be unlabeled. Clustering can be
done using different techniques like K-means clustering, Mean Shift clustering, DB Scan 
clustering, Hierarchical clustering etc. 

###### Image clustering


Image clustering is an essential data analysis tool in machine
learning and computer vision. Many applications
such as content-based image annotation and
image retrieval can be viewed as different instances
of image clustering. Technically, image clustering
is the process of grouping images into clusters such that the
images within the same clusters are similar to each other,
while those in different clusters are dissimilar.

In [2]:
import os
# Code: import Kmeans library from sklearn ( 1 point)
from sklearn.cluster import KMeans


###### VGG 

VGG is a convolutional neural network model for image recognition proposed by the Visual Geometry Group in the University of Oxford, where VGG16 refers to a VGG model with 16 weight layers, and VGG19 refers to a VGG model with 19 weight layers. The architecture of VGG16: the input layer takes an image in the size of (224 x 224 x 3), and the output layer is a softmax prediction on 1000 classes. From the input layer to the last max pooling layer (labeled by 7 x 7 x 512) is regarded as the feature extraction part of the model.

In [14]:
from keras.preprocessing import image
# Code: import VGG feature extraction from keras application as VGG16 (1 point)
from keras.applications import VGG16
from keras.applications.vgg16 import preprocess_input
import numpy as np

model = VGG16(weights='imagenet', include_top=False)    
# Code: Specify path of the random image from the training dataset. (1 point)
img_path = "dataset/train_dataset/_83930440_lion-think-976.jpg"
img = image.load_img(img_path, target_size=(224, 224)) 
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)

vgg16_feature = model.predict(img_data)  

# Code: print the shape of the vgg16_feature  (1 point)
print('Shape of the vgg16 features is :',vgg16_feature.shape)

Shape of the vgg16 features is : (1, 7, 7, 512)


In [15]:
# The given function will extract the features from the images.
def extract_feature(directory):
    vgg16_feature_list = []

    for filename in os.listdir(directory):

        img = image.load_img(os.path.join(directory,filename), target_size=(224, 224))
        img_data = image.img_to_array(img)
        img_data = np.expand_dims(img_data, axis=0)
        img_data = preprocess_input(img_data)

        vgg16_feature = model.predict(img_data)
        vgg16_feature_np = np.array(vgg16_feature)
        vgg16_feature_list.append(vgg16_feature_np.flatten())

    vgg16_feature_list_np = np.array(vgg16_feature_list)
    
    return vgg16_feature_list_np

The given dataset has three classes that are: Lion , Fish and Zebra, but we are not providing any 
    supervision to the model i.e. we are not specifying which image is associated with which
    class / cluster. For this we using unsupervised image clustering to create the clusters.

In [18]:
train_feature_vector = extract_feature("dataset/train_dataset")  # pass the path of the folder where you have the training dataset
# Code: create the kmeans object and initialize it with the number_of_clusters = 3   (2 point)
kmeans_model = KMeans(n_clusters=3)
kmeans_model.fit(train_feature_vector) 
   


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [20]:
# create a test vector using extract_feature function. It will return a feature vector of size 
# number of images * size of the feature vector

test_vector  = extract_feature('dataset/test_dataset')  # (1 point)

In [21]:
# Code: print the shape of the test vector   # (1 point)
print('Shape of the test vector is :',test_vector.shape)

Shape of the test vector is : (32, 25088)


In [31]:
labels = kmeans_model.predict(test_vector)
# Code: use the kmeans model to predict the labels for the test vector (1 point)
print('Length of labels: ',len(labels))

Length of labels:  32


In [33]:
# Code: Using the labels and the images, save the test images in the different folders in respective 
#clusters.   (2 point)
#Assuming the label 0 is for fish, label 1 is for zebra and label 2 is for lion
from shutil import copyfile
len_labels = len(labels)
test_folder = "dataset/test_dataset/"
output_zebra = "output/zebra/"
output_lion = "output/lion/"
output_fish = "output/fish/"
test_file = []
for files in os.listdir(test_folder):
    if not files.startswith('.DS_Store'):
        test_file.append(files)
for i in range(len_labels):
    if labels[i] == 0:
        copyfile(test_folder + test_file[i], output_fish + test_file[i])
    elif labels[i] == 1:
        copyfile(test_folder + test_file[i], output_zebra + test_file[i])
    else:
        copyfile(test_folder + test_file[i], output_lion + test_file[i])