# DEC Supervised Learning

This notebook focuses on training the supervised learning model for DEC to get a benchmark of the performance and also understand whether the model is able to seperate the two classes into seperate clusters or not in the N-Dimensional space

### Importing required Libaries

Importing the required libraries and modules so that they can be used in the notebook

In [1]:
import tensorflow as tf
import keras.backend as K
import numpy as np
import pandas as pd
# loading the requirements for the Xception model
from keras.applications.xception import Xception
from keras.applications.xception import absolute_import, decode_predictions, preprocess_input
from keras.layers import Flatten, Dense, GlobalAveragePooling2D
from keras.preprocessing.image import ImageDataGenerator
# Loading the DEC module cloned from github
from DEC.model import *
from DEC.metrics import *
from xception_dec_datagenerator import XceptionDataGenerator
# Importing the utilities
from utils.file_utils import *
from PIL import Image
# Using scikit-image  resize function for resizing the image from original size to 224 X 224
# from skimage.transform import resize
# Train Test split from sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from shutil import copy2
# For visualization of images and for plotting
import matplotlib.pyplot as plt
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
sns.set()

ModuleNotFoundError: No module named 'tensorflow'

In [2]:
import warnings
warnings.filterwarnings('ignore')

### Xception Model

First we load the Xception model into the computer memory using the Keras library. Because we are focusing on extracting features from the model we do not include the topmost layer. However we do use the imagenet weights for the model. Also because we want a 1-D vector form of the features we do use the pooling layer at the end.

In [3]:
input_tensor_shape = (150, 150, 3)
base_xception_model = Xception(weights = 'imagenet', input_shape = input_tensor_shape, include_top = False, pooling='avg')
base_xception_model.summary()

NameError: name 'Xception' is not defined

The input shape of our base xception model is: 150 X 150 X 3. That is a 3 channel square image with side 150 pixels.
The output shape of the base xception model is: 2048 X 1. It is a 1-D vector representing the features learned by the model

### Loading the Galaxy Zoo data

We now start loading the galaxy zoo data into memory. First we load the label file and then start loading the corresponding images such that we can assign the corresponding label to them

In [4]:
all_labels = pd.read_csv(f'../data/galaxy_zoo/training_solutions_rev1.csv')
# all_labels = pd.read_csv(f'../data/galaxy_zoo/galaxy-zoo-the-galaxy-challenge/training_solutions_rev1/training_solutions_rev1.csv')
all_labels.head()

NameError: name 'pd' is not defined

In [5]:
# First we can get rid of all the extra data that we will not use

# Assuming the follwing column names:
elliptical_galaxy_col_name = 'Class1.1'
spiral_galaxy_col_name = 'Class1.2'

all_labels = all_labels[['GalaxyID', elliptical_galaxy_col_name, spiral_galaxy_col_name]]
all_labels.head()

NameError: name 'all_labels' is not defined

In [6]:
# Getting the id's for elliptical and spiral galaxies
elliptical_galaxy_ids = pd.Series(all_labels[all_labels[elliptical_galaxy_col_name] >= 0.5]['GalaxyID'], dtype=str)
spiral_galaxy_ids = pd.Series(all_labels[all_labels[spiral_galaxy_col_name] > 0.5]['GalaxyID'], dtype=str)

NameError: name 'pd' is not defined

In [7]:
# Finding the number of images for each type of galaxy
print(f'Number Elliptical Galaxies: {elliptical_galaxy_ids.shape[0]}')
print(f'Number Spiral Galaxies: {spiral_galaxy_ids.shape[0]}')

NameError: name 'elliptical_galaxy_ids' is not defined

__Finding High Confidence Galaxies__

In [8]:
# First find the number of high confidence galaxies for both classes
conf_threshold = 1.0
print(f'Confidence Threshold: {conf_threshold}')
num_high_conf_elliptical_galaxies = all_labels[all_labels[elliptical_galaxy_col_name] >= conf_threshold].shape[0]
num_high_conf_spiral_galaxies = all_labels[all_labels[spiral_galaxy_col_name] >= conf_threshold].shape[0]
print('Number of High Confidence Elliptical Galaxies:', num_high_conf_elliptical_galaxies)
print('Number of High Confidence Spiral Galaxies:', num_high_conf_spiral_galaxies)

Confidence Threshold: 1.0


NameError: name 'all_labels' is not defined

In [9]:
# sampling certain number of high confidence galaxies
num_samples = 500
high_conf_ellip_galaxies = all_labels.sort_values(by=elliptical_galaxy_col_name, ascending=False).iloc[:num_samples]['GalaxyID']
high_conf_spiral_galaxies = all_labels.sort_values(by=spiral_galaxy_col_name, ascending=False).iloc[:num_samples]['GalaxyID']

NameError: name 'all_labels' is not defined

In [10]:
high_conf_ellip_galaxies.head()

NameError: name 'high_conf_ellip_galaxies' is not defined

In [11]:
high_conf_spiral_galaxies.head()

NameError: name 'high_conf_spiral_galaxies' is not defined

### Data Generators

Defining the keras data generators to iterate through all the images and then essentially help in extracting the features from the images

In [12]:
image_extension = '.jpg'
training_directory_path = f'../data/xception_clustering/training/'
testing_directory_path = f'../data/xception_clustering/testing/'

In [13]:
# Getting the files already in the training and testing folders respectively
spiral_training_directory_path = construct_path(training_directory_path, 'spiral')
elliptical_training_directory_path = construct_path(training_directory_path, 'elliptical')
spiral_testing_directory_path = construct_path(testing_directory_path, 'spiral')
elliptical_testing_directory_path = construct_path(testing_directory_path, 'elliptical')
elliptical_training_files = get_file_nms(elliptical_training_directory_path, image_extension)
spiral_training_files = get_file_nms(spiral_training_directory_path, image_extension)
elliptical_testing_files = get_file_nms(elliptical_testing_directory_path, image_extension)
spiral_testing_files = get_file_nms(spiral_testing_directory_path, image_extension)
# Finding the number of images for each type of galaxy after finding the common images and list
print(f'Number of already present Training Elliptical Galaxies: {len(elliptical_training_files)}')
print(f'Number of already present Training Spiral Galaxies: {len(spiral_training_files)}')
print(f'Number of already present Testing Elliptical Galaxies: {len(elliptical_testing_files)}')
print(f'Number of already present Testing Spiral Galaxies: {len(spiral_testing_files)}')

NameError: name 'construct_path' is not defined

#### Normalization and Cropping functions

In [14]:
def get_difference(orig_size, target_size):
    orig_size, target_size = list(orig_size), list(target_size)
    ret_ls = []
    for o, t in zip(orig_size, target_size):
        ret_ls.append(o - t)
    return ret_ls

def crop_image(image, orig_size, target_size):
    crop_sizes = get_difference(orig_size, target_size)
    height_dif, width_dif = crop_sizes[0] // 2, crop_sizes[1] // 2
    return image[height_dif:(height_dif + target_size[0]), width_dif:(width_dif + target_size[1]), :]

def range_scaling(image, out_feature_range=(-1, 1)):
    old_min, old_max = 0., 255.
    new_min, new_max = -1., 1.
    return ((image - old_min)/(old_max - old_min))*(new_max - new_min) + new_min

def image_preprocessing_function(image, crop=True, range_scale=True):
    """
    image is a 3-D image tensor (numpy array).
    """
    target_image_size = input_tensor_shape
    if crop:
        cropped_image = crop_image(image, image.shape, target_image_size)
    else:
        cropped_image = image
        
    if range_scale:
        final_image = range_scaling(cropped_image)
    else:
        final_image = cropped_image
    return final_image

#### Generator Definitions

In [15]:
%%time
generator_batch_size = 64
# Current generator uses -1 to 1
image_generator = ImageDataGenerator(preprocessing_function=image_preprocessing_function)
training_generator = image_generator.flow_from_directory(training_directory_path, target_size = input_tensor_shape[:2], 
                                                         class_mode='binary', batch_size=generator_batch_size)
testing_generator = image_generator.flow_from_directory(testing_directory_path, target_size = input_tensor_shape[:2], 
                                                         class_mode='binary', batch_size=generator_batch_size)

NameError: name 'ImageDataGenerator' is not defined

In [16]:
%%time
n_train_examples = (len(training_generator.filenames)//generator_batch_size) * generator_batch_size
# n_train_examples = 128
train_features = np.zeros((n_train_examples, 2048))
train_labels = np.zeros(n_train_examples, dtype=int)
i = 0
for inputs_batch, labels_batch in training_generator:
    features_batch = base_xception_model.predict(inputs_batch)
    train_features[i * generator_batch_size : (i + 1) * generator_batch_size] = features_batch
    train_labels[i * generator_batch_size : (i + 1) * generator_batch_size] = labels_batch
    i += 1
    if i % 100 == 0 and i:
        print('Number of Images processed:', i * generator_batch_size)
    if i * generator_batch_size >= n_train_examples:
        break

print('Shape of the training features', train_features.shape)

NameError: name 'training_generator' is not defined

In [17]:
training_generator.class_indices

NameError: name 'training_generator' is not defined

#### DEC Xception Training Regime

This part of the notebook defines the generator for the training regime of the DEC model over the features extracted from the Xception architecture

In [18]:
# Defining our DEC model
dec_model = DEC_Supervised([2048, 500, 500, 2000, 10], n_clusters=2)
dec_model.model.summary()

NameError: name 'DEC_Supervised' is not defined

In [19]:
results_save_dir = 'results/supervised_learning'
if not exist_directory(results_save_dir):
    os.makedirs(results_save_dir)

NameError: name 'exist_directory' is not defined

#### Pretraining

In [20]:
%%time
dec_model.pretrain(train_features, None, epochs=100, save_dir=results_save_dir)

NameError: name 'dec_model' is not defined

### Supervised Learning

We do supervised learning on the whole dataset of galaxy images.

In [21]:
dec_model.supervised_learning(train_features, train_labels, batch_size=64, epochs=200, save_dir=results_save_dir)

NameError: name 'dec_model' is not defined

### Visualization of the features extracted by the DEC model

#### Encoder Output Features

In [22]:
%%time
dec_encoder_model_pred = dec_model.encoder.predict(train_features)
pca_mod = PCA(2)
pca_mod.fit(dec_encoder_model_pred)
reduced_features = pca_mod.transform(dec_encoder_model_pred)

fig, ax = plt.subplots(1, 1)
sns.scatterplot(reduced_features[:, 0], reduced_features[:, 1], 
                hue=np.where(train_labels==0, 'elliptical', 'spiral'), ax=ax)
ax.text(0.02, 0.92, f'Explained Var: {np.round(np.sum(pca_mod.explained_variance_ratio_), decimals=4)}', 
        transform=ax.transAxes)
ax.set_xlabel('PCA Dim 1')
ax.set_ylabel('PCA Dim 2')
ax.set_title('PCA Scatterplot for Encoder Features')
ax.legend(loc=4)
plt.savefig(f'{results_save_dir}/PCAencoding_pca_features.png')

NameError: name 'dec_model' is not defined

In [23]:
%%time
tsne_mod = TSNE(2)
reduced_features = tsne_mod.fit_transform(dec_encoder_model_pred)

fig, ax = plt.subplots(1, 1)
sns.scatterplot(reduced_features[:, 0], reduced_features[:, 1], 
                hue=np.where(train_labels==0, 'elliptical', 'spiral'), ax=ax)
ax.text(0.02, 0.92, f'K-L Divergence: {np.round(tsne_mod.kl_divergence_, decimals=4)}', 
        transform=ax.transAxes)
ax.set_xlabel('TSNE Dim 1')
ax.set_ylabel('TSNE Dim 2')
ax.set_title('TSNE Scatterplot for Encoder Features')
ax.legend(loc=4)
plt.savefig(f'{results_save_dir}/TSNE_encoding_features.png')

NameError: name 'TSNE' is not defined

In [24]:
%%time
tsne_mod = TSNE(3)
reduced_features = tsne_mod.fit_transform(dec_encoder_model_pred)

fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(reduced_features[train_labels==0, 0], reduced_features[train_labels==0, 1], 
           reduced_features[train_labels==0, 2], 
           c='red', marker='x', label='Elliptical')

ax.scatter(reduced_features[train_labels==1, 0], reduced_features[train_labels==1, 1], 
           reduced_features[train_labels==1, 2], 
           c='blue', marker='o', label='Spiral')

ax.text(0.92, 0.92, 0.92, f'K-L Divergence: {np.round(tsne_mod.kl_divergence_, decimals=4)}', 
        transform=ax.transAxes)
ax.set_xlabel('TSNE Dim 1')
ax.set_ylabel('TSNE Dim 2')
ax.set_zlabel('TSNE Dim 3')
ax.set_title('TSNE Scatterplot for Encoder Features')
ax.legend(loc=4)
plt.savefig(f'{results_save_dir}/TSNE_encoding_features_3Dim.png')

NameError: name 'TSNE' is not defined

#### Training Curves

In [25]:
supervised_log, ptrain_log = pd.read_csv(f'{results_save_dir}/supervised_learning_log.csv'), pd.read_csv(f'{results_save_dir}/pretrain_log.csv')

NameError: name 'pd' is not defined

In [26]:
fig, ax = plt.subplots(1, 2, figsize=(20, 5))
ptrain_log.plot(x='epoch', y='loss', ax=ax[0], title='Reconstruction Loss')
supervised_log.plot(x='epoch', y='loss', ax=ax[1], title='Classification Loss')
plt.savefig(f'{results_save_dir}/training_curves.png')

NameError: name 'plt' is not defined