## Introduction

### UMAP

* `n_neighbors`: This determines the number of neighboring points used in local approximations of manifold structure. Larger values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the **range 5 to 50**, with a choice of 10 to 15 being a sensible default.

* `min_dist`: This controls how tightly the embedding is allowed compress points together. **Larger values** ensure embedded points are **more evenly distributed**, while smaller values allow the algorithm to optimise more accurately with regard to local structure. Sensible values are in the range 0.001 to 0.5, with 0.1 being a reasonable default.

* `metric`: This determines the choice of metric used to measure distance in the input space. A wide variety of metrics are already coded, and a user defined function can be passed as long as it has been JITd by numba.

`['euclidean','manhattan','chebyshev','minkowski','canberra','braycurtis','mahalanobis','wminkowski',,'cosine'，'correlation','haversine', 'hamming','jaccard','dice','russelrao', 'kulsinski', 'll_dirichlet', 'hellinger' ]`

euclidean
manhattan
chebyshev
minkowski
canberra
braycurtis
~~mahalanobis~~
wminkowski
~~seuclidean~~
cosine
correlation
haversine
hamming
jaccard
dice
~~russelrao~~
kulsinski
ll_dirichlet
hellinger
rogerstanimoto
sokalmichener
sokalsneath
yule

[UMAP API Guide](https://umap.scikit-tda.org/api.html)

* `n_components`: 2-3



### AgglomerativeClustering


`class sklearn.cluster.AgglomerativeClustering(n_clusters=2, *, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', distance_threshold=None, compute_distances=False)`

`n_clusters = 4`

`affinity` str `[“euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”]`

`memory` str or object with the joblib. Memory interface, default=None

`connectivity` array-like or callable, default=None

`compute_full_tree`: 'auto' or bool, default=’auto’ **It must be True if distance_threshold is not None. **

`linkage`: `{‘ward’, ‘complete’, ‘average’, ‘single’}`, default=’ward’

> Which linkage **criterion** to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.

* ‘ward’ minimizes the variance of the clusters being merged.

* ‘average’ uses the average of the distances of each observation of the two sets.

* ‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.

* ‘single’ uses the minimum of the distances between all observations of the two sets.


⚠️ l2 was provided as affinity. Ward can only work with euclidean distances.

`distance_threshold` float, default=None
> The linkage distance threshold above which, clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True.

`compute_distances` bool, default=False
> Computes distances between clusters even if distance_threshold is not used. This can be used to make dendrogram visualization, but introduces a computational and memory overhead.


In [1]:
# for everything else
import os
import random
# from random import randint
from functools import partial
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

# for loading/processing the images  
from tensorflow import keras
from keras.models import load_model
from keras.preprocessing.image import load_img 
from keras.preprocessing.image import img_to_array 
from keras.applications.vgg16 import preprocess_input 

# models 
from keras.applications.vgg16 import VGG16 
from keras.models import Model
import pickle

import umap

# clustering and dimension reduction
from sklearn.cluster import AgglomerativeClustering


from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
# from sklearn.preprocessing import LabelEncoder
# from keras.utils.np_utils import to_categorical
from tqdm import trange
from hyperopt import fmin, tpe, hp, STATUS_OK, space_eval, Trials

import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline
sns.set(style='white', context='notebook', rc={'figure.figsize':(14,10)})

from warnings import filterwarnings
filterwarnings('ignore')

#### Fixing random seeds for reproducibility


In [2]:
random.seed(10)
np.random.seed(10)

## Hyperparameters
* agglomerative: `n_clusters` `affinity`
* k-means: `n_initint`, `max_iter`

In [3]:
affinity = ['euclidean', 'l1', 'l2', 'manhattan', 'cosine', 'precomputed']
linkage = ['complete', 'average', 'single']
aggparams = [[a,l] for a in affinity for l in linkage]
aggparams.append(['euclidean', 'ward'])
aggparams.remove(['precomputed', 'single']) #Distance matrix should be square, Got matrix of shape {X.shape}
aggparams.remove(['precomputed', 'complete'])
aggparams.remove(['precomputed', 'average'])
aggparams, len(aggparams)

([['euclidean', 'complete'],
  ['euclidean', 'average'],
  ['euclidean', 'single'],
  ['l1', 'complete'],
  ['l1', 'average'],
  ['l1', 'single'],
  ['l2', 'complete'],
  ['l2', 'average'],
  ['l2', 'single'],
  ['manhattan', 'complete'],
  ['manhattan', 'average'],
  ['manhattan', 'single'],
  ['cosine', 'complete'],
  ['cosine', 'average'],
  ['cosine', 'single'],
  ['euclidean', 'ward']],
 16)

In [4]:
config = {
    'method': 'AgglomerativeClustering',
    'n_neighbors': range(20,300,20),
    'min_dist': np.arange(0.0, 1.0, 0.05),
    'n_components': 2,
    "umap_metric": ['manhattan','canberra','hellinger','correlation','jaccard','cosine','dice','ll_dirichlet'],
    'n_clusters': 4,
    'n_init': 10,
    'aggparams': aggparams,
    'max_evals': 5000,
    'random_state': 42
}

In [5]:
'space:', len(config['n_neighbors']) * len(config['min_dist']) * len(config['umap_metric'] * len(config['aggparams']))

('space:', 35840)

## Load Data

In [6]:
base_dir = '../input/nbiinfframes/FRAMES'
Fold1 = '../input/nbiinfframes/FRAMES/Fold1'
Fold2 = '../input/nbiinfframes/FRAMES/Fold2'
Fold3 = '../input/nbiinfframes/FRAMES/Fold3'

model_path = r'vgg16_weights_tf_dim_ordering_tf_kernels.h5'
pkl_file = r"../input/feature-embedding/files_feature_embeddings_vgg16.pkl"

class_ = ['B', 'I', 'S', 'U']
Folds = [Fold1, Fold2, Fold3]
all_data = []

# Define path to the data directory
for fold in Folds:
    for case in class_:
        path = os.path.join(fold, case)
        if os.path.isdir(path):
            for img in os.listdir(path):
                all_data.append([os.path.join(path, img), case, os.path.basename(fold)])
        
all_df = pd.DataFrame(all_data, columns=['image_path', 'class', 'folder'],index=None)
all_df.dropna(inplace = True)
# all_df.sample(10)

#mapping the image name with the image path
all_df['name'] = all_df['image_path'].map(lambda x: os.path.basename(x))
img_name = all_df['name'] .to_list()
files = all_df['image_path'].to_list()

## Methods: Building Model

In [7]:
%%time 

model = VGG16()
model = Model(inputs = model.inputs, outputs = model.layers[-2].output)
    
out_features = model.output.shape[1] # to obtain the number of the output feature
# model.summary()
print(out_features)

2022-04-23 15:14:17.640768: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels.h5
4096
CPU times: user 3.85 s, sys: 3.56 s, total: 7.41 s
Wall time: 4.98 s


## Feature Extraction by CNN

In [8]:
def extract_features(files, model, pkl="", out_feature=out_features):
    
    data = {}
    if os.path.exists(pkl):
        with open(pkl, 'rb') as f:
            data = pickle.load(f)
    else:
        print("File data pkl file doesn't exist..., extracting by model")
        for file in files:
            # load the image as a 224x224 array
            resized_img = load_img(file, target_size=(224,224)) ##reshaped
            # convert from 'PIL.Image.Image' to numpy array
            img_arr = np.array(resized_img) 
            # reshape the data for the model reshape(num_of_samples, dim 1, dim 2, channels)
            reshaped_img = img_arr.reshape(1,224,224,3)  ##reshape
            # prepare image for model
            imgx = preprocess_input(reshaped_img)
            ## define the method to extract the feature vector
            embeddings = model.predict(imgx, use_multiprocessing=True)
            data[os.path.basename(file)] = embeddings
            
        pickle.dump(data, open(pkl, "wb"))
 
    # get a list of the filenames
    filenames = np.array(list(data.keys()))
    # get a list of just the features
    feat = np.array(list(data.values()))
    # reshape so that there are 720 samples of 512 vectors
    feat = feat.reshape(-1, out_features)
        
    return filenames, feat

In [9]:
%%time

filenames, feat = extract_features(files, model, pkl_file, out_features)    
print('extracted feature shape: {}'.format(feat.shape))

extracted feature shape: (720, 4096)
CPU times: user 10.1 ms, sys: 12 ms, total: 22.1 ms
Wall time: 135 ms


## Different Clusters

In [10]:
def generate_clusters(feature_embeddings,
                       n_neighbors,
                       min_dist,
                       metric,
                       affinity,
                       linkage,
                       random_state = config['random_state'],
                       n_components = config['n_components'],
                       n_clusters = config['n_clusters'],
                       n_init = config['n_init']):
    
    print('the setting affinity: {}, linkage: {}'.format(affinity, linkage))
    
    umap_embeddings = umap.UMAP(n_neighbors = n_neighbors,
                                min_dist = min_dist,
                                n_components = n_components,
                                metric = metric,
                                random_state=random_state).fit_transform(feature_embeddings)
    
    clusters = AgglomerativeClustering(n_clusters=n_clusters,
                                       affinity=affinity,
                                       linkage=linkage,
                                       memory=None,
                                       connectivity=None,
                                       compute_full_tree='auto',
                                       distance_threshold=None, 
                                       compute_distances=False).fit(umap_embeddings) 
    
    return clusters

### Score

In [11]:
def metrics_score_(clusters, filenames, image_label_df):
    
    """
    Returns  objects after first performing dimensionality reduction using UMAP
    
    Arguments:
        clusters: clustering object
        filenames: image name list
        image_label_df: dataframe object, the image name and the groundtruth label
        
    Returns:
        cluster_labels: K-Means object of clusters
    """
    mapped_label = {}
    ### prepare the labels from clusters
    clustered_labels = clusters.labels_ #clustered labels
    clustered_groups = {}
    for file, cluster in zip(filenames, clustered_labels):
        if cluster not in clustered_groups.keys():
            clustered_groups[cluster] = []
            clustered_groups[cluster].append(file)
        else:
            clustered_groups[cluster].append(file)
    
    
    ### prepare the ground truth labels, using group by 
    image_groups = image_label_df.groupby('class')['name'] ## only get the image name for matching the predicted one
    ground_labels = [*image_groups.groups.keys()] 
    
    ## mapping each group in datafame with the max intersection clustered groups
    for label in ground_labels:
        group_label = {} # allocate to the max intersection number group
        lst = image_groups.get_group(label).tolist()
        
        for x in range(len(clustered_groups)):
            num = len(set(clustered_groups[x]) & set(lst))
#             print('clustered group: {} intersection with label {}, instersection number {}'.format(x, label, num))
            group_label[x]=num
            
        max_key = max(group_label, key=group_label.get)
#         print('the {} label is the group {}'.format(label, max_key))
        mapped_label[max_key] = label #reverse the label
        
        ### connect the mapped labels to the image name
        np_groups = np.array(list(clustered_groups.items()))
        imgs = []
        clustered_img_labels = []
        for i in range(np_groups.shape[0]):
            imgs.extend(np_groups[i,:][1])
            clustered_img_labels.extend([np_groups[i,:][0]]*len(np_groups[i,:][1]))
        
        ## merge the mapped df with the dataframe
        clustered_df = pd.DataFrame({'name': imgs, 'predict_target': clustered_img_labels})
        clustered_df['predict_class'] = clustered_df['predict_target'].map(mapped_label)
        eval_df = pd.merge(image_label_df, clustered_df, on='name', how='left')
        
        y_test = eval_df['class'].tolist()
        y_pred = eval_df['predict_class'].tolist()
        
        report = classification_report(y_test, y_pred)

        precision_score_ = precision_recall_fscore_support(y_test, y_pred)[0]
        recall_score_ = precision_recall_fscore_support(y_test, y_pred)[1]
        f1_score_ = precision_recall_fscore_support(y_test, y_pred)[2]

        avg_precision = np.average(precision_score_)
        avg_recall = np.average(recall_score_)
        avg_f1 = np.average(f1_score_)
        
    return avg_precision, avg_recall, avg_f1

In [12]:
def objective(params, embeddings, filenames, df):
    """
    Objective function for hyperopt to minimize, which incorporates constraints
    on the number of clusters we want to identify
    """
    clusters = generate_clusters(embeddings,
                                 n_neighbors = params['n_neighbors'], 
                                 min_dist = params['min_dist'], 
                                 metric = params['umap_metric'],
                                 affinity = params['aggparams'][0],
                                 linkage = params['aggparams'][1])
    
    avg_precision, avg_recall, avg_f1 = metrics_score_(clusters, filenames, df)
    cost = 1 - avg_recall 

    if (np.abs(avg_precision - avg_recall) > 0.025):
        penalty = 0.1 * np.abs(avg_precision - avg_recall)
    else:
        penalty = 0
    loss = cost + penalty # penalty
    return {'loss': loss, 'avg_precision': avg_precision,'avg_recall': avg_recall, 'avg_f1':avg_f1, 'status': STATUS_OK}

In [13]:
def bayesian_search(features, filenames, df, space, max_evals=100):

    """
    Perform bayseian search on hyperopt hyperparameter space to minimize objective function
    """
    
    trials = Trials()
    
    fmin_objective = partial(objective, embeddings=features, filenames=filenames, df=df)
    
    best = fmin(fmin_objective, 
                space = space, 
                algo=tpe.suggest,
                max_evals = max_evals, 
                trials = trials)

    best_params = space_eval(space, best)
    print ('best:')
    print (best_params)
    print (f"the average recall: {trials.best_trial['result']['avg_recall']}")

    best_clusters = generate_clusters(features, 
                                      n_neighbors = best_params['n_neighbors'], 
                                      min_dist = best_params['min_dist'], 
                                      metric = best_params['umap_metric'],
                                      affinity = best_params['aggparams'][0],
                                      linkage = best_params['aggparams'][1])
    

    return best_params, best_clusters, trials

In [14]:
hspace = {
    "n_neighbors": hp.choice('n_neighbors', config['n_neighbors']),
    "min_dist": hp.choice('min_dist', config['min_dist']),
    "umap_metric": hp.choice('umap_metric', config['umap_metric']),
    "aggparams": hp.choice('aggparams',config['aggparams']),
}
max_evals = config['max_evals']

In [15]:
print('test the time-cost for the method {}'.format(config['method']))
%time best_params_use, best_clusters_use, trials_use = bayesian_search(feat, filenames, all_df, hspace, max_evals)

test the time-cost for the method AgglomerativeClustering
the setting affinity: l2, linkage: average
the setting affinity: l2, linkage: complete
the setting affinity: l1, linkage: average
the setting affinity: euclidean, linkage: single
the setting affinity: l1, linkage: single
the setting affinity: l2, linkage: single
the setting affinity: manhattan, linkage: complete
the setting affinity: l2, linkage: average
the setting affinity: manhattan, linkage: complete
the setting affinity: l1, linkage: average
the setting affinity: l1, linkage: complete
the setting affinity: l1, linkage: complete
the setting affinity: manhattan, linkage: complete
the setting affinity: euclidean, linkage: single
the setting affinity: cosine, linkage: single
the setting affinity: cosine, linkage: complete
the setting affinity: l1, linkage: single
the setting affinity: cosine, linkage: average
the setting affinity: l2, linkage: single
the setting affinity: cosine, linkage: single
the setting affinity: euclidean,

In [16]:
best_params_use['avg_recall'] = trials_use.best_trial['result']['avg_recall']
print(best_params_use)
sourceFile = open('best_params_{}.txt'.format(config['method']), 'w')
print(best_params_use, file = sourceFile)
sourceFile.close()

{'aggparams': ('euclidean', 'ward'), 'min_dist': 0.15000000000000002, 'n_neighbors': 60, 'umap_metric': 'hellinger', 'avg_recall': 0.9472222222222222}


In [17]:
trials_use.best_trial

{'state': 2,
 'tid': 516,
 'spec': None,
 'result': {'loss': 0.05277777777777781,
  'avg_precision': 0.9503256317992905,
  'avg_recall': 0.9472222222222222,
  'avg_f1': 0.9473242547393499,
  'status': 'ok'},
 'misc': {'tid': 516,
  'cmd': ('domain_attachment', 'FMinIter_Domain'),
  'workdir': None,
  'idxs': {'aggparams': [516],
   'min_dist': [516],
   'n_neighbors': [516],
   'umap_metric': [516]},
  'vals': {'aggparams': [15],
   'min_dist': [3],
   'n_neighbors': [2],
   'umap_metric': [2]}},
 'exp_key': None,
 'owner': None,
 'version': 0,
 'book_time': datetime.datetime(2022, 4, 23, 16, 22, 45, 475000),
 'refresh_time': datetime.datetime(2022, 4, 23, 16, 22, 51, 23000)}

In [18]:
 umap.__version__

'0.5.2'