## Introduction

### UMAP + Spectral Clustering

* `n_neighbors`: This determines the number of neighboring points used in local approximations of manifold structure. Larger values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the **range 5 to 50**, with a choice of 10 to 15 being a sensible default.

* `min_dist`: This controls how tightly the embedding is allowed compress points together. **Larger values** ensure embedded points are **more evenly distributed**, while smaller values allow the algorithm to optimise more accurately with regard to local structure. Sensible values are in the range 0.001 to 0.5, with 0.1 being a reasonable default.

* `metric`: This determines the choice of metric used to measure distance in the input space. A wide variety of metrics are already coded, and a user defined function can be passed as long as it has been JITd by numba.

`['euclidean','manhattan','chebyshev','minkowski','canberra','braycurtis','mahalanobis','wminkowski',,'cosine'，'correlation','haversine', 'hamming','jaccard','dice','russelrao', 'kulsinski', 'll_dirichlet', 'hellinger' ]`

euclidean
manhattan
chebyshev
minkowski
canberra
braycurtis
~~mahalanobis~~
wminkowski
~~seuclidean~~
cosine
correlation
haversine
hamming
jaccard
dice
~~russelrao~~
kulsinski
ll_dirichlet
hellinger
rogerstanimoto
sokalmichener
sokalsneath
yule

[UMAP API Guide](https://umap.scikit-tda.org/api.html)

* `n_components`: 2-3


In [1]:
! pip install pyamg

Collecting pyamg
  Downloading pyamg-4.2.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.7 MB)
     |████████████████████████████████| 1.7 MB 920 kB/s            
Installing collected packages: pyamg
Successfully installed pyamg-4.2.3



### Spectral Clustering

`eigen_solver`: `['arpack', 'lobpcg', 'amg']`, default=None
> The eigenvalue decomposition strategy to use. AMG requires pyamg to be installed. It can be faster on very large, sparse problems, but may also lead to instabilities. `If None, then 'arpack' is used`. See [4] for more details regarding 'lobpcg'.

`affinity`: `['nearest_neighbors', 'rbf', 'precomputed', 'precomputed_nearest_neighbors']`

* `[*, 'precomputed', *]`: `array must be 2-dimensional and square. shape = (720, 2)` 
* `[*, 'precomputed_nearest_neighbors', *]` : Precomputed matrix must be square. Input is a 720x2 matrix.

`assign_labels`: `['discretize', 'kmeans']`

`gamma`: float, default=1.0

> Ignored for `affinity='nearest_neighbors'`

`random_state` 

 > when eigen_solver == 'amg', and for the K-Means initialization.
 
 When using ` eigen_solver == 'amg'`, it is necessary to also fix the global numpy seed with `np.random.seed(int)` to get deterministic results.

In [2]:
# for everything else
import os
import random
# from random import randint
from functools import partial
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

# for loading/processing the images  
from tensorflow import keras
from keras.models import load_model
from keras.preprocessing.image import load_img 
from keras.preprocessing.image import img_to_array 
from keras.applications.vgg16 import preprocess_input 

# models 
from keras.applications.vgg16 import VGG16 
from keras.models import Model
import pickle

import umap

# clustering and dimension reduction
from sklearn.cluster import SpectralClustering
from pyamg import smoothed_aggregation_solver #The eigen_solver was set to 'amg', but pyamg is not available.

from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
# from sklearn.preprocessing import LabelEncoder
# from keras.utils.np_utils import to_categorical
from tqdm import trange
from hyperopt import fmin, tpe, hp, STATUS_OK, space_eval, Trials

import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline
sns.set(style='white', context='notebook', rc={'figure.figsize':(14,10)})

from warnings import filterwarnings
filterwarnings('ignore')

#### Fixing random seeds for reproducibility

📝 When using `eigen_solver == 'amg'`, it is necessary to also fix the global numpy seed with `np.random.seed(int)` to get deterministic results.

In [3]:
random.seed(10)
np.random.seed(10)

## Hyperparameters
* agglomerative: `n_clusters` `affinity`
* k-means: `n_initint`, `max_iter`

In [4]:
eigen_solver=['arpack', 'lobpcg', 'amg']
affinity=['nearest_neighbors', 'rbf']
assign_labels=['discretize', 'kmeans']

specparams = [[e,a,l] for e in eigen_solver for a in affinity for l in assign_labels]

In [5]:
config = {
    'method': 'SpectralClustering',
    'output_features': 512,
    'n_neighbors': range(20,300,20),
    'min_dist': np.arange(0.0, 1.0, 0.05),
    'n_components': 2,
    "umap_metric": ['manhattan','canberra','hellinger','correlation','jaccard','cosine','dice','ll_dirichlet'],
    'n_clusters': 4,
    'n_init': 10,
    'specparams': specparams,
    'max_evals': 3000,
    'random_state': 42
}

In [6]:
umap_space = len(config['n_neighbors']) * len(config['min_dist']) * len(config['umap_metric'])
spectral_space = len(config['specparams'])
umap_space * spectral_space

26880

## Load Data

In [7]:
base_dir = '../input/nbiinfframes/FRAMES'
Fold1 = '../input/nbiinfframes/FRAMES/Fold1'
Fold2 = '../input/nbiinfframes/FRAMES/Fold2'
Fold3 = '../input/nbiinfframes/FRAMES/Fold3'

model_path = r'vgg16_weights_tf_dim_ordering_tf_kernels.h5'
pkl_file = r"../input/feature-embedding/files_feature_embeddings_vgg16.pkl"

class_ = ['B', 'I', 'S', 'U']
Folds = [Fold1, Fold2, Fold3]
all_data = []

for fold in Folds:
    for case in class_:
        path = os.path.join(fold, case)
        if os.path.isdir(path):
            for img in os.listdir(path):
                all_data.append([os.path.join(path, img), case, os.path.basename(fold)])
        
all_df = pd.DataFrame(all_data, columns=['image_path', 'class', 'folder'],index=None)
all_df.dropna(inplace = True)

all_df['name'] = all_df['image_path'].map(lambda x: os.path.basename(x))
img_name = all_df['name'] .to_list()
files = all_df['image_path'].to_list()

## Methods: Building Model

In [8]:
%%time 
model = VGG16()
model = Model(inputs = model.inputs, outputs = model.layers[-2].output)
    
out_features = model.output.shape[1] # to obtain the number of the output feature
print(out_features)

2022-04-24 13:29:37.307521: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels.h5
4096
CPU times: user 3.92 s, sys: 2.96 s, total: 6.89 s
Wall time: 5.57 s


## Feature Extraction by CNN

In [9]:
def extract_features(files, model, pkl="", out_feature=out_features):
    
    data = {}
    if os.path.exists(pkl):
        with open(pkl, 'rb') as f:
            data = pickle.load(f)
    else:
        print("File data pkl file doesn't exist..., extracting by model")
        for file in files:
            resized_img = load_img(file, target_size=(224,224))
            img_arr = np.array(resized_img) 
            reshaped_img = img_arr.reshape(1,224,224,3)  ##reshape
            imgx = preprocess_input(reshaped_img)
            embeddings = model.predict(imgx, use_multiprocessing=True)
            data[os.path.basename(file)] = embeddings
            
        pickle.dump(data, open(pkl, "wb"))
 
    filenames = np.array(list(data.keys()))
    feat = np.array(list(data.values()))
    feat = feat.reshape(-1, out_features)
        
    return filenames, feat

In [10]:
%%time

filenames, feat = extract_features(files, model, pkl_file, out_features)    
print('extracted feature shape: {}'.format(feat.shape))

extracted feature shape: (720, 4096)
CPU times: user 9.49 ms, sys: 11.7 ms, total: 21.2 ms
Wall time: 109 ms


## Different Clusters

In [11]:
def generate_clusters(feature_embeddings,
                       n_neighbors,
                       min_dist,
                       metric,
                       eigen_solver,
                       affinity,
                       assign_labels,
                       random_state = config['random_state'],
                       n_components = config['n_components'],
                       n_clusters = config['n_clusters'],
                       n_init = config['n_init']):
    
    print('the setting eigen_solver:{} affinity: {}, assign_labels: {}'.format(eigen_solver, affinity, assign_labels))
    
    umap_embeddings = umap.UMAP(n_neighbors = n_neighbors,
                                min_dist = min_dist,
                                n_components = n_components,
                                metric = metric,
                                random_state=random_state).fit_transform(feature_embeddings)
    
    clusters = SpectralClustering(n_clusters = n_clusters,
                                  n_components = n_components,
                                  n_init = n_init,
                                  eigen_solver = eigen_solver,
                                  affinity = affinity,
                                  assign_labels = assign_labels,
                                  gamma=1.0,
                                  n_jobs=None, 
                                  n_neighbors=10,
                                  random_state=random_state).fit(umap_embeddings) #tsne_sc_s

    return clusters

In [12]:
generate_clusters(feat, 200, 0, 'manhattan', 'arpack', 'rbf', 'kmeans')

the setting eigen_solver:arpack affinity: rbf, assign_labels: kmeans


SpectralClustering(eigen_solver='arpack', n_clusters=4, n_components=2,
                   random_state=42)

### Score

In [13]:
def metrics_score_(clusters, filenames, image_label_df):
    mapped_label = {}
    clustered_labels = clusters.labels_
    clustered_groups = {}
    for file, cluster in zip(filenames, clustered_labels):
        if cluster not in clustered_groups.keys():
            clustered_groups[cluster] = []
            clustered_groups[cluster].append(file)
        else:
            clustered_groups[cluster].append(file)
    
    image_groups = image_label_df.groupby('class')['name']
    ground_labels = [*image_groups.groups.keys()] 
    
    for label in ground_labels:
        group_label = {}
        lst = image_groups.get_group(label).tolist()
        
        for x in range(len(clustered_groups)):
            num = len(set(clustered_groups[x]) & set(lst))
            group_label[x] = num
            
        max_key = max(group_label, key=group_label.get)
        mapped_label[max_key] = label
        
        np_groups = np.array(list(clustered_groups.items()))
        imgs = []
        clustered_img_labels = []
        for i in range(np_groups.shape[0]):
            imgs.extend(np_groups[i,:][1])
            clustered_img_labels.extend([np_groups[i,:][0]]*len(np_groups[i,:][1]))
        
        clustered_df = pd.DataFrame({'name': imgs, 'predict_target': clustered_img_labels})
        clustered_df['predict_class'] = clustered_df['predict_target'].map(mapped_label)
        eval_df = pd.merge(image_label_df, clustered_df, on='name', how='left')
        
        y_test = eval_df['class'].tolist()
        y_pred = eval_df['predict_class'].tolist()
        
        report = classification_report(y_test, y_pred)

        precision_score_ = precision_recall_fscore_support(y_test, y_pred)[0]
        recall_score_ = precision_recall_fscore_support(y_test, y_pred)[1]
        f1_score_ = precision_recall_fscore_support(y_test, y_pred)[2]

        avg_precision = np.average(precision_score_)
        avg_recall = np.average(recall_score_)
        avg_f1 = np.average(f1_score_)
        
    return avg_precision, avg_recall, avg_f1

In [14]:
def objective(params, embeddings, filenames, df):
      
    clusters = generate_clusters(embeddings, n_neighbors = params['n_neighbors'], min_dist = params['min_dist'], metric = params['umap_metric'],
                                 eigen_solver = params['specparams'][0],
                                 affinity = params['specparams'][1],
                                 assign_labels = params['specparams'][2])
    
    avg_precision, avg_recall, avg_f1 = metrics_score_(clusters, filenames, df)
    cost = 1 - avg_recall 

    if (np.abs(avg_precision - avg_recall) > 0.025):
        penalty = 0.1 * np.abs(avg_precision - avg_recall)
    else:
        penalty = 0
    loss = cost + penalty
    return {'loss': loss, 'avg_precision': avg_precision,'avg_recall': avg_recall, 'avg_f1':avg_f1, 'status': STATUS_OK}

In [15]:
def bayesian_search(features, filenames, df, space, max_evals):
    trials = Trials()
    
    fmin_objective = partial(objective, embeddings=features, filenames=filenames, df=df)
    
    best = fmin(fmin_objective, 
                space = space, 
                algo=tpe.suggest,
                max_evals = max_evals, 
                trials = trials)

    best_params = space_eval(space, best)
    print ('best:')
    print (best_params)
    print (f"the average recall: {trials.best_trial['result']['avg_recall']}")

    best_clusters = generate_clusters(features, 
                                      n_neighbors = best_params['n_neighbors'], 
                                      min_dist = best_params['min_dist'], 
                                      metric = best_params['umap_metric'],
                                      eigen_solver = best_params['specparams'][0],
                                      affinity = best_params['specparams'][1],
                                      assign_labels = best_params['specparams'][2])
    

    return best_params, best_clusters, trials

In [16]:
hspace = {
    "n_neighbors": hp.choice('n_neighbors', config['n_neighbors']),
    "min_dist": hp.choice('min_dist', config['min_dist']),
    "umap_metric": hp.choice('umap_metric', config['umap_metric']),
    "specparams": hp.choice('specparams',config['specparams'])
}
max_evals = config['max_evals']

In [17]:
print('test the time-cost for the method {}'.format(config['method']))
%time best_params_use, best_clusters_use, trials_use = bayesian_search(feat, filenames, all_df, hspace, max_evals)

test the time-cost for the method SpectralClustering
the setting eigen_solver:arpack affinity: nearest_neighbors, assign_labels: discretize
the setting eigen_solver:arpack affinity: rbf, assign_labels: kmeans
the setting eigen_solver:amg affinity: rbf, assign_labels: discretize
the setting eigen_solver:amg affinity: rbf, assign_labels: kmeans
the setting eigen_solver:lobpcg affinity: nearest_neighbors, assign_labels: kmeans
the setting eigen_solver:amg affinity: rbf, assign_labels: discretize
the setting eigen_solver:lobpcg affinity: rbf, assign_labels: kmeans
the setting eigen_solver:amg affinity: rbf, assign_labels: discretize
the setting eigen_solver:amg affinity: rbf, assign_labels: discretize
the setting eigen_solver:arpack affinity: nearest_neighbors, assign_labels: kmeans
the setting eigen_solver:amg affinity: nearest_neighbors, assign_labels: kmeans
the setting eigen_solver:lobpcg affinity: rbf, assign_labels: discretize
the setting eigen_solver:arpack affinity: nearest_neighbo

In [18]:
best_params_use['avg_recall'] = trials_use.best_trial['result']['avg_recall']
print(best_params_use)
sourceFile = open('best_params_{}.txt'.format(config['method']), 'w')
print(best_params_use, file = sourceFile)
sourceFile.close()

{'min_dist': 0.0, 'n_neighbors': 260, 'specparams': ('arpack', 'rbf', 'kmeans'), 'umap_metric': 'manhattan', 'avg_recall': 0.925}


In [19]:
trials_use.best_trial

{'state': 2,
 'tid': 396,
 'spec': None,
 'result': {'loss': 0.07499999999999996,
  'avg_precision': 0.9271032724247373,
  'avg_recall': 0.925,
  'avg_f1': 0.9248171785622348,
  'status': 'ok'},
 'misc': {'tid': 396,
  'cmd': ('domain_attachment', 'FMinIter_Domain'),
  'workdir': None,
  'idxs': {'min_dist': [396],
   'n_neighbors': [396],
   'specparams': [396],
   'umap_metric': [396]},
  'vals': {'min_dist': [0],
   'n_neighbors': [12],
   'specparams': [3],
   'umap_metric': [0]}},
 'exp_key': None,
 'owner': None,
 'version': 0,
 'book_time': datetime.datetime(2022, 4, 24, 14, 32, 15, 598000),
 'refresh_time': datetime.datetime(2022, 4, 24, 14, 32, 22, 322000)}

In [20]:
 umap.__version__

'0.5.2'