# K-medoids

**Update 30/Oct/2020**  
Visualise the sequences in round 0 and round 1, to see whether sequences in round 1 is similar to round 0 or not.
The similarity should be measured exactly the same as how we generate round 1's recommnedations.
Two ways to show that:  
- clustering (k-medoids) on round 0 and round 1 sequences, then show the clusterings using TSNE (2d). 
- sort sequences by similarity (hierichycial clustering). Show two plots: one is heatmap with dendrograms (for feature similarity), see some examples [here](https://datavizpyr.com/hierarchically-clustered-heatmap-with-seaborn-in-python/) and [here](https://stackoverflow.com/questions/2455761/reordering-matrix-elements-to-reflect-column-and-row-clustering-in-naiive-python);
the other one is a 3D heatmap for labels, see [here]().  

**End Update**

For exploration, we should spread the choices as much as possible. Since we do not want to waste the engergy for searching the same region which may give bad results. The UCB algorithm balance the exploration and exploitation by balancing weights on predicted mean and predicted variance. To check whether UCB algorithm gives a reasonable coverage of the unknown spacing and known spacing, we run a clustering algorithm on the whole exploration space. We expect to cover several sequences in each clustering at the first a few rounds, then focusing more on sequences on the clustering gives a relatively high performance. 


## Pipeline

- Embedding
    - kmers ([3], [2,3,4,5,6])
    - onehot
- PCA
    - number of component (select by singular_values_)
- Clustering (Kmeans++)
    - number of clustering (Elbow method to select good number of clusterings)
- TSNE (to 2 dims) for visualisation
   
## Evaluation

- We show the average value and variance of label for the sequences in each clustering

Ideas:

A possible idea is to combine the clustering and successive rejects, where rejects happens in terms of clusters rather than single data points. This solves the problem that the searching space (number of arms) are too large. 

In [1]:
# direct to proper path
import os
import sys
module_path = os.path.abspath(os.path.join('../../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from sklearn.cluster import KMeans
from nltk.metrics import distance 
# import Pycluster as PC
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import itertools
import math

from codes.kernels_for_GPK import *
from codes.embedding import Embedding

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn_extra.cluster import KMedoids

No default features for kernel instance. Please specify features.


## Known seq

In [2]:
import pandas as pd 

# Path = '../../../data/Results_Microplate_partialFalse_normFalse_formatSeq_logTrue.csv'
Path = '../../../data/Results_Microplate_partialFalse_normTrue_roundRep_formatSeq_logTrue.csv'

df = pd.read_csv(Path)
# df['Group Code'] = df.Group.astype('category').cat.codes
known_data = np.asarray(df[['RBS', 'RBS6', 'Group', 'Pred Mean', 'AVERAGE']])
known_seq = np.asarray(df['RBS'])

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Name,Group,Plate,Round,RBS,RBS6,Rep1,Rep2,Rep3,Rep4,Rep5,Rep6,AVERAGE,STD,Pred Mean,Pred Std,Pred UCB,Group Code
0,0,RBS_1by1_0,reference,First_Plate,0,TTTAAGAAGGAGATATACAT,AGGAGA,2.433056,2.502155,2.315237,3.012905,2.917124,2.275329,2.575968,0.31354,,,,5
1,1,RBS_1by1_1,bps_noncore,First_Plate,0,CTTAAGAAGGAGATATACAT,AGGAGA,1.556251,1.654243,1.762146,1.790123,2.31279,1.959275,1.839138,0.26882,,,,3
2,2,RBS_1by1_2,bps_noncore,First_Plate,0,GTTAAGAAGGAGATATACAT,AGGAGA,0.603551,0.748674,0.921939,0.391285,0.503846,0.711231,0.646754,0.188587,,,,3
3,3,RBS_1by1_3,bps_noncore,First_Plate,0,ATTAAGAAGGAGATATACAT,AGGAGA,1.658359,1.874275,1.534988,1.54611,1.747116,1.232548,1.598899,0.220191,,,,3
4,4,RBS_1by1_4,bps_noncore,First_Plate,0,TCTAAGAAGGAGATATACAT,AGGAGA,1.545942,2.072095,1.3863,1.949759,1.774833,2.146898,1.812638,0.300722,,,,3


In [4]:
known_seq.shape

(266,)

## Setting

In [5]:
random_state = 0
n_dim = 2 # dimension reduction 
scores = {}

wd_shift_distance = WD_Shift_Kernel(features = known_seq, l = 6, s=1).distance_all
distance_matrix = [wd_shift_distance]
distance_matrix_name = ['wd_shift_distance']

n_clusters = 6 # to be changed

# spec_distance = Spectrum_Kernel(l_list = [1,2,3,4,5,6]).distance(known_seq)
# wd_distance = WeightedDegree_Kernel(l_list = [6]).distance(known_seq)
# mismatch_distance = Mismatch_Kernel(l_list= [6]).distance(known_seq)

# distance_matrix = [spec_distance, wd_distance, wd_shift_distance, mismatch_distance]
# distance_matrix_name = ['spec_distance', 'wd_distance', 'wd_shift_distance', 'mismatch_distance']

init kernel


In [6]:
def show_tsne_with_clustering(tsne_X, n_clusters, y_km):
    plt.figure(figsize = (15,15))
    for i in range(n_clusters):
        plt.scatter(
            tsne_X[y_km == i, 0], tsne_X[y_km == i, 1],
            s=50, 
            label=str(i)
            )
    plt.legend()
    plt.show()

## K-medoids and TSNE

In [7]:

for i in range(len(distance_matrix)):
    distance = distance_matrix[i]

    # clustering
    kmedoids = KMedoids(n_clusters=n_clusters, metric = 'precomputed', init='k-medoids++', random_state=random_state).fit(distance)
    y_km_spec = kmedoids.labels_
    
    # dim reduction
    tsne = TSNE(n_components = n_dim, metric = 'precomputed')
    tsne_X_spec = tsne.fit_transform(distance)
    # show_tsne_with_clustering(tsne_X_spec, n_clusters, y_km_spec)
    
    # save file
    file_name = 'results/round01_tsne_kmedoid_' + distance_matrix_name[i] +  '_' + str(n_clusters) + '_clusters.npz'
    print(file_name)
    np.savez(file_name, coord = tsne_X_spec, text = known_data, ykm = y_km_spec, known_seq = [], ucb_rec = [])
    print('result saved')

results/round01_tsne_kmedoid_wd_shift_distance_6_clusters.npz
result saved


In [9]:
# read saved data and plot

data = np.load(file_name, allow_pickle=True)

embed = data['coord']
# y = data['label']
text = data['text']
y_km = data['ykm']
known_seq = data['known_seq']
ucb_rec = data['ucb_rec']

tir_labels = []
text_labels = []
for i in text:
    tir_labels.append(float(i[-1]))
    text_labels.append(str(i))

In [10]:
def sigmoid(z_list):
    func_z_list = []
    for z in z_list:
        func_z_list.append(10.0/(1 + np.exp(-z)))
    return func_z_list

def normalise_minmax(z_list):
    min_value = np.min(z_list)
    max_value = np.max(z_list)
    func_z_list = []
    for z in z_list:
        func_z_list.append(5 * (z-min_value)/(max_value - min_value))
    return func_z_list

# marker_size = sigmoid(tir_labels) 
marker_size = normalise_minmax(tir_labels)

In [11]:
import plotly.express as px

fig = px.scatter(x = embed[:,0], y = embed[:,1], color=df['Group'], symbol = y_km[:], size = marker_size,  hover_name = text_labels[:], title = file_name.split('/')[-1].split('.')[0].replace('_', ' '))
fig.write_html(file_name[:-4] + "plot.html")