For exploration, we should spread the choices as much as possible. Since we do not want to waste the engergy for searching the same region which may give bad results. The UCB algorithm balance the exploration and exploitation by balancing weights on predicted mean and predicted variance. To check whether UCB algorithm gives a reasonable coverage of the unknown spacing and known spacing, we run a clustering algorithm on the whole exploration space. We expect to cover several sequences in each clustering at the first a few rounds, then focusing more on sequences on the clustering gives a relatively high performance. 


## Pipeline

- Embedding
    - kmers ([3], [2,3,4,5,6])
    - onehot
- PCA
    - number of component (select by singular_values_)
- Clustering (Kmeans++)
    - number of clustering (Elbow method to select good number of clusterings)
- TSNE (to 2 dims) for visualisation
   
## Evaluation

- We show the average value and variance of label for the sequences in each clustering

Ideas:

A possible idea is to combine the clustering and successive rejects, where rejects happens in terms of clusters rather than single data points. This solves the problem that the searching space (number of arms) are too large. 

In [1]:
# direct to proper path
import os
import sys
module_path = os.path.abspath(os.path.join('../../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from sklearn.cluster import KMeans
from nltk.metrics import distance 
# import Pycluster as PC
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import itertools
import math

from codes.kernels_for_GPK import *
from codes.embedding import Embedding

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

## Known seq

In [2]:
import pandas as pd 

Path = '../../../data/firstRound_4h_normFalse_formatSeq.csv'

df = pd.read_csv(Path)
known_data = np.asarray(df[['RBS', 'AVERAGE']])
known_seq = np.asarray(df['RBS'])

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Name,Group,RBS,RBS6,Rep1,Rep2,Rep3,Rep4,Rep5,AVERAGE,STD
0,0,RBS_1by1_0,reference,TTTAAGAAGGAGATATACAT,AGGAGA,,52.402431,,61.622165,54.151485,56.058694,3.998246
1,1,RBS_1by1_1,bps_noncore,CTTAAGAAGGAGATATACAT,AGGAGA,,40.072951,,42.042854,45.432032,42.515946,2.213263
2,2,RBS_1by1_2,bps_noncore,GTTAAGAAGGAGATATACAT,AGGAGA,,28.831559,,24.48787,24.133637,25.817689,2.136029
3,3,RBS_1by1_3,bps_noncore,ATTAAGAAGGAGATATACAT,AGGAGA,,43.093359,,38.641958,38.049577,39.928298,2.251065
4,4,RBS_1by1_4,bps_noncore,TCTAAGAAGGAGATATACAT,AGGAGA,,45.913214,,44.352931,38.394865,42.887003,3.23966


In [4]:
known_seq.shape

(150,)

## Setting

In [5]:
random_state = 0
n_dim = 2 # dimension reduction 
scores = {}

In [6]:
def show_tsne_with_clustering(tsne_X, n_clusters, y_km):
    plt.figure(figsize = (15,15))
    for i in range(n_clusters):
        plt.scatter(
            tsne_X[y_km == i, 0], tsne_X[y_km == i, 1],
            s=50, 
            label=str(i)
            )
    plt.legend()
    plt.show()

In [7]:
spec_distance = Spectrum_Kernel(l_list = [1,2,3,4,5,6]).distance(known_seq)
wd_distance = WeightedDegree_Kernel(l_list = [6]).distance(known_seq)
wd_shift_distance = WD_Shift_Kernel(l_list = [6]).distance(known_seq)
mismatch_distance = Mismatch_Kernel(l_list= [6]).distance(known_seq)

## K-medoids and TSNE

In [10]:
from sklearn_extra.cluster import KMedoids
n_clusters = 6

distance_matrix = [spec_distance, wd_distance, wd_shift_distance, mismatch_distance]
distance_matrix_name = ['spec_distance', 'wd_distance', 'wd_shift_distance', 'mismatch_distance']

for i in range(len(distance_matrix)):
    distance = distance_matrix[i]
    
    tsne = TSNE(n_components = n_dim, metric = 'precomputed')
    tsne_X_spec = tsne.fit_transform(distance)
    
    file_name = 'results/tsne_kmedoid_' + distance_matrix_name[i] +  '_6_' + str(n_clusters) + '_clusters'
    print(file_name)

    # choose X_spec or tsne_X_spec
    #kmeans_spec = KMeans(n_clusters=n_clusters, init = 'k-means++', random_state= random_state)
    #kmeans_spec.fit(X_spec_pca)
    #y_km_spec = kmeans_spec.predict(X_spec_pca)

    
    kmedoids = KMedoids(n_clusters=n_clusters, metric = 'precomputed', init='k-medoids++', random_state=random_state).fit(distance)
    y_km_spec = kmedoids.labels_

    # show_tsne_with_clustering(tsne_X_spec, n_clusters, y_km_spec)
    np.savez(file_name, coord = tsne_X_spec, text = known_data, ykm = y_km_spec, known_seq = [], ucb_rec = [])
    print('result saved')

results/tsne_kmedoid_spec_distance_6_6_clusters
result saved
results/tsne_kmedoid_wd_distance_6_6_clusters
result saved
results/tsne_kmedoid_wd_shift_distance_6_6_clusters
result saved
results/tsne_kmedoid_mismatch_distance_6_6_clusters
result saved
