This notebook shows how to reproduce the KNN based discovery results on  [Unsupervised discovery of sign terms by K-nearest neighbours approach (ECCV '20)](https://link.springer.com/chapter/10.1007/978-3-030-66096-3_22  "Link to paper").  
The Bayesian parameter search step is not performed in this notebook, instead the saved optimum parameters for each cross validation fold are used to re-run the experiments and obtain results. If you also want to do Bayesian parameter search using `scikit-optimize`, refer to [related notebook](./notebooks/run_crossval_exp.ipynb).


In [None]:
%load_ext autoreload
%autoreload 2

: 

In [None]:
import numpy as np
# import matplotlib.pyplot as plt
from os.path import join
import pandas as pd
import os
from copy import deepcopy

from utils.feature_loaders import load_feats
from utils.helper_fncs import load_json
from utils.db_utils import get_seq_names
from run.pipeline import run_exp
from run.optim import run_for_set
from utils.feature_loaders import load_feats

**sets**  
There are 9 signers in the Phoenix dataset, where the amount of data for each signer is highly imbalanced. Therefore the  
signers are grouped into 3 sets (named `A,B,C`) so that each set has similar amount of data. The assignments and the total  
number of sequences for each set are given in the cell below. 

**cross validation scheme**  
For each fold, parameter tune is performed on devset and using those optimum parameters, report results on remaining test sets.  
Since test sets would contain different signers, this result can be viewed as test on unseen signers. At the end, compute  
a weighted average of the results for *dev* and *test* sets, weighted by total number of files in each set. Because the  
amount of data is very few, the test results are reported as average of cross validation rather than a single test set. 

| dev set |test sets|
|  :----: | :----: |
| A | B,C |  
| B | A,C |  
| C | A,B |


In [3]:
cvsets = set(['A','B','C'])
signers_per_set = {'A': [4,8,9], 'B':[2,5,7], 'C':[1,3,6]}
nfolder_perset = {'A':1705,'B':1991,'C':1975}

alg_type = 'knn'
params = load_json(full_path='./config/{}.json'.format(alg_type))
params

{'CVroot': '/home/korhan/Desktop/knn_utd/data/CVfolds',
 'CVset': 'A',
 'clustering': {'cost_thr': 0.01, 'method': 'pairwise', 'olapthr_m': 0.5},
 'disc_method': 'knn',
 'disc': {'a': 3,
  'dim_fix': 6,
  'emb_type': 'gauss_kernel',
  'k': 150,
  'lmax': 15,
  'lmin': 2,
  'metric': 'L2',
  'norm': False,
  'olapthr_m': 0.2,
  'pca': '',
  'r': 0.2,
  's': 0.6,
  'seg_type': 'uniform',
  'top_delta': 0.05,
  'use_gpu': True},
 'eval': {'tderunfile': '/home/korhan/Desktop/knn_utd/run_tde.sh',
  'TDEROOT': '/home/korhan/Desktop/tez/tdev2/tdev2',
  'TDESOURCE': '/home/korhan/miniconda3/etc/profile.d/conda.sh',
  'njobs': 2,
  'dataset': 'phoenix',
  'config_file': '/home/korhan/Desktop/knn_utd/config/config_phoenix.json'},
 'exp_root': '/home/korhan/Desktop/knn_utd/data/results/',
 'feats_root': '/home/korhan/Desktop/data/features/',
 'featype': 'c3right',
 'coverage': {'patience': 30, 'covth': 10, 'covmargin': 1},
 'tune': {'keys': ['r', 's'],
  'chkpoint_root': '/home/korhan/Desktop/knn

In [6]:
# disable comment for choosing {openpose/deephand} features
# params['featype'] = 'op100'
params['featype'] = 'c3_right_PCA40W'

params_default = deepcopy(params)


In [7]:

if params['featype']=='c3_right_PCA40W':
    params['exp_var'] = 40
    saved_results = pd.read_csv('./data/paper_results/knn_c3_right_PCA40_l6-45_dim_fix_r_s.csv', keep_default_na=False)
if params['featype']=='op100':
    saved_results = pd.read_csv('./data/paper_results/knn_op100_l6_lmax_dim_fix_r_s.csv', keep_default_na=False)

saved_results = saved_results.iloc[::4] # eliminate single signer results
saved_results #[report_cols[:-1]]

Unnamed: 0.1,Unnamed: 0,devset,set,ned,coverage,coverageNS,coverageNS_f,grouping_F,grouping_P,grouping_R,...,lmin,metric,norm,olapthr_m,pca,seg_type,top_delta,use_gpu,r,s
0,0,A,A,37.59,8.52,9.32,11.62,54.36,53.19,55.68,...,2,L2,False,0.2,,uniform,0.05,True,0.21,0.6
4,4,A,B,43.99,9.74,10.45,14.28,53.77,52.14,55.62,...,2,L2,False,0.2,,uniform,0.05,True,0.21,0.6
8,8,A,C,41.08,8.65,9.83,12.15,49.73,48.19,51.47,...,2,L2,False,0.2,,uniform,0.05,True,0.21,0.6
12,12,B,A,44.73,9.05,9.92,13.05,50.52,50.34,50.81,...,2,L2,False,0.2,,uniform,0.06,True,0.2176,0.6
16,16,B,B,42.95,8.71,9.34,12.74,55.1,54.19,56.15,...,2,L2,False,0.2,,uniform,0.05,True,0.2176,0.6
20,20,B,C,43.11,8.67,9.88,11.99,46.13,45.38,47.01,...,2,L2,False,0.2,,uniform,0.05,True,0.2176,0.6
24,24,C,A,44.73,9.05,9.92,13.05,50.52,50.34,50.81,...,2,L2,False,0.2,,uniform,0.06,True,0.2176,0.6
28,28,C,B,42.95,8.71,9.34,12.74,55.1,54.19,56.15,...,2,L2,False,0.2,,uniform,0.05,True,0.2176,0.6
32,32,C,C,43.11,8.67,9.88,11.99,46.13,45.38,47.01,...,2,L2,False,0.2,,uniform,0.05,True,0.2176,0.6


In [8]:
# load optiumum parameters for 

all_stats = dict()

for devset in sorted(cvsets):    
    
    # load related parameters
    expinfo = saved_results[saved_results['devset']==devset].iloc[0]
    # set params from experiment info
    params['CVset'] = devset
    for key,value in params['disc'].items():
        if key in expinfo.keys():
            params['disc'][key] = expinfo[key]

    # re-run to get full stats
    stats = [run_for_set(setx, load_feats, params) for setx in sorted(cvsets)]
    
    all_stats[devset] = stats

    # re-set params
    params = params_default.copy()

    # break

knn_A_c3_right_PCA40W_a3_dim_fix4_emb_typegauss_kernel_k150_lmax15_lmin2_metricL2_normFalse_olapthr_m0.2_pca_r0.21_s0.6_seg_typeuniform_top_delta0.05_use_gpuTrue
0.01
Computing Embeddings
Building index of size 872529x160
Searching index
Selecting good pairs
*** found 3507 matches ***
*** pairwise clustering ***
*** post disc completed, found 68 segments from 34 clusters ***
/home/korhan/miniconda3/etc/profile.d/conda.sh
/home/korhan/Desktop/knn_utd/data/results/knn_A_c3_right_PCA40W_a3_dim_fix4_emb_typegauss_kernel_k150_lmax15_lmin2_metricL2_normFalse_olapthr_m0.2_pca_r0.21_s0.6_seg_typeuniform_top_delta0.05_use_gpuTrue/postpairwise_cost0.01_olap0.5 phoenix sdtw /home/korhan/Desktop/knn_utd/data/results/knn_A_c3_right_PCA40W_a3_dim_fix4_emb_typegauss_kernel_k150_lmax15_lmin2_metricL2_normFalse_olapthr_m0.2_pca_r0.21_s0.6_seg_typeuniform_top_delta0.05_use_gpuTrue/postpairwise_cost0.01_olap0.5/scores.json
Reading gold
*** Config file read, ovth 50.0 ***
/home/korhan/Desktop/knn_utd/conf

KeyboardInterrupt: 

In [17]:
# type of scores to report results
report_cols = ['ned','length_avg', 'n_clus', 'n_node',
              'grouping_P', 'grouping_R', 'grouping_F', 
              'coverageNS']

# concat dataframes in 'all_stats' dict               
dflist = []
for devset, stats in all_stats.items():
    dflist.append(pd.DataFrame([{**{'devset':devset},**{'set':key}, **item} 
                                for sublist in stats for key,item in sublist.items()]))
    
df = pd.concat(dflist, ignore_index=True)
df['nfile'] = df.set.apply(lambda x: nfolder_perset[x])

df[report_cols] #.to_csv('../results/cv/{}_{}.csv'.format(params['disc_method'], params['featype']))

Unnamed: 0,ned,length_avg,n_clus,n_node,grouping_P,grouping_R,grouping_F,coverageNS,nfile
0,37.59,12.525163,1145,2290,53.19,55.68,54.36,9.32,1705
1,43.99,14.188693,1651,3302,52.14,55.62,53.77,10.45,1991
2,41.08,13.31205,1224,2448,48.19,51.47,49.73,9.83,1975
3,44.73,11.944089,1292,2584,50.34,50.81,50.52,9.92,1705
4,42.95,13.250463,1427,2854,54.19,56.15,55.1,9.34,1991
5,43.11,12.58912,1248,2496,45.38,47.01,46.13,9.88,1975
6,44.73,11.944089,1292,2584,50.34,50.81,50.52,9.92,1705
7,42.95,13.250463,1427,2854,54.19,56.15,55.1,9.34,1991
8,43.11,12.58912,1248,2496,45.38,47.01,46.13,9.88,1975


seperate the results into two dataframes, for **devset** and **testset** results
for obtaining development and test scores. 


| devset | testset | df |
| :----: | :----:  | :----: |
| A | A | dev_df |
| B | B | dev_df |
| C | C | dev_df |
| A | B,C | test_df |
| B | A,C | test_df |
| C | A,B | test_df |


In [8]:
devs_ms = []
tests_ms = []

for devset in ['A','B','C']:

    fold_df = df.loc[df.devset.isin([devset])]
    dev_df = fold_df.loc[fold_df.set.isin([devset])][report_cols+['set']]
    test_df = fold_df.loc[~fold_df.set.isin([devset])][report_cols+['set']]

    # DEV
    fold = dev_df
    fold_ms = fold.loc[[len(x)==1 for x in fold.set]]    
    devs_ms.append(fold_ms)
    
    # TEST
    fold = test_df
    fold_ms = fold.loc[[len(x)==1 for x in fold.set]]
    tests_ms.append(fold_ms)


In [9]:
devs_ms[0]

Unnamed: 0,ned,length_avg,n_clus,n_node,grouping_P,grouping_R,grouping_F,coverageNS,nfile,set
0,37.59,12.525163,1145,2290,53.19,55.68,54.36,9.32,1705,A


In [19]:
def averege(df_list, report_cols):
    ''' computes average scores weighted by number of files '''

    fold_df = pd.concat(df_list)
    tmp = pd.DataFrame(
        ((fold_df[report_cols].values * fold_df.nfile.values[:,None]).sum(0) / fold_df.nfile.sum())[None,:],
        columns=report_cols)
    return tmp

keys = ['Dev MS'  ,'Test MS']

final = []

for k,df_list in enumerate([devs_ms , tests_ms]):
    avg_df = averege(df_list, report_cols)
    avg_df['exp'] = keys[k]
    
    final.append(avg_df)
    
    # display(avg_df)

finaldf = pd.concat(final)[['exp'] + report_cols]

finaldf

Unnamed: 0,exp,ned,length_avg,n_clus,n_node,grouping_P,grouping_R,grouping_F,coverageNS,nfile
0,Dev MS,48.555274,8.955494,1090.555458,2181.110915,43.677896,39.975927,41.519275,9.612768,1899.441192
0,Test MS,50.721803,9.254117,1207.087639,2414.175278,43.052922,39.704803,41.028422,10.0517,1899.441192
