# Motifs clustering - Part 2: Motifs DTW distance and clustering

Companion notebook of 2_Motifs_analysis. After exporting motifs to csv.gz files, this notebook will be used to calculate a DTW distance between the motifs. You can export this distance matrix as a .csv or an R object file .rds. Optionally, we propose a clustering procedure at the end of the notebook.

IMPORTANT:
Both the script for computing the distance matrix and performing the clustering are R scripts which will be called directly from the notebook. BEFORE RUNNING THIS NOTEBOOK:
1. Make sure that you have a running R installation on your system with the following packages installed: argparse, data.table, proxy, dtw, parallelDist, reshape2, ggplot2, stringr, dendextend
2. You will need to manually change the first line in both R scripts: dtw_multivar_distmat.R and pattern_clustering.R. On this line, in both files, change the variable "user_lib" such that it contains the path that points to the directory where your personnal R packages are installed. For example in Windows, this library should be under: 'C:/Users/myUserName/Documents/R/win-library/X.X' where X.X is the version of R; in Linux, this library should be under: '/home/myUserName/R/x86_64-pc-linux-gnu-library/X.X'.
3. Shall errors arise while running these scripts, the error message will unfortunately not be returned in the notebook but can be seen in the Jupyter console. You can also run these script manually from the command line by copy-pasting the script call created in the notebook (outputted under the notebook cell).

## Import libraries

In [1]:
# Standard libraries
import torch
from torch.utils.data import DataLoader
from torchvision import transforms
import numpy as np
import pandas as pd
from skimage.filters import threshold_li, threshold_mean
import os
from itertools import chain
from tqdm import tqdm
import subprocess
import sys

# Custom functions/classes
path_to_module = '../source'  # Path where all the .py files are, relative to the notebook folder
sys.path.append(path_to_module)
from load_data import DataProcesser
from results_model import top_confidence_perclass, least_correlated_set
from pattern_utils import extend_segments, create_cam, longest_segments, extract_pattern
from class_dataset import myDataset, ToTensor, RandomCrop

# For reproducibility
myseed = 7
torch.manual_seed(myseed)
torch.cuda.manual_seed(myseed)
np.random.seed(myseed)

## Parameters for distance matrix with DTW

Parameters for DTW:
- center_patt: bool, whether to zero-center the motifs. If the input is multivariate, each channel is independently zero-centered. This matters for DTW calculation. Set to True if you don't want the baseline of the signal to be taken into account when computing the DTW distance between the motifs. Note that this will also erase the shift between channels in a multivariate motif.
- normalize_dtw: bool, whether to normalize DTW distance to the length of the trajectories. This is important to compare motifs of varying lengths. Set to False only if you have a good reason to do so.

In [2]:
# Set to the same values as in notebook 2, otherwise will cause an error. Should be automatically handled in a further update.
min_len_patt = 0
max_len_patt = 200

# Parameters DTW, see above
center_patt = False
normalize_dtw = True

# Whether to compute a DTW matrix for the motifs of each class and/or one for the motifs of all classes pooled together
export_perClass = False
export_allPooled = True

# Whether to save the distance matrix in csv format and/or in R object format
save_csv = True
save_rds = True
out_dir = 'auto'  # If 'auto' will automatically create a directory to save dtw matrices

In [3]:
# Only used to retrieve measurement variables and classes without manual input
data_file = '../sample_data/Synthetic_Univariate.zip'
# data_file = '../sample_data/GrowthFactor_ErkAkt_Bivariate.zip'

meas_var = None  # Set to None for auto detection
data = DataProcesser(data_file, datatable=False)
classes = tuple(data.classes[data.col_classname])
meas_var = data.detect_groups_times()['groups'] if meas_var is None else meas_var

# Path where to export tables with motifs
if out_dir == 'auto':
    out_dir = 'output/' + '_'.join(meas_var) + '/local_motifs/'
    if not os.path.exists(out_dir):
        os.makedirs(out_dir)

In [4]:
# Path to the R installation
path_to_R = '/usr/bin/Rscript'
# Path that point to the directory where the scripts for dtw and clustering are (in this example one up the Notebook one)
path_to_Rscripts = os.path.abspath('../source') + os.path.sep
path_to_Rscripts = path_to_Rscripts.replace('\\', '/')  # for Windows paths

## Build distance matrix with DTW

This is done in R with the implementation of the *parallelDist* package. It is very efficient and has support for multivariate cases.

The distance matrices can be written both as a compressed csv (squared matrix, lower triangle and diagonal set to Inf) or as an rds R object which contains an R distance object. The latter is very useful to resume clustering directly in R, just load the distance object with readRDS().

In [5]:
center_patt = "T" if center_patt else "F"
normalize_dtw = "T" if normalize_dtw else "F"
save_csv = "T" if save_csv else "F"
save_rds = "T" if save_rds else "F"

if export_perClass:
    for classe in classes:
        print('Building distance matrix for class: {}, with call:'.format(classe))
        fin_patt = out_dir + 'motif_{}.csv.gz'.format(classe)
        fout_dist = out_dir + 'DTWdist_{}'.format(classe)
        call_str = r'"{}" --vanilla {}dtw_multivar_distmat.R -i "{}" -o "{}" -l {} -n {} --norm {} --center {} --colid {} --csv {} --rds {}'.format(
                path_to_R,
                path_to_Rscripts,
                fin_patt,
                fout_dist,
                max_len_patt,
                len(meas_var),
                normalize_dtw,
                center_patt,
                "NULL",
                save_csv,
                save_rds)
        print(call_str + '\n')
        subprocess.call(call_str, shell=True)
            
if export_allPooled:
    print('Building distance matrix for pooled data with call:')
    fin_patt = out_dir + 'motif_allPooled.csv.gz'
    fout_dist = out_dir + 'DTWdist_allPooled'
    call_str = r'"{}" --vanilla {}dtw_multivar_distmat.R -i "{}" -o "{}" -l {} -n {} --norm {} --center {} --colid {} --csv {} --rds {}'.format(
        path_to_R,
        path_to_Rscripts,
        fin_patt,
        fout_dist,
        max_len_patt,
        len(meas_var),
        normalize_dtw,
        center_patt,
        "pattID",
        save_csv,
        save_rds)
    print(call_str)
    subprocess.call(call_str, shell=True)

Building distance matrix for pooled data with call:
"/usr/bin/Rscript" --vanilla /home/marc/Dropbox/CNN_paper_MarcAntoine/CODEX/source/dtw_multivar_distmat.R -i "output/FRST/local_motifs/motif_allPooled.csv.gz" -o "output/FRST/local_motifs/DTWdist_allPooled" -l 200 -n 1 --norm T --center F --colid pattID --csv T --rds T


## Cluster, generate report with results

This will use the distance matrix generated in the previous section to perform hierarchical clustering. Medoids from each cluster are reported along with a random sample of each cluster.

- nclust: int, number of clusters.
- nmedoid: int, number of medoids to plot per cluster.
- nseries: int, number of series to plot from each cluster (choosen randomly).
- linkage: str, one of ["ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", "centroid"]. Linkage for hierarchical clustering. Ward and average seem to be advisable defaults. More details on the help page for hierarchical clustering with R: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html

In [6]:
nclust = 3
nmedoid = 8
nseries = 16
linkage = "ward.D"

assert linkage in ["ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", "centroid"]

if export_perClass:
    for classe in classes:
        print('Cluster patterns for class: {} with call:'.format(classe))
        fin_patt = out_dir + 'motif_{}.csv.gz'.format(classe)
        fin_dist = out_dir + 'DTWdist_{}.csv.gz'.format(classe)
        fout_plot = out_dir + 'motifPlot_{}.pdf'.format(classe)
        call_str = r'"{}" --vanilla {}pattern_clustering.R -d {} -p {} -o {} -l {} -n {} -c {} -m {} -t {} --colid {} --linkage {}'.format(
           path_to_R,
           path_to_Rscripts,
           fin_dist,
           fin_patt,
           fout_plot,
           max_len_patt,
           len(meas_var),
           nclust,
           nmedoid,
           nseries,
           "NULL",
           linkage)
        print(call_str + '\n')
        subprocess.call(call_str, shell=True)

if export_allPooled:
    print('Cluster patterns for pooled data with call:')
    fin_patt = out_dir + 'motif_allPooled.csv.gz'
    fin_dist = out_dir + 'DTWdist_allPooled.csv.gz'
    fout_plot = out_dir + 'motifPlot_allPooled.pdf'
    call_str = r'"{}" --vanilla {}pattern_clustering.R -d {} -p {} -o {} -l {} -n {} -c {} -m {} -t {} --colid {} --linkage {}'.format(
           path_to_R,
           path_to_Rscripts,
           fin_dist,
           fin_patt,
           fout_plot,
           max_len_patt,
           len(meas_var),
           nclust,
           nmedoid,
           nseries,
           "pattID",
           linkage)
    print(call_str)
    subprocess.call(call_str, shell=True)
print('Clustering done.')

Cluster patterns for pooled data with call:
"/usr/bin/Rscript" --vanilla /home/marc/Dropbox/CNN_paper_MarcAntoine/CODEX/source/pattern_clustering.R -d output/FRST/local_motifs/DTWdist_allPooled.csv.gz -p output/FRST/local_motifs/motif_allPooled.csv.gz -o output/FRST/local_motifs/motifPlot_allPooled.pdf -l 200 -n 1 -c 3 -m 8 -t 16 --colid pattID --linkage ward.D
Clustering done.
