# Motifs analysis

Notebook to perform the discriminative motifs analysis. It requires a trained model but it is an independant analysis from the analysis of feature space and from the prototypes analysis.

Motifs are extracted on the base of Class-Activation Maps (CAMs) which display the saliency of a class in a given input according to a model. CAMs towards any class can be computed regardless of the actual class of the input. This means that one can look for discriminative motifs of class B in an input of class A. However, for the sake of motif extraction, we don't use this feature of CAMs. Instead we produce CAMs towards the actual class of the input.

The motif extraction procedure is as follow:
1. Select trajectories from which to extract motifs.
2. Compute CAM for each trajectory (saliency towards its own class).
3. Binarize each time point into 'relevant' and 'non-relevant' to recognize input class.
4. Optional but recommended, extend the 'relevant' regions to capture the full motifs. Also filter for motif length.
5. Extract the longest 'relevant' stretches of time-points. These are the final motifs.

In order to visualize these motifs, we propose to cluster them afterwards as follow:
1. Build a distance matrix between the motifs with dynamic time warping (dtw)
2. Cluster with hierarchical clustering.
3. Visualize dynamics captured by each cluster.

This clustering can be run in 2 modes: either patterns from every class are pooled together, either a separate clustering is run indepently for each class. In the 1st case, this will reflect the diversity of patterns at the dataset level and can reveal dynamics overlap between classes. In the second case, the emphasis is put on the diversity of dynamics induced by each class.


## Import libraries

In [1]:
import torch
from torch.utils.data import DataLoader
from torchvision import transforms
import numpy as np
import pandas as pd
from load_data import DataProcesser
from results_model import top_confidence_perclass, least_correlated_set
from pattern_utils import extend_segments, create_cam, longest_segments, extract_pattern
from class_dataset import myDataset, ToTensor, RandomCrop
from dtaidistance import dtw, clustering
from models import ConvNetCam
from skimage.filters import threshold_li, threshold_mean
import os
from itertools import chain
from tqdm import tqdm

## Parameters

Parameters for the motifs extraction:
- selected_set: str one of ['both', 'validation', 'training'], from which set of trajectories should motifs be extracted? For this purprose, extracting from training data also makes sense.
- n_series_perclass: int, maximum number of series, per class, on which motif extraction is attempted.
- n_pattern_perseries: int, maximum number of motifs to extract out of a single trajectory.
- mode_series_selection: str one of ['top_confidence', 'least_correlated']. Mode to select the trajectories from which to extract the motifs (see Prototype analysis). If top confidence, the motifs might be heavily biased towards a representative subpopulation of the class. Hence, the output might not reflect the whole diversity of motifs induced by the class.
- extend_patt: int, by how many points to extend motifs? After binarization into 'relevant' and 'non-relevant time points', the motifs are usually fragmented because a few points in their middle are improperly classified as 'non-relevant'. This parameter allows to extend each fragment by a number of time points (in both time directions) before extracting the actual patterns.
- min_len_patt/max_len_patt: int, set minimum/maximum size of a motif. **/!\ The size is given in number of time-points. This means that if the input has more than one channel, the actual length of the motifs will be divided across them.**

Parameters for the motifs clustering:
- center_patt: bool, whether to zero-center patterns prior to clustering. If the input is multivariate, each channel is independantly zero-centered. This matters for DTW calculation.
- normalize_dtw: bool, whether to normalize DTW distance to the length of the trajectories. This is important to compare motifs of varying lengths.
- export_perClass: bool, whether to run the motif clustering class per class.
- export_allPooled: bool, whether to pool all motifs across classes for clustering.

In [2]:
selected_set = 'both'
n_series_perclass = 125
n_pattern_perseries = 1
mode_series_selection = 'least_correlated'
thresh_confidence = 0.9  # used in least_correlated mode to choose set of series with minimal classification confidence
extend_patt = 0
min_len_patt = 5
max_len_patt = 400  # length to divide by nchannel

export_perClass = True
export_allPooled = True

assert mode_series_selection in ['top_confidence', 'least_correlated']

## Load model and data

- Pay attention to the order of 'meas_var', should be the same as for training the model!
- Pay attention to trajectories preprocessing.
- Set batch_size as high as memory allows for speed up.

In [3]:
# data_file = '/home/marc/Dropbox/Work/TSclass_GF/data/ErkAkt_6GF_len240_repl2_trim100.zip'
# data_file = '/home/marc/Dropbox/CNN_paper_MarcAntoine/forPaper/data_analysis/synthetic_len750_univariate_classAB.zip'
data_file = '/home/marc/Dropbox/CNN_paper_MarcAntoine/forPaper/data_analysis/synthetic_len750.zip'
# model_file = '/home/marc/Dropbox/Work/TSclass_GF/forPaper/models/ERK_AKT/2019-07-04-11:21:58_ErkAkt_6GF_len240_repl2_trim100.pytorch'
# model_file = '/home/marc/Dropbox/CNN_paper_MarcAntoine/forPaper/models/FRST_classAB/2020-02-11-15:52:07_synthetic_len750_univariate_classAB.pytorch'
model_file = '/home/marc/Dropbox/CNN_paper_MarcAntoine/forPaper/models/FRST_SCND/2020-02-11-14:44:57_synthetic_len750.pytorch'

meas_var = None  # Set to None for auto detection
start_time = None  # Set to None for auto detection
end_time = None  # Set to None for auto detection

batch_size = 200
is_cuda = torch.cuda.is_available()
device = torch.device('cuda' if is_cuda else 'cpu')
model = torch.load(model_file)
model.eval()
model.double()
model.batch_size = batch_size
model = model.to(device)

Pay attention that **data.process() is already centering the data**, so don't do a second time when loading the data in the dataloader. The **random crop** should before passing the trajectories in the model to ensure that the same crop is used as input and for extracting the patterns.

In [4]:
# Transformations to perform when loading data into the model
ls_transforms = transforms.Compose([RandomCrop(output_size=model.length, ignore_na_tails=True),
                                                            ToTensor()])
# Loading and PREPROCESSING
data = DataProcesser(data_file)
meas_var = data.detect_groups_times()['groups'] if meas_var is None else meas_var
start_time = data.detect_groups_times()['times'][0] if start_time is None else start_time
end_time = data.detect_groups_times()['times'][1] if end_time is None else end_time

data.subset(sel_groups=meas_var, start_time=start_time, end_time=end_time)
data.get_stats()
data.process(method='center_train', independent_groups=True)  # do here and not in loader so can use in df
data.crop_random(model.length, ignore_na_tails=True)
data.split_sets(which='dataset')
classes = tuple(data.classes.iloc[:, 1])
classes_dict = data.classes['class']

# Random crop before to keep the same in df as the ones passed in the model
if selected_set == 'validation':
    selected_data = myDataset(dataset=data.validation_set, transform=ls_transforms)
    df = data.validation_set
elif selected_set == 'training':
    selected_data = myDataset(dataset=data.train_set, transform=ls_transforms)
    df = data.train_set
elif selected_set == 'both':
    try:
        selected_data = myDataset(dataset=data.dataset_cropped, transform=ls_transforms)
        df = data.dataset_cropped
    except:
        selected_data = myDataset(dataset=data.dataset, transform=ls_transforms)
        df = data.dataset

data_loader = DataLoader(dataset=selected_data,
                         batch_size=batch_size,
                         shuffle=True,
                         num_workers=4)
# Dataframe used for retrieving trajectories. wide_to_long instead of melt because can do melting per group of columns
df = pd.wide_to_long(df, stubnames=meas_var, i=[data.col_id, data.col_class], j='Time', sep='_', suffix='\d+')
df = df.reset_index()  # wide_to_long creates a multi-level Index, reset index to retrieve indexes in columns
df.rename(columns={data.col_id: 'ID', data.col_class: 'Class'}, inplace=True)
df['ID'] = df['ID'].astype('U32')
del data  # free memory

## Select trajectories from which to extract patterns

In [5]:
if mode_series_selection == 'least_correlated':
    set_trajectories = least_correlated_set(model, data_loader, threshold_confidence=thresh_confidence, device=device,
                                            n=n_series_perclass, labels_classes=classes_dict)
elif mode_series_selection == 'top_confidence':
    set_trajectories = top_confidence_perclass(model, data_loader, device=device, n=n_series_perclass,
                                               labels_classes=classes_dict)

# free some memory by keeping only relevant series
selected_trajectories = set_trajectories['ID']
df = df[df['ID'].isin(selected_trajectories)]
# Make sure that class is an integer (especially when 0 or 1, could be read as boolean)
df['Class'] = df['Class'].astype('int32')

100%|██████████| 200/200 [01:29<00:00,  2.25it/s]

## Extract patterns

### Extract, extend and filter patterns. 

Outputs a report of how many trajectories were filtered out by size.

In [6]:
store_patts = {i:[] for i in classes}
model.batch_size = 1
report_filter = {'Total number of patterns': 0,
                 'Number of patterns above maximum length': 0,
                 'Number of patterns below minimum length': 0}
pbar = tqdm(total=len(selected_trajectories))
for id_trajectory in selected_trajectories:
    series_numpy = np.array(df.loc[df['ID'] == id_trajectory][meas_var]).astype('float').squeeze()
    # Row: measurement; Col: time
    if len(meas_var) >= 2:
        series_numpy = series_numpy.transpose()
    series_tensor = torch.tensor(series_numpy)
    class_trajectory = df.loc[df['ID']==id_trajectory]['Class'].iloc[0]  # repeated value through all series
    class_label = classes[class_trajectory]
    cam = create_cam(model, array_series=series_tensor, feature_layer='features',
                         device=device, clip=0, target_class=class_trajectory)
    thresh = threshold_li(cam)
    bincam = np.where(cam >= thresh, 1, 0)
    bincam_ext = extend_segments(array=bincam, max_ext=extend_patt)
    patterns = longest_segments(array=bincam_ext, k=n_pattern_perseries)
    # Filter short/long patterns
    report_filter['Total number of patterns'] += len(patterns)
    report_filter['Number of patterns above maximum length'] += len([k for k in patterns.keys() if patterns[k] > max_len_patt])
    report_filter['Number of patterns below minimum length'] += len([k for k in patterns.keys() if patterns[k] < min_len_patt])
    patterns = {k: patterns[k] for k in patterns.keys() if (patterns[k] >= min_len_patt and
                                                            patterns[k] <= max_len_patt)}
    if len(patterns) > 0:
        for pattern_position in list(patterns.keys()):
            store_patts[class_label].append(extract_pattern(series_numpy, pattern_position, NA_fill=False))
    pbar.update(1)

print(report_filter)


  0%|          | 0/500 [00:00<?, ?it/s][A
  0%|          | 2/500 [00:00<00:37, 13.39it/s][A
  1%|          | 4/500 [00:00<00:33, 14.65it/s][A
  1%|          | 6/500 [00:00<00:31, 15.61it/s][A
  2%|▏         | 8/500 [00:00<00:30, 16.27it/s][A
  2%|▏         | 10/500 [00:00<00:29, 16.89it/s][A
  2%|▏         | 12/500 [00:00<00:28, 17.20it/s][A
  3%|▎         | 14/500 [00:00<00:27, 17.46it/s][A
  3%|▎         | 16/500 [00:00<00:27, 17.87it/s][A
  4%|▎         | 18/500 [00:01<00:26, 18.15it/s][A
  4%|▍         | 20/500 [00:01<00:26, 18.41it/s][A
  4%|▍         | 22/500 [00:01<00:25, 18.44it/s][A
  5%|▍         | 24/500 [00:01<00:25, 18.45it/s][A
  5%|▌         | 26/500 [00:01<00:26, 18.18it/s][A
  6%|▌         | 28/500 [00:01<00:26, 18.06it/s][A
  6%|▌         | 30/500 [00:01<00:26, 17.95it/s][A
  6%|▋         | 32/500 [00:01<00:26, 17.72it/s][A
  7%|▋         | 34/500 [00:01<00:26, 17.30it/s][A
  7%|▋         | 36/500 [00:02<00:26, 17.21it/s][A
  8%|▊         | 38/500 

{'Total number of patterns': 500, 'Number of patterns above maximum length': 22, 'Number of patterns below minimum length': 0}


### Dump patterns into csv

In [7]:
if export_allPooled:
    concat_patts_allPooled = np.full((sum(map(len, store_patts.values())), len(meas_var) * max_len_patt), np.nan)
    irow = 0
for classe in classes:
    concat_patts = np.full((len(store_patts[classe]), len(meas_var) * max_len_patt), np.nan)
    for i, patt in enumerate(store_patts[classe]):
        if len(meas_var) == 1:
            len_patt = len(patt)
            concat_patts[i, 0:len_patt] = patt
        if len(meas_var) >= 2:
            len_patt = patt.shape[1]
            for j in range(len(meas_var)):
                offset = j*max_len_patt
                concat_patts[i, (0+offset):(len_patt+offset)] = patt[j, :]
    if len(meas_var) == 1:
        headers = ','.join([meas_var[0] + '_' + str(k) for k in range(max_len_patt)])
        fout_patt = '/home/marc/Dropbox/Work/TSclass_GF/Notebooks/output/' + meas_var[0] +'/local_patterns/patt_uncorr_{}.csv.gz'.format(classe)
        if export_perClass:
            np.savetxt(fout_patt, concat_patts,
                       delimiter=',', header=headers, comments='')
    elif len(meas_var) >= 2:
        headers = ','.join([meas + '_' + str(k) for meas in meas_var for k in range(max_len_patt)])
        fout_patt = '/home/marc/Dropbox/Work/TSclass_GF/Notebooks/output/' + '_'.join(meas_var) +'/local_patterns/patt_uncorr_{}.csv.gz'.format(classe)
        if export_perClass:
            np.savetxt(fout_patt, concat_patts,
                       delimiter=',', header=headers, comments='')
    if export_allPooled:
        concat_patts_allPooled[irow:(irow+concat_patts.shape[0]), :] = concat_patts
        irow += concat_patts.shape[0]

if export_allPooled:
    concat_patts_allPooled = pd.DataFrame(concat_patts_allPooled)
    concat_patts_allPooled.columns = headers.split(',')
    pattID_col = [[classe] * len(store_patts[classe]) for classe in classes]
    concat_patts_allPooled['pattID'] = [j+'_'+str(i) for i,j in enumerate(list(chain.from_iterable(pattID_col)))]
    concat_patts_allPooled.set_index('pattID', inplace = True)
    fout_patt = '/home/marc/Dropbox/Work/TSclass_GF/Notebooks/output/' + '_'.join(meas_var) + '/local_patterns/patt_uncorr_allPooled.csv.gz'.format(classe)
    concat_patts_allPooled.to_csv(fout_patt, header=True, index=True, compression='gzip')

### Build distance matrix between patterns with DTW

This is done in R with the implementation of the *parallelDist* package. It is very efficient and has support for multivariate cases.

Check next notebook.