# Annotation tool for scatter plots and image data

By: Stefania Russo, Kris Villez
Copyright: 2018, distributed with BSD3 license 

## The challenge

In the context of the ADASen project, we want to address research questions regarding the utility of supervised and unsupervised machine learning models in anomaly detection for environmental systems. We have therefore selected a range of anomaly detection methods for benchmarking on data sets produced by six infrastructures at Eawag.

Critical to the benchmarking is the availability of fully labelled training and test data sets of normal and abnormal behavior in environmental data. 
An annotation tool has therefore being developed to perform the labelling procedure.

This notebook shows an application of the labelling procedure to flow citometry data. Here, each samples consist of a series of coordinate measurements at a regular time interval. Each coordinate measurement represents a particle sample.

## Current method

Below are described teh steps for data access, data preparation, visualization and labelling procedure.

- It is recommended to have the data organised into folders
- All the available data files are loaded from the working directory. The data is in the form of .fcs data files, there are many files in different folders. In the loading procedure, a dictionary is created to keep track of the folders containing the data and the file name associated to each flow citometry data set.
- Each data files contains several samples. Between the available features only 'FL1-A' and 'FL3-A' are chosen and logaritmic transformation is applied.  
- 'FL3-A' vs 'FL1-A' and 'FL3-A' vs Time are visualised
- The labelling procedure starts and the first plots are displayed. Each anomaly type is associated to a number. The annotation tool allows the labelling expert to interactivelly select anomaly type by clicking on the keyboard to the associated number and move through the data files and displaying them by clicking the button 'Next' or 'Previous'in the plot.
- Each time the button 'Next' is clicked, the selected anomaly type is saved together with the name of the folder containing it and its file name. At the end of the procedure, the user can easily access to the anomaly labels in an easy manner.






# Usage

 - Install python and open this Jupyter notebook 
 - Paste your working directory into path_all
 - Decide if working with randomised/temporal visualization of plots. To do this: select option Randomised = Yes or Randomised = No


### Import statements

In [1]:
import sys
import os
print (sys.version)

import numpy as np
import pandas as pd
from random import shuffle

import matplotlib
from matplotlib import pyplot
import matplotlib.pyplot as plt
from matplotlib.widgets import Button
from matplotlib.widgets import SpanSelector
import matplotlib.pylab as pl
import matplotlib.gridspec as gridspec

import itertools
from sklearn import preprocessing
import seaborn as sns
from numpy import inf

import fcsparser




3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


## Options 

In [2]:
path_all = ('/Users/arbeit/Desktop/ADASenDataAnnotation/003_Grey_Water_A_Data/')

#path_all = ("/Volumes/DOFCM_RTFCM_archive$/mbesmer/dofcm/tap_water/")

Randomised = 'No'   #Randomised = 'Yes'

## Functions

Most typically a log-transformation is applied to analyze these data. Since it is typical to have zeros in the data, we apply the transformation $$y \leftarrow log(|x|+1) \cdot sign(x)$$

In [3]:
def softsign(x):
    out = np.log(np.abs(x)+1)*np.sign(x)
    return out

# Loading data

### Variable names

* path_all contains the path to the data folder working directory. This directory will also contain the labels file
* path_dir_all contains the absolute path to all the data folders directories. This is also the directory where the file with Labels will be saved. The file is never overwritten: new data is always appended 
-------------
* fold_paths is a list that contains the name of all the data folders directories containinf the data files
* fold is a list that contains the absolute path to all the data folders directories (its length depends on the data files in each directory)
* fold_names is a list that contains the name of all the data folders directories (its length depends on the data files in each directory)
* data_fcs contains all data files
-----------
* file_name is a list of all data files names


In [4]:
# Get all the folders in path

save_path = path_all     # Destination folder to for saving text file
name_of_file = "Labels"
completeName = os.path.join(save_path, name_of_file+".txt") 


path_dir_all = []
fold_paths = []
for name in os.listdir(path_all):
    if os.path.isdir(os.path.join(path_all, name)):
        tmp = path_all + name
        path_dir_all.append(tmp)
        fold_paths.append(name)

#print(fold_paths)

### Data Loading Non Randomised  - Randomised case 


In [5]:
fold = []
data_fcs = []
file_name = []


if Randomised == 'No':
    print ("Non randomised case")
    
    # Get directories
    for path in path_dir_all:       # enter into directories
        i=0
        file_name_tmp = []
        for file in os.listdir(path):        # enter into directory files
            if file.endswith(".fcs"):
                file_name_tmp.append(file)
                i += 1
                file_name_tmp = sorted(file_name_tmp)

        for fn in file_name_tmp:
            new_path = path + '/' + fn
            meta, data = fcsparser.parse(new_path, reformat_meta=True)
            data.columns = data.columns.astype(str)
            data_fcs.append(data)
            file_name.append(fn)
        a = [path] * i
        fold.extend(a)    
    
else:
    print ("Randomised case")

    # Get directories
    for path in path_dir_all:       # enter into directories
        i=0
        file_name_tmp = []
        for file in os.listdir(path):        # enter into directory files
            if file.endswith(".fcs"):
                file_name_tmp.append(file)
                i += 1
                shuffle(file_name_tmp)

        for fn in file_name_tmp:
            new_path = path + '/' + fn
            meta, data = fcsparser.parse(new_path, reformat_meta=True)
            data.columns = data.columns.astype(str)
            data_fcs.append(data)
            file_name.append(fn)
        a = [path] * i
        fold.extend(a)

###########################################     
    


Non randomised case


In [6]:
print('Variables:', pd.DataFrame(data_fcs[0]).keys())
sr0 = pd.DataFrame(data_fcs[0]).keys()[2]
sr1 = pd.DataFrame(data_fcs[0]).keys()[4]
print('Selected variables:',sr0,',', sr1)

Variables: Index(['FSC-A', 'SSC-A', 'FITC-A', 'PE-A', 'PerCP-A', 'APC-A', 'FSC-H',
       'SSC-H', 'FITC-H', 'PE-H', 'PerCP-H', 'APC-H', 'Width', 'Time'],
      dtype='object')
Selected variables: FITC-A , PerCP-A


In [7]:
for i in range(len(file_name)):
    file_name[i] = file_name[i].split('.')[0]
       
    
# Get folder names
fold_names = []
for i in range (len(fold)):
    tmp = ((fold[i].split('/'))[-1])
    fold_names.append(tmp)
    

#data_all = data_fcs
for j in range(len(data_fcs)):
    data_fcs[j][sr0] = softsign(data_fcs[j][sr0])
    data_fcs[j][sr1] = softsign(data_fcs[j][sr1])
                        
data_strct = [fold, file_name]
#print(data_strct)
if len(fold) != len(data_fcs):
    print('There is an error')
    
# This is not used at the moment
data_strct_df = pd.DataFrame(
    {'Folder': fold_names,
     'File_name': file_name
    })  

#print(file_name)

# Label plot sequentially


In [8]:
%matplotlib tk

data_all = data_fcs 

itera = []
for i in range(len(data_all)):
    itera.append(i)

import numpy as np
import matplotlib.pylab as pl
import matplotlib.gridspec as gridspec

x = data_all[0][sr0]
y = data_all[0][sr1]
time = data_all[0]['Time']

fig, ax = plt.subplots(nrows=1, ncols=2,figsize=(15,15),gridspec_kw = {'width_ratios':[1, 1]})

fig.suptitle(fold_names[0], fontsize=12)

plt.subplots_adjust(bottom=0.2)
plt.subplots_adjust(left=0.04, bottom=0.2, right=0.98, top=0.87, wspace=0.2 , hspace=0.17 )


# Change box color here https://htmlcolorcodes.com/

l, = ax[0].plot(x, y, alpha=0.1, color='black', linestyle='None', marker='.')           # first plot
ax[0].set_title(file_name[0], fontdict=None, loc='center', pad=None)
ax[0].text(11, 0.5, 'Click a number on your keyboard ' '\n' 'to select between the following ' '\n' 'anomaly types, then click Next ' '\n' '\n' '0: Normal behaviour' '\n' '1: Anomaly: Oxidation' '\n' '2: Anomaly: Poor Staining' '\n' '3: Anomaly: Poor Cleaning' '\n' '4: Anomaly: Air' '\n' '5: Anomaly: Fluidic Problem' '\n' '6: Anomaly: PI Contamination' '\n''7: Anomaly: Unsure' '\n''8: Other anomaly type' '\n', bbox=dict(facecolor='#F1DFE2', alpha=1))
#ax[0].set_aspect('equal', adjustable='box', share=True)
ax[0].set_xlim([5, 14])
ax[0].set_ylim([0, 13])

l2, = ax[1].plot(time, y, alpha=0.1, color='black', linestyle='None', marker='.')           # first plot
ax[1].set_title(file_name[0], fontdict=None, loc='center', pad=None)

############### Presskey widget  ####################

pr_key = 0

def presskey(event):
    print('Pressed key = ', event.key)
    #sys.stdout.flush()
    
    global pr_key
    pr_key = event.key
    pr_key_str = str( event.key)
    pr_key_str_upd = 'Anomaly type = ' + pr_key_str
    ax[0].text(5.5, 12, pr_key_str_upd , bbox=dict(facecolor='#F1DFE2', alpha=1))
    plt.draw()

 
fig.canvas.mpl_connect('key_press_event',presskey)

############### Buttons widget  ####################

class Index(object):        
    ind = 0

    def next(self, event):
        self.ind += 1
        i = self.ind % len(itera)    # module gives the possibility to start again
        ydata1 = data_all[i][sr1]          
        xdata1 = data_all[i][sr0] 
        
        timedata1 = data_all[i]['Time']  
        
        
        l.set_ydata(ydata1)
        l.set_xdata(xdata1)
        l2.set_xdata(timedata1)
        l2.set_ydata(ydata1)
        
        ax[0].set_title(file_name[i], fontdict=None, loc='center', pad=None)
        ax[1].set_title(file_name[i], fontdict=None, loc='center', pad=None)
        
        if (i == (0)):
            fig.suptitle('End of data files - restarting with data file ' + file_name[i] + '\n' + 'in folder ' + fold_names[i], fontsize=12)
        else: 
            fig.suptitle(fold_names[i], fontsize=12)
        plt.draw()
        
        with open(completeName, "a") as myfile:
            myfile.write(fold_names[i-1] + '/' +  file_name[i-1] + ' ' + 'Anomaly_type:' + ' ' + str(pr_key) +'\n')   
            # save in path_all       

            
    def prev(self, event):
        self.ind -= 1
        i = self.ind % len(itera)
        
        ydata1 = data_all[i][sr1]          
        xdata1 = data_all[i][sr0]  
        timedata1 = data_all[i]['Time']
        folder_name = fold[i]
        
        l.set_ydata(ydata1)
        l.set_xdata(xdata1)
        l2.set_xdata(timedata1)
        l2.set_ydata(ydata1)
        
        
        #ax[0].set_title(fold[i] + ' ' +  file_name[i], fontdict=None, loc='center', pad=None)
        ax[0].set_title(file_name[i], fontdict=None, loc='center', pad=None)
        ax[1].set_title(file_name[i], fontdict=None, loc='center', pad=None)
        fig.suptitle(fold_names[i], fontsize=12)
        plt.draw()

callback = Index()

############### Connect events  ####################

axprev = plt.axes([0.7, 0.05, 0.1, 0.075])
axnext = plt.axes([0.81, 0.05, 0.1, 0.075])
bnext = Button(axnext, 'Next')
bnext.on_clicked(callback.next)
bprev = Button(axprev, 'Previous')
bprev.on_clicked(callback.prev)


0

Pressed key =  9
