# Annotation tool for time series data

By: Stefania Russo, Kris Villez
Copyright: 2018, distributed with BSD3 license 

# Usage
- Create folder called "labels" into each case folder
- Change path of  working directory path_all
- Select Case (1 or 2)
- Select data file name_of_file
- Run the cells
- After performing the annotation, close the plot and run the last cell 'Save labelled data'
- Note: if the user wants to change any of his selections, he needs to move forward to the next plot by clicking 'Next', perform a selection of the anomalous data, and then go back and restart.

# Usage as .py script

- Create folder called "labels" into each case folder
- Change path of  working directory path_all
- Open terminal and go to the path of the .py file
- On terminal write : python name_of_annotation_tool.py
- Select Case + enter
- Select data file name_of_file + enter
- After performing the annotation, close the plot.
- Restart
- Note: if the user wants to change any of his selections, he needs to move forward to the next plot by clicking 'Next', perform a selection of the anomalous data, and then go back and restart.


# Usage as .ipyn script
- select notebook_type = 'ipynb'

## The challenge

In the context of the ADASen project, we want to address research questions regarding the utility of supervised and unsupervised machine learning models in anomaly detection for environmental systems. We have therefore selected a range of anomaly detection methods for benchmarking on data sets produced by six infrastructures at Eawag.

Critical to the benchmarking is the availability of fully labelled training and test data sets of normal and abnormal behavior in environmental data. 
An annotation tool has therefore being developed to perform the labelling procedure.

This notebook shows an application of the labelling procedure to time series data. Here, each time series is a univariate 24h signal from ......

Each series is visualised as a 24h time series.

## Current method

Below are described the steps for data access, data preparation, visualization and labelling procedure.

- The data is in the form of .csv data files. Each data file consists of many 24h sets across 2 sensors.

    - if missing values are already replaced with NaNs
    - If none, replace missing values with NaNs
    - Decide if removing dates with Missing Values
    - Perform Annotation

- The labelling procedure starts and the first plots are displayed. The 3 plots at the top are univariate sensor signals, where the bottom plot shows a collection of these signals.  

- The annotation tool allows the labelling expert to interactivelly select multiple portions of the time series by moving through the data with the mouse cursor.

- Each time the button 'Next' is clicked, all the selected areas (time index and sensor value) are saved together with information about the date stamp date. At the end of the procedure, the user can easily access to the anomaly labels in an easy manner.

- When the process is over, the plots need to be closed and then the cell 'Save labelled data' hs to be run 

- Note: if the user wants to change any of his selections, he needs to move forward to the next plot by clicking 'Next', perform a selection of the anomalous data, and then go back and restart.


 
# NOTE: 
## This is an alpha version of the labelling tool! Please provide any feedback




# Iniziatization

In [1]:
# Import Statements

import os
import numpy as np
import pandas as pd
import csv
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.pylab as pl
import matplotlib.gridspec as gridspec
from matplotlib.widgets import Button
from matplotlib.widgets import SpanSelector
import itertools
import datetime as datetime
from datetime import timedelta


In [4]:
notebook_type = 'ipynb'    #ipynb   #py

In [16]:
case_from_terminal = input('Please select Case (1: GAK (p3,p4)  2: Pressure T1 T2 (p1,p2)):  ')
text_from_terminal = input("Please enter the file name: ")  # Python 3

Please select Case (1: GAK (p3,p4)  2: Pressure T1 T2 (p1,p2)):  1
Please enter the file name: 06_June 2018


### Options

In [17]:
path_all = ('/Users/russoste/Desktop/my_git_repos/00_Data/04_Nest/data_raw/')
save_path = path_all     # Destination folder to for labelled data

In [18]:
# Label month by month

# Select Case
# Case = 2     # case 1: GAK (p3,p4),       case 2: Pressure T1 T2 (p1,p2)

if case_from_terminal == '1':
    folder = 'data_pressure_sensor_GAK/'
    
if case_from_terminal == '2':
    folder = 'data_pressure_sensors_T1_T2/' 
    
Case = int(case_from_terminal)    
    
name_of_file = text_from_terminal

In [19]:
name_of_file1 = name_of_file + ".csv"
name_of_file_l1 =  folder + 'labels/' + name_of_file + '_labels_'
name_of_file_l1_time = folder + 'labels/' + name_of_file + '_labels_time'

completePath = os.path.join(path_all, folder, name_of_file1) 

In [20]:
print ('Now working with : ', folder, ' file: ', name_of_file)

Now working with :  data_pressure_sensor_GAK/  file:  06_June 2018


# Load data and basic sanity checks

In [21]:
# Load data
df = pd.read_csv(completePath)
df.head()

In [23]:
df2 = df.copy(deep=True)

sr0 = df2.keys()[2]
sr1 = df2.keys()[3]
print('Sensor names:',sr0,',', sr1)

# if (df2.shape[0]%8640 !=0):
    # print('Missing values')

Sensor names: p3 , p4


In [24]:
df2['Datetime'] = df2['day'] + ' ' + df2['hour']
df2['Datetime_'] = [x for x in (pd.to_datetime([i for i in df2['Datetime']], format='%d.%m.%Y %H:%M:%S'))] 
df3 = df2.resample('10S', on='Datetime_', base=10).mean()

if (df3.shape[0]%8640 !=0):
    print('Missing values')

In [25]:
df3 = df3.reset_index()

In [26]:
df3['day'] = [x.date() for x in df3['Datetime_']] 
df3['time'] = [x.time() for x in df3['Datetime_']] 

In [27]:
df4 = df3.copy(deep=True)
df4.set_index(['day','time'], inplace=True)

df_bf_00 = df4[sr0]
df_bf_01 = df4[sr1]
df_bf_02 = df4[sr0] - df4[sr1]

df4.drop(columns ='Datetime_', inplace=True)
df2 = df4.copy(deep=True)

## Basic sanity checks

In [28]:
# Accessing dates
i_date = df2.index.get_level_values(0)                                      # get all dates
idx_date = np.unique(df2.index.get_level_values(0), return_index=True)[1]      # get index of unique dates
date_list = i_date[idx_date]   # get list of all dates
# print('Unique dates:',date_list)

# for pl_i in range(len(date_list)):
#     if len(df_bf_00[date_list[pl_i].date()].values) != 8640:
#         print('Corrupted date:', date_list[pl_i].date())
#         print('Corrupted date index:',pl_i)
#         print('Corrupted date shape:', df2.loc[date_list[pl_i].date()].shape)  
# print ('Number of days:', df2.shape[0]/8640)

In [29]:
# Dates and times
data_df2 = df2.copy()

data_time = []
for pl_i in idx_date:                             # create data_time indeces to have access later
    time = data_df2.loc[i_date[pl_i]].index
    data_time.append(time)                        # associated to every date segment
    
time_int = [np.linspace(1, 8640, num = 8640, dtype=int) for _ in range(len(date_list))]

# Plotting

In [30]:
#get_ipython().run_line_magic('matplotlib', 'tk')

if notebook_type == 'ipynb':
    %matplotlib tk

if notebook_type == 'py':
    import matplotlib as mpl
    mpl.use('Qt5Agg')

data1 = []
data2 = []
data123 = []

itera = date_list



# ########################################
# # added

# import matplotlib as mpl
# mpl.use('Qt5Agg')
# #######################################

gs = gridspec.GridSpec(2, 2)

fig = plt.figure()
#plt.axis([0, 24, -3, 100])



ax1 = fig.add_subplot(gs[0, 0]) # row 0, col 0
ax2 = fig.add_subplot(gs[0, 1]) # row 0, col 1
ax4 = fig.add_subplot(gs[1, :]) # row 1, span all columns

ax1.set_title(sr0, fontdict=None, pad=None)
ax2.set_title(sr1, fontdict=None, pad=None)
full = sr0 + ' '+ sr1
ax4.set_title(full, fontdict=None, pad=None)

fig.suptitle(str(date_list[0].date()), fontsize=12)


if Case == 1:
    ax1.set_ylim([-3,100])
    ax2.set_ylim([-3,100])
    ax4.set_ylim([-3,100])
    
if Case == 2:
    ax1.set_ylim([40,150])
    ax2.set_ylim([40,150])
    ax4.set_ylim([40,150])

for pl_i in range(len(date_list)): 
    ax1.plot(time_int[pl_i], df_bf_00[date_list[pl_i].date()].values, '#C0C0C0', lw=2)
    ax2.plot(time_int[pl_i], df_bf_01[date_list[pl_i].date()].values, '#C0C0C0', lw=2)
    
l, = ax1.plot(time_int[0], df_bf_00[date_list[0].date()].values, '#1E90FF', lw=2)     #the first one is the one in blue
l2, = ax2.plot(time_int[0], df_bf_01[date_list[0].date()].values, '#8B008B')


###########################
if Case == 1:
    ll1, = ax4.plot(time_int[0], df_bf_00[date_list[0].date()].values, '#C0C0C0')
    ll2, = ax4.plot(time_int[0], df_bf_01[date_list[0].date()].values, '#C0C0C0')
    ll3, = ax4.plot(time_int[0], df_bf_02[date_list[0].date()].values, '#31a354')   # add difference4 plot

if Case == 2:
    ll1, = ax4.plot(time_int[0], df_bf_00[date_list[0].date()].values, '#1E90FF')
    ll2, = ax4.plot(time_int[0], df_bf_01[date_list[0].date()].values, '#8B008B')

############### Buttons widget  ####################

class Index(object):
    ind = 0

    def next(self, event):
        self.ind += 1
        i = self.ind % len(itera)

        #ydata0 will be the plot alone
        ydata1 = df_bf_00[date_list[i].date()].values   
        ydata2 = df_bf_01[date_list[i].date()].values 
        ydata3 = df_bf_02[date_list[i].date()].values 
        
        xdata = time_int[i]          
        
        l.set_ydata(ydata1)
        l.set_xdata(xdata)
        l2.set_ydata(ydata2)
        l2.set_xdata(xdata)
        
        if Case == 1:
            ll1.set_ydata(ydata1)
            ll2.set_ydata(ydata2)
            ll3.set_ydata(ydata3)
            ll1.set_xdata(xdata) 
            ll2.set_xdata(xdata)
            ll3.set_xdata(xdata)
        
        if Case == 2:                
            ll1.set_ydata(ydata1)
            ll2.set_ydata(ydata2)
            ll1.set_xdata(xdata) 
            ll2.set_xdata(xdata)
        
        
        if (i == (0)):
            fig.suptitle('End of data files - restarting with data file ' + str(date_list[i].date()), fontsize=12)
        else: 
            fig.suptitle(str(date_list[i].date()), fontsize=12)
            
        plt.draw()

    def prev(self, event):
        self.ind -= 1
        i = self.ind % len(itera)
        
        #ydata0 will be the plot alone
        ydata1 = df_bf_00[date_list[i].date()].values 
        ydata2 = df_bf_01[date_list[i].date()].values 
        ydata3 = df_bf_02[date_list[i].date()].values 
        
        xdata = time_int[i]  
        
        l.set_ydata(ydata1)
        l.set_xdata(xdata)
        
        l2.set_ydata(ydata2)
        l2.set_xdata(xdata)
        
        if Case == 1:
            ll1.set_ydata(ydata1)
            ll2.set_ydata(ydata2)
            ll3.set_ydata(ydata3)
            ll1.set_xdata(xdata) 
            ll2.set_xdata(xdata)
            ll3.set_xdata(xdata)
        
        if Case == 2:                
            ll1.set_ydata(ydata1)
            ll2.set_ydata(ydata2)
            ll1.set_xdata(xdata) 
            ll2.set_xdata(xdata)
        

        if (i == (0)):
            fig.suptitle('End of data files - restarting with data file ' + str(date_list[i].date()), fontsize=12)
        else: 
            fig.suptitle(str(date_list[i].date()), fontsize=12)
            
        plt.draw()

callback = Index()

axprev = plt.axes([0.7, 0.05, 0.1, 0.075])
axnext = plt.axes([0.81, 0.05, 0.1, 0.075])
bnext = Button(axnext, 'Next')
bnext.on_clicked(callback.next)

bprev = Button(axprev, 'Previous')
bprev.on_clicked(callback.prev)

"""
valore = '11'
def presskey(event):
    print('Pressed key = ', event.key)
    #sys.stdout.flush()    
    global valore 
    valore = event.key       
    return valore
"""

def onselect1(xmin, xmax):
    x = time_int[callback.ind % len(itera)]
    y = df_bf_00[date_list[callback.ind % len(itera)].date()].values 
    today = date_list[callback.ind % len(itera)]
   
    indmin1, indmax1 = np.searchsorted(x, (xmin, xmax))
    indmax1 = min(len(x) - 1, indmax1)
    thisx = x[indmin1:indmax1]
    thisy = y[indmin1:indmax1]    
    nplist = np.array([today.date() for i in range(len(thisx))])
        
    a1 = np.c_[nplist, thisx, thisy]
    global data1
    data1.extend(a1)
    #np.savetxt(completeName_label_1, data1)

        

def onselect2(xmin, xmax):
    x = time_int[callback.ind % len(itera)]
    y = df_bf_01[date_list[callback.ind % len(itera)].date()].values 
    today = date_list[callback.ind % len(itera)]
    
    indmin, indmax = np.searchsorted(x, (xmin, xmax))
    indmax = min(len(x) - 1, indmax)
    thisx = x[indmin:indmax]
    thisy = y[indmin:indmax]
    nplist = np.array([today.date() for i in range(len(thisx))])
    
    a2 = np.c_[nplist, thisx, thisy]
    global data2
    data2.extend(a2)
    

def onselect4(xmin, xmax):
    x = time_int[callback.ind % len(itera)]
    y1 = df_bf_00[date_list[callback.ind % len(itera)].date()].values 
    y2 = df_bf_01[date_list[callback.ind % len(itera)].date()].values
    today = date_list[callback.ind % len(itera)]
    
    indmin, indmax = np.searchsorted(x, (xmin, xmax))
    indmax = min(len(x) - 1, indmax)
    
    thisx = x[indmin:indmax]
    thisy1 = y1[indmin:indmax]
    thisy2 = y2[indmin:indmax]
    nplist = np.array([today.date() for i in range(len(thisx))])
        
    # save
    a123 = np.c_[nplist, thisx, thisy1, thisy2]
    global data123
    data123.extend(a123)

    
"""
# Connect key event to figure
fig.canvas.mpl_connect('key_press_event',presskey)
"""

#class1 = Onselect_1()

spans1 = SpanSelector(ax1, onselect1, 'horizontal', useblit=False,
                      rectprops=dict(alpha=0.5, facecolor='red'), span_stays=True)
span2 = SpanSelector(ax2, onselect2, 'horizontal', useblit=True,
                    rectprops=dict(alpha=0.5, facecolor='red'), span_stays=True )
span4 = SpanSelector(ax4, onselect4, 'horizontal', useblit=True,
                    rectprops=dict(alpha=0.5, facecolor='red') , span_stays=True)


########################################
if notebook_type == 'py':
    # added
    plt.show()
########################################


## Save labels 

In [102]:
data1 = pd.DataFrame(data1, columns=['day','time_m', sr0])
data2 = pd.DataFrame(data2, columns=['day','time_m', sr1])
data123 = pd.DataFrame(data123, columns=['day','time_m', sr0, sr1])

data1.to_csv(os.path.join(save_path,name_of_file_l1+sr0 + ".csv") )
data2.to_csv(os.path.join(save_path,name_of_file_l1+sr1 + ".csv") )
data123.to_csv(os.path.join(save_path,name_of_file_l1+sr0+sr1+ ".csv") )

### Back to time labels


In [103]:
def rem_dup(datal):
    if (len(datal))==0:
        return datal.copy(deep=True)
    list_org = [str(i) for i in datal['day'].values]
    states = [list_org[-1]]
    index_keep =[True]

    for i in range(len(list_org)-2,-1,-1):
        if list_org[i]!=list_org[i+1] and list_org[i] not in states:
            states.extend([list_org[i]])
            index_keep.append(True) 

        elif list_org[i]!=list_org[i+1] and list_org[i] in states:
            index_keep.append(False)

        elif list_org[i]==list_org[i+1] and index_keep[len(list_org)-2-i]==False:   # check
            index_keep.append(False)

        elif list_org[i]==list_org[i+1] and list_org[i] in states:
            index_keep.append(True)
            
    index_keep.reverse()
    #index_keep = rem_dup(datal)
    datall = datal[index_keep]
    return datall



In [104]:
pd.set_option('mode.chained_assignment', None)

In [105]:
data1_l = rem_dup(data1)
data2_l = rem_dup(data2)
data123_l = rem_dup(data123)

In [106]:
data_df = data_df2.reset_index()

data_df_s1 = data_df.drop([sr1], axis=1)
data_df_s2 = data_df.drop([sr0], axis=1)
data_df_s3 = data_df.copy()

data_df_s1['time_m'] = np.reshape(time_int, 8640*len(time_int))
data_df_s2['time_m'] = np.reshape(time_int, 8640*len(time_int))
data_df_s3['time_m'] = np.reshape(time_int, 8640*len(time_int))

data_df_s1['day'] = [x.date() for x in data_df_s1['day']] 
data_df_s2['day'] = [x.date() for x in data_df_s2['day']] 
data_df_s3['day'] = [x.date() for x in data_df_s3['day']] 

data1_l.drop(columns = sr0, inplace=True)
data2_l.drop(columns = sr1, inplace=True)
data123_l.drop(columns =[sr0, sr1], inplace=True)

labels_df_1 = pd.merge(data_df_s1, data1_l, on = ['day', 'time_m'], how='left', indicator=True)
labels_df_2 = pd.merge(data_df_s2, data2_l, on = ['day', 'time_m'], how='left', indicator=True)
labels_df_123 = pd.merge(data_df_s3, data123_l, on = ['day', 'time_m'], how='left', indicator=True)



In [107]:
# ADD zeros and ones with dictionary mapping

mapper_dict = {'left_only': 0, 'both': 1}

def mp(entry):
    """
    maps new values
    """
    return mapper_dict[entry] if entry in mapper_dict else entry
mp = np.vectorize(mp)


In [108]:
labels_df_1 ['_merge'] = mp(labels_df_1['_merge'])
labels_df_2 ['_merge'] = mp(labels_df_2['_merge'])
labels_df_123 ['_merge'] = mp(labels_df_123['_merge'])

labels_df_1 = labels_df_1.rename(index=str, columns={"_merge": "Anomaly"})
labels_df_2 = labels_df_2.rename(index=str, columns={"_merge": "Anomaly"})
labels_df_123 = labels_df_123.rename(index=str, columns={"_merge": "Anomaly"})


In [109]:
labels_df_1.drop(['time_m'], axis=1, inplace=True)
labels_df_2.drop(['time_m'], axis=1, inplace=True)
labels_df_123.drop(['time_m'], axis=1, inplace=True)

In [110]:
labels_df_1.to_csv(os.path.join(save_path, name_of_file_l1_time+sr0 + ".csv") )
labels_df_2.to_csv(os.path.join(save_path, name_of_file_l1_time+sr1 + ".csv") )
labels_df_123.to_csv(os.path.join(save_path, name_of_file_l1_time+sr0+sr1 + ".csv") )  

In [98]:
labels_df_2

Unnamed: 0,day,time,p4,Anomaly
0,2018-09-01,00:00:00,13.2,0
1,2018-09-01,00:00:10,13.2,0
2,2018-09-01,00:00:20,13.2,0
3,2018-09-01,00:00:30,13.1,0
4,2018-09-01,00:00:40,13.1,0
5,2018-09-01,00:00:50,13.1,0
6,2018-09-01,00:01:00,13.1,0
7,2018-09-01,00:01:10,13.0,0
8,2018-09-01,00:01:20,13.7,0
9,2018-09-01,00:01:30,16.2,0
