# Annotation tool for time series data

By: Stefania Russo, Kris Villez
Copyright: 2018, distributed with BSD3 license 

## The challenge

In the context of the ADASen project, we want to address research questions regarding the utility of supervised and unsupervised machine learning models in anomaly detection for environmental systems. We have therefore selected a range of anomaly detection methods for benchmarking on data sets produced by six infrastructures at Eawag.

Critical to the benchmarking is the availability of fully labelled training and test data sets of normal and abnormal behavior in environmental data. 
An annotation tool has therefore being developed to perform the labelling procedure.

This notebook shows an application of the labelling procedure to time series data. Here, each time series is a univariate 24h signal from a spatially-distributed, low-power sensor network.

Each series is visualised as a 3am-3am time series.

## Current method

Below are described the steps for data access, data preparation, visualization and labelling procedure.

- The data is in the form of .csv data files. Each data file consists of many 24h sets across 3 sensors.
- Corruption checks are performed and dates cointaining corrupted time-series are removed
- The labelling procedure starts and the first plots are displayed. The 3 plots at the top are univariate sensor signals, where the bottom plot shows a collection of these signals.  

- The annotation tool allows the labelling expert to interactivelly select multiple portions of the time series by moving through the data with the mouse cursor.

- Each time the button 'Next' is clicked, all the selected areas (time index and sensor value) are saved together with information about the date stamp date. At the end of the procedure, the user can easily access to the anomaly labels in an easy manner.

- When the process is over, the plots need to be closed and then the cell 'Save labelled data' hs to be run 

- Note: if the user wants to change any of his selections, he needs to move forward to the next plot by clicking 'Next', perform a selection of the anomalous data, and then go back and restart.


# Usage

 - Install python and open this Jupyter notebook 
 - Paste your working directory into path_all
 
# NOTE: 
## This is a beta version of the labelling tool! Please provide any feedback




# Iniziatization

In [1]:
# Import Statements

import os
import numpy as np
import pandas as pd
import csv
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.pylab as pl
import matplotlib.gridspec as gridspec
from matplotlib.widgets import Button
from matplotlib.widgets import SpanSelector
import itertools
from sklearn import preprocessing
import seaborn as sns
import datetime as datetime

# WINDOWS server: KV can find the data here
# path_all = ('//eaw-dc02/ea-daten/Abteilungsprojekte/eng/SWWData/2018_datValX/4_workspace/R/')

# iOS server: SR can find the data here
# path_all = ("/Volumes/UWO/")

# On SR's laptop
path_all = ('/Users/russoste/Desktop/Z_UWO/Data/')

name_of_file1 = '181128_trialData_UWOforADASen_case1.csv'
name_of_file2 = '181128_trialData_UWOforADASen_case2.csv'

save_path = path_all     # Destination folder to for saving text file
name_of_file_l1 = "Labels_Case1_"
name_of_file_l1_time = "Labels_Case1_time_"


# Load data and basic sanity checks

In [2]:
# Load data

completePath = os.path.join(path_all, name_of_file1) 
df = pd.read_csv(completePath)

df2 = df.copy(deep=True)
df2['date'], df2['time'] = df2['time'].str.split(' ', 1).str

sr0 = df2.keys()[1]
sr1 = df2.keys()[2]
sr2 = df2.keys()[3]
print('Sensor names:',sr0,',', sr1,',', sr2)

# Replace extreme values with zeros and create datetime
# Replace extreme values with zeros and create datetime

df2[sr0] = df2[sr0].replace([-9999.000], 0)
df2[sr1] = df2[sr1].replace([-9999.000], 0)
df2[sr2] = df2[sr2].replace([-9999.000], 0)
df2['date'] = [x.date() for x in (pd.to_datetime([i for i in df2['date']], format='%Y-%m-%d'))] 
df2['time'] = [x.time() for x in (pd.to_datetime([i for i in df2['time']], format='%H:%M:%S'))]   # remove primes from the time
df1 = df2.copy(deep=True)
df2.set_index(['date','time'], inplace=True)


Sensor names: bf_03 , bf_04 , bl_ce193


## Basic sanity checks

In [3]:
# Accessing dates
i_date = df2.index.get_level_values(0)          # get all dates
idx_date = np.unique(df2.index.get_level_values(0), return_index=True)[1]      # get index of unique dates
date_list = i_date[idx_date]   # get list of all dates
print('Unique dates:',date_list)

df_bf_00 = df2[sr0]
df_bf_01 = df2[sr1]
df_bf_02 = df2[sr2]

for pl_i in range(len(date_list)):
    if len(df_bf_00[date_list[pl_i].date()].values) != 288:
        print('Corrupted date:', date_list[pl_i].date())
        print('Corrupted date index:',idx_date[pl_i])
        print('Corrupted date shape:', df2.loc[date_list[pl_i].date()].shape)  

Unique dates: DatetimeIndex(['2018-03-06', '2018-03-07', '2018-03-08', '2018-03-09',
               '2018-03-10', '2018-03-11', '2018-03-12', '2018-03-13',
               '2018-03-14', '2018-03-15', '2018-03-16', '2018-03-17',
               '2018-03-18', '2018-03-19', '2018-03-20', '2018-03-21',
               '2018-03-22', '2018-03-23', '2018-03-24', '2018-03-25',
               '2018-03-26', '2018-03-27', '2018-03-28', '2018-03-29',
               '2018-03-30', '2018-03-31', '2018-04-01', '2018-04-02',
               '2018-04-03', '2018-04-04', '2018-04-05', '2018-04-06',
               '2018-04-07', '2018-04-08', '2018-04-09', '2018-04-10',
               '2018-04-11', '2018-04-12', '2018-04-13', '2018-04-14',
               '2018-04-15', '2018-04-16', '2018-04-17', '2018-04-18',
               '2018-04-19', '2018-04-20', '2018-04-21', '2018-04-22',
               '2018-04-23', '2018-04-24', '2018-04-25'],
              dtype='datetime64[ns]', name='date', freq=None)
Corrupted date

In [4]:
# Remove corrupted dataframe and compute new date list
data_df = df2[:-145]

# Accessing dates
i_date = data_df.index.get_level_values(0)          # get all dates
idx_date = np.unique(data_df.index.get_level_values(0), return_index=True)[1]      # get index of unique dates
date_list = i_date[idx_date]   # get list of all dates
#print('Unique dates',date_list)

# Dates and times
data_time = []
for pl_i in idx_date:                             # create data_time indeces to have access later
    time = data_df.loc[i_date[pl_i]].index
    data_time.append(time)

## Create 3am-3am data sets

In [5]:
# Data points from 00:00:00 till 02:55:00
tmp0 = data_df[sr0]
tmp1 = tmp0[date_list[0].date()].loc[datetime.time(0,0,0):datetime.time(2,55,0)]
tmp2 = tmp0[date_list[0].date()].loc[datetime.time(3,0,0):datetime.time(23,55,0)]
#print(len(tmp1), len(tmp2))

X = data_df.values
X = X[len(tmp1) : -len(tmp2)]     # this is my new data set

tmp3 = time[len(tmp1):]
tmp4 = time[:len(tmp1)]
time_s = np.concatenate((tmp3,tmp4))

tmp3 = data_time[0][len(tmp1):]
tmp4 = data_time[0][:len(tmp1)]
time_s1 = np.concatenate((tmp3,tmp4))


### Create individual lists of data

nc = X.shape[0]/288
print('Number of examples:', nc)

data_sr0 = X[:,0]
data_sr1 = X[:,1]
data_sr2 = X[:,2]

data_sr0 = np.array(np.split(data_sr0,nc))      # Divide data in shape for plotting
data_sr1 = np.array(np.split(data_sr1,nc))
data_sr2 = np.array(np.split(data_sr2,nc))

date_list_n = len(data_sr0)  # I have lost part of the data (last day)
dates = date_list[0:len(data_sr0)]

time_int = []
for ind in range(X.shape[0]):
    time = np.linspace(00.0, 23.55, num=len(data_sr0[0]))
    time_int.append(time)

Number of examples: 49.0


# Plotting

In [13]:
%matplotlib tk


data1 = []
data2 = []
data3 = []
data123 = []

itera = dates

import numpy as np
import matplotlib.pylab as pl
import matplotlib.gridspec as gridspec

gs = gridspec.GridSpec(2, 3)

fig = plt.figure()
#plt.axis([0, 24, -3, 100])

ax1 = fig.add_subplot(gs[0, 0]) # row 0, col 0
ax2 = fig.add_subplot(gs[0, 1]) # row 0, col 1
ax3 = fig.add_subplot(gs[0, 2]) # row 0, col 1
ax4 = fig.add_subplot(gs[1, :]) # row 1, span all columns

ax1.set_title('bf_03', fontdict=None, pad=None)
ax2.set_title('bf_04', fontdict=None, pad=None)
ax3.set_title('bl_ce193', fontdict=None, pad=None)
ax4.set_title('bf_03 + bf_04 + bl_ce193', fontdict=None, pad=None)

fig.suptitle(str(dates[0].date()), fontsize=12)

ax1.set_ylim([-3,100])
ax2.set_ylim([-3,100])
ax3.set_ylim([-3,100])
ax4.set_ylim([-3,100])

for pl_i in range(len(dates)): 
    ax1.plot(time_int[pl_i], data_sr0[pl_i], '#C0C0C0', lw=2)
    ax2.plot(time_int[pl_i], data_sr1[pl_i], '#C0C0C0', lw=2)
    ax3.plot(time_int[pl_i], data_sr2[pl_i],  '#C0C0C0',lw=2) 
    
l, = ax1.plot(time_int[0], data_sr0[0], '#1E90FF', lw=2)     #the first one is the one in blue
l2, = ax2.plot(time_int[0], data_sr1[0], '#8B008B')
l3, = ax3.plot(time_int[0], data_sr2[0],'#FFDAB9')


ll1, = ax4.plot(time_int[0], data_sr0[0], '#1E90FF')
ll2, = ax4.plot(time_int[0], data_sr1[0], '#8B008B')
ll3, = ax4.plot(time_int[0],  data_sr2[0], '#FFDAB9')


############### Buttons widget  ####################

class Index(object):
    ind = 0

    def next(self, event):
        self.ind += 1
        i = self.ind % len(itera)

        #ydata0 will be the plot alone
        ydata1 = data_sr0[i]   
        ydata2 = data_sr1[i] 
        ydata3 = data_sr2[i]
        xdata = time_int[i]          
        
        l.set_ydata(ydata1)
        l.set_xdata(xdata)
        l2.set_ydata(ydata2)
        l2.set_xdata(xdata)
        l3.set_ydata(ydata3)
        l3.set_xdata(xdata)
        
        ll1.set_ydata(ydata1)
        ll2.set_ydata(ydata2)
        ll3.set_ydata(ydata3) 
        
        ll1.set_xdata(xdata) 
        ll2.set_xdata(xdata)
        ll3.set_xdata(xdata)
        
        if (i == (0)):
            fig.suptitle('End of data files - restarting with data file ' + str(dates[i].date()), fontsize=12)
        else: 
            fig.suptitle(str(dates[i].date()), fontsize=12)
            
        plt.draw()

    def prev(self, event):
        self.ind -= 1
        i = self.ind % len(itera)
        
        #ydata0 will be the plot alone
        ydata1 = data_sr0[i]   
        ydata2 = data_sr1[i] 
        ydata3 = data_sr2[i]
        xdata = time_int[i]          
        
        l.set_ydata(ydata1)
        l.set_xdata(xdata)
        
        l2.set_ydata(ydata2)
        l2.set_xdata(xdata)
        
        l3.set_ydata(ydata3)
        l3.set_xdata(xdata)

        ll1.set_ydata(ydata1)
        ll2.set_ydata(ydata2)
        ll3.set_ydata(ydata3) 

        ll1.set_xdata(xdata) 
        ll2.set_xdata(xdata)
        ll3.set_xdata(xdata)
        
        if (i == (0)):
            fig.suptitle('End of data files - restarting with data file ' + str(dates[i].date()), fontsize=12)
        else: 
            fig.suptitle(str(dates[i].date()), fontsize=12)
        
        plt.draw()

callback = Index()

axprev = plt.axes([0.7, 0.05, 0.1, 0.075])
axnext = plt.axes([0.81, 0.05, 0.1, 0.075])
bnext = Button(axnext, 'Next')
bnext.on_clicked(callback.next)

bprev = Button(axprev, 'Previous')
bprev.on_clicked(callback.prev)

"""
valore = '11'
def presskey(event):
    print('Pressed key = ', event.key)
    #sys.stdout.flush()    
    global valore 
    valore = event.key       
    return valore
"""

def onselect1(xmin, xmax):
    x = time_int[callback.ind % len(itera)]
    y = data_sr0[callback.ind % len(itera)]
    today = dates[callback.ind % len(itera)]
   
    indmin1, indmax1 = np.searchsorted(x, (xmin, xmax))
    indmax1 = min(len(x) - 1, indmax1)
    thisx = x[indmin1:indmax1]
    thisy = y[indmin1:indmax1]    
    nplist = np.array([today.date() for i in range(len(thisx))])
        
    a1 = np.c_[nplist, thisx, thisy]
    global data1
    data1.extend(a1)
    #np.savetxt(completeName_label_1, data1)

        

def onselect2(xmin, xmax):
    x = time_int[callback.ind % len(itera)]
    y = data_sr1[callback.ind % len(itera)]
    today = dates[callback.ind % len(itera)]
    
    indmin, indmax = np.searchsorted(x, (xmin, xmax))
    indmax = min(len(x) - 1, indmax)
    thisx = x[indmin:indmax]
    thisy = y[indmin:indmax]
    nplist = np.array([today.date() for i in range(len(thisx))])
    
    a2 = np.c_[nplist, thisx, thisy]
    global data2
    data2.extend(a2)
    

def onselect3(xmin, xmax):
    x = time_int[callback.ind % len(itera)]
    y = data_sr2[callback.ind % len(itera)]
    today = dates[callback.ind % len(itera)]
    
    indmin, indmax = np.searchsorted(x, (xmin, xmax))
    indmax = min(len(x) - 1, indmax)
    thisx = x[indmin:indmax]
    thisy = y[indmin:indmax]
    nplist = np.array([today.date() for i in range(len(thisx))])
    
    a3 = np.c_[nplist, thisx, thisy]
    global data3
    data3.extend(a3)

def onselect4(xmin, xmax):
    x = time_int[callback.ind % len(itera)]
    y1 = data_sr0[callback.ind % len(itera)]
    y2 = data_sr1[callback.ind % len(itera)]
    y3 = data_sr2[callback.ind % len(itera)]
    today = dates[callback.ind % len(itera)]
    
    indmin, indmax = np.searchsorted(x, (xmin, xmax))
    indmax = min(len(x) - 1, indmax)
    
    thisx = x[indmin:indmax]
    thisy1 = y1[indmin:indmax]
    thisy2 = y2[indmin:indmax]
    thisy3 = y3[indmin:indmax] 
    nplist = np.array([today.date() for i in range(len(thisx))])
        
    # save
    a123 = np.c_[nplist, thisx, thisy1, thisy2, thisy3]
    global data123
    data123.extend(a123)

    
"""
# Connect key event to figure
fig.canvas.mpl_connect('key_press_event',presskey)
"""

#class1 = Onselect_1()

spans1 = SpanSelector(ax1, onselect1, 'horizontal', useblit=False,
                      rectprops=dict(alpha=0.5, facecolor='red'), span_stays=True)
span2 = SpanSelector(ax2, onselect2, 'horizontal', useblit=True,
                    rectprops=dict(alpha=0.5, facecolor='red'), span_stays=True )
span3 = SpanSelector(ax3, onselect3, 'horizontal', useblit=True,
                    rectprops=dict(alpha=0.5, facecolor='red'), span_stays=True)
span4 = SpanSelector(ax4, onselect4, 'horizontal', useblit=True,
                    rectprops=dict(alpha=0.5, facecolor='red') , span_stays=True)

## Save labelled data

In [12]:
data1 = pd.DataFrame(data1, columns=['date','time', 'value'])
data2 = pd.DataFrame(data2, columns=['date','time', 'value'])
data3 = pd.DataFrame(data3, columns=['date','time', 'value'])
data123 = pd.DataFrame(data123, columns=['date','time', 'bf_03','bf_04','bl_ce193'])

data1.to_csv(os.path.join(save_path, name_of_file_l1+sr0 + ".csv") )
data2.to_csv(os.path.join(save_path, name_of_file_l1+sr1 + ".csv") )
data2.to_csv(os.path.join(save_path, name_of_file_l1+sr2 + ".csv") )
data123.to_csv(os.path.join(save_path, name_of_file_l1+sr0+sr1+sr2 + ".csv") )

# Index of Date and time for correspondece 
arrsy = [time_int[1], time_s1]
dfy = np.array(arrsy)
dfy = pd.DataFrame(data=dfy)  # 1st row as the column names

# Go back to the real timestamp

data1_l = data1.copy(deep=True)
for i in range(len(data1_l)):
    for j in range(dfy.shape[1]):
        if data1_l['time'].iloc[i] == dfy[j].iloc[0]:
            #data1['time'].iloc[i] = dfy[j].iloc[0]
            data1_l.replace(to_replace=data1_l['time'].iloc[i], value = dfy[j].iloc[1], inplace=True)
            
data2_l = data2.copy(deep=True)
for i in range(len(data2_l)):
    for j in range(dfy.shape[1]):
        if data2_l['time'].iloc[i] == dfy[j].iloc[0]:
            #data1['time'].iloc[i] = dfy[j].iloc[0]
            data2_l.replace(to_replace=data2_l['time'].iloc[i], value = dfy[j].iloc[1], inplace=True)

data3_l = data3.copy(deep=True)
for i in range(len(data3_l)):
    for j in range(dfy.shape[1]):
        if data3_l['time'].iloc[i] == dfy[j].iloc[0]:
            #data1['time'].iloc[i] = dfy[j].iloc[0]
            data3_l.replace(to_replace=data3_l['time'].iloc[i], value = dfy[j].iloc[1], inplace=True)
            
data123_l = data123.copy(deep=True)
for i in range(len(data123_l)):
    for j in range(dfy.shape[1]):
        if data123_l['time'].iloc[i] == dfy[j].iloc[0]:
            #data1['time'].iloc[i] = dfy[j].iloc[0]
            data123_l.replace(to_replace=data123_l['time'].iloc[i], value = dfy[j].iloc[1], inplace=True)
            
data1_l.to_csv(os.path.join(save_path, name_of_file_l1_time+sr0 + ".csv") )
data2_l.to_csv(os.path.join(save_path, name_of_file_l1_time+sr1 + ".csv") )
data2_l.to_csv(os.path.join(save_path, name_of_file_l1_time+sr2 + ".csv") )
data123_l.to_csv(os.path.join(save_path, name_of_file_l1_time+sr0+sr1+sr2 + ".csv") )            


In [11]:
time_int

[array([ 0.        ,  0.08205575,  0.1641115 ,  0.24616725,  0.328223  ,
         0.41027875,  0.49233449,  0.57439024,  0.65644599,  0.73850174,
         0.82055749,  0.90261324,  0.98466899,  1.06672474,  1.14878049,
         1.23083624,  1.31289199,  1.39494774,  1.47700348,  1.55905923,
         1.64111498,  1.72317073,  1.80522648,  1.88728223,  1.96933798,
         2.05139373,  2.13344948,  2.21550523,  2.29756098,  2.37961672,
         2.46167247,  2.54372822,  2.62578397,  2.70783972,  2.78989547,
         2.87195122,  2.95400697,  3.03606272,  3.11811847,  3.20017422,
         3.28222997,  3.36428571,  3.44634146,  3.52839721,  3.61045296,
         3.69250871,  3.77456446,  3.85662021,  3.93867596,  4.02073171,
         4.10278746,  4.18484321,  4.26689895,  4.3489547 ,  4.43101045,
         4.5130662 ,  4.59512195,  4.6771777 ,  4.75923345,  4.8412892 ,
         4.92334495,  5.0054007 ,  5.08745645,  5.1695122 ,  5.25156794,
         5.33362369,  5.41567944,  5.49773519,  5.5