## Description

Often we want to compare if two files with seizure annotations contain the same annotations. For example, if you look through a week of recordings and annotate the sezures, comparing a classifier's predictions with your annotations will allow you to check the number of false positives and (more importantly) false negatives. 

In [1]:
import os

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


For an example we will look at the difference between a raw predictions output and the same file after it has been manually checked and the false postives removed using the gui. 

In [2]:
checked_preds = pd.read_csv('./example_data/checked_predictions.csv', index_col=0)
raw_preds     = pd.read_csv('./example_data/raw_predictions.csv', index_col=0)

In [3]:
raw_preds.head()

Unnamed: 0,old_index,filename,start,end,duration,transmitter,real_start,real_end
0,3,M1513966209_2017-12-22-18-10-09_tids_[99].h5,610.0,625.0,15.0,[99],2017-12-22 18:20:19,2017-12-22 18:20:34
1,15,M1514016609_2017-12-23-08-10-09_tids_[99].h5,950.0,985.0,35.0,[99],2017-12-23 08:25:59,2017-12-23 08:26:34
2,55,M1514164210_2017-12-25-01-10-10_tids_[99].h5,1305.0,1345.0,40.0,[99],2017-12-25 01:31:55,2017-12-25 01:32:35
3,58,M1514189410_2017-12-25-08-10-10_tids_[99].h5,1400.0,1445.0,45.0,[99],2017-12-25 08:33:30,2017-12-25 08:34:15
4,62,M1514211010_2017-12-25-14-10-10_tids_[99].h5,2650.0,2675.0,25.0,[99],2017-12-25 14:54:20,2017-12-25 14:54:45


In [4]:
raw_preds.shape

(1287, 8)

In [5]:
checked_preds.shape

(729, 8)

In [6]:
print('So we expect there to be', raw_preds.shape[0]-checked_preds.shape[0], 'false positives')

So we expect there to be 558 false positives


# Code for comparing the dataframes

In [7]:
def add_mcode_tid_col(df):
    '''Note this expects file start to be of format: M1513966209'''
    df['mcode_tid'] = df.filename.str.slice(0,11)+'_'+df.transmitter.astype(str)
    return df

def check_overlap(series1,series2):
    ''' pandas series should both have start and end columns
    http://baodad.blogspot.co.uk/2014/06/date-range-overlap.html
    '''
    start_a, end_a = float(series1.start), float(series1.end)
    start_b, end_b = float(series2.start), float(series2.end)
    overlap_bool = (start_a <= end_b) and (end_a>=start_b)
    return overlap_bool

def calculate_overlap(series1,series2):
    ''' pandas series should both have start and end attrs
    http://baodad.blogspot.co.uk/2014/06/date-range-overlap.html
    '''
    a, b = float(series1.start), float(series1.end)
    c, d = float(series2.start), float(series2.end)
    overlap = min([b-a,b-c,d-c,d-a])
    return overlap

def compare_dfs(prediction_df, annotation_df):
    ''' 
    Function to check how much of prediction_df is found in annotation_df
    
    Returns two dataframes:
        - preds_in_annotations_df: the predictions found within the annotations and the amount of overlap
          (probably corresponding to true positives). Has an overlap column (seconds).
        - preds_not_in_annotations_df: the predictions not found in the annotations
          (probably corresponding to false positives)
          
    Note:
        To check for false negatives, or missed seizures, pass in the actual annotations
        as the 'prediction_df' and the and the actual predictions as the 'annotation_df'. 
        Here we would hope for no 'false positives', so the second dataframe will contain the 
        missed seizures...
    '''
    
    # first add a 'mcode_tid' col: allows us to check for same hour and transmitter
    prediction_df = add_mcode_tid_col(prediction_df)
    annotation_df = add_mcode_tid_col(annotation_df)
    
    # Create empty dataframes that we will add to below
    preds_in_annotations_df     = pd.DataFrame(columns = prediction_df.columns)
    preds_not_in_annotations_df = pd.DataFrame(columns = prediction_df.columns)
    
    # loop over the predictions
    for _, prediction_row_series in prediction_df.iterrows():
        overlap_bool = False # boolean for if the predicted seizure at all overlaps with an annotation
        # first check if the hour&transmitter is in the annotations
        if prediction_row_series.mcode_tid in annotation_df.mcode_tid.unique():
            
            # next find all annotations with same hour and tid as the prediction row
            # this will often just be one row, but if >1 seizures in a single hour will be more
            revevant_annotations_df = annotation_df[annotation_df.mcode_tid.isin([prediction_row_series.mcode_tid])]
            
            # finally check if the start and end columns overlap
            t_overlap = 0 # store the overlap time between preds and seizures
            for _, annotation_row_series in revevant_annotations_df.iterrows(): 
                row_overlap   =  check_overlap(prediction_row_series,
                                               annotation_row_series) # in the case that two seizures, want to add...
                overlap_bool  += row_overlap
                if row_overlap: # is this robust to two seizures?
                    t_overlap += calculate_overlap(prediction_row_series,
                                                   annotation_row_series)
                
        if overlap_bool>0:
            prediction_row_series['overlap'] = t_overlap
            preds_in_annotations_df   = preds_in_annotations_df.append(prediction_row_series)
        else:
            preds_not_in_annotations_df = preds_not_in_annotations_df.append(prediction_row_series)
    
    return preds_in_annotations_df, preds_not_in_annotations_df

true_positives, false_positives = compare_dfs(raw_preds,checked_preds)     

In [8]:
true_positives.shape, false_positives.shape

((729, 10), (558, 9))

In [9]:
true_positives.head()

Unnamed: 0,old_index,filename,start,end,duration,transmitter,real_start,real_end,mcode_tid,overlap
0,3,M1513966209_2017-12-22-18-10-09_tids_[99].h5,610.0,625.0,15.0,[99],2017-12-22 18:20:19,2017-12-22 18:20:34,M1513966209_[99],15.0
1,15,M1514016609_2017-12-23-08-10-09_tids_[99].h5,950.0,985.0,35.0,[99],2017-12-23 08:25:59,2017-12-23 08:26:34,M1514016609_[99],35.0
2,55,M1514164210_2017-12-25-01-10-10_tids_[99].h5,1305.0,1345.0,40.0,[99],2017-12-25 01:31:55,2017-12-25 01:32:35,M1514164210_[99],40.0
3,58,M1514189410_2017-12-25-08-10-10_tids_[99].h5,1400.0,1445.0,45.0,[99],2017-12-25 08:33:30,2017-12-25 08:34:15,M1514189410_[99],45.0
4,62,M1514211010_2017-12-25-14-10-10_tids_[99].h5,2650.0,2675.0,25.0,[99],2017-12-25 14:54:20,2017-12-25 14:54:45,M1514211010_[99],25.0


In [10]:
false_positives.head()

Unnamed: 0,old_index,filename,start,end,duration,transmitter,real_start,real_end,mcode_tid
115,56,M1514178610_2017-12-25-05-10-10_tids_[98].h5,0.0,5.0,5.0,[98],2017-12-25 05:10:10,2017-12-25 05:10:15,M1514178610_[98]
126,5,M1513969809_2017-12-22-19-10-09_tids_[97].h5,3565.0,3580.0,15.0,[97],2017-12-22 20:09:34,2017-12-22 20:09:49,M1513969809_[97]
128,8,M1513977009_2017-12-22-21-10-09_tids_[97].h5,3540.0,3560.0,20.0,[97],2017-12-22 22:09:09,2017-12-22 22:09:29,M1513977009_[97]
143,25,M1514038209_2017-12-23-14-10-09_tids_[97].h5,20.0,30.0,10.0,[97],2017-12-23 14:10:29,2017-12-23 14:10:39,M1514038209_[97]
194,127,M1514445010_2017-12-28-07-10-10_tids_[97].h5,2615.0,2645.0,30.0,[97],2017-12-28 07:53:45,2017-12-28 07:54:15,M1514445010_[97]


# Here save the false positives to check through using the gui

- they might not all be false positives!

In [11]:
savename = 'predictions_not_in_annotations.csv'
false_positives.to_csv(savename,header=False, index=False)

# Here flip the order of dataframes:

```
Note:
        To check for false negatives, or missed seizures, pass in the actual annotations
        as the 'prediction_df' and the and the actual predictions as the 'annotation_df'. 
        Here we would hope for no 'false positives', so the second dataframe will contain the 
        missed seizures...
```

Pass in checked preds as the predctions. This is similar to the case where you are checking for missed seizures

In [12]:
true_positives, false_positives = compare_dfs(checked_preds,raw_preds)     

In [13]:
true_positives.shape

(729, 10)

In [14]:
false_positives.shape

(0, 9)

as expected no false positives. If there were these would be annoations that had been missed (if the predictions were over the same time period as the annotations)