## Loading and preparing JSON data

We load jsons and save for each classified window. We'll save a big file with subject 
name, frame; And later we'll add subject class and frame class. 
Where subject class is overall normal or abnormal, and frame class is tagged as 
if used for study (GM).

In [10]:
import json
import os
import pandas as pd


data = []
data_folder = '/home/harrisonford/Documents/babybrain/confidences/'
data_files = sorted(os.listdir(data_folder))
for a_file in data_files:
    with open(data_folder + a_file) as json_file:
        single_data = json.load(json_file)
        my_frames = single_data['processed_frames']
        my_confidences = single_data['confidences']
        for frame_value, confidence_value in zip(my_frames, my_confidences):
            for index, a_confidence in enumerate(confidence_value):
                data.append([a_file[:-5], frame_value, a_confidence, index])

# create a DataFrame
columns = ['subject', 'frame', 'confidence', 'joint_id'] 
df = pd.DataFrame(data, columns=columns)

# save it as csv
df.to_csv('confidences.csv')

## Loading label data

We load the excel files with specialist classification, we'll add to prior data two 
columns containing subject class (1=normal, 0=abnormal) and frame class (1 is used as
General Movement and 0 if its not used as subject classification)

In [9]:
import pandas as pd


# we make a small function to compare frame times at input in seconds with times in 
# label in date time format, gives true if inside dates/
# TODO: I need to finish this function
def dates_to_seconds(some_date):
    s = int(some_date.split(':'))
    return s[0] + 60*s[1] + 3600* s[2]

# a small function to check if a frame time is inside a tag stamp
def in_window(frame_sec, time_start, time_end):
    return dates_to_seconds(time_start) <= frame_sec <= dates_to_seconds(time_end)


# first we import from label files
label_path = 'D:/babybrain/SubjectTags.xlsx'
labels = pd.read_excel(label_path)

names = labels['Sujeto']
tag = labels['Normal']
start = labels['T0']
end = labels['T1']

# now we import the processed data to add the columns
input_path = 'D:/babybrain/confidences.csv'
inputs = pd.read_csv(input_path)

# for each value in inputs we find the subject and check if frame inside a classification
input_names = inputs['subject']
frames = inputs['frame']
is_gm = []
is_normal = []
for index, an_input in enumerate(input_names):
    # search for subject index in the label data and then get all his intervals
    interval_indexes = [i for i, label in enumerate(names) if an_input in label]
    # for each interval we check frame time is inside a label window
    if any([in_window(frames[index], start[subindex], end[subindex]) for subindex in interval_indexes]):
        is_gm.append(1)
        # we can append any of the intervals tag as normal or abnormal, its the same for all
        is_normal.append(tag[interval_indexes[0]]=='NORMAL')
    else:
        is_gm.append(0)
        is_normal.append(None)
    
# save new columns into Data Frame
inputs['is_gm'] = is_gm
inputs['is_normal'] = is_normal
inputs.to_csv(input_path)

## Using new labels and data exploration
We make an average value of joint confidence per class to check if at clear eye we
can tell a difference in values, even if we can't we can use effectively a random
forest classifier.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# get the data frame
input_path = 'D:/babybrain/confidences.csv'
data = pd.read_csv(input_path)

# create a simple graph with seaborn
graph = sns.catplot(x='joint_id', y='confidence', hue='is_GM', data=data)
plt.show()

## Training Random Forest Classifier
Now we use data to train a random forest classifier for either GM or normal.


In [None]:
import pandas as pd
import sklearn as sk

# get the data frame
input_path = 'D:/babybrain/confidences.csv'
data = pd.read_csv(input_path)

classifying = 'is_GM'

# prepare the data to fit in the model
X = data.groupby('joint_id')
y = data[classifying]
model = RandomForest()
model.fit(X, y)

## Checking the Classifier accuracy

We can use sklearn's internal library to check a ROC curve for accuracy.

In [None]:
# get the data frame
input_path = 'D:/babybrain/confidences.csv'
data = pd.read_csv(input_path)
print(data)
