## Preparation of the different data sets

The goal of this notebook is to prepare the different data sets for the training of the model. The data sets are the following :
- full data set
- full data set (removing unexpected labels -> a,j,s,7,8)
- artifact free data set (same as before is 1,2,3)
- df_simplify labels (4 and 9-> w, 5-> n, 6-> r)
- df_simplify_day3 (keeps only third day)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

import sys
sys.path.append('../Library')

import breedManip as breedManip
import dataProcessing as dataProcessing
import breeds as breeds
import splitData as splitData

import os
import importlib

importlib.reload(splitData)

2023-07-23 17:01:54.765262: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-23 17:01:54.838930: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


<module 'splitData' from '/mnt/remote/workspaces/magali.egger/TBproject/Travail_Bachelor/Preparation/../Library/splitData.py'>

In [2]:
def addTemporalityInfo(df):
    df1 = df.copy()
    df1['hour'] = df1.index // 900
    df1['minute'] = df1.index // 15

In [3]:
input_folder = '/home/magali.egger/shared-projects/mice_UNIL/BXD envoie/Prep_files/'
files = [file for file in os.listdir(input_folder)]

df = pd.DataFrame()

for file in files :
    file_name = file.split('.')[0]

    # add the mouse name
    df = pd.concat([df,pd.read_csv(input_folder + file).assign(mouse=file_name)])
    
    # add the breed
    df.loc[df['mouse'] == file_name, 'breed'] = breedManip.getBreedOfMouse(file_name)

First data set to be stored is the full data set (no filter). The data set is stored in the folder "data" by the library pickle.

In [6]:
with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_full_noFilter.pkl', 'wb') as f:
    pickle.dump(df, f)

In [4]:
len(df['mouse'].unique())

252

In [5]:
len(df.index)
# should be 21'314'412

21314412

In [7]:
print(df['rawState'].value_counts())
print(df['state'].value_counts())

w    10800193
n     8812707
r     1181364
1      448785
2       46969
3       15120
5        3819
4        3401
8         784
6         729
s         329
9         208
a           3
j           1
Name: rawState, dtype: int64
w    11252379
n     8863495
r     1197213
8         784
s         329
9         208
a           3
j           1
Name: state, dtype: int64


Second data set is the data set with the filter of the unexpected labels. The lines containing unexpected labels are removed. Unexpected labels are the following : a,j,s,7,8.

In [8]:
df_filter = df.copy()
labels_to_keep = ['1', '2', '3', '4', '5', '6', '9', 'n', 'r', 'w']
df_filter = df_filter[df_filter['rawState'].isin(labels_to_keep)]

In [9]:
print(df_filter['rawState'].value_counts())

w    10800193
n     8812707
r     1181364
1      448785
2       46969
3       15120
5        3819
4        3401
6         729
9         208
Name: rawState, dtype: int64


In [10]:
print(len(df_filter.index))
# should be 21'313'295

21313295


In [11]:
with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_filter.pkl', 'wb') as f:
    pickle.dump(df_filter, f)

Third data set is the data set without the artifacts. The artifacts are the labels 1,2,3.

In [12]:
df_artifacts_free = df_filter.copy()
labels_artifacts_free = ['4', '5', '6', '9', 'n', 'r', 'w']
df_artifacts_free = df_artifacts_free[df_artifacts_free['rawState'].isin(labels_artifacts_free)]

In [13]:
len(df_artifacts_free.index)
# should be 20'802'421

20802421

In [14]:
with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_artifacts_free.pkl', 'wb') as f:
    pickle.dump(df_artifacts_free, f)

Fourth data set is the data set with the simplified labels. The labels 4 and 9 are replaced by w, 5 by n and 6 by r.

In [15]:
df_simplify = df_artifacts_free.copy()
df_simplify['rawState'] = df_simplify['rawState'].replace(['4','9'], 'w')
df_simplify['rawState'] = df_simplify['rawState'].replace(['5'], 'n')
df_simplify['rawState'] = df_simplify['rawState'].replace(['6'], 'r')

In [16]:
len(df_simplify.index) - len(df_artifacts_free.index)
# should be 0

0

In [17]:
with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_simplify.pkl', 'wb') as f:
    pickle.dump(df_simplify, f)

The last df is the df simplified with only the third day.

In [18]:
df_simplify_day3 = df_simplify.copy()
df_simplify_day3 = df_simplify_day3[df_simplify_day3['day'] == 2]

In [19]:
with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_simplify_day3.pkl', 'wb') as f:
    pickle.dump(df_simplify_day3, f)

#### Loading the data frames

In [20]:
with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_simplify_day3.pkl', 'rb') as f:
    df_simplify_day3 = pickle.load(f)

In [21]:
with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_artifacts_free.pkl', 'rb') as f:
    df_artifacts_free = pickle.load(f)

with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_simplify.pkl', 'rb') as f:
    df_simplify = pickle.load(f)

In [40]:
with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_simplify_day3.pkl', 'rb') as f:
    df_simplify_day3 = pickle.load(f)

4919666


In [None]:
# get the different df
"""
with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_full_noFilter.pkl', 'rb') as f:
    df = pickle.load(f)

with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_filter.pkl', 'rb') as f:
    df_filter = pickle.load(f)

with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_artifacts_free.pkl', 'rb') as f:
    df_artifacts_free = pickle.load(f)

with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_simplify.pkl', 'rb') as f:
    df_simplify = pickle.load(f)

with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_simplify_day3.pkl', 'rb') as f:
    df_simplify_day3 = pickle.load(f)
"""

### Preparation for the train and the test set

There's different step to follow in order to prepare correctly the train and test set.
- The train and the tests set are created based on the third day  
- The mouse from the breeds smaller than 4 aren't kept
- The test set is composed of one mouse from each breed
- The train set is composed of the rest of the mice.

In [22]:
with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_simplify_day3.pkl', 'rb') as f:
    df1 = pickle.load(f)

In [23]:
selected_breeds = breedManip.selectAllBreedsOfSizeNOrMore(4)
#id_selected_breeds = [breedManip.getBreedIndex(breed) for breed in selected_breeds]
df1 = df1[df1['breed'].isin(selected_breeds)]

In [24]:
len(selected_breeds)

39

In [25]:
df1.head()

Unnamed: 0.1,Unnamed: 0,rawState,state,EEGv,EMGv,epoch,day,spectral_flatness,spectral_centroid,spectral_entropy,...,EMGv_max100,EEGv_log,EMGv_log,bias,EEGv^2,EEGv^3,EMGv^2,EMGv^3,mouse,breed
42018,43200,w,w,2.561793e-09,2.476193e-10,43200,2,0.156645,15.654199,-6.416485,...,6.79873e-10,-19.782559,-22.119129,1,6.562781e-18,1.681248e-26,6.13153e-20,1.518285e-29,9003,bxd_090
42019,43201,w,w,2.890715e-09,1.768662e-10,43201,2,0.146992,15.432471,-5.64792,...,6.79873e-10,-19.661762,-22.455627,1,8.356233e-18,2.415549e-26,3.128167e-20,5.5326709999999995e-30,9003,bxd_090
42020,43202,w,w,2.657057e-09,1.719997e-10,43202,2,0.09666,11.4556,-5.618493,...,6.79873e-10,-19.746047,-22.483528,1,7.059955e-18,1.875871e-26,2.958391e-20,5.0884259999999996e-30,9003,bxd_090
42021,43203,w,w,3.102465e-09,1.463278e-10,43203,2,0.085641,10.517584,-5.388146,...,6.79873e-10,-19.591069,-22.645172,1,9.625289e-18,2.9862119999999997e-26,2.141181e-20,3.1331429999999998e-30,9003,bxd_090
42022,43204,w,w,3.150835e-09,1.709445e-10,43204,2,0.098589,10.449326,-5.47833,...,6.79873e-10,-19.575598,-22.489682,1,9.927761e-18,3.128074e-26,2.922202e-20,4.995344e-30,9003,bxd_090


In [26]:
seed = 24
df_train, df_test = splitData.split_data_breeds(df1,seed)

In [27]:
print(len(df_train['mouse'].unique()))
print(len(df_test['mouse'].unique()))

206
39


In [28]:
with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_train.pkl', 'wb') as f:
    pickle.dump(df_train, f)

with open('/home/magali.egger/workspace/TBproject/Travail_Bachelor/Data/df_test.pkl', 'wb') as f:
    pickle.dump(df_test, f)

In [29]:
print(df_test['mouse'].unique())

['09003' '09806' '08405' '00503' '29T01' '09504' '07303' '06408' '10010'
 'BDF02' '09701' '06605' '07501' '1D203' '06504' '06703' '08102' '06308'
 '05505' 'BL614' '05107' '05001' '04805' '07005' '06111' '08701' '04903'
 '04407' '03207' '05603' '08306' '071S2' '04505' '10302' 'DBF02' '09607'
 '08910' '04304' '02902']


In [30]:
print(df_train['mouse'].unique())

['043S5' '06403' '06110' '10301' '06404' '02901' '03205' '09509' '04309'
 '29T06' 'BDF06' '02910' '04508' '08404' '08108' '00506' '09803' '08106'
 '07007' 'DBA13' '04901' '07006' '2D203' '1D204' '06603' '51G10' '09708'
 '10009' 'DBF01' 'BL6V3' '04806' 'BL611' '08314' '05502' '06405' '07004'
 'BL6V2' '10304' '05602' '09006' '10004' '06505' 'BL606' '10306' '09808'
 '08911' '10002' '05501' '05604' '02905' '07106' '03206' '07305' 'BDF04'
 '08112' '06105' 'DBA12' '04405' '07502' '29T10' '06306' '04403' 'DBF06'
 '08706' '07105' '08707' '06707' '05002' '06109' '09807' 'BL601' '09702'
 '1D206' '04501' '04906' 'BL6V1' '09506' '04308' '05606' 'DBF04' '00505'
 '06705' '08311' '09604' '02909' '04306' '02903' '06303' 'BL610' '03208'
 '08903' '09602' '051G9' '07505' '09505' '04402' '10003' '00504' '09005'
 '07302' 'DBF07' '00501' '05004' '06601' '05101' 'BL609' '08904' '08401'
 'DBA14' '09501' '09709' '06702' '09601' '07503' '04504' '04902' '02907'
 '06307' '09508' '08705' '05108' 'DBA11' 'DBF05' '0

In [31]:
print(len(df_train['mouse'].unique()))

206


In [32]:
print(len(df_test['mouse'].unique()))

39
