목표
======

>피험자가 잠을 잔 시점을 각 밤별로 예측
>>sleep window: 해당 일차의 밤동안 가장 길게 잠으로 예측되는 상태에 빠진 기간(window)

피험자별로(series_id), 피험자의 관찰 기간 중의 측정 시점별로(step) 

잠에 들어가는 상태인지 깨는 상태인지(event), 그 예측에 대한 confidence score를 예측 


# 문제점
## 기준점 잡기
- 어떻게 잠에 잔 상태인지 예측할 것인가?

- 특정 날의 밤은 언제부터 언제로 어떻게 설정할 것인가?
>> 한 번의 밤당 하나의 sleep window만 존재 가능하기 때문

## 측정 오류
- 측정기기를 벗은 것으로 의심되는 이상치(미동도 없는 경우) 찾아내기
>> 이때를 잘못 예측하면 false positive로 기록(bad precision score)

- onset-awake가 pair로 등장하지 않는 경우(onset만 두번, awake만 두번 등)

## 변수 설정
- 외부데이터 이용 가능한데, 어떤 데이터를 가져다 쓸 것인가(단, 인터넷 연결은 불가능)

- 기존 데이터에서 뽑아 쓸 새로운 변수 데이터는 없을까? (ex. 요일, 연휴기간 등)

## 이해가 완전히 되지 않은 부분
-**AP(average precision) score**는 정확히 어떤 것을 의미하는가(이 문제에서)?

- Detections are matched to ground-truth events within error tolerances, with ambiguities resolved in order of decreasing confidence. 
>> For both event classes, we use **error tolerance thresholds** of 1, 3, 5, 7.5, 10, 12.5, 15, 20, 25, 30 in minutes,  
or 12, 36, 60, 90, 120, 150, 180, 240, 300, 360 in steps.



In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt 
import os
from tqdm import tqdm

os.chdir('C:/Users/최민석/Desktop/대학생활/대외활동/Kaggle/child-mind-institute-detect-sleep-states')

In [2]:
train_event=pd.read_csv("train_events.csv")
train_event.head()


Unnamed: 0,series_id,night,event,step,timestamp
0,038441c925bb,1,onset,4992.0,2018-08-14T22:26:00-0400
1,038441c925bb,1,wakeup,10932.0,2018-08-15T06:41:00-0400
2,038441c925bb,2,onset,20244.0,2018-08-15T19:37:00-0400
3,038441c925bb,2,wakeup,27492.0,2018-08-16T05:41:00-0400
4,038441c925bb,3,onset,39996.0,2018-08-16T23:03:00-0400


In [3]:
train_event.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14510 entries, 0 to 14509
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   series_id  14510 non-null  object 
 1   night      14510 non-null  int64  
 2   event      14510 non-null  object 
 3   step       9587 non-null   float64
 4   timestamp  9587 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 566.9+ KB


In [18]:
train_event.loc[train_event['step'].isnull(),:]

Unnamed: 0,series_id,night,event,step,timestamp
8,038441c925bb,5,onset,,
9,038441c925bb,5,wakeup,,
16,038441c925bb,9,onset,,
17,038441c925bb,9,wakeup,,
26,038441c925bb,14,onset,,
...,...,...,...,...,...
14439,fcca183903b7,36,wakeup,,
14440,fe90110788d2,1,onset,,
14441,fe90110788d2,1,wakeup,,
14508,fe90110788d2,35,onset,,


In [16]:
train_event['step'].isnull()

0        False
1        False
2        False
3        False
4        False
         ...  
14505    False
14506    False
14507    False
14508     True
14509     True
Name: step, Length: 14510, dtype: bool

In [4]:
train_series=pd.read_parquet("train_series.parquet")
train_series.head()

Unnamed: 0,series_id,step,timestamp,anglez,enmo
0,038441c925bb,0,2018-08-14T15:30:00-0400,2.6367,0.0217
1,038441c925bb,1,2018-08-14T15:30:05-0400,2.6368,0.0215
2,038441c925bb,2,2018-08-14T15:30:10-0400,2.637,0.0216
3,038441c925bb,3,2018-08-14T15:30:15-0400,2.6368,0.0213
4,038441c925bb,4,2018-08-14T15:30:20-0400,2.6368,0.0215


In [5]:
train_series.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 127946340 entries, 0 to 127946339
Data columns (total 5 columns):
 #   Column     Dtype  
---  ------     -----  
 0   series_id  object 
 1   step       uint32 
 2   timestamp  object 
 3   anglez     float32
 4   enmo       float32
dtypes: float32(2), object(2), uint32(1)
memory usage: 3.3+ GB


In [13]:
train_series[390000:600000]

Unnamed: 0,series_id,step,timestamp,anglez,enmo
390000,03d92c9f6f8a,120,2018-05-31T12:10:00-0400,-88.216599,0.0000
390001,03d92c9f6f8a,121,2018-05-31T12:10:05-0400,-88.216599,0.0000
390002,03d92c9f6f8a,122,2018-05-31T12:10:10-0400,-88.216599,0.0000
390003,03d92c9f6f8a,123,2018-05-31T12:10:15-0400,-88.216599,0.0000
390004,03d92c9f6f8a,124,2018-05-31T12:10:20-0400,-88.216599,0.0000
...,...,...,...,...,...
599995,03d92c9f6f8a,210115,2018-06-12T15:49:35-0400,-30.546600,0.0358
599996,03d92c9f6f8a,210116,2018-06-12T15:49:40-0400,-52.482101,0.1858
599997,03d92c9f6f8a,210117,2018-06-12T15:49:45-0400,-27.647499,0.1033
599998,03d92c9f6f8a,210118,2018-06-12T15:49:50-0400,-11.763500,0.0206
