<div style="float: block; text-align: center; line-height: 1.7em">
    <span style="font-size: 2em; font-weight: bold"> Fatigue-Sleepiness in Irregular Workloads for Pilots </span><br>
    <span style="font-size: 1.5em"> Features Engineering </span><br>
</div>

---

# 1. Loading Required Packages

In [1]:
import os as os
import pandas as pd
import numpy as np

---

# 2. Reading Data

## 2.1. Karolinska, Sam-Perelli scales, Punch Clock and Actimetry data

This file consists of three measurements, fatigue-sleepiness scale, pucnh clock measurements and sleep measurement.

    All of these measurements were taken in a longitudinal way along one and half semester, specifically during 06/2021 to 03/2022, with exception of punch clock and actimetry, the participant filled the karolinska and Sam-Perelli scales and other informations in a google questionnary form. 

* **Punch Clock Measurements**, consists of measurements where the participant starts and finnish its duties, so it register these monents in a punch clock

* **Fatigue-Sleepiness scale**, consists of a questionnary where the participant scale its status relative to fatigue and sleepiness in three moments, in the beginning, in the middle and end of workload or duty period. 

* **Sleep Measurement**, here the time when the participant goes to sleep is taken and the moment where he awakes, the data are validated with a sleep diary filled by the participant and actimetric measurements such as arms movements and luminosity of surroundings and the sleep quality scale, also filled by the participant in a google questionnaries form.

In [23]:
file = os.path.join('data','data_raw_questionnaire.csv')
df1 = pd.read_csv(file)

df1.head(2)

Unnamed: 0,ID,Record_Category,Record_time_stamp,Sam_Perelli,Karolinska,Workload_start,Workload_end,Workload_length,Workload_category,Sleep_begin,...,Sleep_quality,Time_awake_before_workload,Nap_1_begin,Nap_1_end,Nap_1_quality,Nap_1_length,Nap_2_start,Nap_2_end,Nap_2_length,Nap_2_quality
0,P01,INÍCIO DE JORNADA,7/12/21 12:21,"4. Um pouco cansado, não totalmente disposto","5. Nem alerta, nem sonolento",15:55:00,00:48:00,533,AFT,23:04,...,5,08:28,,,,,,,,
1,P01,MEIO DE JORNADA,7/12/21 19:57,"5. Moderadamente cansado, enfraquecido","5. Nem alerta, nem sonolento",15:55:00,00:48:00,534,AFT,23:04,...,5,08:28,,,,,,,,


## 2.2. Socio-Economic Data

These data were takens just once, through a questionnary filled by the participants at the beginning of the study.

In [24]:
file = os.path.join('data','data_raw_sociodemo.csv')
df2 = pd.read_csv(file)

df2.head(4)

Unnamed: 0,Id,Sex,Position,Num_Sons,Age,rate_income,Number_sons,Younger_son_age,Son1,Son2,Son3,Flight_hours,Time_company,Time_aviation,Education,Marital_status,Number_residents,Time_displacement
0,P01,MALE,CMTE,2,49,70.0,2,11.0,18.0,11.0,,17000.0,19.5,25.0,PHD,MARRIED,3,60
1,P02,MALE,CMTE,2,57,80.0,2,0.0,,,,17000.0,9.0,30.0,ESPEC,MARRIED,1,20
2,P03,MALE,CMTE,1,42,80.0,1,8.0,8.0,,,8000.0,12.0,12.0,ESPEC,NON_STABLE,2,420
3,P04,MALE,CMTE,0,34,60.0,0,0.0,,,,5200.0,13.0,13.0,ESPEC,MARRIED,1,20


## 2.3. Chronotype data

These data were taken just once at the beginning of the study through a test to classify the chronotype of participant, if it is Matutine, Intermediary or Vespertine.

In [25]:
file = os.path.join('data','data_raw_chronotype.csv')
df3 = pd.read_csv(file)

df3.head(5)

Unnamed: 0,Id,S1,S2,S3,S4,S5,S6,S7,Score_sum,Result,Classification
0,P01,3,2,4,2,2,4,4,21,3.0,INT
1,P02,3,2,3,3,2,3,3,19,2.71,INT
2,P03,3,3,3,4,4,4,4,25,3.57,MAT
3,P04,2,1,2,3,1,4,2,15,2.14,INT
4,P05,1,1,2,1,1,2,2,10,1.43,VES


## 2.4. Karolinska 6 months sleep

Here is the Karolinska and Sam Perelli index for sleep quality for the last 6 months. It consists of questionnary with more than 10 questions after some mathematical operations based on previous studies (factor analysis), provide us a serie of three factors, presented in this file.

In [26]:
file = os.path.join('data','data_raw_KSQ.csv')
df4 = pd.read_csv(file)

df4.head(5)

Unnamed: 0,Id,Disturbed sleep index,Awakening index,Sleepiness/fatigue
0,P01,13,9,14
1,P02,6,3,25
2,P03,11,7,10
3,P04,11,9,12
4,P05,18,11,18


---

# 3. Features Engineering of Workloads Types and Scales

## 3.1. Function to create features based on substrings

In [51]:
def func_wType(x):
    # Not to be used in big data, too slow.
    # Call as: df[new_col] = df[col].apply(func_wType)
    
    group = "others"
    for key in category_dict:
        if key in x:
            group = category_dict[key]
            break
    return group

## 3.2. Variables Derived from "Record_Category" Column

### 3.2.1. Workload Moment

The variable workload moment will categorize the moment in workload wherein the Karolinska and Sam Perelly scales were filled.

In [57]:
category_dict = {'INÍCIO DE JORNADA':'begin', 'MEIO DE JORNADA':'middle', 'FIM DE JORNADA':'end',
                 'FOLGA':'day_off', 'MONOFOLGA':'single_day_off', 'SOBREAVISO':'warning', 
                 'RESERVA':'reserve'}

df1['workload_moment'] = df1['Record_Category'].apply(func_wType)

### 3.2.2. Workload Type

The variables workload type identifies the aviation relevant type of workload based on the hour wherein the pilot begins its duties.

In [58]:
category_dict ={'EARLY-START':'early-start', 'MADRUGADA':'night'}

df1['workload_type'] = df1['Record_Category'].apply(func_wType)

### 3.2.3. Type of Previous Workload (Early Start, Night, Single Day-Off)

This variable identifies the previous workload types, we only selected three types of of previous workloads, Early-Start, Night and Single Day Off.

#### Previous Early-Start Workloads

In [65]:
category_dict = {'APÓS EARLY-START':'1', 'TRÊS EARLY-START':'3', 'APÓS DOIS EARLY-START': '2',
                 'APÓS DUAS JORNADAS EARLY-START': '2'}

df1['workload_type_prev_es'] = df1['Record_Category'].apply(func_wType).replace('others','0')

#### Previous Night Workloads

In [66]:
category_dict = {'APÓS MADRUGADA':'1', 'APÓS DUAS MADRUGADAS':'2', 'APÓS JORNADA NA MADRUGADA':'1',
                 'APÓS DUAS JORNADAS NA MADRUGADA' :'2', 'APÓS DUAS JORNADASNA MADRUGADA' :'2'}

df1['workload_type_prev_nt'] = df1['Record_Category'].apply(func_wType).replace('others','0')

In [69]:
df1[['workload_type','workload_type_prev_nt','workload_type_prev_es']].drop_duplicates().head(100)

Unnamed: 0,workload_type,workload_type_prev_nt,workload_type_prev_es
0,others,0,0
3,early-start,0,0
5,early-start,0,1
37,night,0,0
38,night,1,0
41,night,2,0
46,early-start,1,0
127,early-start,0,2
210,early-start,0,3
