<div style="float: block; text-align: center; line-height: 1.7em">
    <span style="font-size: 2em; font-weight: bold"> Fatigue-Sleepiness in Irregular Workloads for Pilots </span><br>
    <span style="font-size: 1.5em"> Features Engineering </span><br>
</div>

---

# 1. Loading Required Packages

In [1]:
import os as os
import pandas as pd
import numpy as np

---

# 2. Reading Data

## 2.1. Karolinska, Sam-Perelli scales, Punch Clock and Actimetry data

This file consists of three measurements, fatigue-sleepiness scale, pucnh clock measurements and sleep measurement.

    All of these measurements were taken in a longitudinal way along one and half semester, specifically during 06/2021 to 03/2022, with exception of punch clock and actimetry, the participant filled the karolinska and Sam-Perelli scales and other informations in a google questionnary form. 

* **Punch Clock Measurements**, consists of measurements where the participant starts and finnish its duties, so it register these monents in a punch clock

* **Fatigue-Sleepiness scale**, consists of a questionnary where the participant scale its status relative to fatigue and sleepiness in three moments, in the beginning, in the middle and end of workload or duty period. 

* **Sleep Measurement**, here the time when the participant goes to sleep is taken and the moment where he awakes, the data are validated with a sleep diary filled by the participant and actimetric measurements such as arms movements and luminosity of surroundings and the sleep quality scale, also filled by the participant in a google questionnaries form.

In [2]:
file = os.path.join('data','data_raw_questionnaire.csv')
df1 = pd.read_csv(file)

df1.head(2)

Unnamed: 0,Id,Record_Category,Record_time_stamp,Sam_Perelli,Karolinska,Workload_start,Workload_end,Workload_length,Workload_category,Sleep_begin,...,Sleep_quality,Time_awake_before_workload,Nap_1_begin,Nap_1_end,Nap_1_quality,Nap_1_length,Nap_2_start,Nap_2_end,Nap_2_length,Nap_2_quality
0,P01,INÍCIO DE JORNADA,07/12/2021 12:21,"4. Um pouco cansado, não totalmente disposto","5. Nem alerta, nem sonolento",15:55:00,00:48:00,533,AFT,23:04,...,5,08:28,,,,,,,,
1,P01,MEIO DE JORNADA,07/12/2021 19:57,"5. Moderadamente cansado, enfraquecido","5. Nem alerta, nem sonolento",15:55:00,00:48:00,534,AFT,23:04,...,5,08:28,,,,,,,,


## 2.2. Socio-Economic Data

These data were takens just once, through a questionnary filled by the participants at the beginning of the study.

In [3]:
file = os.path.join('data','data_raw_sociodemo.csv')
df2 = pd.read_csv(file)

df2.head(4)

Unnamed: 0,Id,Sex,Position,Num_Sons,Age,rate_income,Number_sons,Younger_son_age,Son1,Son2,Son3,Flight_hours,Time_company,Time_aviation,Education,Marital_status,Number_residents,Time_displacement
0,P01,MALE,CMTE,2,49,70.0,2,11.0,18.0,11.0,,17000.0,19.5,25.0,PHD,MARRIED,3,60
1,P02,MALE,CMTE,2,57,80.0,2,0.0,,,,17000.0,9.0,30.0,ESPEC,MARRIED,1,20
2,P03,MALE,CMTE,1,42,80.0,1,8.0,8.0,,,8000.0,12.0,12.0,ESPEC,NON_STABLE,2,420
3,P04,MALE,CMTE,0,34,60.0,0,0.0,,,,5200.0,13.0,13.0,ESPEC,MARRIED,1,20


## 2.3. Chronotype data

These data were taken just once at the beginning of the study through a test to classify the chronotype of participant, if it is Matutine, Intermediary or Vespertine.

In [4]:
file = os.path.join('data','data_raw_chronotype.csv')
df3 = pd.read_csv(file)

df3.head(5)

Unnamed: 0,Id,S1,S2,S3,S4,S5,S6,S7,Score_sum,Result,Classification
0,P01,3,2,4,2,2,4,4,21,3.0,INT
1,P02,3,2,3,3,2,3,3,19,2.71,INT
2,P03,3,3,3,4,4,4,4,25,3.57,MAT
3,P04,2,1,2,3,1,4,2,15,2.14,INT
4,P05,1,1,2,1,1,2,2,10,1.43,VES


## 2.4. Karolinska 6 months sleep

Here is the Karolinska and Sam Perelli index for sleep quality for the last 6 months. It consists of questionnary with more than 10 questions after some mathematical operations based on previous studies (factor analysis), provide us a serie of three factors, presented in this file.

In [5]:
file = os.path.join('data','data_raw_KSQ.csv')
df4 = pd.read_csv(file)

df4.head(5)

Unnamed: 0,Id,Disturbed sleep index,Awakening index,Sleepiness/fatigue
0,P01,13,9,14
1,P02,6,3,25
2,P03,11,7,10
3,P04,11,9,12
4,P05,18,11,18


---

# 3. Features Engineering of Workloads Types, Scales and Sleep

## 3.1. Function to create features based on substrings

In [6]:
def func_wType(x):
    # Not to be used in big data, too slow.
    # Call as: df[new_col] = df[col].apply(func_wType)
    
    group = "others"
    for key in category_dict:
        if key in x:
            group = category_dict[key]
            break
    return group

## 3.2. Variables Derived from "Record_Category" Column

### 3.2.1. Workload Moment

The variable workload moment will categorize the moment in workload wherein the Karolinska and Sam Perelly scales were filled.

In [7]:
category_dict = {'INÍCIO DE JORNADA':'start', 'MEIO DE JORNADA':'middle', 'FIM DE JORNADA':'end',
                 'FOLGA':'day_off', 'MONOFOLGA':'single_day_off', 'SOBREAVISO':'warning', 
                 'RESERVA':'reserve'}

df1['workload_moment'] = df1['Record_Category'].apply(func_wType)

### 3.2.2. Workload Type

The variables workload type identifies the aviation relevant type of workload based on the hour wherein the pilot begins its duties.

In [8]:
category_dict ={'EARLY-START':'early-start', 'MADRUGADA':'night'}

df1['workload_type'] = df1['Record_Category'].apply(func_wType)

### 3.2.3. Previous Workload Type (Early Start, Night or Single Day-Off)

This variable identifies the previous workload types, we only selected three types of of previous workloads, Early-Start, Night and Single Day Off.

#### Previous Early-Start Workloads

In [9]:
category_dict = {'APÓS EARLY-START':'1', 'TRÊS EARLY-START':'3', 'APÓS DOIS EARLY-START': '2',
                 'APÓS DUAS JORNADAS EARLY-START': '2'}

df1['workload_type_prev_es'] = df1['Record_Category'].apply(func_wType).replace('others','0')

#### Previous Night Workloads

In [10]:
category_dict = {'APÓS MADRUGADA':'1', 'APÓS DUAS MADRUGADAS':'2', 'APÓS JORNADA NA MADRUGADA':'1',
                 'APÓS DUAS JORNADAS NA MADRUGADA' :'2', 'APÓS DUAS JORNADASNA MADRUGADA' :'2'}

df1['workload_type_prev_nt'] = df1['Record_Category'].apply(func_wType).replace('others','0')

### 3.2.4. Adjusting Workload Type

Once, the workload query just identify words containing "EARLY-START", we need to remove occurreences of "APÓS EARLY-START" and etc, for night category as well.

In [11]:
df1['workload_type'] = np.where( (df1['workload_type_prev_es']!=0) & (df1['workload_type']=='early-start'), 'others', df1['workload_type'])
df1['workload_type'] = np.where( (df1['workload_type_prev_nt']!=0) & (df1['workload_type']=='night'), 'others', df1['workload_type'])

### 3.2.5. Summary

In resume, the following variables were extracted from the column "Record-Category"

| Variable | Variable Long Name | Short Description |
| --- | --- | --- |
| workload_moment | Moment of workload | Moment of workload wherein the participant fills Karolinska or Samm-Perelli scales (start, middle, end, day-off) |
| workload_type | Type of workload | Indentify if the current workload or duty period comprehends early-start or night periods |
| workload_type_prev_es | Type of the previous workload early-start | Identify the number of previous workloads which copreehends early-start period |
| workload_type_prev_nt | Type of the previous workload night | Identify the number of previous workloads comprehends which night period |

## 3.3. Variables relative to Durations and Times

### 3.3.1. Workload Length

In [12]:
# Workload length in hours not in minutes
df1['workload_length'] = pd.to_numeric( df1['Workload_length'], errors='coerce')/60

### 3.3.2. Sleep Duration Before Workload

In [13]:
# Sleep duration in hours
df1['sleep_duration'] = pd.to_numeric( df1['Sleep_total_duration_before_workload_minutes'], errors='coerce')/60

### 3.3.3. Time Awake Before Workload

In [14]:
def hr_mn_func(ts):
    # function to return the duration in hpurs
    return ts.hour+ts.minute/60

df1['time_awake'] = pd.to_datetime(df1['Time_awake_before_workload'], errors ='coerce').apply(hr_mn_func)

### 3.3.4. Summary

The following variables realative to durations were transformed:


| Variable | Variable Long Name | Short Description |
| --- | --- | --- |
| workload_length | Workload Duration | Total workload duration based on punch clock record |
| sleep_duration | Sleep Duration before the workload | Total sleep duration before the workload based on sleep diary and actimetric measurement |
| time_awake | Time Awake before the workload | Time awake before the workload based on sleep diary, actimetric measurements and punch clock record |


## 3.4. Variables Relatives to Naps

### 3.4.1. Number of Naps During Workload

In [15]:
df1.loc[pd.to_datetime(df1['Nap_1_length'], errors='coerce').apply(hr_mn_func) > 0, 'nap_number'] = 1
df1.loc[pd.to_datetime(df1['Nap_2_length'], errors='coerce').apply(hr_mn_func) > 0, 'nap_number'] = 2
df1['nap_number'] = df1['nap_number'].fillna(0).astype(int)

### 3.4.2. Total Nap Duration

In [16]:
df1['nap_duration'] = pd.to_datetime(df1['Nap_1_length'], errors='coerce').apply(hr_mn_func)\
                      + pd.to_datetime(df1['Nap_2_length'], errors='coerce').apply(hr_mn_func)
df1['nap_duration'] = df1['nap_duration'].fillna(0)

### 3.4.3. Summary

The following variables had been produced:

| Variable | Variable Long Name | Short Description |
| --- | --- | --- |
| nap_number | Number of naps | Number of naps during the workload period |
| nap_duration | Total nap duration | Sum of duration of all naps during the workload period in hours |

---

# 4. Retrieving Karolinska and Samm-Perelli scales

## 4.1. Karolinska and Sam-Perelli scales from text data

In [17]:
def kss_sps(x):
    # The filled scale correponds to the first character in row
    return x[0]

df1['kss'] = df1['Karolinska'].fillna('0').apply(kss_sps).astype(int).replace(0,np.nan)
df1['sps'] = df1['Sam_Perelli'].fillna('0').apply(kss_sps).astype(int).replace(0,np.nan)

## 4.2. Hour wherein the Participant filled Karolinska and Sam-Perelli scales

To categorize the hour wherein the participant filled both the karolinka and Sam-Perelli scales, we will follow an international aviation convention, showed in table bellow:

| Category | Long Name | Convention |
| --- | --- | --- |
| EM | Early Morning (or Early Start) | Between 06:00 and 07:59 |
| MOR | Morning | Between 08:00 and 11:59 |
| AFT | Afternoon | Between 12:00 and 17:59 |
| EVE | Evening | Between 18:00 and 23:59 |
| NI | Night | Between 00:00 and 05:59 |

Here we call the first period of EM (Early Morning) to avoid ambiguity to the variable ES (Early-Start).

In [18]:
def categorization_hour(x):
    hr = 'others'
    if 0 <= x < 6:
        hr = 'NI'
    if 6 <= x < 8:
        hr = 'EM'
    elif 8 <= x < 12:
        hr = 'MOR'
    elif 12 <= x < 18:
        hr = 'AFT'
    elif 18 <= x < 24:
        hr = 'EVE'
    return hr

df1['quest_fill_hour'] = pd.to_datetime(df1['Record_time_stamp'], errors ='coerce').apply(hr_mn_func).astype(float)\
                                                                                    .apply(categorization_hour)

## 4.2. Summary

| Variable | Variable Long Name | Short Description |
| --- | --- | --- |
| kss | Karolinska Scale | Karolinska sleepiness perception scale |
| sps | Sam-Perelli Scale | Sam-Perelli fatigue perception scale |
| quest_fill_hour | Moment wherein the kss and sps were filled | Hour of the day wherein the participant filled the kss and sps scales in categories (EM, MOR, AFT, EVE, NI) |

---

# 5. Joning the Data Bases

The following variables will be used:

| Number | Variable | Variable Long Name | Type | Short Description |
| --- | --- | --- | --- | --- |
| 1 | **Id** | Identification | Categorical | Participant's identification |
| 2 | **kss** | Karolinska Scale | Categorical Ordinal | Karolinska sleepiness perception scale |
| 3 | **sps** | Sam-Perelli Scale | Categorical Ordinal | Sam-Pereli fatigue perception scale |
| 4 | **quest_fill_hour** | Moment wherein the kss and sps were filled) | Categorical | Hour of the day wherein the participant filled the kss and sps scales in categories (EM, MOR, AFT, EVE, NI) |
| 5 | **workload_moment** | Moment of workload | Categorical | Moment of workload wherein the participant fills Karolinska or Samm-Perelli scales (start, middle, end) |
| 6 | **workload_type** | Type of workload | Categorical | Indentify if the current workload or duty period comprehends early-start or night periods |
| 7 | **workload_type_prev_es** | Type of previous workload early-start | Categorical | Identify the number of previous workloads which copreehends early-start period |
| 8 | **workload_type_prev_nt** | Type of previous workload night | Categorical | Identify the number of previous workloads comprehends which night period |
| 9 | **workload_length** | Workload Duration | Float | Total workload duration based on punch clock record |
| 10 | **sleep_duration** | Sleep Duration before the workload | Float | Total sleep duration before the workload based on sleep diary and actimetric measurement |
| 11 | **time_awake** | Time Awake before the workload | Float | Time awake before the workload based on sleep diary, actimetric measurements and punch clock record |
| 12 | **nap_number** | Number of naps | Categorical Ordinal | Number of naps during the workload period |
| 13 | **nap_duration** | Total nap duration | Float |Sum of duration of all naps during the workload period in hours |
| 14 | **Sex** | Gender | Categorical | Gender of participant |
| 15 | **Position** | Work position | Categorical | Work position if commander or co-pilot |
| 16 | **Num_Sons** | Number of Sons | Categorical Ordinal | Number of dependent sons |
| 17 | **Flight_hours** | Number of flight hours | Integer | Number of flight hours |
| 18 | **Education** | Education level | Categorical Ordinal | Education level |
| 19 | **Marital_status** | Marital status | Categorical | Marital status |
| 20 | **Time_displacement** | Home to work time | float | time spent from Home to work |
| 21 | **Classification** | Chronotype | Categorical | Chronotype based on test |
| 22 | **Disturbed_sleep** | Disturbed sleep index | Integer | Perceived diturbed sleep index in past 6 months |
| 23 | **Awakening** | Awakening index | Integer | Perceived mean awakening index in past 6 months |
| 24 | **Sleep_Fatig** | Sleepiness Fatigue index | Integer | Perceived sleepiness/fatigue index in past 6 months |


In [47]:
# Merging (Joining) the data bases

cols1 = ['Id', 'kss', 'sps', 'quest_fill_hour', 'workload_moment', 'workload_type', 'workload_type_prev_es',
         'workload_type_prev_nt', 'workload_length', 'sleep_duration', 'time_awake', 'nap_number', 'nap_duration']
cols2 = ['Id', 'Sex', 'Position', 'Num_Sons', 'Flight_hours', 'Education', 'Marital_status', 'Time_displacement']
cols3 = ['Id', 'Classification']
cols4 = ['Id', 'Disturbed sleep index', 'Awakening index', 'Sleepiness/fatigue']

dict_rename = {'Disturbed sleep index':'Disturbed_sleep',
               'Awakening index':'Awakening',
               'Sleepiness/fatigue':'Sleep_Fatig'}

df = df1[cols1].merge(df2[cols2], how='inner', on = 'Id')\
               .merge(df3[cols3], how='inner', on = 'Id')\
               .merge(df4[cols4].rename(columns = dict_rename), how='inner', on = 'Id')
df.head(5)

Unnamed: 0,Id,kss,sps,quest_fill_hour,workload_moment,workload_type,workload_type_prev_es,workload_type_prev_nt,workload_length,sleep_duration,...,Position,Num_Sons,Flight_hours,Education,Marital_status,Time_displacement,Classification,Disturbed_sleep,Awakening,Sleep_Fatig
0,P01,5,4.0,AFT,start,others,0,0,8.883333,8.383333,...,CMTE,2,17000.0,PHD,MARRIED,60,INT,13,9,14
1,P01,5,5.0,EVE,middle,others,0,0,8.9,8.383333,...,CMTE,2,17000.0,PHD,MARRIED,60,INT,13,9,14
2,P01,7,6.0,NI,end,others,0,0,8.916667,8.383333,...,CMTE,2,17000.0,PHD,MARRIED,60,INT,13,9,14
3,P01,7,5.0,NI,start,others,0,0,5.45,7.2,...,CMTE,2,17000.0,PHD,MARRIED,60,INT,13,9,14
4,P01,4,3.0,MOR,middle,others,0,0,5.466667,7.2,...,CMTE,2,17000.0,PHD,MARRIED,60,INT,13,9,14


In [48]:
ofile = os.path.join('data/processed_features.csv')
try:
    df.to_csv(ofile, index=False)
    print('---------------------------------------------')
    print(f'Success! Data recorded in file: {ofile}')
    print('---------------------------------------------')
except:
    print('---------------------------------------------')
    print(f'Data not recorded in file: {ofile}, verify!')
    print('---------------------------------------------')

---------------------------------------------
Success! Data recorded in file: data/processed_features.csv
---------------------------------------------
