<div style="float: block; text-align: center; line-height: 1.7em">
    <span style="font-size: 2em; font-weight: bold"> Fatigue-Sleepiness in Irregular Workloads for Pilots </span><br>
    <span style="font-size: 1.5em"> Features Engineering </span><br>
</div>

---

# 1. Loading Required Packages

In [1]:
import os as os
import pandas as pd
import numpy as np

---

# 2. Reading Data

## 2.1. Karolinska, Sam-Perelli scales, Punch Clock and Actimetry data

This file consists of three measurements, fatigue-sleepiness scale, pucnh clock and actimetric or sleep diary measurements.

    All of these measurements were taken longitudinally along one and half semester, specifically during 06/2021 to 03/2022, questionnaires based informations were taken in google questionnaire form. 

* **Punch Clock Measurements**, consists of measurements where the participant starts and finnish its labour duties in a punch clock.

* **Fatigue-Sleepiness scale**, consists of a questionnary where the participant scale their status relative to fatigue and sleepiness in three moments along the duty period, at the start, at the middle and at the end. 

* **Sleep Measurements**, the time the participant goes to sleep and wake up is taken through actimetric measurements such as ammount of light in surroundings, arms movement and sleep diary.

In [3]:
file = os.path.join('data','data_raw_questionnaire.csv')
df1 = pd.read_csv(file)

df1.head(2)

Unnamed: 0,Id,Record_Category,Record_time_stamp,Sam_Perelli,Karolinska,Duty_start,Duty_end,Duty_length,Duty_category,Bedtime_start,...,Sleep_quality,Time_awake_before_workload,Nap_1_begin,Nap_1_end,Nap_1_quality,Nap_1_length,Nap_2_start,Nap_2_end,Nap_2_length,Nap_2_quality
0,P01,INÍCIO DE JORNADA,07/12/2021 12:21,"4. Um pouco cansado, não totalmente disposto","5. Nem alerta, nem sonolento",15:55:00,00:48:00,533,AFT,23:04,...,5,08:28,,,,,,,,
1,P01,MEIO DE JORNADA,07/12/2021 19:57,"5. Moderadamente cansado, enfraquecido","5. Nem alerta, nem sonolento",15:55:00,00:48:00,534,AFT,23:04,...,5,08:28,,,,,,,,


## 2.2. Socio-Economic Data

These data were taken just once, through a questionnary filled by the participants at the beginning of the study.

In [5]:
file = os.path.join('data','data_raw_sociodemo.csv')
df2 = pd.read_csv(file)

df2.head(4)

Unnamed: 0,Id,Sex,Position,Num_Sons,Age,rate_income,Number_sons,Younger_son_age,Son1,Son2,Son3,Flight_hours,Time_company,Time_aviation,Education,Marital_status,Number_residents,Time_displacement
0,P01,MALE,CMTE,2,49,70.0,2,11.0,18.0,11.0,,17000.0,19.5,25.0,PHD,MARRIED,3,60
1,P02,MALE,CMTE,2,57,80.0,2,0.0,,,,17000.0,9.0,30.0,ESPEC,MARRIED,1,20
2,P03,MALE,CMTE,1,42,80.0,1,8.0,8.0,,,8000.0,12.0,12.0,ESPEC,NON_STABLE,2,420
3,P04,MALE,CMTE,0,34,60.0,0,0.0,,,,5200.0,13.0,13.0,ESPEC,MARRIED,1,20


## 2.3. Chronotype data

These data were taken just once at the beginning of the study through a test to classify the chronotype of participant, if it is Matutine, Intermediary or Vespertine.

In [7]:
file = os.path.join('data','data_raw_chronotype.csv')
df3 = pd.read_csv(file)

df3.head(5)

Unnamed: 0,Id,S1,S2,S3,S4,S5,S6,S7,Score_sum,Result,Classification
0,P01,3,2,4,2,2,4,4,21,3.0,INT
1,P02,3,2,3,3,2,3,3,19,2.71,INT
2,P03,3,3,3,4,4,4,4,25,3.57,MAT
3,P04,2,1,2,3,1,4,2,15,2.14,INT
4,P05,1,1,2,1,1,2,2,10,1.43,VES


## 2.4. Karolinska 6 months sleep

Here is the Karolinska and Sam Perelli index for sleep quality for the last 6 months. It consists of questionnary with more than 10 questions after some mathematical operations based on previous studies (factor analysis), provide us a serie of three factors, presented in this file.

In [9]:
file = os.path.join('data','data_raw_KSQ.csv')
df4 = pd.read_csv(file)

df4.head(5)

Unnamed: 0,Id,Disturbed sleep index,Awakening index,Sleepiness/fatigue
0,P01,13,9,14
1,P02,6,3,25
2,P03,11,7,10
3,P04,11,9,12
4,P05,18,11,18


---

# 3. Features Engineering of Workloads Types, Scales and Sleep

## 3.1. Function to create features based on substrings

In [11]:
def func_wType(x):
    # Not to be used in big data, too slow.
    # Call as: df[new_col] = df[col].apply(func_wType)
    
    group = "others"
    for key in category_dict:
        if key in x:
            group = category_dict[key]
            break
    return group

## 3.2. Variables Derived from "Record_Category" Column

### 3.2.1. Duty Moment

The variable workload moment will categorize the moment in workload wherein the Karolinska and Sam Perelly scales were filled.

In [13]:
category_dict = {'INÍCIO DE JORNADA':'start', 'MEIO DE JORNADA':'middle', 'FIM DE JORNADA':'end',
                 'FOLGA':'day_off', 'MONOFOLGA':'single_day_off', 'SOBREAVISO':'warning', 
                 'RESERVA':'reserve'}

df1['duty_moment'] = df1['Record_Category'].apply(func_wType)

### 3.2.2. Duty Period Type

The variables workload type identifies the aviation relevant type of workload based on the hour the pilot starts its duties.

In [15]:
category_dict ={'EARLY-START':'early-start', 'MADRUGADA':'night'}

df1['duty_type'] = df1['Record_Category'].apply(func_wType)

### 3.2.3. Previous Duty Type (Early Start, Night or Single Day-Off)

This variable identifies the previous duty types, It was only selected three types of of previous duty periods, Early-Start, Night and Single Day Off.

#### Previous Early-Start Workloads

In [17]:
category_dict = {'APÓS EARLY-START':'1', 'TRÊS EARLY-START':'3', 'APÓS DOIS EARLY-START': '2',
                 'APÓS DUAS JORNADAS EARLY-START': '2'}

df1['duty_type_prev_es'] = df1['Record_Category'].apply(func_wType).replace('others','0')

#### Previous Night Workloads

In [19]:
category_dict = {'APÓS MADRUGADA':'1', 'APÓS DUAS MADRUGADAS':'2', 'APÓS JORNADA NA MADRUGADA':'1',
                 'APÓS DUAS JORNADAS NA MADRUGADA' :'2', 'APÓS DUAS JORNADASNA MADRUGADA' :'2'}

df1['duty_type_prev_nt'] = df1['Record_Category'].apply(func_wType).replace('others','0')

### 3.2.4. Adjusting Duty Type

Once, the duty query just identify words containing "EARLY-START", we need to remove occurreences of "APÓS EARLY-START" and etc, for night category as well.

In [21]:
df1.loc[(df1['duty_type_prev_es'] == 1) & (df1['duty_type'] == 'early_start'), 'duty_type' ] = 'others'
df1.loc[(df1['duty_type_prev_nt'] == 1) & (df1['duty_type'] == 'night'), 'duty_type' ] = 'others'

### 3.2.5. Summary

In summary, the following variables were extracted from the column "Record-Category"

| Variable | Variable Long Name | Short Description |
| --- | --- | --- |
| duty_moment | Moment at the duty | Moment during the duty period the participant fills Karolinska or Samm-Perelli scales (start, middle, end, day-off) |
| duty_type | Type of duty | Indentify if the current duty period comprehends early-start or night periods |
| duty_type_prev_es | Type early-start of the previous duty period | Identify the number of previous duty periods comprehending early-start period |
| duty_type_prev_nt | Type early-start of the previous duty period | Identify the number of previous duty period comprehending night period |

## 3.3. Variables relative to Durations and Times

### 3.3.1. Duty Length

In [25]:
# Workload length in hours not in minutes
df1['duty_length'] = pd.to_numeric( df1['Duty_length'], errors='coerce')/60

### 3.3.2. Sleep Duration Before Workload

In [27]:
# Sleep duration in hours
df1['sleep_duration'] = pd.to_numeric( df1['Sleep_duration_minutes'], errors='coerce')/60

### 3.3.3. Time Awake Before Duty Period

In [32]:
def hr_mn_func(ts):
    # function to return the duration in hpurs
    # return ts.hour+ts.minute/60
    try:
        res = float(ts[0:2])+float(ts[3:5])/60
    except:
        res = np.nan
    return res

In [34]:
df1['time_awake'] = df1.Time_awake_before_workload.apply(hr_mn_func)

### 3.3.4. Summary

The following variables realative to durations were transformed:


| Variable | Variable Long Name | Short Description |
| --- | --- | --- |
| duty_length | Duty Period length | Total workload duration based on punch clock record |
| sleep_duration | Sleep Duration before the duty period | Total sleep duration before the duty period based on sleep diary and wearable gadget measurement |
| time_awake | Time Awake before the duty period | Time awake before the duty period based on sleep diary, actimetric measurements and punch clock record |


## 3.4. Variables Relatives to Naps

### 3.4.1. Number of Naps During Workload

In [36]:
df1.loc[df1['Nap_1_length'].apply(hr_mn_func) > 0, 'nap_number'] = 1
df1.loc[df1['Nap_2_length'].apply(hr_mn_func) > 0, 'nap_number'] = 2
df1['nap_number'] = df1['nap_number'].fillna(0).astype(int)

### 3.4.2. Total Nap Duration

In [38]:
df1['nap_duration'] = df1['Nap_1_length'].apply(hr_mn_func)\
                      + df1['Nap_2_length'].apply(hr_mn_func)
df1['nap_duration'] = df1['nap_duration'].fillna(0)

### 3.4.3. Summary

The following variables had been produced:

| Variable | Variable Long Name | Short Description |
| --- | --- | --- |
| nap_number | Number of naps | Number of naps during the duty period |
| nap_duration | Total nap duration | Sum of duration of all naps during the duty period in hours |

---

# 4. Retrieving Karolinska and Samm-Perelli scales from file

## 4.1. Karolinska and Sam-Perelli scales from text data

In [40]:
def kss_sps(x):
    # The filled scale correponds to the first character in row
    return x[0]

df1['kss'] = df1['Karolinska'].fillna('0').apply(kss_sps).astype(int).replace(0,np.nan)
df1['sps'] = df1['Sam_Perelli'].fillna('0').apply(kss_sps).astype(int).replace(0,np.nan)

## 4.2. Time the Participant filled Karolinska and Sam-Perelli scales

To categorize the time the participant filled both the karolinka and Sam-Perelli scales, we will follow an international aviation convention for some categories such as Early-Start or Early Mornings as shown in table bellow:

| Category | Long Name | Rule |
| --- | --- | --- |
| **EM** | Early Morning (or Early Start) | Between 05:00 and 07:59 |
| **MOR** | Morning | Between 08:00 and 11:59 |
| **AFT** | Afternoon | Between 12:00 and 17:59 |
| **EVE** | Evening | Between 18:00 and 23:59 |
| **NI** | Night | Between 00:00 and 04:59 |


In [64]:
def get_hour(x):
    return x.hour+x.minute/60

def categorization_hour(x):
    hr = 'not_apply'
    if 0 <= x < 5:
        hr = 'NI'
    if 5 <= x < 8:
        hr = 'EM'
    elif 8 <= x < 12:
        hr = 'MOR'
    elif 12 <= x < 18:
        hr = 'AFT'
    elif 18 <= x < 24:
        hr = 'EVE'
    return hr

In [66]:

df1['time_fill_kss_sps'] = pd.to_datetime(df1['Record_time_stamp'], dayfirst=True, errors ='raise')\
                             .apply(get_hour).astype(float)\
                             .apply(categorization_hour)

## 4.3. Time of start and end of duty period

In [72]:
df1['duty_start_time'] = df1['Duty_start'].apply(hr_mn_func).astype(float)
df1['duty_end_time']   = df1['Duty_end'].apply(hr_mn_func).astype(float)

df1['duty_start_cat'] = df1['duty_start_time'].apply(categorization_hour)
df1['duty_end_cat']   = df1['duty_end_time'].apply(categorization_hour)

In [80]:
# df1[['time_fill_kss_sps','duty_moment','duty_start_time','duty_start_cat','duty_end_time','duty_end_cat']].head(50)

## 4.2. Summary

| Variable | Variable Long Name | Short Description |
| --- | --- | --- |
| **kss** | Karolinska Scale | Karolinska sleepiness perception scale |
| **sps** | Sam-Perelli Scale | Sam-Perelli fatigue perception scale |
| **time_fill_kss_sps** | Time the participant filled kss and sps scales | Categorized Time the participant filled kss and sps scales |

---

# 5. Joning the Data Bases

The following variables will be used:

| Number | Variable | Variable Long Name | Type | Short Description |
| --- | --- | --- | --- | --- |
| 1 | **Id** | Identification | Categorical | Participant's identification |
| 2 | **kss** | Karolinska Scale | Categorical Ordinal | Karolinska sleepiness perception scale |
| 3 | **sps** | Sam-Perelli Scale | Categorical Ordinal | Sam-Pereli fatigue perception scale |
| 4 | **time_fill_kss_sps** | Time the participants filled kss and sps | Categorical | Time the participants filled kss and sps scales categorized in (EM, MOR, AFT, EVE, NI) |
| 5 | **duty_moment** | Moment at duty period | Categorical | Moment of workload the participants fill kss and sps scales (start, middle, end) |
| 6 | **duty_type** | Type of duty period | Categorical | Indentify if the current workload or duty period comprehends early-start or night periods |
| 7 | **duty_type_prev_es** | Type of previous workload early-start | Categorical | Identify the number of previous workloads which copreehends early-start period |
| 8 | **duty_type_prev_nt** | Type of previous workload night | Categorical | indicators of workloads comprehending early-morning, night or other periods |
| 9 | **duty_length** | Workload Duration | Float | Total workload duration based on punch clock record |
| 10 | **sleep_duration** | Sleep Duration before the workload | Float | Total sleep duration before the workload based on sleep diary and actimetric measurement |
| 11 | **time_awake** | Time Awake before the workload | Float | Time awake before the workload based on sleep diary, actimetric measurements and punch clock record |
| 12 | **nap_number** | Number of naps | Categorical Ordinal | Number of naps during the workload period |
| 13 | **nap_duration** | Total nap duration | Float |Sum of duration of all naps during the workload period in hours |
| 14 | **Sex** | Gender | Categorical | Gender of participant |
| 15 | **Position** | Work position | Categorical | Work position if commander or co-pilot |
| 16 | **Num_Sons** | Number of Sons | Categorical Ordinal | Number of dependent sons |
| 17 | **Flight_hours** | Number of flight hours | Integer | Number of flight hours |
| 18 | **Education** | Education level | Categorical Ordinal | Education level |
| 19 | **Marital_status** | Marital status | Categorical | Marital status |
| 20 | **Time_displacement** | Home to work time | float | time spent from Home to work |
| 21 | **Classification** | Chronotype | Categorical | Chronotype based on test |
| 22 | **Disturbed_sleep** | Disturbed sleep index | Integer | Perceived diturbed sleep index in past 6 months |
| 23 | **Awakening** | Awakening index | Integer | Perceived mean awakening index in past 6 months |
| 24 | **Sleep_Fatig** | Sleepiness Fatigue index | Integer | Perceived sleepiness/fatigue index in past 6 months |
| 25 | **Age** | Participant's Age | Categorical Ordinal | Participant's Age |
| 26 | **Bedtime_start** | Start of participants' bedtime | timestamp |Start of participants' bedtime prior the duty period in hours |
| 27 | **Bedtime_end** | End of participants' bedtime |timestamp|End of participants' bedtime prior the duty period in hours |
| 28 | **Sleep_quality**| Sleep quality index | timestamp |Perception of quality of the sleep prior the current duty period |

In [82]:
# Merging (Joining) the data bases

cols1 = ['Id', 'kss', 'sps', 'time_fill_kss_sps', 'duty_moment', 'duty_moment',
         'duty_start_time','duty_start_cat','duty_end_time','duty_end_cat',
         'duty_type', 'duty_type_prev_es',
         'duty_type_prev_nt', 'duty_length', 'sleep_duration', 'time_awake', 'nap_number', 'nap_duration',
         'Record_time_stamp', 'Bedtime_start','Bedtime_end','Sleep_quality']
cols2 = ['Id', 'Sex', 'Position', 'Num_Sons', 'Flight_hours', 'Education', 'Marital_status', 'Time_displacement','Age']
cols3 = ['Id', 'Classification']
cols4 = ['Id', 'Disturbed sleep index', 'Awakening index', 'Sleepiness/fatigue']

dict_rename = {'Disturbed sleep index':'Disturbed_sleep',
               'Awakening index':'Awakening',
               'Sleepiness/fatigue':'Sleep_Fatig'}

df = df1[cols1].merge(df2[cols2], how='inner', on = 'Id')\
               .merge(df3[cols3], how='inner', on = 'Id')\
               .merge(df4[cols4].rename(columns = dict_rename), how='inner', on = 'Id')
df.head(5)

Unnamed: 0,Id,kss,sps,time_fill_kss_sps,duty_moment,duty_moment.1,duty_start_time,duty_start_cat,duty_end_time,duty_end_cat,...,Num_Sons,Flight_hours,Education,Marital_status,Time_displacement,Age,Classification,Disturbed_sleep,Awakening,Sleep_Fatig
0,P01,5,4.0,AFT,start,start,15.916667,AFT,0.8,NI,...,2,17000.0,PHD,MARRIED,60,49,INT,13,9,14
1,P01,5,5.0,EVE,middle,middle,15.916667,AFT,0.8,NI,...,2,17000.0,PHD,MARRIED,60,49,INT,13,9,14
2,P01,7,6.0,NI,end,end,15.916667,AFT,0.8,NI,...,2,17000.0,PHD,MARRIED,60,49,INT,13,9,14
3,P01,7,5.0,EM,start,start,5.666667,EM,11.116667,MOR,...,2,17000.0,PHD,MARRIED,60,49,INT,13,9,14
4,P01,4,3.0,MOR,middle,middle,5.666667,EM,11.116667,MOR,...,2,17000.0,PHD,MARRIED,60,49,INT,13,9,14


In [84]:
ofile = os.path.join('data/processed_features.csv')
try:
    df.to_csv(ofile, index=False)
    print('---------------------------------------------')
    print(f'Success! Data recorded in file: {ofile}')
    print('---------------------------------------------')
except:
    print('---------------------------------------------')
    print(f'Data not recorded in file: {ofile}, verify!')
    print('---------------------------------------------')

---------------------------------------------
Success! Data recorded in file: data/processed_features.csv
---------------------------------------------
