# Extraction of features
In this jupyter notebook we are extracting our features using the functions created in *feature_extraction.py*. This process is fully __automated__, we just have to decide between the __test or the control group__ and the features will be extracted.

In [1]:
#Libraries
import pandas as pd
import numpy as np
import math
import os

# Import code developed and functions
from feature_extraction import preprocess, flatten, label, select_group

# Ignore warnings of appending dataframes
import warnings
warnings.simplefilter('ignore')  

### Preparation of paths and variables

After loading our required libraries, we have to assign the __path__ where our __questionnarie files__ are located, in my case is the following:

In [2]:
# Load data of questionaries
pathQ = 'C:\\Users\\mverd\\Desktop\\IMD\\ESSEX\\TERM2\\Modules\\Data Science and Decision Making\\Assignment2_Final\\questionnaries\\'

In the following lines we are deciding between the groups. We call our function *select_group* to assign the correct variables for the extraction. In this case, we are getting the __control__ group features.

In [3]:
# SELECT GROUP
pathParticipants, groupSelect = select_group('test')
# pathParticipants, groupSelect = select_group('control')

### Extraction of features per participant
Finally we call our function *preprocess* in order to perform the extraction. In this process we left some print statements to keep track of the for loop. In the loop we extract the features of __each participant__ and concatenate to a __final dataframe__. In this loop we also append our label to predict in the future

In [4]:
# Initialize variables
# -------------------------
df_final = pd.DataFrame()   # DataFrame to store final features
cont = 0                    # Counter to know which participant is in the loop
num_trials = []             # List to store the num of trials per participant
empathy_score = []          # List to store the empathy values of each recording
#--------------------------

label = label(pathQ,groupSelect)

for file in os.listdir(pathParticipants):
    # Preprocess for each participant
    df = preprocess(pathParticipants,file)
    
    # Make sure each participant of the shape of each participant
    print(df.shape)
    num_trials = df.shape[0]
        
    # Save all recordings in df_final
    df_final = pd.concat([df_final,df],axis=0,ignore_index=True)
    
    # Assign value of label to all participant recordings,
    # this will multiply the number of trials by the empathy score
    # and will assign the score to every recording of the participant
    empathy_score.append(num_trials*[label[cont]])
    cont += 1
    
# Since we are appending list into a list, we flatten the variable into only one list
# for it to be able to insert into the df
empathyScoresFlat = flatten(empathy_score)
df_final['Empathy Score'] = empathyScoresFlat

Participant # 1
(8, 19)
Participant # 3
(16, 19)
Participant # 5
(24, 19)
Participant # 7
(32, 19)
Participant # 9
(40, 19)
Participant # 11
(48, 19)
Participant # 13
(8, 19)
Participant # 15
(8, 19)
Participant # 17
(8, 19)
Participant # 19
(8, 19)
Participant # 21
(8, 19)
Participant # 23
(8, 19)
Participant # 25
(8, 19)
Participant # 27
(8, 19)
Participant # 29
(8, 19)
Participant # 31
(8, 19)
Participant # 33
(9, 19)
Participant # 35
(8, 19)
Participant # 37
(8, 19)
Participant # 39
(8, 19)
Participant # 41
(8, 19)
Participant # 43
(8, 19)
Participant # 45
(8, 19)
Participant # 47
(8, 19)
Participant # 49
(8, 19)
Participant # 51
(8, 19)
Participant # 53
(8, 19)
Participant # 55
(8, 19)
Participant # 57
(8, 19)
Participant # 59
(8, 19)


### Verification of outputs
The print above and the following, allow us to __verify the dimensions__ of each participant and the number of trials done. This is the last step before saving our dataframe as a csv and using for classifying models.

In [5]:
# Print to visualize the number of recording per participant,
# to verify the dataframe
df_final['Participant name'].value_counts().sort_index()

1.0      8
3.0     16
5.0     24
7.0     32
9.0     40
11.0    48
13.0     8
15.0     8
17.0     8
19.0     8
21.0     8
23.0     8
25.0     8
27.0     8
29.0     8
31.0     8
33.0     9
35.0     8
37.0     8
39.0     8
41.0     8
43.0     8
45.0     8
47.0     8
49.0     8
51.0     8
53.0     8
55.0     8
57.0     8
59.0     8
Name: Participant name, dtype: int64

In [6]:
df_final

Unnamed: 0,Participant name,Mean Pupil diameter left,Std Pupil diameter left,Mean Pupil diameter right,Std Pupil diameter right,Num. of Fixations,Num. of Saccades,Num. of Unclassified,Recording duration (s),Mean Gaze event duration (s),Mean Fixation point X,Std Fixation point X,Mean Fixation point Y,Std Fixation point Y,Mean Gaze point X,Std Gaze point X,Mean Gaze point Y,Std Gaze point Y,Empathy Score
0,1.0,3.116179,0.170624,3.086736,0.187692,5583.0,2821.0,2145.0,83.579,0.170216,0.551323,0.176729,0.360338,0.193124,0.554027,0.181455,0.336391,0.191762,122.00
1,1.0,3.069993,0.214786,3.067127,0.206133,5493.0,2332.0,1381.0,73.900,0.163398,0.508583,0.204791,0.392129,0.196699,0.506665,0.206216,0.386806,0.183729,122.00
2,1.0,3.087159,0.123202,3.107459,0.138585,4238.0,2250.0,1381.0,64.593,0.129727,0.499206,0.173633,0.450421,0.210797,0.495547,0.175859,0.436538,0.212200,122.00
3,1.0,3.051393,0.142042,3.072163,0.152912,7273.0,3294.0,2098.0,103.283,0.140284,0.494731,0.138572,0.442459,0.214930,0.496109,0.145535,0.438177,0.205424,122.00
4,1.0,3.040784,0.168366,3.073641,0.167071,5022.0,2344.0,1528.0,72.090,0.165431,0.513632,0.176040,0.455382,0.205291,0.516138,0.182646,0.446003,0.205497,122.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
356,59.0,3.022702,0.213759,2.876765,0.336419,2366.0,865.0,3970.0,58.938,0.210440,0.418216,0.171146,0.595376,0.151689,0.421936,0.156015,0.531597,0.179603,119.25
357,59.0,2.946206,0.189593,2.786759,0.192084,6304.0,2542.0,8959.0,147.069,0.227903,0.416745,0.176847,0.630439,0.167650,0.411828,0.167449,0.566383,0.202803,119.25
358,59.0,3.078361,0.243782,2.920099,0.214158,7852.0,3197.0,12683.0,195.934,0.233097,0.382620,0.191286,0.648154,0.154107,0.378999,0.179963,0.591099,0.189290,119.25
359,59.0,2.965768,0.211124,2.825485,0.223740,5804.0,2480.0,8604.0,138.960,0.219877,0.390574,0.190507,0.581496,0.171900,0.382599,0.178919,0.541035,0.186727,119.25


### Storage of results in .csv

In [7]:
# df_final.to_csv('Final_Features_test.csv', index=False)
# df_final.to_csv('Final_Features_control.csv', index=False)

In [8]:
trials = df_final['Participant name'].value_counts().sort_index()
trials

1.0      8
3.0     16
5.0     24
7.0     32
9.0     40
11.0    48
13.0     8
15.0     8
17.0     8
19.0     8
21.0     8
23.0     8
25.0     8
27.0     8
29.0     8
31.0     8
33.0     9
35.0     8
37.0     8
39.0     8
41.0     8
43.0     8
45.0     8
47.0     8
49.0     8
51.0     8
53.0     8
55.0     8
57.0     8
59.0     8
Name: Participant name, dtype: int64

In [9]:
trials.sum()

361

In [10]:
trials.std()

10.226683600819195