# Feature Extraction

In this notebook, I use tsfresh **feature extraction** module. I gave each 30 second interval a unique session ID whereby features would be extracted to represent each particular session for each subject 

**Feature selection** was also used to reduce the feature space (filtering the huge dataframe) 

Then, I grouped the features formed from each session by PSG status (sleep stage). (awake -> list of session IDs). This produced a set of features which represented each PSG status (sleep stage)

---
---

# <font color='orange'> 1. Set up</font>

Setup and Imports of tsfresh
Mounting Google Drive Workspace


In [None]:
from google.colab import drive
import os
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from collections import Counter
     

In [None]:
!pip install tsfresh

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tsfresh
  Downloading tsfresh-0.20.0-py2.py3-none-any.whl (98 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 KB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting stumpy>=1.7.2
  Downloading stumpy-1.11.1-py3-none-any.whl (136 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m136.2/136.2 KB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: stumpy, tsfresh
Successfully installed stumpy-1.11.1 tsfresh-0.20.0


In [None]:
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh import extract_features

ERROR:numba.cuda.cudadrv.driver:Call to cuInit results in CUDA_ERROR_NO_DEVICE


In [None]:
import json

class JSONEncoder(json.JSONEncoder):
    def default(self, obj):
        if hasattr(obj, 'to_json'):
            return obj.to_json(orient='records')
        return json.JSONEncoder.default(self, obj)

In [None]:
pd.set_option('max_rows', 25)

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
os.chdir('/content/drive/My Drive/FYDP_data') # Ainley's directory   

# <font color='orange'> 2. Read data </font>

Read in the dictionary data from the Json file

In [None]:
json_df_data = json.load(open('map_of_subject_id_to_its_cleaned_sensor_data.json'))

In [None]:
id_to_df_map = {}
for subject_id, sensor_df in tqdm(json_df_data.items()):
    id_to_df_map[subject_id] = pd.read_json(json_df_data[subject_id])

  0%|          | 0/31 [00:00<?, ?it/s]

Manually inspect the read in data


In [None]:
i = 0
for student_id, df in id_to_df_map.items():
    print("--------------------", student_id, "--------------------")
    print(df)

    # only output 3 subjects and their sensor data
    if i == 2:
        break
    i = i + 1

-------------------- 46343 --------------------
       second    x_move    y_move    z_move  heart_rate  psg_status
0         390 -0.540527  0.680496 -0.271774        90.0           0
1         391 -0.424678  0.922138 -0.094113        90.0           0
2         392 -0.448256  0.816442 -0.228871        90.0           0
3         393 -0.464479  0.765035 -0.432477        90.0           0
4         394 -0.494360  0.798880 -0.076882        90.0           0
...       ...       ...       ...       ...         ...         ...
16556   16946 -0.441763 -0.525671  0.723509        73.0           0
16557   16947 -0.441267 -0.525272  0.724310        73.0           0
16558   16948 -0.441876 -0.525352  0.724083        73.0           0
16559   16949 -0.442227 -0.525543  0.723603        73.0           0
16560   16950 -0.441933 -0.525514  0.723843        73.0           0

[16561 rows x 6 columns]
-------------------- 759667 --------------------
       second    x_move    y_move    z_move  heart_rate  psg_



# <font color='orange'> 3. Get the data into the correct format for extracting the features</font>


Add an 'id' column to each row specifying which bin of seconds each row falls in

In [None]:
bin_size = 30

In [None]:

map_subject_to_df_with_id = {}
for subject_id, fixed_sensor_df in tqdm(id_to_df_map.items()):

    print("---------------", subject_id, "-----------------")

    print(fixed_sensor_df.shape)
  	# dropna's
    no_nans_fixed_sensor_df = fixed_sensor_df.dropna()
    # print(no_nans_fixed_sensor_df.shape)

    # get the value of the maximum second in this dataframe
    max_second_in_df = int(round(max(no_nans_fixed_sensor_df.second) + 0.5))

    # create a new dataframe that we will populate
    new_df = pd.DataFrame(columns=(list(no_nans_fixed_sensor_df.columns).extend(["session_id"])))

    session_number = 0
    # iterate through each second interval in this dataframe
    for i in np.arange(0, max_second_in_df + bin_size, bin_size):

        # get the rows between second "i - 1" and second "i"
        rows_in_session_df = pd.DataFrame(no_nans_fixed_sensor_df.loc[(no_nans_fixed_sensor_df.second >= (i)) & (no_nans_fixed_sensor_df.second < i + bin_size)])
        
        if not rows_in_session_df.empty:
            # assign the session_id label to this row
            rows_in_session_df['session_id'] = session_number

            # join these rows to the rest of the rows
            new_df = pd.concat([new_df, rows_in_session_df], axis=0)

            session_number += 1

    map_subject_to_df_with_id[subject_id] = new_df

  0%|          | 0/31 [00:00<?, ?it/s]

--------------- 46343 -----------------
(16561, 6)
--------------- 759667 -----------------
(14184, 6)
--------------- 781756 -----------------
(29369, 6)
--------------- 844359 -----------------
(26881, 6)
--------------- 1066528 -----------------
(28389, 6)
--------------- 1360686 -----------------
(27695, 6)
--------------- 1449548 -----------------
(28561, 6)
--------------- 1455390 -----------------
(28621, 6)
--------------- 1818471 -----------------
(28711, 6)
--------------- 2598705 -----------------
(28591, 6)
--------------- 2638030 -----------------
(28411, 6)
--------------- 3509524 -----------------
(12448, 6)
--------------- 3997827 -----------------
(28711, 6)
--------------- 4018081 -----------------
(14940, 6)
--------------- 4314139 -----------------
(28801, 6)
--------------- 4426783 -----------------
(29337, 6)
--------------- 5132496 -----------------
(13884, 6)
--------------- 5383425 -----------------
(29279, 6)
--------------- 5498603 -----------------
(22291, 6

# <font color='orange'> 4. Create a map of 'session_id' to 'psg_status' for each student_id</font>

For each subject, create a map between each of their session_ids and that sessions psg_status


In [None]:
map_subject_id_to_a_map_of_the_session_id_to_psg_status = {}

for subject_id, sensor_df in tqdm(map_subject_to_df_with_id.items()):

    # for this subject, create a dictionary to map their sessions to their psg status'
    subjects_session_to_psg_map = {}

    for session_id in list(set(sensor_df.session_id)):
    
        # get all id entries in df where psg_status = sleep_state
        all_psg_status = sensor_df[sensor_df['session_id'] == session_id]['psg_status']

        # get the most common psg_status across all rows with this session_id
        most_common_psg_status = Counter(all_psg_status).most_common(1)[0][0]

        # create an entry in the subject dictionary of a map between the session_id and the most common psg_status
        subjects_session_to_psg_map[session_id] = most_common_psg_status

    # add this subjects dictionaries to the map of each subject_id to their dictionaries
    map_subject_id_to_a_map_of_the_session_id_to_psg_status[subject_id] = subjects_session_to_psg_map
     

  0%|          | 0/31 [00:00<?, ?it/s]

In [None]:
for sub_id, id_to_psg_dict in map_subject_id_to_a_map_of_the_session_id_to_psg_status.items():

    # print the subject_id
    print("----------------", sub_id, "---------------------")
    
    # iterate through the dictionary and print the values
    for session_id, psg_status in id_to_psg_dict.items():
        print(session_id, "->", psg_status)

    # stop after one subject_id
    break

---------------- 46343 ---------------------
0 -> 0
1 -> 0
2 -> 0
3 -> 0
4 -> 0
5 -> 0
6 -> 0
7 -> 0
8 -> 0
9 -> 0
10 -> 0
11 -> 0
12 -> 0
13 -> 0
14 -> 0
15 -> 0
16 -> 0
17 -> 0
18 -> 0
19 -> 0
20 -> 0
21 -> 0
22 -> 0
23 -> 0
24 -> 0
25 -> 0
26 -> 0
27 -> 0
28 -> 0
29 -> 0
30 -> 0
31 -> 0
32 -> 0
33 -> 0
34 -> 1
35 -> 1
36 -> 1
37 -> 1
38 -> 2
39 -> 2
40 -> 2
41 -> 2
42 -> 2
43 -> 2
44 -> 2
45 -> 2
46 -> 2
47 -> 2
48 -> 2
49 -> 2
50 -> 2
51 -> 2
52 -> 2
53 -> 2
54 -> 2
55 -> 2
56 -> 3
57 -> 3
58 -> 3
59 -> 3
60 -> 3
61 -> 3
62 -> 3
63 -> 3
64 -> 3
65 -> 3
66 -> 3
67 -> 3
68 -> 3
69 -> 3
70 -> 3
71 -> 3
72 -> 3
73 -> 3
74 -> 3
75 -> 3
76 -> 3
77 -> 3
78 -> 3
79 -> 3
80 -> 3
81 -> 3
82 -> 3
83 -> 3
84 -> 3
85 -> 3
86 -> 3
87 -> 3
88 -> 3
89 -> 3
90 -> 3
91 -> 3
92 -> 3
93 -> 3
94 -> 3
95 -> 3
96 -> 3
97 -> 3
98 -> 3
99 -> 3
100 -> 3
101 -> 3
102 -> 3
103 -> 3
104 -> 3
105 -> 3
106 -> 3
107 -> 3
108 -> 3
109 -> 3
110 -> 0
111 -> 0
112 -> 0
113 -> 0
114 -> 1
115 -> 2
116 -> 2
117 -> 2
118

# <font color='orange'> 5. Extract the features</font>

Extract the features for each sensor

**this code took 4 hours to run**

In [None]:

map_id_to_extracted_features = {}
for subject_id, cleaned_sensor_df in tqdm(map_subject_to_df_with_id.items()):

    print("---------------------------------------------------")
    print("======================", subject_id, "======================")
    print("---------------------------------------------------")

    no_psg_status_cleaned_df = cleaned_sensor_df.drop(columns=["psg_status"], axis=1).dropna()

    no_psg_status_cleaned_df["session_id"] = no_psg_status_cleaned_df["session_id"].astype(str)

    extracted_features = extract_features(no_psg_status_cleaned_df, column_value=None, column_sort="second", column_id="session_id")

    #print(extracted_features.shape)
    extracted_features = extracted_features.dropna(axis='columns')
    
    #print(extracted_features.shape)

    map_id_to_extracted_features[subject_id] = extracted_features
     

Feature Extraction:  89%|████████▉ | 2553/2876 [04:07<00:31, 10.25it/s][A
Feature Extraction:  89%|████████▉ | 2555/2876 [04:08<00:30, 10.50it/s][A
Feature Extraction:  89%|████████▉ | 2557/2876 [04:08<00:29, 11.00it/s][A
Feature Extraction:  89%|████████▉ | 2559/2876 [04:08<00:29, 10.92it/s][A
Feature Extraction:  89%|████████▉ | 2561/2876 [04:08<00:27, 11.31it/s][A
Feature Extraction:  89%|████████▉ | 2563/2876 [04:08<00:29, 10.52it/s][A
Feature Extraction:  89%|████████▉ | 2565/2876 [04:08<00:28, 11.05it/s][A
Feature Extraction:  89%|████████▉ | 2567/2876 [04:09<00:28, 10.97it/s][A
Feature Extraction:  89%|████████▉ | 2569/2876 [04:09<00:27, 11.17it/s][A
Feature Extraction:  89%|████████▉ | 2571/2876 [04:09<00:28, 10.84it/s][A
Feature Extraction:  89%|████████▉ | 2573/2876 [04:09<00:28, 10.81it/s][A
Feature Extraction:  90%|████████▉ | 2575/2876 [04:09<00:28, 10.72it/s][A
Feature Extraction:  90%|████████▉ | 2577/2876 [04:10<00:26, 11.09it/s][A
Feature Extraction:  90%|

In [None]:
# for the no step data
with open('no_step_map_subject_id_to_its_unfiltered_extracted_features_df.json', 'w') as fp:
   json.dump(map_id_to_extracted_features, fp, cls=JSONEncoder)
     

# <font color='orange'> 6. Select the most relevant features from all of these extracted features</font>

This code uses tsfresh's built in function "**select_features**" ino order to filter the data and the relevant features produced

In [None]:
map_id_to_filtered_extracted_features = {}
for subject_id, extracted_features_df in tqdm(map_id_to_extracted_features.items()):

    print("---------------------------------------------------")
    print("======================", subject_id, "======================")
    print("---------------------------------------------------")
    
    map_of_session_id_to_psg_status = map_subject_id_to_a_map_of_the_session_id_to_psg_status[subject_id]

    target_array = np.array(list(map_of_session_id_to_psg_status.values()))

    print(extracted_features_df.shape)
    features_filtered = select_features(extracted_features_df, target_array)
    print(features_filtered.shape)

    map_id_to_filtered_extracted_features[subject_id] = features_filtered

  0%|          | 0/31 [00:00<?, ?it/s]

---------------------------------------------------
---------------------------------------------------
(553, 872)
(553, 363)
---------------------------------------------------
---------------------------------------------------
(474, 872)
(474, 396)
---------------------------------------------------
---------------------------------------------------
(980, 872)
(980, 284)
---------------------------------------------------
---------------------------------------------------
(897, 872)
(897, 272)
---------------------------------------------------
---------------------------------------------------
(947, 1433)
(947, 699)
---------------------------------------------------
---------------------------------------------------
(925, 872)
(925, 267)
---------------------------------------------------
---------------------------------------------------
(953, 872)
(953, 416)
---------------------------------------------------
---------------------------------------------------
(955, 872)
(9

# <font color='orange'>7. Store this data in a json file</font>




In [None]:
# for the no step data
with open('no_step_map_subject_id_to_its_filtered_extracted_features_df.json', 'w') as fp:
    json.dump(map_id_to_filtered_extracted_features, fp, cls=JSONEncoder)

In [None]:
# put the dataframe in the json file
with open('map_subject_id_to_a_map_of_the_session_id_to_psg_status.json', 'w') as fp:
    json.dump(map_subject_id_to_a_map_of_the_session_id_to_psg_status, fp, cls=JSONEncoder)