# Constructing Data Sets for Neural Networks

Author: Jake Dumbauld <br>
Contact: jacobmilodumbauld@gmail.com<br>
Date: 3.15.22

## Introduction:

The purpose of this notebook is make the most of the patient demographic information available in the data set that I have by converting variables into a machine-readable format. In addition, there are many subjective decisions made throughout on what to keep and what to throw away. I will explain my thought process as those decisions are being made. 

In [1]:
# importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import librosa
import librosa.display
import IPython
from IPython.display import display, clear_output

import os
from random import uniform
import time
import gc

#options
pd.set_option('display.max_columns', None) #making sure I can see all my columns

Picking up from my df created in `2 - Importing Signal Data` as the array created in 3 dropped all columns but the signal and target.

In [2]:
root_path = '/Users/jmd/Documents/BOOTCAMP/Capstone/'

In [2]:
df = pd.DataFrame(data = np.load(root_path+'arrays/patient_signals_4k.npy', allow_pickle=True),
                       columns=(['Patient ID', 'Locations', 'Age', 'Sex', 'Height', 'Weight',
                                 'Pregnancy status', 'Murmur', 'Murmur locations',
                                 'Most audible location', 'Systolic murmur timing',
                                 'Systolic murmur shape', 'Systolic murmur grading',
                                 'Systolic murmur pitch', 'Systolic murmur quality',
                                 'Diastolic murmur timing', 'Diastolic murmur shape',
                                 'Diastolic murmur grading', 'Diastolic murmur pitch',
                                 'Diastolic murmur quality', 'Campaign', 'Additional ID',
                                 'location_count', 'signal_patient_id', 'location', 'signal']))

## Dropping duplicated or unnecessary columns

df

In [3]:
df.columns

Index(['Patient ID', 'Locations', 'Age', 'Sex', 'Height', 'Weight',
       'Pregnancy status', 'Murmur', 'Murmur locations',
       'Most audible location', 'Systolic murmur timing',
       'Systolic murmur shape', 'Systolic murmur grading',
       'Systolic murmur pitch', 'Systolic murmur quality',
       'Diastolic murmur timing', 'Diastolic murmur shape',
       'Diastolic murmur grading', 'Diastolic murmur pitch',
       'Diastolic murmur quality', 'Campaign', 'Additional ID',
       'location_count', 'signal_patient_id', 'location', 'signal'],
      dtype='object')

### Reasoning:

In general, my goal is to observe the impact of patient information on model performance. Additionally, I'm interested in studying information that would be easy to obtain in the field if a machine learning system for phonocardiogram classification was employed that also took in patient information. With this in mind, I'll proceed through each of the columns in the dataframe and state my reasoning for keeping them.

- `Patient ID`: Kept for now,  to keep things straight in the process ahead. Later dropped in this notebook as it is not necessary for the model to know.
- `Locations`: Drop. This is the column that the `location` column is based on and contains duplicate information.
- `Age`, `Sex`, `Height`, `Weight`: Keep. These are the most basic bits of information that can help shape a patient profile. All are relatively easy to obtain.
- `Pregnancy Status`: Keep. Pregnancy has an effect on heart sounds ([source](https://www.ahajournals.org/doi/10.1161/circulationaha.114.009029)), and thus it is a very important metric to track in any potential modelling.
- `Murmur`: Keep. This is our target variable.
- `Systolic murmur timing` - `Diastolic murmur quality`: Drop - I'll admit, dropping these hurts as I'm sure it was no small feat to put together this info. However, this information isn't useful in training a binary model on the presence or absence of a murmur. On a model with functionality of grading a murmur, it would be great! That's not the problem I'm trying to solve though. In the future, the qualifying and grading info could be useful in evaluating the type of murmurs most misclassified.
- `Campaign & Additional ID` - Drop. There were two campaigns that gathered all of these data points, denoted in the `campaign` column. If a patient participated in both campaigns they have a value in the `Additional ID` column. I expect the variability in the patients heart rate & sounds from day to day to provide different _enough_ data for the purpose of training a neural network, so I will not be dropping observations with an Additional ID.
- `location_count` - Drop. Artifact from data processing used to expand the dataframe. Has no use in modelling.
- `signal_patient_id` - Drop. Same as above.
- `location` - This one is interesting. I've explained in previous notebooks that the location of the heart recording has an impact on the heart sounds heard. Thus, it makes sense to me to keep this information in. However, I feel there's a case to be made that this could confuse the model if it learns patterns of murmurs only being audible at one spot if there is some underlying skew in the data that I'm not aware of. Despite this, the variability in heart sounds from location to location compelled me to keep this in. 
- `signal` - Keep. This is my data. 

In [4]:
columns_to_drop = ['Locations','signal_patient_id', 'Murmur locations',
                   'Most audible location', 'Systolic murmur timing',
                   'Systolic murmur shape', 'Systolic murmur grading',
                   'Systolic murmur pitch', 'Systolic murmur quality',
                   'Diastolic murmur timing', 'Diastolic murmur shape',
                   'Diastolic murmur grading', 'Diastolic murmur pitch',
                   'Diastolic murmur quality', 'Campaign', 'Additional ID', 'location_count']

In [5]:
df.drop(columns_to_drop, axis=1, inplace=True)

## Binarizing Columns

With the hard decisions out of the way, now I was down to taking these columns and translating them into a machine readable format. The first step was to binarize my target in the same was as was done in notebook 3.

### Binarizing Murmurs

In [6]:
#checking to see if we have any NaN's
df['Murmur'].isna().sum()

0

In [7]:
df.drop(df[df['Murmur'] == 'Unknown'].index, inplace=True)

In [8]:
df['Murmur'].value_counts()

Absent     2391
Present     616
Name: Murmur, dtype: int64

In [9]:
df['Murmur'] = df['Murmur'].map({"Absent": 0,
                                 "Present": 1})

In [10]:
df.reset_index(inplace=True, drop=True)

### Binarizing Pregnancy Status

Next, pregnancy status was given as a boolean, so it was as simple as casting it to an integer. 

In [11]:
#checking to see if we have any NaN's
df['Pregnancy status'].isna().sum()

0

In [12]:
df['Pregnancy status'] = df['Pregnancy status'].astype(int)

### Binarizing Sex

Patient sex was recorded as a string, so I arbitrarily mapped it to 0 for male and 1 for female.

In [13]:
#checking to see if we have any NaN's
df['Sex'].isna().sum()

0

In [14]:
df['Sex'].value_counts()

Female    1523
Male      1484
Name: Sex, dtype: int64

In [15]:
df['Sex'] = df['Sex'].map({"Male": 0,
                           "Female": 1})

### Mapping Age to Ints

In [17]:
df['Age'].value_counts()

Child          2125
Infant          383
Adolescent      230
Young Adult      24
Neonate           8
Name: Age, dtype: int64

In [16]:
#checking to see if we have any NaN's
df['Age'].isna().sum()

237

This was the first real choice in this process. Age was given as categorical strings, and there were some unknowns present in the data. For the knowns, I decided to map them to ordinal values, increasing with patient age. For the `nan`s, Rather than imputing something potentially incorrect for the unknown values, I opted to map them all to zero. This is admittedly something I'm uncertain if I handled correctly, and could be explored in future studies.

In [18]:
df['Age'] = df['Age'].map({np.nan: 0,
                           'Neonate': 1,
                           'Infant': 2,
                           'Child': 3,
                           'Adolescent': 4,
                           'Young Adult': 5})

In [19]:
# sanity check
df['Age'].value_counts()

3    2125
2     383
0     237
4     230
5      24
1       8
Name: Age, dtype: int64

### Dummy variables for location

Lastly, there were 5 locations from which recordings were taken. Since this is a categorical variable with only a few possibilities, I opted to one hot encode them use pandas build in `get_dummies()` function.

In [20]:
df = pd.get_dummies(df, columns=['location'])

In [21]:
len(df.columns)

13

I also wanted to have my signal data be the last column in the dataframe, for readability purposes during this process. There were a _lot_ of `df` calls on the last line of these cells to take a look at what I was doing and ensure I wasn't missing anything. These have been cleaned up for the reader :). 

In [22]:
last_col = df.pop('signal')

In [23]:
last_position = len(df.columns)

In [24]:
df.insert(last_position, 'signal', last_col)

### Dealing with NaNs in Height/Weight

The second major choice in this process: I had a few hundred `nan` `height` and `weight` values, what to do with them? </br></br>
I opted to impute these values with the means of same sex/age groups. For most, this worked. However, for some patients there was no age information so I was unable to impute these values. Instead, I chose to impute the mean of the whole sample rather than setting them to zero. This is obviously not ideal, but with limited information it was the best I could do. 

In [26]:
#dealing with NaN heights
new_heights = []

#iterating through heights
for i, height in enumerate(df['Height']):
    
    #if the height is null 
    if (pd.isnull(height) == True):        
        
        #store the age and sex groups of the current patient in memory
        age_group = df.iloc[i,1]
        sex_group = df.iloc[i,2]
        
        #creating a condition that checks if age and sex are equal to the age and sex group
        condition = (df['Age'] == age_group) & (df['Sex'] == sex_group)
        
        #computing the mean height of the patient group with same age and sex category
        groups_height_mean = df[condition]['Height'].mean()
        
        #if that mean is null append the mean of the entire sample
        if (pd.isnull(groups_height_mean) == True):
            
            new_heights.append(df['Height'].mean().round(1))
        
        #else append the group mean
        else:
            
            new_heights.append(groups_height_mean.round(1))
    
    else:
        new_heights.append(df['Height'][i])

# reassigning heights with imputation.
df['Height'] = new_heights

In [27]:
#dealing with NaN weights
new_weights = []

#iterating through weights
for i, weight in enumerate(df['Weight']):
    
    #if the weight is null 
    if (pd.isnull(weight) == True):        
        
        #store the age and sex groups of the current patient in memory
        age_group = df.iloc[i,1]
        sex_group = df.iloc[i,2]
        
        #creating a condition that checks if age and sex are equal to the age and sex group
        condition = (df['Age'] == age_group) & (df['Sex'] == sex_group)
        
        #computing the mean weight of the patient group with same age and sex category
        groups_weight_mean = df[condition]['Weight'].mean()
        
        #if that mean is null append the mean of the entire sample
        if (pd.isnull(groups_weight_mean) == True):
            
            new_weights.append(df['Weight'].mean().round(1))
        
        #else append the group mean
        else:
            
            new_weights.append(groups_weight_mean.round(1))
    
    else:
        new_weights.append(df['Weight'][i])
        
# reassigning weights with imputation.
df['Weight'] = new_weights

Another two quick sanity checks here, since these were the last two I kept them in so the end result was visible. 

In [28]:
# sanity check
df.isna().sum()

Patient ID          0
Age                 0
Sex                 0
Height              0
Weight              0
Pregnancy status    0
Murmur              0
location_AV         0
location_MV         0
location_PV         0
location_Phc        0
location_TV         0
signal              0
dtype: int64

In [29]:
df

Unnamed: 0,Patient ID,Age,Sex,Height,Weight,Pregnancy status,Murmur,location_AV,location_MV,location_PV,location_Phc,location_TV,signal
0,2530,3,1,98.0,15.9,0,0,0,0,1,0,0,"[0.07682987, 0.06061038, 0.039170958, 0.048250..."
1,2530,3,1,98.0,15.9,0,0,1,0,0,0,0,"[-0.01187718, 0.029969877, 0.01927742, -0.0206..."
2,2530,3,1,98.0,15.9,0,0,0,1,0,0,0,"[0.37442628, 0.32439327, 0.095518045, -0.06558..."
3,2530,3,1,98.0,15.9,0,0,0,0,0,0,1,"[0.06770988, 0.073658854, 0.072224066, 0.08253..."
4,9979,3,1,103.0,13.1,0,1,0,1,0,0,0,"[0.15039496, 0.18560724, 0.17212218, 0.1603406..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3002,85345,3,1,132.0,38.1,0,0,1,0,0,0,0,"[0.03814805, 0.06008572, 0.03808801, -0.008758..."
3003,85345,3,1,132.0,38.1,0,0,0,0,1,0,0,"[-0.0061015976, 0.029588033, 0.020953469, 0.00..."
3004,85349,0,1,115.9,25.1,1,0,1,0,0,0,0,"[0.00026825635, 0.0034792388, 0.014762115, -0...."
3005,85349,0,1,115.9,25.1,1,0,0,0,1,0,0,"[0.1387334, 0.08302983, 0.13867667, -0.0078439..."


From here, I saved just my target variable to a numpy array for easy loading in later notebooks, as you'll see. 

In [30]:
murmur_array = df['Murmur'].to_numpy()

In [31]:
np.save(root_path+'arrays/target_array', murmur_array)

### Trimming and Padding clips to 12 seconds

Since in my initial loading I had loaded in the raw signal data that was untrimmed and processed, I repeated the trimming/padding code from notebook 3. Since this is identical, I'll be taking a brief break from markdown.

In [32]:
sr = 4096

In [33]:
signals = df['signal']

In [34]:
lengths = []
for signal in signals:
    lengths.append(len(signal))

In [35]:
lengths = pd.Series(lengths)

In [36]:
lengths.describe() / sr

count     0.734131
mean     22.894569
std       7.297946
min       5.152100
25%      19.056152
50%      21.488037
75%      29.392090
max      64.512207
dtype: float64

In [37]:
for i in range(4, 30):
    print(i, len(lengths[lengths > (i * sr)]) / len(lengths))

4 1.0
5 1.0
6 0.9986697705354174
7 0.9960093116062521
8 0.9890256069171932
9 0.9787163285666778
10 0.9630861323578317
11 0.9441303624875291
12 0.9218490189557699
13 0.8985700033255737
14 0.8782840039906884
15 0.8540073162620552
16 0.8240771533089458
17 0.79847023611573
18 0.7751912204855338
19 0.7509145327569006
20 0.6508147655470569
21 0.5340871300299301
22 0.4672430994346525
23 0.4286664449617559
24 0.40239441303624873
25 0.3831060857998005
26 0.36647821749251747
27 0.3518456933821084
28 0.32790156301962087
29 0.28633189225141337


In [38]:
target_len = 12 * sr
target_len

49152

In [39]:
new_signals = []
for signal in signals:
    if len(signal) == target_len:
        new_signals.append(signal)
    elif len(signal) > target_len:
        new_signals.append(signal[0:target_len])
    elif len(signal) < target_len:
        padwidth = target_len-len(signal)
        new_signals.append(np.pad(signal, (0, padwidth), mode='constant'))
    else:
        print('wtf')

new_signals = np.asarray(new_signals)

new_signals.shape

(3007, 49152)

In [40]:
lengths = []
for signal in new_signals:
    lengths.append(len(signal))

In [41]:
lengths = pd.Series(lengths)

In [42]:
lengths.describe() / sr

count     0.734131
mean     12.000000
std       0.000000
min      12.000000
25%      12.000000
50%      12.000000
75%      12.000000
max      12.000000
dtype: float64

## Important Interlude: Taking Stock & Plotting the Course

Taking quick stock of relevant variables, 
- `df` is our dataframe with all of our patient demographic information + patient ID, and currently has the *unprocessed* (variable length) signal data in it. 
- `new_signals` is our list of properly trimmed signals. 
Taking stock of our goals:
- The goal of the notebook was to create machine-readable data
- The goal of this project is to evaluate different machine learning models and the effect of patient information on their output.

At this point, I needed to decide what formats I wanted my data in. Throughout this process, I heavily referenced this [paper](https://www.mdpi.com/1099-4300/23/6/667/htm), and if you are exploring this space I encourage you to do so as well. What I landed on was creating 5 variants of my data to then feed into models. Already compelted was the 4k sampling rate (sr) raw signal data without patient information that I fed into the simple statistical models. The rest were as follows: 
- 1k sr signal data _without_ patient information
- 1k sr signal data _with_ patient information
- MFCC data derived from my 4k sr signal data _without_ patient information
- MFCC data derived from my 4k sr signal data _with_ patient information.
The reason for creating the 1k signal data is explained in the model search notebooks on RNN. </br></br>

From this point, I needed to create MFCCs from the processed signals and find a way to merge in the relevant patient information. The shape of my MFCC output data from each observation (signal) in this dataset was `(20,97)`, and I had a vector of length 10 representing my patient demographic information. </br></br>
After consulting with my instructors, I landed on taking my vector of patient demographic information and creating 'static signals' from it. I reshaped it to an array with shape `(10,1)`. I then repeated the values in this array across the length of my MFCC array to generate 'static signals' with a shape of `(10,97)` that represented my patient information. Then, I stacked the two arrays together to form a final array with shape `(30,97)`. The first 20 rows were my MFCC data signals, and the last 10 rows were my static patient demographic 'signals.' I did this iteratively for each observation in my dataset to generate a 3D array with shape `(3007, 30, 97)`. This contained MFCC & Patient information for all of the audio recordings in my dataset.</br></br>

In addition, I needed to create an MFCC array without the patient information. This was simpler, and just required repeating the code from above without the inclusion of the demographic information. The final shape of this array would be `(3007, 20, 97)`, 20 rows and 97 observations for the MFCC data, for 3007 audio files. </br></br>

The code that follows accomplishes the above goals.

## MFCC Data w/wo Patient Information

#### Variable Selection & Reasoning:

In [43]:
n_fft = 256
n_mfcc = 20

Recall in the Basic Librosa notebook what these variables mean - `n_fft` is our window, and `n_mfcc` is the number of 'bins' we're creating. Implicit in this is our `hop_length`, which by default is set to 1/4 of `n_fft`. I didn't make any adjustments to this. </br></br>

The reasoning behind shortening our `n_fft` window lies in our low sr. Typical approaches using MFCCs (speech recognition, music genre classification) have sr's of the standard 22khz or 44khz because they have frequencies in their data that require that high of a sampling rate, and thus the window can be much larger as there's many more data points per second available. The loss of granularity from the fft window sliding across the sound data is not pronounced. </br></br>

However, we have a 4k sr. A window of 1k would mean that we were taking a power spectrum every 1/4th of a second, rather than every 1/20 or 1/40 seconds. To compensate for this, I reduced the window by a factor of four from the example in the librosa notebook

### Transforming Signal Array into MFCC Array With Patient Information

To keep my code clean, I wrote a quick helper function that takes the demographic information from an observation at position `i` in `df` and returns it as an array in the shape specified above `(10,97)`. 

In [52]:
def patient_info_to_signal(df, i, repeats):
    '''
    Helper function to take the patient demo info and reshape it into static signals with length equal to the 
    MFCC array it's being concatenated with.
    df: input dataframe from where the demographic info is coming from
    i: row number of target patient demo info
    repeats: length of signal representing patient info
    '''
    
    demo_info = df.drop(columns=['Patient ID', 'Murmur', 'signal']).iloc[i,:].to_numpy()

    demo_info = demo_info.reshape(demo_info.shape[0],1)

    demo_info = np.repeat(demo_info, repeats = repeats, axis=1)

    return demo_info

I then looped through every observation in my new_signals list (which is the same length as `df`) and converted these to MFCCs. This needed to be broken up into two blocks as I couldn't find a clean way to build the final array (shape: `(3007, 30, 97)`) that I was trying to create without reshaping the first two outputs and stacking them, then concatenating them on the new axis I created in the reshaping process. The below block of code is the result. 

In [45]:
for i in range(0,len(new_signals)):
    if i == 0:
        
        #defines first MFCC from row 1 and concatenates with patient info signal
        MFCCs = librosa.feature.mfcc(y = new_signals[i], sr = sr, n_fft = n_fft, n_mfcc = n_mfcc)
        demo_info = patient_info_to_signal(df, i, MFCCs.shape[1])
        MFCCs_and_patient = np.concatenate((MFCCs, demo_info), axis=0)
        
        #defines second MFCC from row
        MFCCs2 = librosa.feature.mfcc(y = new_signals[i+1], sr = sr, n_fft = n_fft, n_mfcc = n_mfcc)
        demo_info = patient_info_to_signal(df, i+1, MFCCs.shape[1])
        MFCCs2_and_patient = np.concatenate((MFCCs2, demo_info), axis=0)
        
        #and this is why we have this whole block. choosing to use .stack to start building the final array
        final_patient_MFCC = np.stack((MFCCs_and_patient, MFCCs2_and_patient))
        
    if i == 1:
        continue
        
    elif i > 1:
        #building another MFCC & patient signal 
        MFCCs = librosa.feature.mfcc(y = new_signals[i], sr = sr, n_fft = n_fft, n_mfcc = n_mfcc) 
        demo_info = patient_info_to_signal(df, i, MFCCs.shape[1])
        MFCCs_and_patient = np.concatenate((MFCCs, demo_info), axis=0)
        MFCCs_and_patient = MFCCs_and_patient.reshape(1,MFCCs_and_patient.shape[0],MFCCs_and_patient.shape[1])
        
        #assembling the final array
        final_patient_MFCC = np.concatenate((final_patient_MFCC, MFCCs_and_patient))
        
    clear_output(wait=True)
    display(final_patient_MFCC.shape)
    
np.save(root_path+'arrays/MFCCs_withPatient', final_patient_MFCC)

(3007, 30, 97)

Below block of code can be executed if you run into memory issues with this notebook.

In [46]:
# del final_patient_MFCC
# gc.collect(generation=2)

### Without Patient Information

I repeated the above block of code but without the inclusion of the helper function returning the demographic information to build out my MFCC data without the patient information.

In [48]:
for i in range(0,len(new_signals)):
    if i == 0:
        
        #defines first MFCC from row 1
        MFCCs = librosa.feature.mfcc(y = new_signals[i], sr = sr, n_fft = n_fft, n_mfcc = n_mfcc)
        
        #defines second MFCC from row
        MFCCs2 = librosa.feature.mfcc(y = new_signals[i+1], sr = sr, n_fft = n_fft, n_mfcc = n_mfcc)
        
        #and this is why we have this whole block. choosing to use .stack to start building the final array
        final_MFCC = np.stack((MFCCs, MFCCs2))
        
    if i == 1:
        continue
        
    elif i > 1:
        #building another MFCC
        MFCCs = librosa.feature.mfcc(y = new_signals[i], sr = sr, n_fft = n_fft, n_mfcc = n_mfcc) 
        MFCCs = MFCCs.reshape(1,MFCCs.shape[0],MFCCs.shape[1])
        
        #assembling the final array
        final_MFCC = np.concatenate((final_MFCC, MFCCs))
        
    clear_output(wait=True)
    display(final_MFCC.shape)
    
np.save(root_path+'arrays/MFCCs_noPatient', final_MFCC)

(3007, 20, 97)

Below block of code can be executed if you run into memory issues with this notebook.

In [None]:
# del final_MFCC
# gc.collect(generation=2)

## Unprocessed signal data w/wo Patient Information

Now, I needed to generate the 1k signal data with and without patient information.

First, I dropped the `signal` column from `df`, which contained the old, unprocessed, 4k signal data.

In [43]:
df.drop(columns='signal', inplace=True)

Then, I imported the 1k signal data df from notebook 2, and created a list with just the signal data from it. 

In [44]:
one_k_signals = pd.DataFrame(data = np.load('/Users/jmd/Documents/BOOTCAMP/Capstone/arrays/patient_signals_1k.npy', allow_pickle=True),
                                   columns=(['Patient ID', 'Locations', 'Age', 'Sex', 'Height', 'Weight',
                                             'Pregnancy status', 'Murmur', 'Murmur locations',
                                             'Most audible location', 'Systolic murmur timing',
                                             'Systolic murmur shape', 'Systolic murmur grading',
                                             'Systolic murmur pitch', 'Systolic murmur quality',
                                             'Diastolic murmur timing', 'Diastolic murmur shape',
                                             'Diastolic murmur grading', 'Diastolic murmur pitch',
                                             'Diastolic murmur quality', 'Campaign', 'Additional ID',
                                             'location_count', 'signal_patient_id', 'location', 'signal']))
one_k_signals = one_k_signals.iloc[:,-1]
one_k_signals

0       [0.037114453, 0.061636038, 0.055389933, 0.0675...
1       [0.006357202, -0.010564456, -0.008372305, 0.00...
2       [0.16865823, 0.035755683, -0.01783402, 0.00613...
3       [0.042926352, 0.08795212, 0.081163116, 0.07926...
4       [0.11233253, 0.16617733, 0.22739325, 0.3153002...
                              ...                        
3158    [0.02705935, 0.0051991576, -0.028789269, -0.00...
3159    [0.011014578, 0.009928619, 0.014445103, 0.0160...
3160    [0.0010312841, -0.0074770185, 0.0053564664, 0....
3161    [0.070928715, 0.053460453, -0.012949261, -0.09...
3162    [0.07846564, 0.006256001, -0.02002155, 0.00305...
Name: signal, Length: 3163, dtype: object

I then inserted this new signal data back into my df so that I could reuse the code from notebook 3.

In [45]:
df.insert(loc=len(df.columns), column='signal', value=one_k_signals)

In [46]:
sr = 1024

signals = df['signal']

target_len = 6 * sr

Another important decision was made here - for these signals only 6 second clips were used. This is due to the size of the 1k signal w/ patient info array, which ended up being a REAL memory hog. This will become important when evaluating the models. 

In [47]:
#building out array of padded and trimmed signals
new_signals = []
for signal in signals:
    if len(signal) == target_len:
        new_signals.append(signal)
    elif len(signal) > target_len:
        new_signals.append(signal[0:target_len])
    elif len(signal) < target_len:
        padwidth = target_len-len(signal)
        new_signals.append(np.pad(signal, (0, padwidth), mode='constant'))

new_signals = np.asarray(new_signals)

In [48]:
new_signals.shape

(3007, 6144)

`new_signals` now contains the padded and trimmed to 6 seconds _1k sampling rate_ signals. Note the second dim, down from 49k to 6k. This was one of the goal shapes for my data, so I saved it here.

In [49]:
np.save(root_path+'arrays/signal_noPatient', new_signals)

I then reused the code from above, as well as the helper function `patient_info_to_signal` to create another array. This array has shape `3007, 11, 6144`: 3007 observations, 11 channels (1 signal, 11 static patient demo info), and 6144 time steps.

In [71]:
for i in range(0,len(new_signals)):
    if i == 0:
        
        #defines first MFCC from row 1 and concatenates with patient info signal
        demo_info = patient_info_to_signal(df, i, new_signals[i].shape[0])
        signal_and_patient = np.concatenate((new_signals[i].reshape(1,len(new_signals[i])), demo_info), axis=0)
        
        #defines second MFCC from row
        demo_info2 = patient_info_to_signal(df, i+1, new_signals[i+1].shape[0])
        signal_and_patient2 = np.concatenate((new_signals[i+1].reshape(1,len(new_signals[i+1])), demo_info), axis=0)
        
        #and this is why we have this whole block. choosing to use .stack to start building the final array
        final_patient_signal = np.stack((signal_and_patient, signal_and_patient2))
        
    if i == 1:
        continue
        
    elif i > 1:
        #building another MFCC & patient signal 
        demo_info = patient_info_to_signal(df, i, new_signals[i].shape[0])
        signal_and_patient = np.concatenate((new_signals[i].reshape(1,len(new_signals[i])), demo_info), axis=0)
        signal_and_patient = signal_and_patient.reshape(1,signal_and_patient.shape[0],signal_and_patient.shape[1])
        
        #assembling the final array
        final_patient_signal = np.concatenate((final_patient_signal, signal_and_patient))
        
    clear_output(wait=True)
    display(final_patient_signal.shape)
    
np.save(root_path+'arrays/signal_withPatient', final_patient_signal)

(3007, 11, 6144)

## Conclusion:

And that's it! All the arrays to feed into the models are generated. Importantly, this process was iterative. I returned to this notebook _frequently_ to modify and create new datasets, as well as to make sure I wasn't going insane with the naming of my files (something I did _not_ do well in this project). Quick recap of our data sets:
- 4k sr raw signal data _without_ patient information
- 1k sr signal data _without_ patient information
- 1k sr signal data _with_ patient information
- MFCC data derived from my 4k sr signal data _without_ patient information
- MFCC data derived from my 4k sr signal data _with_ patient information.

The next block of notebooks are building up my models that I evaluate in the last notebook. **PLEASE READ THE INTRO IN NOTEBOOK 0**. Markdown in the Model Search notebooks is sparse, as the plan is laid out in the first and the others are quite repetitve.