# Creating Signal Data

Author: Jake Dumbauld <br>
Contact: jacobmilodumbauld@gmail.com<br>
Date: 3.15.22

## Introduction

The purpose of this notebook was to convert all of the .wav files into signal data at varying sampling rates (sr) to be used in different types of modelling. I initially only created 4k sr data, and this data was used to train simple statistical models (logreg, SVM, KNN). However later I returned to this notebook to create 1k sr data through the same process. More details on that to come later.

## Imports

In [1]:
# importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import librosa
import librosa.display
import IPython

import os
from random import uniform
import time
from IPython.display import display, clear_output

In [2]:
root_path = '/Users/jmd/Documents/BOOTCAMP/Capstone/'

The first step in investigating my data was to import the training data csv. This file contained information on each of the patients in the sample, as well as qualitative and quantitative information about the murmurs heard if applicable. We'll dissect this further in notebook 5.

In [3]:
# importing data annotations as a df
patient_info = pd.read_csv(root_path+'the-circor-digiscope-phonocardiogram-dataset-1.0.1/training_data.csv')

In [4]:
patient_info.columns

Index(['Patient ID', 'Locations', 'Age', 'Sex', 'Height', 'Weight',
       'Pregnancy status', 'Murmur', 'Murmur locations',
       'Most audible location', 'Systolic murmur timing',
       'Systolic murmur shape', 'Systolic murmur grading',
       'Systolic murmur pitch', 'Systolic murmur quality',
       'Diastolic murmur timing', 'Diastolic murmur shape',
       'Diastolic murmur grading', 'Diastolic murmur pitch',
       'Diastolic murmur quality', 'Campaign', 'Additional ID'],
      dtype='object')

In [5]:
training_folder_path = root_path + '/the-circor-digiscope-phonocardiogram-dataset-1.0.1/training_data/'

Within this folder are all of the .wav files I will transform into amplitude data below.

## Generating 4K Signals

Taking into consideration the information gathered in notebook 1, I will be Choosing a Sampling Rate of 4096. The reason for this is two-fold:
- Fundamental heart sounds that I'm interested in are between 20 and 500 hz ([source](https://www.sciencedirect.com/topics/agricultural-and-biological-sciences/heart-sounds)). However, in review of the data I did find that there are frequencies ranging up to 2k hz, so I will increase the sampling rate to fully capture this sound. To satisfy the [Nyquist-Shannon Sampling Theorem](https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem) I need to select a sr of at least 2x the max frequency that I'm trying to capture.
- A sampling rate of 4096 as opposed to the standard 22khz reduces the amount of data we have to process by 1/5. Given the limited computing power I have available, this is a *huge* upside.

In [6]:
sr = 4096

The below code loops through each of the .wav files in the `training_folder_path` and converts them into signal data, appending them to a list. 

In [7]:
filenames = []
signals = []

i = 0

for file in os.listdir(training_folder_path):
    filename = os.fsdecode(file)
    if filename.endswith(".wav"):
        signal, sr = librosa.load(training_folder_path + file, sr = sr)
        filenames.append(filename.split('.')[0])
        signals.append(signal)
        i += 1
    if i % 50 == 0:
        clear_output(wait=True)
        display(f"{i} files loaded")
    else:
        continue

'3150 files loaded'

I also captured the `patient_ids` and `location` of each signal. This will be used in the final df/array I'll feed into the models.</br>
Quick note: `location` refers to the position on the patient's body that the sound was recorded. Think about a time where a doctor was listening to your heart - they never listen at one spot do they? That's the idea here. The reason for this is that some heart sounds are more audible at some locations than others. Thus, each patient in this sample had multiple locations recorded. For a full data dictionary on the `patient_info` visit this [link](https://moody-challenge.physionet.org/2022/)

In [8]:
signal_locations = []
for file in filenames:
    signal_locations.append(file.split('_')[1])

signal_patient_ids = []
for file in filenames:
    signal_patient_ids.append(file.split('_')[0])

Code to create a df with the signals, locations, and patient IDs.

In [9]:
signal_df = pd.DataFrame({'signal_patient_id': signal_patient_ids,
                         'location': signal_locations,
                         'signal': signals},
                         columns=['signal_patient_id','location','signal'])

signal_df['signal_patient_id'] = signal_df['signal_patient_id'].astype('int')

signal_df.sort_values(by='signal_patient_id').reset_index(drop=True)

Unnamed: 0,signal_patient_id,location,signal
0,2530,PV,"[0.07682987, 0.06061038, 0.039170958, 0.048250..."
1,2530,AV,"[-0.01187718, 0.029969877, 0.01927742, -0.0206..."
2,2530,MV,"[0.37442628, 0.32439327, 0.095518045, -0.06558..."
3,2530,TV,"[0.06770988, 0.073658854, 0.072224066, 0.08253..."
4,9979,MV,"[0.15039496, 0.18560724, 0.17212218, 0.1603406..."
...,...,...,...
3158,85345,AV,"[0.03814805, 0.06008572, 0.03808801, -0.008758..."
3159,85345,PV,"[-0.0061015976, 0.029588033, 0.020953469, 0.00..."
3160,85349,AV,"[0.00026825635, 0.0034792388, 0.014762115, -0...."
3161,85349,PV,"[0.1387334, 0.08302983, 0.13867667, -0.0078439..."


In order to fit this signal info to the patient information dataframe I initially imported, I have to expand the dataframe to be organized not by patient, but by patient and location as this is how the signal data is organized. 

This first block of code creates a new column that tracks how many locations (and thus how many audiofiles) were generated for each patient.

In [10]:
patient_info['location_count'] = 0
for i in range(0,len(patient_info['location_count'])):
    patient_info.at[i,'location_count'] = len(patient_info['Locations'][i].split('+'))

Then I repeated each of those rows for the number of locations, and reset the index.

In [11]:
patient_info = patient_info.loc[patient_info.index.repeat(patient_info.location_count)]
patient_info.reset_index(drop=True,inplace=True)

Finally, I concatenated this df with the signal information generated above to create a final data frame with all of the information from the patient df as well as the signal data. Each signal has an accompanying column that indicates the location it was recorded from. 

In [14]:
final_patient_info = pd.concat([patient_info.reset_index(drop=True),
                                signal_df.sort_values(by='signal_patient_id').reset_index(drop=True)], axis=1)
final_patient_info

Unnamed: 0,Patient ID,Locations,Age,Sex,Height,Weight,Pregnancy status,Murmur,Murmur locations,Most audible location,...,Diastolic murmur shape,Diastolic murmur grading,Diastolic murmur pitch,Diastolic murmur quality,Campaign,Additional ID,location_count,signal_patient_id,location,signal
0,2530,AV+PV+TV+MV,Child,Female,98.0,15.9,False,Absent,,,...,,,,,CC2015,,4,2530,PV,"[0.07682987, 0.06061038, 0.039170958, 0.048250..."
1,2530,AV+PV+TV+MV,Child,Female,98.0,15.9,False,Absent,,,...,,,,,CC2015,,4,2530,AV,"[-0.01187718, 0.029969877, 0.01927742, -0.0206..."
2,2530,AV+PV+TV+MV,Child,Female,98.0,15.9,False,Absent,,,...,,,,,CC2015,,4,2530,MV,"[0.37442628, 0.32439327, 0.095518045, -0.06558..."
3,2530,AV+PV+TV+MV,Child,Female,98.0,15.9,False,Absent,,,...,,,,,CC2015,,4,2530,TV,"[0.06770988, 0.073658854, 0.072224066, 0.08253..."
4,9979,AV+PV+TV+MV,Child,Female,103.0,13.1,False,Present,AV+MV+PV+TV,TV,...,,,,,CC2015,,4,9979,MV,"[0.15039496, 0.18560724, 0.17212218, 0.1603406..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3158,85345,AV+PV,Child,Female,132.0,38.1,False,Absent,,,...,,,,,CC2015,,2,85345,AV,"[0.03814805, 0.06008572, 0.03808801, -0.008758..."
3159,85345,AV+PV,Child,Female,132.0,38.1,False,Absent,,,...,,,,,CC2015,,2,85345,PV,"[-0.0061015976, 0.029588033, 0.020953469, 0.00..."
3160,85349,AV+PV+TV,,Female,,,True,Absent,,,...,,,,,CC2015,,3,85349,AV,"[0.00026825635, 0.0034792388, 0.014762115, -0...."
3161,85349,AV+PV+TV,,Female,,,True,Absent,,,...,,,,,CC2015,,3,85349,PV,"[0.1387334, 0.08302983, 0.13867667, -0.0078439..."


In [13]:
final_patient_info.columns

Index(['Patient ID', 'Locations', 'Age', 'Sex', 'Height', 'Weight',
       'Pregnancy status', 'Murmur', 'Murmur locations',
       'Most audible location', 'Systolic murmur timing',
       'Systolic murmur shape', 'Systolic murmur grading',
       'Systolic murmur pitch', 'Systolic murmur quality',
       'Diastolic murmur timing', 'Diastolic murmur shape',
       'Diastolic murmur grading', 'Diastolic murmur pitch',
       'Diastolic murmur quality', 'Campaign', 'Additional ID',
       'location_count', 'signal_patient_id', 'location', 'signal'],
      dtype='object')

There are some unneeded columns here, but these will be cleaned up at a later stage when neural networks are employed.

In [16]:
np.save(root_path + '/arrays/patient_signals_4k.npy', final_patient_info.to_numpy())

Finally I saved the df as a numpy array, because when it was exported as a CSV the signal data was truncated due to ?constraints on the amount of information in a cell, as far as I could tell.

## Generating 1K SR Signals

The below code is identical to the above, but with the modification of a 1k sampling rate. The reasoning for this is explained later in the model search notebooks.

In [5]:
sr = 1024

filenames = []
signals = []

i = 0

for file in os.listdir(training_folder_path):
    filename = os.fsdecode(file)
    if filename.endswith(".wav"):
        signal, sr = librosa.load(training_folder_path + file, sr = sr)
        filenames.append(filename.split('.')[0])
        signals.append(signal)
        i += 1
    if i % 50 == 0:
        clear_output(wait=True)
        display(f"{i} files loaded")
    else:
        continue

'3150 files loaded'

In [6]:
signal_locations = []
for file in filenames:
    signal_locations.append(file.split('_')[1])

signal_patient_ids = []
for file in filenames:
    signal_patient_ids.append(file.split('_')[0])

In [7]:
signal_df = pd.DataFrame({'signal_patient_id': signal_patient_ids,
                         'location': signal_locations,
                         'signal': signals},
                         columns=['signal_patient_id','location','signal'])

signal_df['signal_patient_id'] = signal_df['signal_patient_id'].astype('int')

signal_df.sort_values(by='signal_patient_id').reset_index(drop=True)

Unnamed: 0,signal_patient_id,location,signal
0,2530,PV,"[0.037114453, 0.061636038, 0.055389933, 0.0675..."
1,2530,AV,"[0.006357202, -0.010564456, -0.008372305, 0.00..."
2,2530,MV,"[0.16865823, 0.035755683, -0.01783402, 0.00613..."
3,2530,TV,"[0.042926352, 0.08795212, 0.081163116, 0.07926..."
4,9979,MV,"[0.11233253, 0.16617733, 0.22739325, 0.3153002..."
...,...,...,...
3158,85345,AV,"[0.02705935, 0.0051991576, -0.028789269, -0.00..."
3159,85345,PV,"[0.011014578, 0.009928619, 0.014445103, 0.0160..."
3160,85349,AV,"[0.0010312841, -0.0074770185, 0.0053564664, 0...."
3161,85349,PV,"[0.070928715, 0.053460453, -0.012949261, -0.09..."


In [8]:
patient_info['location_count'] = 0
for i in range(0,len(patient_info['location_count'])):
    patient_info.at[i,'location_count'] = len(patient_info['Locations'][i].split('+'))

In [9]:
patient_info = patient_info.loc[patient_info.index.repeat(patient_info.location_count)]

In [10]:
patient_info.reset_index(drop=True,inplace=True)

In [11]:
final_patient_info = pd.concat([patient_info.reset_index(drop=True),
                                signal_df.sort_values(by='signal_patient_id').reset_index(drop=True)], axis=1)

In [12]:
final_patient_info.columns

Index(['Patient ID', 'Locations', 'Age', 'Sex', 'Height', 'Weight',
       'Pregnancy status', 'Murmur', 'Murmur locations',
       'Most audible location', 'Systolic murmur timing',
       'Systolic murmur shape', 'Systolic murmur grading',
       'Systolic murmur pitch', 'Systolic murmur quality',
       'Diastolic murmur timing', 'Diastolic murmur shape',
       'Diastolic murmur grading', 'Diastolic murmur pitch',
       'Diastolic murmur quality', 'Campaign', 'Additional ID',
       'location_count', 'signal_patient_id', 'location', 'signal'],
      dtype='object')

In [13]:
np.save(root_path + 'arrays/patient_signals_1k.npy', final_patient_info.to_numpy())