# Isolating Signals & Target Variable for Simple Modelling

Author: Jake Dumbauld <br>
Contact: jacobmilodumbauld@gmail.com<br>
Date: 3.15.22

## Introduction:

The purpose of this notebook is to generate a dataset that can be fed into a handful of simple statistical models. Another important note, at the time this notebook was created I had not ran into the sampling rate problem that caused me to return to notebook 2. Thus, only the 4k sampling rate data is altered here. 

In [1]:
# importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import librosa
import librosa.display
import IPython

import os
from random import uniform
import time
from IPython.display import display, clear_output

In [2]:
root_path = '/Users/jmd/Documents/BOOTCAMP/Capstone/'

Loading in the dataframe created in the previous notebook

In [3]:
df = pd.DataFrame(data = np.load(root_path + 'arrays/patient_signals_4k.npy', allow_pickle=True),
                       columns=(['Patient ID', 'Locations', 'Age', 'Sex', 'Height', 'Weight',
                                 'Pregnancy status', 'Murmur', 'Murmur locations',
                                 'Most audible location', 'Systolic murmur timing',
                                 'Systolic murmur shape', 'Systolic murmur grading',
                                 'Systolic murmur pitch', 'Systolic murmur quality',
                                 'Diastolic murmur timing', 'Diastolic murmur shape',
                                 'Diastolic murmur grading', 'Diastolic murmur pitch',
                                 'Diastolic murmur quality', 'Campaign', 'Additional ID',
                                 'location_count', 'signal_patient_id', 'location', 'signal']))

Defining my sampling rate to be used throughout this notebook

In [4]:
sr = 4096

## Binarizing Target

To feed this information into a simple statistical model, I need convert the strings in the `patient_info` df into something machine readable.

#### Dropping Unknown Murmurs

In this dataset, there were 156 audio files for which the listener was unsure if there was a murmur, about 5% of the dataset. Since I'm setting up for a supervised learning process in which the purpose is to develop models that can predict a binary target, I decided to drop these samples.

In [5]:
df.shape

(3163, 26)

In [6]:
df['Murmur'].value_counts()

Absent     2391
Present     616
Unknown     156
Name: Murmur, dtype: int64

In [7]:
df.drop(df[df['Murmur'] == 'Unknown'].index, inplace=True)

In [8]:
df.reset_index(inplace=True, drop=True)

#### Binarizing Murmurs

Simple binarizing of the target variable below. Worth noting here that there is an imbalance in my data, only ~20% of my samples are in the positive class (present murmur).

In [9]:
df['Murmur'].value_counts()

Absent     2391
Present     616
Name: Murmur, dtype: int64

In [10]:
df['Murmur'] = df['Murmur'].map({"Absent": 0,
                                 "Present": 1})

## Extracting a Signal DF

With murmur's binarized, I now needed to deal with the problem of variable signal lengths. The goal was to create an array of signals of all equal length, and a column with my binary target. First, I want to do some simple descriptive statistics on the lengths of my signals.

In [11]:
signal_df = df[['Murmur','signal']]

signals = signal_df['signal']

lengths = []
for signal in signals:
    lengths.append(len(signal))

lengths = pd.Series(lengths)

lengths.describe() / sr #sampling rate for the files in this set, so now the lengths are given in seconds

count     0.734131
mean     22.894569
std       7.297946
min       5.152100
25%      19.056152
50%      21.488037
75%      29.392090
max      64.512207
dtype: float64

With some simple descriptive statistics of my signals, I can see that the average signal length is 22 seconds, min and max of 5 and 64 respectively.

To achieve the above goal, I settled on a clip lengh of 12 seconds. I set an admittedly arbitrary threshold that I wanted 90% of the clips to be longer than the clip length I chose, so I wasn't trimming out too much actual data and padding a bunch of zeroes into my dataset.

In [12]:
for i in range(4, 30):
    print(i, len(lengths[lengths > (i * sr)]) / len(lengths))

4 1.0
5 1.0
6 0.9986697705354174
7 0.9960093116062521
8 0.9890256069171932
9 0.9787163285666778
10 0.9630861323578317
11 0.9441303624875291
12 0.9218490189557699
13 0.8985700033255737
14 0.8782840039906884
15 0.8540073162620552
16 0.8240771533089458
17 0.79847023611573
18 0.7751912204855338
19 0.7509145327569006
20 0.6508147655470569
21 0.5340871300299301
22 0.4672430994346525
23 0.4286664449617559
24 0.40239441303624873
25 0.3831060857998005
26 0.36647821749251747
27 0.3518456933821084
28 0.32790156301962087
29 0.28633189225141337


The above cell shows that this breakpoint ocurrs somewhere between 12 and 13 seconds, this I set my `target_len` in seconds to 12. Multiplied by our sampling rate we get the number of samples to trim each array of amplitude data to.

In [13]:
target_len = 12 * sr
target_len

49152

In [14]:
signal_df['signal'].isna().sum()

0

From here I looped through each signal, trimming it if it was longer than `target_len` or padding it with zeroes if it was shorter.

In [15]:
new_signals = []
for signal in signals:
    if len(signal) == target_len:
        new_signals.append(signal)
    elif len(signal) > target_len:
        new_signals.append(signal[0:target_len])
    elif len(signal) < target_len:
        padwidth = target_len-len(signal)
        new_signals.append(np.pad(signal, (0, padwidth), mode='constant'))

In [16]:
new_signals = np.asarray(new_signals)

In [17]:
new_signals.shape

(3007, 49152)

I then created a df from this to concatenate it with my original signal dataframe after dropping the un-processed signal data. I also did a quick sanity `nan` check. Also, there were a TON of x.shape sanity check cells throughout this notebook that I deleted to make the output cleaner, just a few left in to show what's going on.

In [19]:
temp_df = pd.DataFrame(new_signals)

signal_df.drop('signal',axis=1,inplace=True)

final_df = signal_df.join(temp_df)

final_df.isna().sum().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  signal_df.drop('signal',axis=1,inplace=True)


0

In [20]:
final_df.shape

(3007, 49153)

In [21]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3007 entries, 0 to 3006
Columns: 49153 entries, Murmur to 49151
dtypes: float32(49152), int64(1)
memory usage: 563.8 MB


I checked the memory usage of the dataframe to ensure it wasn't too large, before saving it to a numpy array before pulling it into the next notebook.

In [22]:
np.save(root_path + 'arrays/signal_murmur_presimple_4k.npy', final_df.to_numpy())