In [2]:
import pandas as pd
import numpy as np
import pathlib
from pathlib import Path
import librosa
from typing import List, Tuple
import toolz as tz
from dask.diagnostics import ProgressBar
import dask.bag as db
import dask
import dask.array as da
import zarr
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import robust_scale
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Approach

Using a Temporal Convolutional Neural Network on breathing audio with minimal pre-processing (just downsampling and length-normalizing).  Temporal Convnets are the general case of a class of models whose most famous example is WaveNet.  Good explanation here: https://jeddy92.github.io/JEddy92.github.io/ts_seq2seq_conv/

I went with this route for a few reasons. 

1. I have experience with using TCNs for biomedical time series (perhaps not the best reason, but included in the spirit of full disclosure :P )
2. As a ConvNet, it can extract very "low level" features. 
3. The data format has bigger memory requirements when training, but requires less pre-processing.  This is relevant since the end goal is "Edge Prediction" served via a Smartphone app.  It's okay if it needs a bigger machine to train if the end result is that it can serve predictions on lower-end phones.
4. I went with the TCN over other sequence-processing Deep Learning architectures (such as various flavors of RNN) due to research indicating it performs strictly better on a number of metrics: https://arxiv.org/pdf/1803.01271.pdf  It can also take better advantage of GPUs, resulting in decreased training time, which in turn leads to easier iteration & experimentation.

# Processing

In [3]:
def pad_to_length(arr: np.array, max_len: int) -> np.array:
    arr_len = arr.shape[0]
    diff = max_len - arr_len
    return np.pad(arr, (0, diff), mode="wrap")

def make_breath_array(
    audio_txt_folder: pathlib.Path, file_df: pd.DataFrame
) -> dask.array.Array:
    files_to_use = list(audio_txt_folder.rglob("*.wav"))
    # I downsampled it to the lowest I could get it without
    # running into DivideByZero errors.  Breathing is
    # low-frequency
    wav_bag = (
        db.from_sequence(files_to_use, npartitions=8)
        .map(lambda x: librosa.core.load(x, sr=87)[0])
        .compute()
    )

    max_len = max(x.shape[0] for x in wav_bag)

    breath_array = (
        db.from_sequence(wav_bag, npartitions=8)
        .map(lambda x: pad_to_length(x, max_len))
        .to_dataframe()
        .to_dask_array(lengths=True)
    )

    new_cols = da.stack(
        [
            da.from_array(file_df["n_breaths"].values),
            da.from_array((file_df["Diagnosis"] == "Healthy").astype(np.int8).values),
        ],
        axis=1,
    )

    return da.concatenate([breath_array, new_cols], axis=1).astype(np.float32)

Literature review lead me to believe that for breath data we generally want Longer recordings, and can balance that out with a lower Sampling Rate.  

> In comparison with other physiological signatures (e.g., heart rate, EEG), breathing patterns usually have a narrow and low frequency bandwidth (e.g., between 0.1Hz and 0.85Hz]). In other words, it requires the collection of longer data sets to allow for a deep learning process to take place"  

from here:  https://arxiv.org/ftp/arxiv/papers/1708/1708.06026.pdf

## Sampling Rate
The rate of 87 Hz was chosen due to being the lowest I could get Librosa to do without giving me `DivideByZero` errors (that I don't fully understand) while still being well above what I'm aware of as the range of frequencies of interest for human breath.  

Low sampling rates are also useful because we want to balance long recordings with with memory concerns (both for storage, and so as not to run out of memory when training).

## Dask Bags
https://tutorial.dask.org/02_bag.html
As this is a pretty big dataset, I decided to process it with Dask.  The recordings are of different lengths, so I couldn't just put them all into a Vector - so I used Dask Bags to read them at the specifed Sampling Rate, and then pad them to the length of the longest sample.  I did this because we needed a standard size for Model Optimization and easy storage.  

I padded them with the `wrap` option in NumPy's `pad` function, which
>Pads with the wrap of the vector along the axis. The first values are used to pad the end and the end values are used to pad the beginning.

I decided to do this instead of just padding with 0s because breathing, unlike speech, is cyclic.  The `reflect` or `symmetric` could probably have also been used.

## Target Variables

In [None]:
def make_diagnosis_df(data_folder: pathlib.Path) -> pd.DataFrame:
    return pd.read_csv(
        data_folder.joinpath("patient_diagnosis.csv"),
        header=None,
        names=["Patient", "Diagnosis"],
    ).set_index("Patient")


def make_file_df(
    diagnosis_df: pd.DataFrame, audio_txt_folder: pathlib.Path
) -> pd.DataFrame:
    file_stats = [
        get_record_stats(audio_txt_folder, x.name.split(".")[0])
        for x in audio_txt_folder.glob("*.wav")
    ]

    file_df = pd.DataFrame(
        file_stats,
        columns=[
            "Patient",
            "Section",
            "Location",
            "n_channels",
            "device",
            "filesize",
            "n_breaths",
        ],
    ).assign(
        Diagnosis=lambda x: x["Patient"]
        .astype(int)
        .map(lambda y: diagnosis_df.loc[y, "Diagnosis"])
    )
    return file_df

def get_record_stats(folder: str, file: str) -> Tuple[str, str, str, int, int]:
    name_elems = file.split("_")
    wav_size = Path(folder).joinpath(f"{file}.wav").stat().st_size
    # Count the lines, then subtract 1 cuz they end on an empty
    n_breath_cycles = (
        sum(1 for line in open(Path(folder).joinpath(f"{file}.txt"), "r")) - 1
    )
    return tuple(name_elems + [wav_size] + [n_breath_cycles])


I recast the different diagnoses as simply "Healthy/Unhealthy", as binary classifiers are simpler and we'll need to retrain the last layer anyway.

I included all the samples, including stethoscope ones, because imperfect data can still help models using Transfer Learning to learn the low-level features.  We still need "COVID patients breathing into a microphone" for the final training, but noisy, different data is still good to throw in. In addition, we don't have any mouth recordings anyway - the closest to our ideal would be the recordings taken with a microphone from the patient's larynx.
Deep Learning is weird, it doesn't totally follow the "Garbage In/Garbage Out" principle.  Blurry pictures of cars will still help a Computer Vision algo learn to distinguish Cats from Dogs.

Nothing much to say about the "counting number of breaths" target feature!

## Zarr
I saved the data with the Zarr format, a very handy binary format.    
https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/

In [4]:
def save_to_zarr(arr: dask.array.Array, folder: pathlib.Path, filename: str) -> None:
    destination = str(Path(folder, filename))
    da.to_zarr(arr.rechunk(), destination)