# Surfboard LibriSpeech reference values

Calculation of Surfboard reference means, medians, standard deviations and median absolute deviations on a 40-hour subset of LibriSpeech.

## Setup

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import scipy.stats

The following csv file can be downloaded from: https://novoic-surfboard-interspeech2020.s3-us-west-2.amazonaws.com/surfboard_librispeech40h_features.csv

In [None]:
features_csv = 'surfboard_librispeech40h_features.csv'

## Load/prepare the data

In [None]:
surfboard_values_librispeech = pd.read_csv(features_csv)

Drop the first column: `'fnames'`.

In [None]:
feature_list = surfboard_values_librispeech.columns.tolist()[1:]
print(f'{len(feature_list):,} features')

Some features can be undefined following reference implementations. To extract statistics ignoring these, we replace `inf`s by `nan`s to later use `np.nanmean` etc.

In [None]:
surfboard_values_librispeech = surfboard_values_librispeech.replace([np.inf, -np.inf], np.nan)

## Calculate the reference values

In [None]:
reference_values_dict = {}
for feature in tqdm(feature_list, desc='Calculating reference values'):
    ref_mean = np.nanmean(surfboard_values_librispeech[feature])
    ref_std = np.nanstd(surfboard_values_librispeech[feature])
    ref_median = np.nanmedian(surfboard_values_librispeech[feature])
    
    ref_mad = scipy.stats.median_absolute_deviation(
        surfboard_values_librispeech[feature],
        nan_policy='omit'
    )
    
    reference_values_dict[feature] = [ref_mean, ref_std, ref_median, ref_mad]

In [None]:
reference_values_df = pd.DataFrame.from_dict(reference_values_dict)
reference_values_df.index = ['mean', 'std', 'median', 'mad']
reference_values_df.head()