# BlaBla AMI reference values

Calculation of BlaBla reference means and standard deviations on a single-speaker transcripts from a subset of the AMI corpus.

## Setup

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm

The AMI corpus can be found [here](http://groups.inf.ed.ac.uk/ami/corpus/). Single-speaker transcripts can be extracted and then analyzed using the BlaBla CLI to extract all linguistic features:
```
blabla compute-features -F example_configs/features.yaml -S stanza_config/stanza_config.yaml -i ./ami_transcripts -o blabla_ami_features.csv -format string
```

In [None]:
features_csv = 'blabla_ami_features.csv'

## Load/prepare the data

In [None]:
blabla_values_ami = pd.read_csv(features_csv)

In [None]:
feature_list = blabla_values_ami.columns.tolist()
print(f'{len(feature_list):,} features')

Some features can be undefined following reference implementations. To extract statistics ignoring these, we replace `inf`s by `nan`s to later use `np.nanmean` etc.

In [None]:
blabla_values_ami = blabla_values_ami.replace([np.inf, -np.inf], np.nan)

## Calculate the reference values

In [None]:
reference_values_dict = {}
for feature in tqdm(feature_list, desc='Calculating reference values'):
    ref_mean   = np.nanmean(blabla_values_ami[feature])
    ref_std    = np.nanstd(blabla_values_ami[feature])
    reference_values_dict[feature] = [ref_mean, ref_std]

In [None]:
reference_values_df = pd.DataFrame.from_dict(reference_values_dict)
reference_values_df.index = ['mean', 'std']
reference_values_df.head()