# Building a phonetic dataset from textgrids and measurements

In [None]:
from parselmouth import Sound
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from phonlab import get_f0
from phonlab.utils.phonlablib import tg_to_df, add_context, merge_tiers, explode_intervals, interpolate_measures, loadsig

## Examples

In [None]:
au, rate = loadsig('mono.wav')
formant = Sound(au, rate).to_formant_burg()

In [None]:
phdf, wddf, phptdf = tg_to_df('mono.TextGrid', tiersel=['phone', 'word', 'pointph'])
phdf = add_context(phdf, 'phone', nprev=2, nnext=2, ctxcol='phonectx')
wddf = add_context(wddf, 'word', nprev=1, nnext=0, ctxcol='wdctx')
wdpts = merge_tiers(inner_df=phptdf, outer_df=wddf, suffixes=['_pt', '_wd'], inner_ts=['t1'], drop_repeated_cols='inner')
phwddf = merge_tiers(inner_df=phdf, outer_df=wdpts, suffixes=['_ph', ''], outer_ts=['t1_wd', 't2_wd'], drop_repeated_cols='inner_df')
vowels = ['ai', 'e', 'a', 'au']
vdf = phwddf.query(f'phone in {vowels}')
vdf

In [None]:
vdf = explode_intervals(2, ts=['t1_ph', 't2_ph'], df=vdf)
vdf

## Generic df interface

Establish a pattern for easily getting values from **any** time-based dataframe of measurements.

First, create a dataframe of this type as a proof of concept. The `tcol` column has time values, and `genf1` and `genf2` are values for F1 and F2. The `foo` column is added as a nonce column of a non-numeric type.

In [None]:
gendf = pd.read_csv('generic_formant.csv')
gendf

In [None]:
f0df = get_f0(au, fs=rate)
f0df

The `vdf` dataframe has metadata for vowels and was already used to get `f1` and `f2` values by using a praat-specific function (`praatformant_to_df`) on a parselmouth Formant object.

The `interpolate_measures` function takes a measurement df `measdf` that contains a time column specified by the `meas_ts` parameter. Notice that we take a subset of `gendf` when setting the `measdf` parameter, as non-numeric columns cannot be interpolated and raise an error.

The `interpdf` parameter is used to specify the dataframe that has times for which measures from `measdf` are interpolated. The name of the column containing the times to be interpolated is specified by the `interp_ts` parameter value.

In [None]:
vdf = interpolate_measures(
    meas_df=gendf[['tcol', 'f1', 'f2']],  # tcol + cols to interpolate only
    meas_ts='tcol',
    interp_df=vdf,
    interp_ts='obs_t'
)
vdf

In [None]:
f0df = get_f0(au, fs=rate)
vdf = interpolate_measures(
    meas_df=f0df.drop(columns='voiced'),  # tcol + cols to interpolate only
    meas_ts='sec',
    interp_df=vdf,
    interp_ts='obs_t', tol=0.1
)
vdf

## Applying custom functions to individual tokens

Now that we have a dataset of acoustic measures for our vowel tokens, we use the [split-apply-combine pattern](https://pandas.pydata.org/docs/user_guide/groupby.html) to apply custom functions to (1) characterize pitch dynamics; (2) plot the tokens.

First, create a token grouper that identifies the individual tokens. In our simple dataset the `t1_ph` column is sufficient. A more complicated dataset would `groupby` multiple columns. The grouper maps `t1_ph` times to the dataframe index, where multiple rows share the same index (in fact, we could also have done `groupby(level=0)` to make an equivalent grouping using the dataframe index instead of the `t1_ph` column). We also restrict the grouper to columns that will be of interest.

In [None]:
tokens = vdf.groupby(['t1_ph'])[['f0', 'f1', 'f2', 't1_ph', 't2_ph', 'obs_t', 'obs_id']]
tokens.groups # maps `t1_ph` times to indexes

### Developing a function

We extract one of these groupings in order to conveniently develop a custom function that operates on a single group at a time. Here we take the group at index `5`, which has `t1_ph` time 0.470892. Our goal is to characterize whether F0 rises or falls over the first and second halves of the token.

A subset of the columns from the token is shown.

In [None]:
x = vdf.loc[5]
x[['t1_ph', 't2_ph', 'f0', 'obs_t', 'obs_id']]

Now that we have the group `x` to experiment with, we can work out the steps to categorize tokens by changes in F0 values over time.

We start by calling `diff` on the `f0` column, calculates the change of each F0 value from the preceding value. Since there is no value preceding the first one, its `diff` is `NaN`.

In [None]:
x['f0'].diff()

Create a slice `[1:]` starting from the second value to skip the `NaN`.

In [None]:
x['f0'].diff()[1:]

The `np.where` function is used to label positive rising values as `r` and negative falling values as `f`.

In [None]:
np.where(x['f0'].diff()[1:] >= 0, 'r', 'f')

Finally, we concatenate these labels with the `join` function to create one of four possible F0 contours: `rr`, `rf`, `ff`, `fr`.

In [None]:
''.join(np.where(x['f0'].diff()[1:] > 0, 'r', 'f'))

### `apply` an unnamed lambda function

Since this calculation is simple it's a good candidate for a unnamed lambda function that we can `apply` to each group. The lambda function in combination with `apply` allows us to do the calculation on each group in succession, here named `x`. The result is a `Series` of contour types, indexed by `t1_ph`. `rename` is called on the result to give the `Series` the name `f0type`.

In [None]:
tokens.apply(
    lambda x: ''.join(np.where(x['f0'].diff()[1:] > 0, 'r', 'f'))
).rename('f0type')

We can `merge` the result back into `vdf` based on the `t1_ph` values. The use of `rename` is important so that the new column has a name. The `merge` would fail without it.

In [None]:
vdf = vdf.merge(
    tokens.apply(
        lambda x: ''.join(np.where(x['f0'].diff()[1:] > 0, 'r', 'f'))
    ).rename('f0type'),
    on='t1_ph'
)
vdf

### `apply` a named function

We'll illustrate the use of `apply` with named functions to do the same contour calculation on F1 and F2. We could easily use lambda functions for these as well, but it's good to know how to create named functions. Named functions are especially useful when you want to apply more complicated operations to your dataframe groups than can easily be included in a single line of code. **Always** include a docstring in your function that describes its purpose.

The first parameter of the new `risefalltype` function is a dataframe `df`, which will be supplied automatically by `apply`. The second parameter names the column in `df` to be summarized.

In [None]:
def risefalltype(df, col):
    '''
    Return a joined string of 'r' and 'f' for each interval in a dataframe
    (pairs of successive rows) where the values of `col` are rising
    ('r') or falling ('f'). The number of 'r' and 'f' characters in the
    result is one less than the number of rows.
    '''
    return ''.join(np.where(df[col].diff()[1:] > 0, 'r', 'f'))

Summarize F1 and F2 using the named function and `merge` with `vdf`. The first parameter to `apply` names the function to use (note that it is the function's symbol, like a variable name, not a string!). `apply` passes each dataframe group to the function as the first parameter. Named parameters like `col` are also passed to the function.

In [None]:
vdf = vdf.merge(
    tokens.apply(risefalltype, col='f1').rename('f1type'),
    on='t1_ph'
)
vdf = vdf.merge(
    tokens.apply(risefalltype, col='f2').rename('f2type'),
    on='t1_ph'
)
vdf

## Plot by groups

F0, F1, and F2 plots showing all measures and interpolated values at 0%, 50%, 100%.

In [None]:
fig, axes = plt.subplots(5, 1, figsize=[6, 8])
for i, (t1, g) in enumerate(tokens):
    t1 = g['t1_ph'].iloc[0]
    t2 = g['t2_ph'].iloc[0]
    for col in ('f0', 'f1', 'f2'):
        if col == 'f0':
            mdf = f0df.query(f'(sec >= {t1}) and (sec <= {t2})')
            tcol = 'sec'
        else:
            mdf = gendf.query(f'(tcol >= {t1}) and (tcol <= {t2})')
            tcol = 'tcol'
        axes[i].plot(mdf[tcol], mdf[col])
        axes[i].scatter(mdf[tcol], mdf[col])
        axes[i].scatter(g['obs_t'], g[col])
        axes[i].set_xlim([t1-0.010, t2+0.010])