# Analyzing the DIRNDL data set
---

This notebook contains some basic operations to analyze the portion of the DIRNDL corpus that is used to create PromDetect. The results of the operations can also be found in section [X] of the thesis. While the operations are not completely reproducible because the data is not made public, this document is supposed to still provide some insight into how the numbers in that section came to be.

In [None]:
from promdetect.prep import prepare_data
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from glob import glob
import re


In [None]:
# Get the paths of all recording files and extract the IDs from their filenames
corpusDir = "/home/lukas/Dokumente/Uni/ma_thesis/quelldaten/DIRNDL-prosody"
recordingPaths = glob(pathname = corpusDir + "/*.wav")
recordingIds = [re.split(r".*/dlf-nachrichten-(.*)\.wav", rPath)[1] for rPath in recordingPaths]

In [None]:
# Run the primary data preparation function on the annotations for all recordings and store the relevant parts in separate dictionaries
accents = pd.DataFrame(columns = ["time", "label", "recording"])
tones = pd.DataFrame(columns = ["time", "label", "recording"])
transcript = pd.DataFrame(columns = ["start", "end", "label", "recording"])

for recording in recordingIds:
    currentData = prepare_data.DataPreparation(corpusDir, recording)
    currentData.transform_annotations()

    currentAccents = pd.DataFrame(currentData.accents[["time", "label"]])
    currentAccents["recording"] = recording
    accents = accents.append(currentAccents)
    
    currentTones = pd.DataFrame(currentData.tones[["time", "label"]])
    currentTones["recording"] = recording
    tones = tones.append(currentTones)
    
    currentTranscript = pd.DataFrame(currentData.transcript[["start", "end", "label"]])
    currentTranscript["recording"] = recording
    transcript = transcript.append(currentTranscript)

---
### Analyze accents

In [112]:
accents = accents.loc[~accents["label"].isna()] # drop NA-labelled rows
print("Amount of accent boundary labels: {}".format(len(accents)))

Amount of accent boundary labels: 19631


In [92]:
# Show frequency counts for each label
accents["label"].value_counts()

L*H      7819
H*L      6120
!H*L     2126
H*       2055
..L       634
L*        461
!H*       263
L*HL       53
..H        25
H*L?       18
*?         17
L*H?       13
H*?         6
HH*L        6
!H*L?       3
L*?         2
LH*L        2
.L          2
!H          1
H!          1
L*!H        1
H*l         1
L*HL?       1
H*M?        1
Name: label, dtype: int64

---
### Analyze tones

In [111]:
tones = tones.loc[~tones["label"].isna()] # drop NA-labelled rows
print("Amount of tone boundary labels: {}".format(len(tones)))

Amount of tone boundary labels: 9216


In [94]:
# Show frequency counts for each label
tones["label"].value_counts()

-      4823
%      4027
H%      173
L%      145
%H       40
-?        7
H?%       1
Name: label, dtype: int64

---
### Analyze transcripts

In [None]:
# Drop breathing sounds, paragraph markers and empty labels
transcript = transcript.loc[~transcript["label"].isin(["[@]", "[t]", "[n]", "[f]", "[h]", "<P>", np.nan])]

In [100]:
# Get the duration of each word by using its end and start timestamps
transcript["dur"] = transcript["end"] - transcript["start"]

In [None]:
# Get amount of word annotations
len(transcript)

In [109]:
# Get miscellaneous statistics about transcripts
print("Longest transcripts:\n{}\n".format(transcript["recording"].value_counts().head(3)))
print("Shortest transcripts:\n{}\n".format(transcript["recording"].value_counts().tail(3)))
print("Median length: {}".format(np.median(transcript["recording"].value_counts())))
print("Mean length: {}".format(np.mean(transcript["recording"].value_counts())))

Longest transcripts:
200703251200    1071
200703262000     995
200703270600     987
Name: recording, dtype: int64

Shortest transcripts:
200703251100    472
200703271100    461
200703271500    457
Name: recording, dtype: int64

Median length: 536.0
Mean length: 642.6727272727272
