# Analyzing the DIRNDL data set
---

This notebook contains some basic operations to analyze the portion of the DIRNDL corpus that is used to create PromDetect. The results of the operations can also be found in section [X] of the thesis. While the operations are not completely reproducible because the data is not made public, this document is supposed to still provide some insight into how the numbers in that section came to be.

In [1]:
from promdetect.prep import prepare_data
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from glob import glob
import re


In [5]:
rootDir = "/home/lukas/Dokumente/Uni/ma_thesis/"

In [24]:
# Get the paths of all recording files and extract the IDs from their filenames
corpusDir = rootDir + "quelldaten/DIRNDL-prosody"
recordingPaths = glob(pathname = corpusDir + "/*.wav")
recordingIds = [re.split(r".*/dlf-nachrichten-(.*)\.wav", rPath)[1] for rPath in recordingPaths]

In [25]:
# Run the primary data preparation function on the annotations for all recordings and store the relevant parts in separate dictionaries
accents = pd.DataFrame(columns = ["time", "label", "recording"])
tones = pd.DataFrame(columns = ["time", "label", "recording"])
transcript = pd.DataFrame(columns = ["start", "end", "label", "recording"])

for recording in recordingIds:
    currentData = prepare_data.DataPreparation(corpusDir, recording)
    currentData.transform_annotations()

    currentAccents = pd.DataFrame(currentData.accents[["time", "label"]])
    currentAccents["recording"] = recording
    accents = accents.append(currentAccents)
    
    currentTones = pd.DataFrame(currentData.tones[["time", "label"]])
    currentTones["recording"] = recording
    tones = tones.append(currentTones)
    
    currentTranscript = pd.DataFrame(currentData.transcript[["start", "end", "label"]])
    currentTranscript["recording"] = recording
    transcript = transcript.append(currentTranscript)

---
### Analyze accents

In [None]:
accents = accents.loc[~accents["label"].isna()] # drop NA-labelled rows
print("Amount of accent boundary labels: {}".format(len(accents)))

In [None]:
# Show frequency counts for each label
accents["label"].value_counts()

---
### Analyze tones

In [None]:
tones = tones.loc[~tones["label"].isna()] # drop NA-labelled rows
print("Amount of tone boundary labels: {}".format(len(tones)))

In [None]:
# Show frequency counts for each label
tones["label"].value_counts()

---
### Analyze transcripts

In [26]:
# Drop breathing sounds, paragraph markers and empty labels
transcript = transcript.loc[~transcript["label"].isin(["[@]", "[t]", "[n]", "[f]", "[h]", "<P>", np.nan])]

In [27]:
# Get the duration of each word by using its end and start timestamps
transcript["dur"] = transcript["end"] - transcript["start"]

In [28]:
# Get amount of word annotations
print("Amount of word annotation labels: {}".format(len(transcript)))

Amount of word annotation labels: 35347


In [33]:
# Get miscellaneous statistics about transcripts
print("Longest transcripts:\n{}\n".format(transcript["recording"].value_counts().head(3)))
print("Shortest transcripts:\n{}\n".format(transcript["recording"].value_counts().tail(3)))
print("Median length: {}".format(np.median(transcript["recording"].value_counts())))
print("Mean length: {}".format(np.mean(transcript["recording"].value_counts())))

Longest transcripts:
200703251200    1071
200703262000     995
200703270600     987
Name: recording, dtype: int64

Shortest transcripts:
200703251100    472
200703271100    461
200703271500    457
Name: recording, dtype: int64

Median length: 536.0
Mean length: 642.6727272727272


---
### Analyze nuclei

In [18]:
nucleiFiles = glob(rootDir + "promdetect/data/dirndl/nuclei/*")
nucleiData = pd.DataFrame(columns = ["start", "end", "timestamp_auto", "phone_label", "word_label", "origin_file"])

In [19]:
for nucleiFile in nucleiFiles:
    currentData = pd.read_csv(nucleiFile, delimiter = ",")
    currentData["origin_file"] = re.search(r"[ \w-]+?(?=\.)", nucleiFile)[0] + ".nuclei"
    nucleiData = nucleiData.append(currentData)

In [23]:
# Get miscellaneous statistics about nuclei
print("Most nuclei per recording:\n{}\n".format(nucleiData["origin_file"].value_counts().head(3)))
print("Least nuclei per recording:\n{}\n".format(nucleiData["origin_file"].value_counts().tail(3)))
print("Median nuclei per transcript: {}".format(np.median(nucleiData["origin_file"].value_counts())))
print("Mean nuclei per transcript: {}".format(np.mean(nucleiData["origin_file"].value_counts())))

Most nuclei per recording:
dlf-nachrichten-200703251200.nuclei    2084
dlf-nachrichten-200703270600.nuclei    2024
dlf-nachrichten-200703260600.nuclei    1957
Name: origin_file, dtype: int64

Least nuclei per recording:
dlf-nachrichten-200703260900.nuclei    898
dlf-nachrichten-200703271500.nuclei    884
dlf-nachrichten-200703261900.nuclei    879
Name: origin_file, dtype: int64

Median nuclei per transcript: 1032.0
Mean nuclei per transcript: 1238.8363636363636
