# Analyzing the DIRNDL data set
---

This notebook contains some basic operations to analyze the portion of the DIRNDL corpus that is used to create PromDetect. The results of the operations can also be found in section [X] of the thesis. While the operations are not completely reproducible because the data is not made public, this document is supposed to still provide some insight into how the numbers in that section came to be.

In [2]:
from promdetect.prep import prepare_data
import numpy as np
import pandas as pd
from glob import glob
import re

In [3]:
# Get the paths of all recording files and extract the IDs from their filenames
corpusDir = "/home/lukas/Dokumente/Uni/ma_thesis/quelldaten/DIRNDL-prosody"
recordingPaths = glob(pathname = corpusDir + "/*.wav")
recordingIds = [re.split(r".*/dlf-nachrichten-(.*)\.wav", rPath)[1] for rPath in recordingPaths]

In [4]:
# Run the primary data preparation function on the annotations for all recordings and store the relevant parts in separate dictionaries
accents = pd.DataFrame(columns = ["time", "label", "recording"])
tones = pd.DataFrame(columns = ["time", "label", "recording"])
transcript = pd.DataFrame(columns = ["start", "end", "label", "recording"])

for recording in recordingIds:
    currentData = prepare_data.DataPreparation(corpusDir, recording)
    currentData.transform_annotations()

    currentAccents = pd.DataFrame(currentData.accents[["time", "label"]])
    currentAccents["recording"] = recording
    accents = accents.append(currentAccents)
    
    currentTones = pd.DataFrame(currentData.tones[["time", "label"]])
    currentTones["recording"] = recording
    tones = tones.append(currentTones)
    
    currentTranscript = pd.DataFrame(currentData.transcript[["start", "end", "label"]])
    currentTranscript["recording"] = recording
    transcript = transcript.append(currentTranscript)