# About 

This notebooks is aimed at obtaining different audio features from audio files 
(including the GeMAPSv01b features set). It is intended to be a proof of concept 
for how these features may be extracted. Additionally, it investigates the 
results generated from the refactored data pipeline.


NOTE: From the original paper, we need the following features for our data vectors for the LSTM model:
1. Voice Activity
2. Pitch 
3. Spectral Stability 
4. Parts of Speech 

Additional feature investigation uses the following features:
1. Acoustic features:
    - eGeMaps
2. Linguistic Features:
    - Parts of Speech 
    - Word tags. 
3. Phonetic Features 
    - Senone bottleneck features. 
4. Voice Activity 

**UPDATE 5/17/22**

This notebook was used to develop initial methods for feature extraction for the Skantze 2017 paper. For a notebook containing the final pipeline, please 
search for skantze_pipeline.ipynb

## Setup 

In [1]:
import os
import sys
import time
import pandas as pd
from sklearn import preprocessing 
import numpy as np
import scipy.io as io
from glob import glob 


%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)


In [2]:
# ---- Paths
# Dir paths.
PROJECT_ROOT_PATH = "."
DATA_PATH = os.path.join(PROJECT_ROOT_PATH,"data")
MAPTASK_PATH = os.path.join(DATA_PATH,"maptaskv2-1")
RESULTS_PATH = os.path.join(PROJECT_ROOT_PATH,"results")
STEREO_AUDIO_PATH = os.path.join(MAPTASK_PATH,"Data/signals/dialogues")
MONO_AUDIO_PATH = os.path.join(MAPTASK_PATH,"Data/signals/mono_dialogues")
# NOTE: The timed units are also used for Voice Activity annotations. 
TIMED_UNIT_PATHS = os.path.join(MAPTASK_PATH,"Data/timed-units") 
POS_PATH = os.path.join(MAPTASK_PATH,"Data/pos")
GEMAPS_DIR = os.path.join(DATA_PATH,"gemaps")

We use very specific data files since the goal for this notebook is to investigate 
how different features of the pipeline are working and what values they produce. 

In [3]:
DIALOGUE_NAMES_SPLIT = ["q1ec1"]
PARTICIPANT_LABELS_MAPTASK = ["f","g"] # NOTE: f = follower ; g = giver. 

These are some utility functions and their tests

In [4]:
def get_timed_unit(dialogue_name,participant):
    timed_unit_paths = [os.path.join(TIMED_UNIT_PATHS,p) for p in os.listdir(TIMED_UNIT_PATHS)]
    for path in timed_unit_paths:
        basename = os.path.basename(path)
        timed_participant = basename[basename.find(".")+1:basename.find(".")+2]
        if basename[:basename.find(".")] == dialogue_name and\
                timed_participant == participant:
            return path 

In [5]:
def get_mono_audio(dialogue_name, participant):
    audio_paths = [os.path.join(MONO_AUDIO_PATH,p) for p in os.listdir(MONO_AUDIO_PATH)]
    for path in audio_paths:
        basename = os.path.basename(path)
        audio_participant = basename[basename.find(".")+1:basename.find(".")+2]
        if basename[:basename.find(".")] == dialogue_name and\
                audio_participant == participant:
            return path 

In [6]:
def get_stereo_audio(dialogue_name):
    audio_paths = [os.path.join(STEREO_AUDIO_PATH,p) for p in os.listdir(STEREO_AUDIO_PATH)]
    for path in audio_paths:
        basename = os.path.basename(path)
        if basename[:basename.find(".")] == dialogue_name:
            return path 


In [7]:
def read_data(dir_path,dialogue_name, participant,ext):
    """
    Assumption is that the basename . is the dialogue name. 
    """
    results = []
    data_paths = [p for p in os.listdir(dir_path)]
    data_paths = [os.path.join(dir_path,p) for p in data_paths if os.path.splitext(p)[1][1:] == ext]
    for path in data_paths:
        basename = os.path.basename(path)
        audio_participant = basename[basename.find(".")+1:basename.find(".")+2]
        if basename[:basename.find(".")] == dialogue_name and\
                audio_participant == participant:
            results.append(path)
    return results 
            


In [8]:
get_timed_unit(DIALOGUE_NAMES_SPLIT[0],"g")

'./data/maptaskv2-1/Data/timed-units/q1ec1.g.timed-units.xml'

In [9]:
get_mono_audio(DIALOGUE_NAMES_SPLIT[0],"f")

'./data/maptaskv2-1/Data/signals/mono_dialogues/q1ec1.f.wav'

In [10]:
get_stereo_audio(DIALOGUE_NAMES_SPLIT[0])

'./data/maptaskv2-1/Data/signals/dialogues/q1ec1.mix.wav'

In [11]:
read_data(GEMAPS_DIR,DIALOGUE_NAMES_SPLIT[0], "f","csv")

['./data/gemaps/q1ec1.f.csv']

## Feature Extraction - Skantze 2017 

The goal in this section is to extract all the features that are required for the original LSTM model. 

These features include:
1. Voice Activity --> From dataset annotations 
2. Pitch --> From opensmile
3. Spectral Stability --> Not sure 
4. Parts of Speech --> Annotations supplied with the data 

In [None]:
FRAME_STEP_MS = 50 # In the original paper, features were extracted every 50 ms. 
FRAME_SIZE_MS = 50 # In the original paper, each frame was 50 ms long.


### Voice Activity

The voice activity annotations can be directly extracted from the MapTask corpus timed-units, which have tu tags for when there was an utterance. 

In [None]:
import xml.etree.ElementTree

In [None]:
# Minimum utterance duration for it to be considered voice activity. 
MINIMUM_VA_CLASSIFICATION_TIME_MS = 25 
VOICE_ACTIVITY_LABEL = 1 # This means that voice activity was detected. 


In [None]:

# NOTE: This voice activity is for 50ms intervals. 
def get_voice_activity_annotations(dialogue_name, participant):
    timed_unit_path = get_timed_unit(dialogue_name,participant)
    # Read the xml file 
    tree = xml.etree.ElementTree.parse(timed_unit_path).getroot()
    # Extracting the audio end time from te timed units file. 
    audio_end_time_ms = float(list(tree.iter())[-1].get('end')) *1000
    tu_tags = tree.findall('tu')
    # Getting all the times in which there are voice activity annotations in the corpus. 
    va_times = []
    for tu_tag in tu_tags:
        start_time_s = float(tu_tag.get('start'))
        end_time_s = float(tu_tag.get('end'))
        if end_time_s - start_time_s >= MINIMUM_VA_CLASSIFICATION_TIME_MS/1000:
            va_times.append((start_time_s,end_time_s))
    # Get the frame times based on the final times unit time. 
    # NOTE: This is being generated based on the step size for now. 
    frame_times_s = np.arange(0,audio_end_time_ms,FRAME_STEP_MS) / 1000
    # Array to store voice  activity - initially all zeros means no voice activity. 
    voice_activity = np.zeros((frame_times_s.shape[0]))
    # For each activity detected, get the start and end index of the nearest frame being 
    # considered from the input audio. 
    for start_time_s, end_time_s in va_times:
        # Obtaining index relative to the frameTimes being considered for 
        # which there is voice activity. 
        start_idx = np.abs(frame_times_s-start_time_s).argmin()
        end_idx = np.abs(frame_times_s-end_time_s).argmin()
        voice_activity[start_idx:end_idx+1] = VOICE_ACTIVITY_LABEL
    return pd.DataFrame({
        "frameTime" : frame_times_s,
        "voiceActivity" :  voice_activity
    })

In [None]:
# Getting VA annotations 
voice_activity = get_voice_activity_annotations(
    DIALOGUE_NAMES_SPLIT[0],PARTICIPANT_LABELS_MAPTASK[0])


In [None]:
voice_activity.shape

In [None]:
np.array(np.where(voice_activity == 0)).shape, np.array(np.where(voice_activity == VOICE_ACTIVITY_LABEL)).shape

In [None]:
# NOTE: Percentage of frames with voice activity 
(np.array(np.where(voice_activity == VOICE_ACTIVITY_LABEL)).shape[1] / \
np.array(np.where(voice_activity == 0)).shape[1]) * 100 

In [None]:
# Save the voice activity for both f and g participants. 
voice_activity_f_df = get_voice_activity_annotations(
    DIALOGUE_NAMES_SPLIT[0],"f")
voice_activity_g_df = get_voice_activity_annotations(
    DIALOGUE_NAMES_SPLIT[0],"g")
voice_activity_f_df.to_csv("{}/{}.f.voice_activity.csv".format(RESULTS_PATH,DIALOGUE_NAMES_SPLIT[0]))
voice_activity_g_df.to_csv("{}/{}.g.voice_activity.csv".format(RESULTS_PATH,DIALOGUE_NAMES_SPLIT[0]))

### Pitch 

We use the [eGeMAPS](https://sail.usc.edu/publications/files/eyben-preprinttaffc-2015.pdf) feature set in which frequency related features include Pitch. 



Features in eGeMAPS, which can be extracted using [OpenSmile](https://audeering.github.io/opensmile/get-started.html#default-feature-sets) are defined [here](https://link.springer.com/content/pdf/bbm%3A978-3-319-27299-3%2F1.pdf). They are separated into individual 
categories as follows:


In [None]:
# NOTE: These categories are defined in the original paper. 
# NOTE: There are other features extracted by OpenSmile but they were not defined 
# in the original paper - and are not used here - but these might be useful later. 
# Ex. Mfccs. 
# The names have been adapted for use with the results produced by opensmile. 


GEMAPS_FREQUENCY_FEATURES = [
    'F0semitoneFrom27.5Hz_sma3nz', # Pitch: logarithmic F0 on a semitone frequency scale, starting at 27.5 Hz (semitone 0)
    "jitterLocal_sma3nz", # Jitter, deviations in individual consecutive F0 period lengths.
     # Formant 1, 2, and 3 frequency, centre frequency of first, second, and third formant
    "F1frequency_sma3nz",
    "F2frequency_sma3nz", 
    "F3frequency_sma3nz", 
    "F1bandwidth_sma3nz"
] 
    
GEMAPS_ENERGY_FEATURES = [
    "shimmerLocaldB_sma3nz", # Shimmer, difference of the peak amplitudes of consecutive F0 periods.
    "Loudness_sma3", # Loudness, estimate of perceived signal intensity from an auditory spectrum.
    "HNRdBACF_sma3nz" # Harmonics-to-Noise Ratio (HNR), relation of energy in harmonic components to energy in noiselike components.
]

GEMAPS_SPECTRAL_FEATURES = [
    "alphaRatio_sma3", #  Alpha Ratio, ratio of the summed energy from 50–1000 Hz and 1–5 kHz
    "hammarbergIndex_sma3",  # Hammarberg Index, ratio of the strongest energy peak in the 0–2 kHz region to the strongest peak in the 2–5 kHz region
    # Spectral Slope 0–500 Hz and 500–1500 Hz, linear regression slope of the logarithmic power spectrum within the two given bands
    "slope0-500_sma3", 
    "slope500-1500_sma3", 
    # Formant 1, 2, and 3 relative energy, as well as the ratio of the energy of the spectral harmonic
    # peak at the first, second, third formant’s centre frequency to the energy of the spectral peak at F0.
    "F1amplitudeLogRelF0_sma3nz", 
    "F2amplitudeLogRelF0_sma3nz", 
    "F3amplitudeLogRelF0_sma3nz", 
    "logRelF0-H1-H2_sma3nz", # Harmonic difference H1–H2, ratio of energy of the first F0 harmonic (H1) to the energy of the second F0 harmonic (H2)
    "logRelF0-H1-A3_sma3nz" # Harmonic difference H1–A3, ratio of energy of the first F0 harmonic (H1) to the energy of the highest harmonic in the third formant range (A3).
]

# These are all the GeMAPS features we are interested in. 
RELEVANT_GEMAP_FEATURES = GEMAPS_FREQUENCY_FEATURES + GEMAPS_ENERGY_FEATURES + \
    GEMAPS_SPECTRAL_FEATURES

In [None]:
GEMAPS_CSV_DELIMITER = ";"

In [None]:
# Read the frameTimes 
# NOTE: Not sure why the original code is doing this. 
gemaps_path = read_data(GEMAPS_DIR,DIALOGUE_NAMES_SPLIT[0], "g","csv")[0]


**IMPORTANT**: In the code below, we assume that the extracted opensmile features 
were extracted with the same frame step and frame size as defined for this notebook. 

If the timescale we are interested in is smaller than the timescale that was 
used to extract the features, then we should shift the features **back** by some 
amount. 


In [None]:
PITCH_FEATURE_LABELS = "F0semitoneFrom27.5Hz_sma3nz"

In [None]:
# Define the amount by which the opensmile features are shifted back
# NOTE: This is because we are using a 50ms timestep and the extracted features 
# are also on a 50 ms timescale. 

# Read the gemaps feature file. 
gemaps_df = pd.read_csv(gemaps_path,delimiter=GEMAPS_CSV_DELIMITER)
# Extract the relevant raw gemap features into a separate file. 
relevant_gemaps_df = gemaps_df[RELEVANT_GEMAP_FEATURES]
# Obtain the z normalized values for each column individually. 
z_normalized_feat_df = relevant_gemaps_df.apply(
    lambda col: preprocessing.scale(col),axis=1,result_type='broadcast')

In [None]:
# These are all the available GeMAPS features extracted from opensmile. 

# Read the gemaps feature file. 
gemaps_df = pd.read_csv(gemaps_path,delimiter=GEMAPS_CSV_DELIMITER)
gemaps_df.columns

In [None]:
# The original paper uses both the absolute and relevant pitch values and a 
# binary label indicating whether the frame was voiced. 
absolute_pitch = relevant_gemaps_df[PITCH_FEATURE_LABELS]
z_normalized_pitch = z_normalized_feat_df[PITCH_FEATURE_LABELS]
# Determine whether frame was voiced 
frame_times_s = gemaps_df["frameTime"] 
z_normalized_pitch
data = {
    "frameTime" : frame_times_s,
    "{}Absolute".format(PITCH_FEATURE_LABELS) : absolute_pitch, 
    "{}Znormelized".format(PITCH_FEATURE_LABELS) : z_normalized_pitch, 
    "voiceActivity" : voice_activity["voiceActivity"]
}
pitch_df = pd.DataFrame(data)

In [None]:
pitch_df.to_csv("{}/{}.g.pitch.csv".format(RESULTS_PATH,DIALOGUE_NAMES_SPLIT[0]))

In [None]:
# Creating a method for extraction. 

def extract_pitch(dialogue_name, participant, va_annotation_df):
    # Define the amount by which the opensmile features are shifted back
    # NOTE: This is because we are using a 50ms timestep and the extracted features 
    # are also on a 50 ms timescale. 

    # Read the gemaps feature file. 
    gemaps_path = read_data(GEMAPS_DIR,dialogue_name, participant,"csv")[0]
    gemaps_df = pd.read_csv(gemaps_path,delimiter=GEMAPS_CSV_DELIMITER)
    # Extract the relevant raw gemap features into a separate file. 
    relevant_gemaps_df = gemaps_df[RELEVANT_GEMAP_FEATURES]
    # Obtain the z normalized values for each column individually. 
    z_normalized_feat_df = relevant_gemaps_df.apply(
        lambda col: preprocessing.scale(col),axis=1,result_type='broadcast')
    # The original paper uses both the absolute and relevant pitch values and a 
    # binary label indicating whether the frame was voiced. 
    absolute_pitch = relevant_gemaps_df[PITCH_FEATURE_LABELS]
    z_normalized_pitch = z_normalized_feat_df[PITCH_FEATURE_LABELS]
    # Determine whether frame was voiced 
    frame_times_s = gemaps_df["frameTime"] 
    z_normalized_pitch
    data = {
        "frameTime" : frame_times_s,
        "{}_Absolute".format(PITCH_FEATURE_LABELS) : absolute_pitch, 
        "{}_Znormelized".format(PITCH_FEATURE_LABELS) : z_normalized_pitch, 
        "voiceActivity" : va_annotation_df["voiceActivity"]
    }
    return pd.DataFrame(data)

In [None]:
# Save for both speakers in one file 
pitch_f_df = extract_pitch(DIALOGUE_NAMES_SPLIT[0], "f", voice_activity_f_df)
pitch_g_df = extract_pitch(DIALOGUE_NAMES_SPLIT[0], "g", voice_activity_g_df)


In [None]:
pitch_f_df.to_csv("{}/{}.f.pitch.csv".format(RESULTS_PATH,DIALOGUE_NAMES_SPLIT[0]))
pitch_g_df.to_csv("{}/{}.g.pitch.csv".format(RESULTS_PATH,DIALOGUE_NAMES_SPLIT[0]))

### Power / Intensity 

For now, we consider Loudness from the GeMAPS feature set as a measure of 
intensity. However, there are other energy related features that may be used 
later. 

In [None]:
POWER_FEATURE_LABELS = "Loudness_sma3"

In [None]:
def extract_power(dialogue_name, participant):
    # Define the amount by which the opensmile features are shifted back
    # NOTE: This is because we are using a 50ms timestep and the extracted features 
    # are also on a 50 ms timescale. 
    # Read the gemaps feature file. 
    gemaps_path = read_data(GEMAPS_DIR,dialogue_name, participant,"csv")[0]
    gemaps_df = pd.read_csv(gemaps_path,delimiter=GEMAPS_CSV_DELIMITER)
    # Extract the relevant raw gemap features into a separate file. 
    relevant_gemaps_df = gemaps_df[RELEVANT_GEMAP_FEATURES]
    # Obtain the z normalized values for each column individually. 
    z_normalized_feat_df = relevant_gemaps_df.apply(
        lambda col: preprocessing.scale(col),axis=1,result_type='broadcast')
    # The original paper uses power / intensity in dB - 
    # TODO: Check what the units of loudness are in the GeMAPS set. 
    absolute_power = relevant_gemaps_df[POWER_FEATURE_LABELS]
    z_normalized_power = z_normalized_feat_df[POWER_FEATURE_LABELS]
    # Determine whether frame was voiced 
    frame_times_s = gemaps_df["frameTime"] 
    data = {
        "frameTime" : frame_times_s,
        "{}_Absolute".format(POWER_FEATURE_LABELS) :  absolute_power,
        "{}_Znormalized".format(POWER_FEATURE_LABELS) : z_normalized_power
    }
    return pd.DataFrame(data)

In [None]:
power_f_df = extract_power(DIALOGUE_NAMES_SPLIT[0], "f")
power_g_df = extract_power(DIALOGUE_NAMES_SPLIT[0], "g")

In [None]:
# Save 
power_f_df.to_csv("{}/{}.f.power.csv".format(RESULTS_PATH,DIALOGUE_NAMES_SPLIT[0]))
power_g_df.to_csv("{}/{}.g.power.csv".format(RESULTS_PATH,DIALOGUE_NAMES_SPLIT[0]))

### Spectral Stability 

The spectral stability in the original paper is a derived measure. First, 
Snack FFT analysis was used to get the power spectrum divided into N bands 
(up to 4kHz) at each time step. The equation in the skantze paper was used 
to calculate the Stability at time t. 


In [None]:
SPECTRAL_FEATURE_LABELS = 'spectralFlux_sma3'

In [None]:
def extract_spectral_flux(dialogue_name, participant):
    # Define the amount by which the opensmile features are shifted back
    # NOTE: This is because we are using a 50ms timestep and the extracted features 
    # are also on a 50 ms timescale. 
    # Read the gemaps feature file. 
    gemaps_path = read_data(GEMAPS_DIR,dialogue_name, participant,"csv")[0]
    gemaps_df = pd.read_csv(gemaps_path,delimiter=GEMAPS_CSV_DELIMITER)
    # Extract the relevant raw gemap features into a separate file. 
    spectral_flux_df = gemaps_df[SPECTRAL_FEATURE_LABELS]
    # Obtain the z normalized values for each column individually. 
    z_normalized_spectral_flux_df = preprocessing.scale(spectral_flux_df)
    # Determine whether frame was voiced 
    frame_times_s = gemaps_df["frameTime"] 
    data = {
        "frameTime" : frame_times_s,
        "{}_Znormalized".format(SPECTRAL_FEATURE_LABELS) : z_normalized_spectral_flux_df 
    }
    return pd.DataFrame(data)


In [None]:
spectral_flux_f_df = extract_spectral_flux(DIALOGUE_NAMES_SPLIT[0], "f")
spectral_flux_g_df = extract_spectral_flux(DIALOGUE_NAMES_SPLIT[0], "g")

In [None]:
# Save 
spectral_flux_f_df .to_csv("{}/{}.f.spectral_flux.csv".format(RESULTS_PATH,DIALOGUE_NAMES_SPLIT[0]))
spectral_flux_g_df .to_csv("{}/{}.g.spectral_flux.csv".format(RESULTS_PATH,DIALOGUE_NAMES_SPLIT[0]))

### Parts of Speech Annotations

These are directly obtained from the annotations received with the MapTask corpus. 

In the original paper, there are 59 different POS tags, and all the tags are 
represented as a one hot feature vector. Additionally, to simulate the delay 
in extracting POS tags, the feature vector was set to 0 by default but the 
corresponding feature vector was set to 1 (since it is a one-hot encoded vector)
a 100ms after the word had ended. 

In [None]:
# Need to read the timed-unit file for the corresponding start and end times 
# for the tags since we need to assign it
timed_unit_path = read_data(
    TIMED_UNIT_PATHS,DIALOGUE_NAMES_SPLIT[0],PARTICIPANT_LABELS_MAPTASK[1],"xml")[0]
tree_timed_unit = xml.etree.ElementTree.parse(timed_unit_path).getroot()
timed_unit_tags = list(tree_timed_unit.iter())
tu_tags = tree_timed_unit.findall("tu")



In [None]:
# Read the appropriate pos file
pos_path = read_data(
    POS_PATH,DIALOGUE_NAMES_SPLIT[0],PARTICIPANT_LABELS_MAPTASK[1],"xml")[0]
tree_pos = xml.etree.ElementTree.parse(pos_path).getroot()
# The pos is the tag attribute in all the tw tags. 
tw_tags = tree_pos.findall("tw")

In [None]:

# Getting all the times in which there are voice activity annotations in the corpus. 
va_times = []
for tu_tag in tu_tags:
    start_time_s = float(tu_tag.get('start'))
    end_time_s = float(tu_tag.get('end'))
    if end_time_s - start_time_s >= MINIMUM_VA_CLASSIFICATION_TIME_MS/1000:
        va_times.append((start_time_s,end_time_s))
        

In [None]:
POS_DELAY_TIME_MS = 100 # Assume that each POS calculation is delayed by 100ms. 

In [None]:
# Extracting the audio end time from te timed units file. 
audio_end_time_ms = float(list(tree_timed_unit.iter())[-1].get('end')) *1000
# Get the frame times based on the final times unit time. 
# NOTE: This is being generated based on the step size for now. 
frame_times_s = np.arange(0,audio_end_time_ms,FRAME_STEP_MS) / 1000
frame_times_s

In [None]:
# We need to have a POS annotation per frame - not simply per detected word time. 
pos_annotations = [0] * frame_times_s.shape[0] 
len(pos_annotations)

In [None]:
# Collecting the end time of the word and the corresponding POS tag. 
word_annotations = []
for tu_tag in tu_tags:
    tu_tag_id = tu_tag.get("id")[7:]
    end_time_s = float(tu_tag.get('end'))
    for tw_tag in tw_tags:
        # NOTE: Not sure if this is the correct way to extract the corresponding 
        # timed-unit id. 
        href = list(tw_tag.iter())[1].get("href")
        href_filename, href_ids = href.split("#")
        # Look at the appropriate file tags based on the filename. 
        href_ids = href_ids.split("..")
        for href_id in href_ids:
            href_id = href_id[href_id.find("(")+8:href_id.rfind(")")]
            if href_id == tu_tag_id:
                word_annotations.append((end_time_s,tw_tag.get("tag")))

Now creating the one hot encodings from an array representing the 
POS annotation for each frame (delayed X ms after the end of the word). 


Note that 0 represents that no annotation was available for that time. 

In [None]:
# Create a vocabulary from all the POS annotation tags
# Documentation to the MapTask POS tags:  https://groups.inf.ed.ac.uk/maptask/interface/expl.html
POS_TAGS = [
    "vb", 
    "vbd", 
    "vbg",
    "vbn", 
    "vbz",
    "nn",
    "nns",
    "np",
    "jj",
    "jjr",
    "jjt",
    "ql",
    "qldt",
    "qlp",
    "rb",
    "rbr",
    "wql",
    "wrb",
    "not",
    "to",
    "be",
    "bem",
    "ber",
    "bez",
    "do",
    "doz",
    "hv",
    "hvz",
    "md",
    "dpr",
    "at",
    "dt",
    "ppg",
    "wdt",
    "ap",
    "cd",
    "od",
    "gen",
    "ex",
    "pd",
    "wps",
    "wpo",
    "pps",
    "ppss",
    "ppo",
    "ppl",
    "ppg2\"",
    "pr",
    "pn",
    "in",
    "rp",
    "cc",
    "cs",
    "aff",
    "fp",
    "noi",
    "pau",
    "frag",
    "sent"
]
len(POS_TAGS)

In [None]:
pos_tags_to_idx = {}
idx_to_pos_tag = {}
# NOTE: Indices start from 1 here because 0 already represents unknown categories. 
for i,tag in enumerate(POS_TAGS):
    pos_tags_to_idx[tag] = i +1
    idx_to_pos_tag[i+1] = tag 


In [None]:
# For all the collected word end times and POS tags, we need to introduce 
# a delay and add the POS annotation to the delayed frame. 
pos_annotations = np.zeros((frame_times_s.shape[0]))
for end_time_s, pos_tag in word_annotations:
    frame_idx = np.abs(frame_times_s-(end_time_s +POS_DELAY_TIME_MS/1000)).argmin()
    pos_annotations[frame_idx] = pos_tags_to_idx[pos_tag] 

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# This encoder will ignore any unknown tags by replacing them with all zeros. 
onehot_encoder = OneHotEncoder(sparse=False,handle_unknown="ignore")
onehot_encoder.fit(np.asarray(list(pos_tags_to_idx.values())).reshape(-1,1))
onehot_encoder.categories_

In [None]:
onehot_encoder.categories_[0].shape

In [None]:
encoded_pos = onehot_encoder.transform(pos_annotations.reshape(-1,1))
encoded_pos.shape

In [None]:
# Create a dataframe from the extracted features and saving. 
pos_annotations_df = pd.DataFrame(encoded_pos,columns=POS_TAGS)
pos_annotations_df


In [None]:
# Combinging the above individual cells to create a single method to extract POS 
# features 

def extract_pos_annotations_with_delay(dialogue_name, participant):
    # Need to read the timed-unit file for the corresponding start and end times 
    # for the tags since we need to assign it
    timed_unit_path = read_data(
        TIMED_UNIT_PATHS,dialogue_name,participant,"xml")[0]
    tree_timed_unit = xml.etree.ElementTree.parse(timed_unit_path).getroot()
    tu_tags = tree_timed_unit.findall("tu")
    # Read the appropriate pos file
    pos_path = read_data(POS_PATH,dialogue_name,participant,"xml")[0]
    tree_pos = xml.etree.ElementTree.parse(pos_path).getroot()
    # The pos is the tag attribute in all the tw tags. 
    tw_tags = tree_pos.findall("tw")
    # Getting all the times in which there are voice activity annotations in the corpus. 
    va_times = []
    for tu_tag in tu_tags:
        start_time_s = float(tu_tag.get('start'))
        end_time_s = float(tu_tag.get('end'))
        if end_time_s - start_time_s >= MINIMUM_VA_CLASSIFICATION_TIME_MS/1000:
            va_times.append((start_time_s,end_time_s))
    # Extracting the audio end time from te timed units file. 
    audio_end_time_ms = float(list(tree_timed_unit.iter())[-1].get('end')) *1000
    # Get the frame times based on the final times unit time. 
    # NOTE: This is being generated based on the step size for now. 
    frame_times_s = np.arange(0,audio_end_time_ms,FRAME_STEP_MS) / 1000
    # Collecting the end time of the word and the corresponding POS tag. 
    word_annotations = []
    for tu_tag in tu_tags:
        tu_tag_id = tu_tag.get("id")[7:]
        end_time_s = float(tu_tag.get('end'))
        for tw_tag in tw_tags:
            # NOTE: Not sure if this is the correct way to extract the corresponding 
            # timed-unit id. 
            href = list(tw_tag.iter())[1].get("href")
            href_filename, href_ids = href.split("#")
            # Look at the appropriate file tags based on the filename. 
            href_ids = href_ids.split("..")
            for href_id in href_ids:
                href_id = href_id[href_id.find("(")+8:href_id.rfind(")")]
                if href_id == tu_tag_id:
                    if tw_tag.get("tag") in POS_TAGS:
                        word_annotations.append((end_time_s,tw_tag.get("tag")))
    # For all the collected word end times and POS tags, we need to introduce 
    # a delay and add the POS annotation to the delayed frame. 
    pos_annotations = np.zeros((frame_times_s.shape[0]))
    
    for end_time_s, pos_tag in word_annotations:
        frame_idx = np.abs(frame_times_s-(end_time_s +POS_DELAY_TIME_MS/1000)).argmin()
        pos_annotations[frame_idx] = pos_tags_to_idx[pos_tag] 
    # This encoder will ignore any unknown tags by replacing them with all zeros. 
    onehot_encoder = OneHotEncoder(sparse=False,handle_unknown="ignore")
    onehot_encoder.fit(np.asarray(list(pos_tags_to_idx.values())).reshape(-1,1))
    encoded_pos = onehot_encoder.transform(pos_annotations.reshape(-1,1))
    pos_annotations_df = pd.DataFrame(encoded_pos,columns=POS_TAGS)
    # Add frametimes to the df 
    pos_annotations_df.insert(0,"frameTime",frame_times_s)
    return pos_annotations_df
        

In [None]:
pos_annotations_f_df = extract_pos_annotations_with_delay(
    DIALOGUE_NAMES_SPLIT[0], "f")
pos_annotations_g_df = extract_pos_annotations_with_delay(
    DIALOGUE_NAMES_SPLIT[0], "g")
pos_annotations_f_df.to_csv("{}/{}.f.pos_onehot.csv".format(RESULTS_PATH,DIALOGUE_NAMES_SPLIT[0]))
pos_annotations_g_df.to_csv("{}/{}.g.pos_onehot.csv".format(RESULTS_PATH,DIALOGUE_NAMES_SPLIT[0]))