# Dataset Ensembling 
<br></br>
In this notebook you will find the final steps completed during the entire ensembling process. 

These steps include, segmenting each individual audio by sentence using timestamps and cleaning audio segment labels for correct .wav filename exporting. The timestamps for each sentence found within each audio were derived from the forced alignment synchronization map operation performed using the "aeneas" CLI. Each audio segment (sentence) was provided a label using NLP (natural-language-processing) by the means of transformer (finBERT). 


<br></br>
<img src="dataset_ensembling_workflow2.png" alt="workflow" style="width:60%">
<br></br>
<br></br>

### Data totals:
- Scraped videos: 11,525
    - 23 unique search queries (financial market related)
- Successful transcriptions: 11,235
    - .JSON format
    - 1 file per audio -> "display" form
        - Punctuation and capitalization
- Successful audios processed by "finBERT": 10,548
    - Sentences labeled by sentiment
        - Negative, neutral, positive = bearish, neutral, bullish respectively 

<br></br>
#### **Final dataset (total audio segments):** 176,432
    





    



In [1]:
from pydub import AudioSegment

import pandas as pd
import numpy as np

import glob
import os

In [12]:
def read_data(df_files):

    # Read segment csv file, count columns for each line -fixes uneven amount of columns when should be same
    segments_list = []
    for csv in df_files:
        with open(csv, 'r') as temp_f:
            # Get n columns in each line
            col_count = [len(l.split(",")) for l in temp_f.readlines()]
        column_names = [i for i in range(0, max(col_count))]
        df = pd.read_csv(csv, header=None, delimiter=",", names=column_names)
        segments_list.append(df)

    # Select the first 4 columns for each segment df
    new_segment_list = []
    for i in segments_list:
        i = i.iloc[:, 0:4]
        new_segment_list.append(i)

    # Clean label names for smooth export of .wavs
    final_segment_list = []
    for sl in new_segment_list:
        sl[0] = sl[0].astype(str)
        sl[0] = sl[0].str.replace('\d+', '')
        final_segment_list.append(sl)

    return final_segment_list


# Extract segments of each audio file parallel to segment df
def extract_segments(wavobj, seg_df):

    # Get start time for each segment of each file
    starts = []
    for seg_start in seg_df[1]:
        starts.append(
            seg_start *
            1000)  # Multiply by 1000 for miliseconds for "AudioSegment"

    # Get end time for each segment of each file
    ends = []
    for seg_end in seg_df[2]:
        ends.append(seg_end * 1000)

    # Export segmented files
    i = 0
    for start, end in zip(starts, ends):
        audio_seg = wavobj[start:end]
        print('Extracting audio segment:', len(audio_seg), 'Samples')

        new_index = len(
            os.listdir('F://audioSegments_labeled2//')
        )  # Use previous file export as index to write next file
        file_name = str(seg_df[0][i][0:]) + "_" + str(new_index) + ".wav"

        audio_seg.export(file_name, format='wav')  # Export each .wav
        i += 1

In [3]:
# Import segment files

df_files = glob.glob(
    './capstone3_data/labeledSegments_audios/labeledSegments_timestamps/*.csv',
    recursive=True)
#print(df_files)

In [4]:
segment_dfs = read_data(df_files)  # Read data from each .csv
len(segment_dfs)

10548

In [5]:
# Import .wav audio files

wavs = glob.glob('./audio_wavFiles/*.wav', recursive=True)
#print(wavs)

In [6]:
# Load each .wav file with AudioSegment and convert to "wav object", create new list of objects

wavobjs = []

for i in wavs:
    f = AudioSegment.from_wav(i)
    wavobjs.append(f)

In [15]:
segment_dfs[0].head()  # Check if first segment df of segment df list matches original

Unnamed: 0,0,1,2,3
0,NEUT,0.0,16.32,"""Diabulus I would like to take you through wit..."
1,NEUT,16.32,20.52,That we have adopted in the new normal.
2,NEUT,20.52,23.76,A few key points to be kept in mind.
3,NEUT,23.76,26.44,While drafting the communication.
4,NEUT,26.44,29.36,As brands return to business as usual.


In [8]:
# Create list of tuples for each audio and its matching segments

obj_seg_pair = list(zip(wavobjs, segment_dfs))
print(obj_seg_pair[0])

(<pydub.audio_segment.AudioSegment object at 0x0000021946CC63A0>,        0       1       2                                                  3
0   NEUT    0.00   16.32  "Diabulus I would like to take you through wit...
1   NEUT   16.32   20.52            That we have adopted in the new normal.
2   NEUT   20.52   23.76               A few key points to be kept in mind.
3   NEUT   23.76   26.44                  While drafting the communication.
4   NEUT   26.44   29.36             As brands return to business as usual.
5   NEUT   29.36   30.92                                        What I saw.
6   NEUT   30.92   32.76                                                 1.
7   NEUT   32.76   38.44  With changing times consumer behavior and need...
8   NEUT   38.44   45.76  Branch should anticipate the change in pattern...
9   NEUT   45.76   50.00  And feel the messaging around what consumers c...
10  NEUT   50.00   52.44                                                 2.
11  NEUT   52.44   60.

In [10]:
os.chdir('F://audioSegments_labeled2//')
os.getcwd()

'F:\\audioSegments_labeled2'

In [14]:
# Loop through wav objects/segment pairs and extract audio segments

for wavobj, seg_df in obj_seg_pair:
    extract_segments(wavobj, seg_df)