## **Prepare Full Dataset: Segment audio and transcriptions based on main speaker's speech**

<br/>
<br/>
<br/>

## Overview of code flow

### Step 0:

<u>Run the processing code to clean the textgrid files</u>
- Renames a minority of files for convention purposes and delete outdated files
- Remove instances of ```text = "...item [something]..."```, ```text = "...intervals [something]..."``` to let TextGrid library run properly
- Remove files with instantenous timings given a proper transcription and files with overlap timings

```
local drive
- clean_textgrid
    - org_transcripts
        - 3000-1.TextGrid
        - 3000-2.TextGrid
        - ...
```

### Step 1:
<u>After manually creating the directory in the input drive and running the code to initialise the directory in the output drive</u>
```
input drive
- org_wavs: Manually add in .wav files to be segmented
    - 3000-1.wav
    - 3000-2.wav
    - ...
- org_transcripts: Manually add in cleaned .TextGrid files to be segmented
    - 3000-1.TextGrid
    - 3000-2.TextGrid
    - ...
- invalid_wavs: Manually move the following invalid wav files from org_wavs
    - 3035-2.wav: Instantaneous timing and transcription don't match
    - 3075-2.wav: Instantaneous timing and transcription don't match
    - 3143-2.wav: Overlap in transcription timing
    - 3201-1.wav: Instantaneous timing and transcription don't match
    - 3250-2.wav: Overlap in transcription timing

output drive
- dataset
    - data: Used to store compression files
    - train
        - waves: Empty
        - transcripts: Empty
        - textgrids: Empty
    - test
        - waves: Empty
        - transcripts: Empty
```

<br/>

### Step 2:
<u>After running the processing code</u>
```
input drive
- org_wavs: Manually add in .wav files to be segmented
    - 3000-1.wav
    - 3000-2.wav
    - ...
- org_transcripts: Manually add in .TextGrid files to be segmented
    - 3000-1.TextGrid
    - 3000-2.TextGrid
    - ...
- invalid_wavs: Manually move the following invalid wav files from org_wavs
    - 3035-2.wav: Instantaneous timing and transcription don't match
    - 3075-2.wav: Instantaneous timing and transcription don't match
    - 3143-2.wav: Overlap in transcription timing
    - 3201-1.wav: Instantaneous timing and transcription don't match
    - 3250-2.wav: Overlap in transcription timing

output drive
- dataset
    - data: Used to store compression files
    - train
        - prompts.txt: Contains transcriptions for all the .wav files in train
        - waves
            - 3000-1_1.wav
            - 3000-1_2.wav
            - 3000-1_3.wav
            - ...
            - 3000-2_1.wav
            - 3000-2_2.wav
            - 3000-2_3.wav
        - transcripts
            - 3000-1_1.txt
            - 3000-1_2.txt
            - 3000-1_3.txt
            - ...
            - 3000-2_1.txt
            - 3000-2_2.txt
            - 3000-2_3.txt
    - test
        - prompts.txt: Contains transcriptions for all the .wav files in test
        - waves
            - 3000-3_1.wav
            - 3000-3_2.wav
            - 3000-3_3.wav
            - ...
            - 3000-4_1.wav
            - 3000-4_2.wav
            - 3000-4_3.wav
        - transcripts
            - 3000-3_1.txt
            - 3000-3_2.txt
            - 3000-3_3.txt
            - ...
            - 3000-4_1.txt
            - 3000-4_2.txt
            - 3000-4_3.txt
```

<br/>

### Step 3:
<u>After running the compression code</u>

```
output drive
- data: Used to store compression files
    - imda_nsc_p3.tar.gz
        - train
            - prompts.txt: Contains transcriptions for all the .wav files in train
            - waves
                - 3000-1_1.wav
                - 3000-1_2.wav
                - 3000-1_3.wav
                - ...
                - 3000-2_1.wav
                - 3000-2_2.wav
                - 3000-2_3.wav
        - test
            - prompts.txt: Contains transcriptions for all the .wav files in test
            - waves
                - 3000-3_1.wav
                - 3000-3_2.wav
                - 3000-3_3.wav
                - ...
                - 3000-4_1.wav
                - 3000-4_2.wav
                - 3000-4_3.wav
    - prompts-train.txt.gz
        - prompts-train.txt: Contains transcriptions for all the train .wav files -> taken from train/prompts.txt
    - prompts-test.txt.gz
        - prompts-test.txt: Contains transcriptions for all the test .wav files -> take from test/prompts.txt
```

<br/>
<br/>
<br/>

## Step 0: Code to clean TextGrid files

**Imports**

In [1]:
import re 
import os
import shutil
import tarfile
import gzip
import math
import random
from praatio import textgrid 
from pydub import AudioSegment



**<u>USER ACTION REQUIRED</u>**

Change Relative Paths and Naming Conventions if you want

In [2]:
org_transcripts_path = ['clean_textgrid', 'org_transcripts']

**Initialise Paths and Create the directories**

**<u>USER ACTION REQUIRED</u>**: 

- Add in the <u>original</u> ```.TextGrid``` files provided by IMDA NSC to ```clean_textgrid/org_transcripts``` <u>after</u> running the code block directly below

In [3]:
org_transcripts_folder = os.path.join(os.getcwd(), *org_transcripts_path)
create_dir = [org_transcripts_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

Code that

Renames the following files:
- 3108-1_edited.TextGrid: Rename to 3108-1.TextGrid
- 3115-1 9 (Update 2.05).TextGrid: Rename to 3115-1.TextGrid
- 3115-2 (Update 2.05).TextGrid: Rename to 3115-2.TextGrid
- 3209-1_edited.TextGrid: Rename to 3209-1.TextGrid

Deletes the following files: 
- 3115-1 (Update 2.04).TextGrid: Delete because outdated
- 3115-2 (Update 2.04).TextGrid -> Delete because outdated
- 3035-2.TextGrid: Instantaneous timing and transcription don't match
- 3075-2.TextGrid: Instantaneous timing and transcription don't match
- 3143-2.TextGrid: Overlap in transcription timing
- 3201-1.TextGrid: Instantaneous timing and transcription don't match
- 3250-2.TextGrid: Overlap in transcription timing

In [4]:
files_to_delete = ['3115-1 (Update 2.04).TextGrid', '3115-2 (Update 2.04).TextGrid', '3035-2.TextGrid', 
                   '3075-2.TextGrid', '3143-2.TextGrid', '3201-1.TextGrid', '3250-2.TextGrid']

files_to_rename = {
    "3108-1_edited.TextGrid": "3108-1.TextGrid",
    "3115-1 9 (Update 2.05).TextGrid": "3115-1.TextGrid",
    "3115-2 (Update 2.05).TextGrid": "3115-2.TextGrid",
    "3209-1_edited.TextGrid": "3209-1.TextGrid"
}

for filename in files_to_delete:
    file_path = os.path.join(org_transcripts_folder, filename)
    os.remove(file_path)
    print(f"Deleted {filename}")

for old_name, new_name in files_to_rename.items():
    old_path = os.path.join(org_transcripts_folder, old_name)
    new_path = os.path.join(org_transcripts_folder, new_name)
    os.rename(old_path, new_path)
    print(f"Renamed {old_name} to {new_name}")

Deleted 3115-1 (Update 2.04).TextGrid
Deleted 3115-2 (Update 2.04).TextGrid
Deleted 3035-2.TextGrid
Deleted 3075-2.TextGrid
Deleted 3143-2.TextGrid
Deleted 3201-1.TextGrid
Deleted 3250-2.TextGrid
Renamed 3108-1_edited.TextGrid to 3108-1.TextGrid
Renamed 3115-1 9 (Update 2.05).TextGrid to 3115-1.TextGrid
Renamed 3115-2 (Update 2.05).TextGrid to 3115-2.TextGrid
Renamed 3209-1_edited.TextGrid to 3209-1.TextGrid


**Helper function to remove instances of ```text = "...item [something]..."``` and ```text = "...intervals [something]..."``` from a single TextGrid file**

- To not interfere with praatio library's splitting logic

In [5]:
def remove_text_restriction(textgrid_path):
    try:
        with open(textgrid_path, "r", encoding="utf-16") as file:
            textgrid = file.read()
        encoding = "utf-16"
    except UnicodeError:
        with open(textgrid_path, "r", encoding="utf-8") as file:
            textgrid = file.read()
        encoding = "utf-8"

    text_restriction_1 = r'text = "(.*?item \[.*?\].*?)"'
    text_restriction_2 = r'text = "(.*?intervals \[.*?\].*?)"'

    def replace_brackets(match):
        text_content = match.group(1)
        text_content = text_content.replace("[", "").replace("]", "")
        return f'text = "{text_content}"'

    # Receives: regex pattern, function to do replacement for matched patterns 
    # (result of function is used as replacement text), input string where the replacement will occur

    # Function receives a match object. It is called for each match found in the input string
    # Match object represents a specific occurence of the matched pattern
    textgrid_fixed = re.sub(text_restriction_1, replace_brackets, textgrid)
    textgrid_fixed_final = re.sub(text_restriction_2, replace_brackets, textgrid_fixed)

    with open(textgrid_path, "w", encoding=encoding) as file:
        file.write(textgrid_fixed_final)

**Remove text restrictions to let praatio library run properly**

In [6]:
cleaned_successfully = []
cleaned_unsuccessfully = []
for filename in os.listdir(org_transcripts_folder):
    try:
        textgrid_path = os.path.join(org_transcripts_folder, filename)
        tg = textgrid.openTextgrid(textgrid_path, False)
    except:
        remove_text_restriction(textgrid_path)
        try:
            tg = textgrid.openTextgrid(textgrid_path, False)
            cleaned_successfully.append(filename)
        except:
            cleaned_unsuccessfully.append(filename)

In [7]:
cleaned_successfully

['3018-1.TextGrid',
 '3025-1.TextGrid',
 '3030-1.TextGrid',
 '3045-2.TextGrid',
 '3048-2.TextGrid',
 '3055-1.TextGrid',
 '3061-2.TextGrid',
 '3069-2.TextGrid',
 '3083-1.TextGrid',
 '3093-2.TextGrid',
 '3095-1.TextGrid',
 '3122-1.TextGrid',
 '3127-2.TextGrid',
 '3136-2.TextGrid',
 '3137-1.TextGrid',
 '3141-1.TextGrid',
 '3141-2.TextGrid',
 '3169-2.TextGrid',
 '3174-1.TextGrid',
 '3178-1.TextGrid',
 '3178-2.TextGrid',
 '3202-1.TextGrid',
 '3214-1.TextGrid',
 '3232-2.TextGrid',
 '3244-1.TextGrid',
 '3244-2.TextGrid',
 '3250-1.TextGrid',
 '3263-1.TextGrid']

In [8]:
cleaned_unsuccessfully

[]

<br/>
<br/>
<br/>

## Step 1: Code to initialise the directory

**<u>USER INPUT REQUIRED</u>**

- Change Relative Paths and Naming Conventions if you want 
- Set the segment duration (has to be <= 30s because of Whisper's design) and buffer between each entry (in ms)

In [9]:
input_audio_path = ['org_wavs']
input_textgrid_path = ['org_transcripts']
output_train_path = ['dataset', 'train']
output_test_path = ['dataset', 'test']
output_compressed_path = ['dataset','data']
compressed_filename = 'imda_nsc_p3.tar.gz'
compressed_train_prompt_filename = 'prompts-train.txt.gz'
compressed_test_prompt_filename = 'prompts-test.txt.gz'
segment_duration_s = 30
buffer_ms = 1000

**Initialise Paths and Create the directories**

**<u>USER ACTION REQUIRED</u>**

- Specify the input drive path
- Create the ```org_wavs```, ```invalid_waves``` folders in the input drive
- Add in the ```.wav``` from IMDA NSC to ```org_wavs``` in the input drive
- Move the following files from ```org_wavs``` to ```invalid_wavs```: 
    - 3035-2.wav: Instantaneous timing and transcription don't match
    - 3075-2.wav: Instantaneous timing and transcription don't match
    - 3143-2.wav: Overlap in transcription timing
    - 3201-1.wav: Instantaneous timing and transcription don't match
    - 3250-2.wav: Overlap in transcription timing
- Copy ```clean_textgrid/org_transcripts``` as ```org_transcripts``` to the input drive

Can replace point 1,3 with code: add in to ```create_dir```

In [10]:
input_drive_path = 'D:\\' # os.getcwd()
output_drive_path = os.getcwd()
input_wav_folder = os.path.join(input_drive_path, *input_audio_path)
input_textgrid_folder = os.path.join(input_drive_path, *input_textgrid_path)
output_train_folder_waves = os.path.join(output_drive_path, *output_train_path, 'waves')
output_train_folder_transcripts = os.path.join(output_drive_path, *output_train_path, 'transcripts')
output_test_folder_waves  = os.path.join(output_drive_path, *output_test_path, 'waves')
output_test_folder_transcripts = os.path.join(output_drive_path, *output_test_path, 'transcripts')
output_textgrids_folder = os.path.join(output_drive_path, *output_train_path, 'textgrids')
output_compressed_folder = os.path.join(output_drive_path, *output_compressed_path)
output_compressed_file = os.path.join(output_compressed_folder, compressed_filename)
output_compressed_train_prompt_file = os.path.join(output_compressed_folder, compressed_train_prompt_filename)
output_compressed_test_prompt_file = os.path.join(output_compressed_folder, compressed_test_prompt_filename)

create_dir = [output_train_folder_waves, output_train_folder_transcripts,
              output_test_folder_waves, output_test_folder_transcripts, output_textgrids_folder, output_compressed_folder]

# create input wav and textgrid folder
#create_dir = [input_wav_folder, input_textgrid_folder, output_train_folder_waves, output_train_folder_transcripts,
              #output_test_folder_waves, output_test_folder_transcripts, output_textgrids_folder, output_compressed_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

<br/>
<br/>
<br/>

## Step 2: Code to process and segment the original ```.wav``` and ```.TextGrid``` files into output files

**Helper function to clean the transcription**

1. Lower-case the text

2. Remove and replace annotations

- Paralinguistic Phenomena: Remove '(ppb)', '(ppc)', '(ppl)', '(ppo)'
- Acronyms: Remove '_'
- Multi-word nouns: Replace '-' with ' '
- Discourse particles: Remove '[' and ']'
- Fillers: Remove '(' and ')'
- Interjections: Remove '!'
- Other languages: Remove '#'
- Unclear words: Remove ```'<unk>'```
- Incomplete words: Remove '~'
- Short pauses: Remove ```'<s>'```
- Invalid: Remove ```'<z>'```
- Long-running non-english utterances: Remove ```'<nen>'```
- Fillers: Remove ```'<fil/>'```
- Speaker Noise: Remove ```'<spk/>'```
- Unknown: Remove '**'
- Non-primary speaker sound: Remove ```'<non/>'```
- End of sentence: Remove ```'<s/>'```
- Comma: Remove ```'<c/>'```
- Remove all instances of ```<whatever is inside>```

3. Remove extra spaces created by ```<s>``` and stuff

Refer to the Transcription Guidelines by IMDA

In [11]:
def clean_transcription(transcript):
    transcript = transcript.strip()
    transcript = transcript.lower()
    remove = [r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', r'_', r'\[|\]', r'\(|\)', r'!', 
            r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
            r'\*', r'<non/>', r'<s/>', r'<c/>', r'<[^>]+>'] 
    replace = ['-']
    for e in remove:
        transcript = re.sub(e, '', transcript)
    for e in replace:
        transcript = re.sub(e, ' ', transcript)
    transcript = re.sub(r'\s+', ' ', transcript).strip()
    return transcript

**Main function**

- Matches a single ```.wav``` file to its respective ```.TextGrid``` file

- Break the ```.wav``` file and ```.TextGrid``` files into segments such that each segment only contains a transcription that is <= 30s long


In [12]:
def process_audio_transcript(audio_filename, input_audio_path, input_textgrid_path, output_dir_wav, output_dir_transcript, segment_duration_s, buffer):
    # Initialise the wav and TextGrid paths of the current file
    audio_path = os.path.join(input_audio_path, f'{audio_filename}.wav')
    textgrid_path = os.path.join(input_textgrid_path, f'{audio_filename}.TextGrid')

    audio = AudioSegment.from_wav(audio_path)
    tg = textgrid.openTextgrid(textgrid_path, False) 

    # Specify the current segment index
    segment_index = 1

    # Initialise the current segment duration
    curr_segment_duration = 0
    # Initialise a list to hold the transcriptions for the current segment
    curr_transcriptions = []
    # Initialise a list to hold the audios for the current segment
    curr_wavs = []
    # Get the buffer in seconds -> To separate potentially unrelated speech
    buffer_s = buffer/1000 
    # Initialise audio buffer
    buffer_audio = AudioSegment.silent(duration=buffer)

    for tier_name in tg.tierNames: 
        tier = tg.getTier(tier_name) 
        for start,end,label in tier.entries:  
            # Get the duration of this new entry
            entry_duration = end-start

            # if entry_duration <= segment_duration_s -> don't need to consider and

            # If the new entry does not exceed our sepcified duration of each segment and
            # adding a buffer and new entry to the current segment does not exceed our specified duration of each segment
            # we can try accumulating the current segment
            if entry_duration < segment_duration_s and curr_segment_duration + buffer_s + entry_duration <= segment_duration_s:
                # Clean the transcription/label of this entry
                curr_transcription_clean = clean_transcription(label)
                # If this entry has text after cleaning i.e. contains proper ground truth transcription,
                # it is a valid sample
                if len(curr_transcription_clean) > 0:
                    # Update the current_segment_duration
                    curr_segment_duration = curr_segment_duration + buffer_s + entry_duration
                    # Add the current cleaned transcription of this entry
                    curr_transcriptions.append(curr_transcription_clean)
                    # Add the audio of this entry: Segment the audio using the start and end time from the current TextGrid entry
                    curr_wavs.append(audio[start*1000:(end*1000)+1]) # Add 1 ms s.t the end timing is inclusive

            # If adding a buffer and new entry exceeds our specified duration of each segment,
            # that means the current segment is completed and
            # we save the current transcription and the segmented audio as well as perform resetting
            elif curr_segment_duration > 0:
                    # Join the current transcription for the segment
                    transcript_segment = ' '.join(curr_transcriptions)

                    # Initialise the transcription segment path
                    transcript_segment_path = os.path.join(output_dir_transcript, f'{audio_filename}_{segment_index}.txt')
                    # Write the transcription to the transcription segment file
                    with open(transcript_segment_path, 'w') as f:
                        f.write(f'{audio_filename}_{segment_index} {transcript_segment}')

                    # Join the audio segments together with an audio buffer between them
                    audio_segment = curr_wavs[0]
                    for wav in curr_wavs[1:]:
                        audio_segment = audio_segment + buffer_audio + wav

                    # Initialise the audio segment path
                    audio_segment_path = os.path.join(output_dir_wav, f'{audio_filename}_{segment_index}.wav')
                    # Save the audio segment
                    audio_segment.export(audio_segment_path, format="wav")

                    # Increment the segment index
                    segment_index+=1

                    # Resetting
                    curr_transcription_clean = clean_transcription(label)
                    # If the entry in the current iteration is <= than our specified duration of each segment and has text after cleaning i.e. contains proper ground truth transcription
                    if entry_duration <= segment_duration_s and len(curr_transcription_clean) > 0:
                        # Reset the current segment duration
                        curr_segment_duration = entry_duration
                        # Reset the list to hold the transcriptions for the new segment
                        curr_transcriptions = [curr_transcription_clean]
                        # Reset the list to hold the audios for the new segment
                        curr_wavs = [audio[start*1000:(end*1000)+1]] # Add 1 ms s.t the end timing is inclusive
                    # Skip the entry as a sample if it is > than our specified duration of each segment
                    else:
                        # Reset the new segment duration
                        curr_segment_duration = 0
                        # Reset the list to hold the transcriptions for the new segment
                        curr_transcriptions = []
                        # Reset the list to hold the audios for the new segment
                        curr_wavs = []

**Run the main function to create segments for each ```.wav``` and ```.TextGrid``` file**

Output is the segmented ```.wav``` audio files and corresponding ```.txt``` transcription files that is stored in ```train/waves``` and ```train/transcripts``` respectively

Note: We first put the files into the train folder

In [13]:
for filename in os.listdir(input_wav_folder):
    try:
        filename = filename.split('.')[0]
        process_audio_transcript(filename, input_wav_folder, input_textgrid_folder, output_train_folder_waves, output_train_folder_transcripts, segment_duration_s, buffer_ms)
    except Exception as e:
        print(f"Filename {filename}")
        print(f"Exception {e}")
        # break

**Move a split of the ```.wav``` files and ```.txt``` file to test**

In [14]:
test_split = 0.2

sample_filenames = []
for filename in os.listdir(output_train_folder_waves):
    sample_filenames.append(filename.split('.')[0])

samples = len(sample_filenames)

num_train_samples = math.floor((1-test_split)*samples)
num_test_samples = samples-num_train_samples

print(f"The total number of samples is {samples}")
print(f"The total number of training samples will be {num_train_samples}")
print(f"The total number of test samples will be {num_test_samples}")

The total number of samples is 100345
The total number of training samples will be 80276
The total number of test samples will be 20069


In [15]:
random.shuffle(sample_filenames)

In [16]:
for i in range(num_test_samples):
    filename = sample_filenames[i]

    source_wav = os.path.join(output_train_folder_waves, filename + '.wav')
    destination_wav = os.path.join(output_test_folder_waves)
    shutil.move(source_wav, destination_wav)

    source_transcript = os.path.join(output_train_folder_transcripts, filename + '.txt')
    destination_transcript = os.path.join(output_test_folder_transcripts)
    shutil.move(source_transcript, destination_transcript)

**Write the ```/train/prompts.txt``` and ```/test/prompts.txt``` files**

In [17]:
train_prompts_path = os.path.join(output_drive_path, *output_train_path, 'prompts.txt')
with open(train_prompts_path, 'a') as outfile:
    for filename in os.listdir(output_train_folder_transcripts):
        file_path = os.path.join(output_train_folder_transcripts, filename)
        with open(file_path, "r") as infile:
            outfile.write(infile.read() + '\n')

In [18]:
test_prompts_path = os.path.join(output_drive_path, *output_test_path, 'prompts.txt')
with open(test_prompts_path, 'a') as outfile:
    for filename in os.listdir(output_test_folder_transcripts):
        file_path = os.path.join(output_test_folder_transcripts, filename)
        with open(file_path, "r") as infile:
            outfile.write(infile.read() + '\n')

<br/>
<br/>
<br/>

## Step 3: Compress the files

**Compress the folders into ```.tar.gzip```**

In [19]:
paths_to_compress = [train_prompts_path, output_train_folder_waves, test_prompts_path, output_test_folder_waves]

with tarfile.open(output_compressed_file, "w:gz") as tar_gz:
    for path in paths_to_compress:
        rel_path = os.path.relpath(path, os.path.join(os.getcwd(), *output_compressed_path))
        tar_gz.add(path, arcname=rel_path) 

In [20]:
with open(train_prompts_path, 'rb') as f_in, gzip.open(output_compressed_train_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)

In [21]:
with open(test_prompts_path, 'rb') as f_in, gzip.open(output_compressed_test_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

**Sanity Check**

In [22]:
with open(train_prompts_path, "r") as f:
    lines = f.readlines()
    train_prompts_filenames = sorted([l.split(' ')[0] for l in lines])

In [23]:
train_prompts_filenames[:10]

['3000-1_11',
 '3000-1_12',
 '3000-1_13',
 '3000-1_16',
 '3000-1_17',
 '3000-1_19',
 '3000-1_2',
 '3000-1_20',
 '3000-1_21',
 '3000-1_22']

In [24]:
train_wavs_filenames = []
for filename in os.listdir(output_train_folder_waves):
    filename = filename.split('.')[0]
    train_wavs_filenames.append(filename)
train_waves_filename = sorted(train_wavs_filenames)

In [25]:
train_waves_filename[:10]

['3000-1_11',
 '3000-1_12',
 '3000-1_13',
 '3000-1_16',
 '3000-1_17',
 '3000-1_19',
 '3000-1_2',
 '3000-1_20',
 '3000-1_21',
 '3000-1_22']

In [26]:
train_prompts_filenames==train_waves_filename

True

In [27]:
with open(test_prompts_path, "r") as f:
    lines = f.readlines()
    test_prompts_filenames = sorted([l.split(' ')[0] for l in lines])

In [28]:
test_prompts_filenames[:10]

['3000-1_1',
 '3000-1_10',
 '3000-1_14',
 '3000-1_15',
 '3000-1_18',
 '3000-1_29',
 '3000-1_44',
 '3000-1_45',
 '3000-1_49',
 '3000-1_50']

In [29]:
test_wavs_filenames = []
for filename in os.listdir(output_test_folder_waves):
    filename = filename.split('.')[0]
    test_wavs_filenames.append(filename)
test_waves_filename = sorted(test_wavs_filenames)

In [30]:
test_wavs_filenames[:10]

['3000-1_1',
 '3000-1_10',
 '3000-1_14',
 '3000-1_15',
 '3000-1_18',
 '3000-1_29',
 '3000-1_44',
 '3000-1_45',
 '3000-1_49',
 '3000-1_50']

In [31]:
test_prompts_filenames==test_wavs_filenames

True

In [32]:
len(train_prompts_filenames)

80276

In [33]:
len(test_prompts_filenames)

20069