### Step 1:
<u>Initialising the directory</u>
```
dataset
- testing
    - data: Used to store compression files
    - org_waves: Manually add in .wav files
        - 3000-1.wav
        - 3000-2.wav
        - ...
    - org_transcripts: Manually add in .TextGrid files
        - 3000-1.TextGrid
        - 3000-2.TextGrid
        - ...
    - train
        - waves: Empty
        - transcripts: Empty
    - test
        - waves: Empty
        - transcripts: Empty
```

<br/>

### Step 2:
<u>After running the processing code</u>
```
dataset
- testing
    - data: Used to store compression files
    - org_waves: Manually add in .wav files
        - 3000-1.wav
        - 3000-2.wav
        - ...
    - org_transcripts: Manually add in .TextGrid files
        - 3000-1.TextGrid
        - 3000-2.TextGrid
        - ...
    - train
        - prompts.txt: Contains transcriptions for all the .wav files in train
        - waves
            - 3000-1_1.wav
            - 3000-1_2.wav
            - 3000-1_3.wav
            - ...
            - 3000-2_1.wav
            - 3000-2_2.wav
            - 3000-2_3.wav
        - transcripts
            - 3000-1_1.txt
            - 3000-1_2.txt
            - 3000-1_3.txt
            - ...
            - 3000-2_1.txt
            - 3000-2_2.txt
            - 3000-2_3.txt
    - test
        - prompts.txt: Contains transcriptions for all the .wav files in test
        - waves
            - 3000-3_1.wav
            - 3000-3_2.wav
            - 3000-3_3.wav
            - ...
            - 3000-4_1.wav
            - 3000-4_2.wav
            - 3000-4_3.wav
        - transcripts
            - 3000-3_1.txt
            - 3000-3_2.txt
            - 3000-3_3.txt
            - ...
            - 3000-4_1.txt
            - 3000-4_2.txt
            - 3000-4_3.txt
```

<br/>

### Step 3:
<u>After running the compression code</u>

```
data
    - input_name.tar.gz
        - train
            - prompts.txt: Contains transcriptions for all the .wav files
            - waves
                - 3000-1_1.wav
                - 3000-1_2.wav
                - 3000-1_3.wav
                - ...
                - 3000-2_1.wav
                - 3000-2_2.wav
                - 3000-2_3.wav
        - test
            - prompts.txt: Contains transcriptions for all the .wav files
            - waves
                - 3000-3_1.wav
                - 3000-3_2.wav
                - 3000-3_3.wav
                - ...
                - 3000-4_1.wav
                - 3000-4_2.wav
                - 3000-4_3.wav
    - prompts-train.txt.gz
        - prompts-train.txt: Contains transcriptions for all the train .wav files -> take this from train/prompts.txt
    - prompts-test.txt.gz
        - prompts-test.txt: Contains transcriptions for all the test .wav files -> take this from test/prompts.txt
```

**Imports**

In [1]:
import re 
import os
import shutil
import tarfile
import gzip
import math
import random
from praatio import textgrid 
from pydub import AudioSegment



**<u>USER INPUT REQUIRED</u> Input Relative Paths**

In [2]:
input_audio_path = ['dataset', 'testing', 'org_wavs']
input_textgrid_path = ['dataset', 'testing', 'org_transcripts']
output_train_path = ['dataset', 'testing', 'train']
output_test_path = ['dataset', 'testing', 'test']
output_compressed_path = ['dataset', 'testing']
compressed_filename = 'imda_nsc_p3_testing.tar.gz'
compressed_train_prompt_filename = 'prompts-train.txt.gz'
compressed_test_prompt_filename = 'prompts-test.txt.gz'

**Initialise Paths and Create the directories**

**IMPT <u>USER INPUT REQUIRED</u>**: Remember to add in the ```.wav``` and ```.TextGrid``` files to org_waves and org_transcripts

In [3]:
input_wav_folder = os.path.join(os.getcwd(), *input_audio_path)
input_textgrid_folder = os.path.join(os.getcwd(), *input_textgrid_path)
output_train_folder_waves = os.path.join(os.getcwd(), *output_train_path, 'waves')
output_train_folder_transcripts = os.path.join(os.getcwd(), *output_train_path, 'transcripts')
output_test_folder_waves  = os.path.join(os.getcwd(), *output_test_path, 'waves')
output_test_folder_transcripts = os.path.join(os.getcwd(), *output_test_path, 'transcripts')
output_textgrids_folder = os.path.join(os.getcwd(), *output_train_path, 'textgrids')
output_compressed_folder = os.path.join(os.getcwd(), *output_compressed_path, 'data')
output_compressed_file = os.path.join(output_compressed_folder, compressed_filename)
output_compressed_train_prompt_file = os.path.join(output_compressed_folder, compressed_train_prompt_filename)
output_compressed_test_prompt_file = os.path.join(output_compressed_folder, compressed_test_prompt_filename)

create_dir = [input_wav_folder, input_textgrid_folder, output_train_folder_waves, output_train_folder_transcripts,
              output_test_folder_waves, output_test_folder_transcripts, output_textgrids_folder, output_compressed_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

**Helper function to clean the transcription**

1. Lower-case the text

2. Remove and replace annotations

- Paralinguistic Phenomena: Remove '(ppb)', '(ppc)', '(ppl)', '(ppo)'
- Acronyms: Remove '_'
- Multi-word nouns: Replace '-' with ' '
- Discourse particles: Remove '[' and ']'
- Fillers: Remove '(' and ')'
- Interjections: Remove '!'
- Other languages: Remove '#'
- Unclear words: Remove ```'<unk>'```
- Incomplete words: Remove '~'
- Short pauses: Remove ```'<s>'```
- Invalid: Remove ```'<z>'```
- Long-running non-english utterances: Remove ```'<nen>'```
- Fillers: Remove ```'<fil/>'```
- Speaker Noise: Remove ```'<spk/>'```
- Unknown: Remove '**'
- Non-primary speaker sound: Remove ```'<non/>'```
- End of sentence: Remove ```'<s/>'```
- Comma: Remove ```'<c/>'```
- Remove all instances of ```<whatever is inside>```

3. Remove extra spaces created by ```<s>``` and stuff

Refer to the Transcription Guidelines by IMDA

In [4]:
def clean_transcription(transcript):
    transcript = ' '.join(line.strip() for line in transcript)
    transcript = transcript.lower()
    remove = [r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', r'_', r'\[|\]', r'\(|\)', r'!', 
            r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
            r'\*', r'<non/>', r'<s/>', r'<c/>', r'<[^>]+>'] 
    replace = ['-']
    for e in remove:
        transcript = re.sub(e, '', transcript)
    for e in replace:
        transcript = re.sub(e, ' ', transcript)
    transcript = re.sub(r'\s+', ' ', transcript).strip()
    return transcript

**Main function**

Matches a single ```.wav``` file to its respective ```.TextGrid``` file

- Break the ```.wav``` file and ```.TextGrid``` file into 30s segments
- Clean the ```.TextGrid``` file
- Only keep segments that have audio

In [5]:
def process_audio_transcript(audio_filename, input_audio_path, input_textgrid_path, output_path, sanity_check=False):
    audio_path = os.path.join(os.getcwd(), *input_audio_path, f'{audio_filename}.wav')
    textgrid_path = os.path.join(os.getcwd(), *input_textgrid_path, f'{audio_filename}.TextGrid')

    output_dir_wav = os.path.join(os.getcwd(), *output_path, 'waves')
    output_dir_transcript = os.path.join(os.getcwd(), *output_path, 'transcripts')

    output_dir_textgrid = os.path.join(os.getcwd(), *output_path, 'textgrids')

    audio = AudioSegment.from_wav(audio_path)
    tg = textgrid.openTextgrid(textgrid_path, False) 

    segment_duration_ms = 30 * 1000  

    audio_duration = len(audio)

    start_time = 0
    segment_index = 1

    while start_time < audio_duration:
        end_time = min(start_time + segment_duration_ms, audio_duration)

        audio_segment = audio[start_time:end_time]
        tg_segment = tg.crop(start_time / 1000, end_time / 1000, mode="truncated", rebaseToZero=False)

        transcriptions = []
        for tier_name in tg_segment.tierNames: 
            tier = tg_segment.getTier(tier_name) 
            for entry in tier.entries:  
                if entry.label.strip():  
                    transcriptions.append(entry.label)

        transcriptions_clean = clean_transcription(transcriptions)

        if len(transcriptions_clean) > 0:
            transcript_segment_path = os.path.join(output_dir_transcript, f'{audio_filename}_{segment_index}.txt')
            with open(transcript_segment_path, 'w') as f:
                f.write(f'{audio_filename}_{segment_index} {transcriptions_clean}')

            if sanity_check:
                tg_segment_path = os.path.join(output_dir_textgrid, f'{audio_filename}_{segment_index}.TextGrid')
                tg_segment.save(tg_segment_path, "long_textgrid", True)
            
            audio_segment_path = os.path.join(output_dir_wav, f'{audio_filename}_{segment_index}.wav')
            audio_segment.export(audio_segment_path, format="wav")

            start_time+=segment_duration_ms
            segment_index+=1
        else:
            start_time+=segment_duration_ms

**Run the main function to segment 30s chunks for each ```.wav``` and ```.TextGrid``` file**

Output is the segmented ```.wav``` files and transcriptions for each ```.wav``` file stored in ```train/waves``` and ```train/transcripts``` respectively

Note: We first put the files into the train folder

A sanity check can be set to True to view the segmented ```.TextGrid``` files in ```./train/textgrids/```

In [6]:
audio_path = os.path.join(os.getcwd(), *input_audio_path)
for filename in os.listdir(audio_path):
    filename = filename.split('.')[0]
    process_audio_transcript(filename, input_audio_path, input_textgrid_path, output_train_path, True)

**Move a split of the ```.wav``` files and ```.txt``` file to test**

In [7]:
test_split = 0.2

sample_filenames = []
for filename in os.listdir(output_train_folder_waves):
    sample_filenames.append(filename.split('.')[0])

samples = len(sample_filenames)

num_train_samples = math.floor((1-test_split)*samples)
num_test_samples = samples-num_train_samples

print(f"The total number of samples is {samples}")
print(f"The total number of training samples will be {num_train_samples}")
print(f"The total number of test samples will be {num_test_samples}")

The total number of samples is 238
The total number of training samples will be 190
The total number of test samples will be 48


In [8]:
random.shuffle(sample_filenames)

In [9]:
for i in range(num_test_samples):
    filename = sample_filenames[i]

    source_wav = os.path.join(output_train_folder_waves, filename + '.wav')
    destination_wav = os.path.join(output_test_folder_waves)
    shutil.move(source_wav, destination_wav)

    source_transcript = os.path.join(output_train_folder_transcripts, filename + '.txt')
    destination_transcript = os.path.join(output_test_folder_transcripts)
    shutil.move(source_transcript, destination_transcript)

**Write the ```/train/prompts.txt``` and ```/test/prompts.txt``` files**

In [10]:
train_prompts_path = os.path.join(os.getcwd(), *output_train_path, 'prompts.txt')
with open(train_prompts_path, 'a') as outfile:
    for filename in os.listdir(output_train_folder_transcripts):
        file_path = os.path.join(output_train_folder_transcripts, filename)
        with open(file_path, "r") as infile:
            outfile.write(infile.read() + '\n')

In [11]:
test_prompts_path = os.path.join(os.getcwd(), *output_test_path, 'prompts.txt')
with open(test_prompts_path, 'a') as outfile:
    for filename in os.listdir(output_test_folder_transcripts):
        file_path = os.path.join(output_test_folder_transcripts, filename)
        with open(file_path, "r") as infile:
            outfile.write(infile.read() + '\n')

**Compress the folders into ```.tar.gzip```**

In [12]:
paths_to_compress = [train_prompts_path, output_train_folder_waves, test_prompts_path, output_test_folder_waves]

with tarfile.open(output_compressed_file, "w:gz") as tar_gz:
    for path in paths_to_compress:
        rel_path = os.path.relpath(path, os.path.join(os.getcwd(), *output_compressed_path))
        tar_gz.add(path, arcname=rel_path) 

In [13]:
with open(train_prompts_path, 'rb') as f_in, gzip.open(output_compressed_train_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)

In [14]:
with open(test_prompts_path, 'rb') as f_in, gzip.open(output_compressed_test_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

**Sanity Check**

In [15]:
with open(train_prompts_path, "r") as f:
    lines = f.readlines()
    train_prompts_filenames = sorted([l.split(' ')[0] for l in lines])

In [16]:
train_prompts_filenames[:10]

['3000-1_1',
 '3000-1_100',
 '3000-1_101',
 '3000-1_102',
 '3000-1_103',
 '3000-1_104',
 '3000-1_105',
 '3000-1_106',
 '3000-1_107',
 '3000-1_108']

In [17]:
train_wavs_filenames = []
for filename in os.listdir(output_train_folder_waves):
    filename = filename.split('.')[0]
    train_wavs_filenames.append(filename)
train_waves_filename = sorted(train_wavs_filenames)

In [18]:
train_waves_filename[:10]

['3000-1_1',
 '3000-1_100',
 '3000-1_101',
 '3000-1_102',
 '3000-1_103',
 '3000-1_104',
 '3000-1_105',
 '3000-1_106',
 '3000-1_107',
 '3000-1_108']

In [19]:
train_prompts_filenames==train_waves_filename

True

In [20]:
with open(test_prompts_path, "r") as f:
    lines = f.readlines()
    test_prompts_filenames = sorted([l.split(' ')[0] for l in lines])

In [21]:
test_prompts_filenames[:10]

['3000-1_10',
 '3000-1_118',
 '3000-1_119',
 '3000-1_13',
 '3000-1_15',
 '3000-1_27',
 '3000-1_28',
 '3000-1_32',
 '3000-1_33',
 '3000-1_34']

In [22]:
test_wavs_filenames = []
for filename in os.listdir(output_test_folder_waves):
    filename = filename.split('.')[0]
    test_wavs_filenames.append(filename)
test_waves_filename = sorted(test_wavs_filenames)

In [23]:
test_wavs_filenames[:10]

['3000-1_10',
 '3000-1_118',
 '3000-1_119',
 '3000-1_13',
 '3000-1_15',
 '3000-1_27',
 '3000-1_28',
 '3000-1_32',
 '3000-1_33',
 '3000-1_34']

In [24]:
test_prompts_filenames==test_wavs_filenames

True