# Preparing Data

## Part 1: Get Data 

National Speech Corpus
- Part 3: 1000 hours of conversational speech data (Used by Home team)
- Part 2: 1000 hours of prompted recordings of random sentences containing local words and entities (Used by some developer)
- Part 4: Conversational code-switched data (from Singaporean English to various native languages)

<br/>
<br/>

- https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809
- https://www.jensenlwt.com/blog/singlish-whisper-finetuning-asr-for-singapore-unique-english


<br/>
<br/>
<br/>
<br/>
<br/>

## Part 2: Prepare Data

- Match each transcript sentence to its corresponding audio file
- Check on the environment where the audio is recorded (decide the environment)
    - Hometeam
        - The NSC Part 3 recordings are split into two environments, each with two different microphones used for recording. In the first environment, where speakers were in the same room, we selected the recordings using the close-talk mic as this isolated the main speaker’s voice (without picking up background noise or the secondary speaker). For the second environment with speakers in different rooms, we chose to use the standing microphone recordings, as opposed to recordings via telephone.
    - Same room environment: Close-talk mic that isolates main speaker's voice 
    - Different room environment: Standing microphone as opposed to telephone
- Clean the transcripts by removing annotations
- Normalise the transcript text
    - Remove punctuations
    - Lowercase text
- Create 30s audio segments with corresponding transcripts
    - Using time segments from ```TextGrid files```, splice out corresponding segments from WAV files
    - Combine shorter consecutive segments (?)
    - 30s: Whisper's feature extractor ensures all audio is 30s (intrinsic design)

<br/>
<br/>

- https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809
- https://www.jensenlwt.com/blog/singlish-whisper-finetuning-asr-for-singapore-unique-english

<br/>
<br/>

More on dataset part 3 (see ```ABOUT.txt```):

Part 3 consists of about 1000 hours of conversational data recorded from about 1000 local English speakers, split into pairs. The data includes conversations covering daily life and of speakers playing games provided. 

Part 3's recordings were split into 2 environments. In the Same Room environment where speakers were in same room, the recordings were done using 2 microphones: a close-talk mic and a boundary mic. In the Separate Room environment, speakers were separated into individual rooms. The recordings were done using 2 microphones in each room: a standing mic and a telephone. 

Part 3 is further organised into a six subdirectories, 3 for each recording environment (Same Room or Separate Room). Among each group of 3 subdirectories, 1 contains transcriptions, while the remaining 2 contain audio data from each of the two microphones used for the environment. There is also a manifest document at the root of the Part 3 folder that lists the files released.


Summary of Part 3 data organization:
- Same Room environment, files organized by speaker number:
    - /Scripts Same: Orthographic transcripts saved in TextGrid format
    - /Audio Same BoundaryMic: Audio files in WAV format recorded using the boundary mic, sampled at 16kHz
    - /Audio Same CloseMic: Audio files in WAV format recorded using the close-talk mic, sampled at 16kHz


- Separate Room environment, files organized by speaker number and session number:
    - /Scripts Separate: Orthographic transcripts saved in TextGrid format 
    - /Audio Separate IVR: Audio files in WAV format recorded using the telephone, sampled at 16kHz
    - /Audio Separate StandingMic: Audio files in WAV format recorded using the standing mic, sampled at 16kHz


<br/>
<br/>
<br/>
<br/>
<br/>

### **Iteration 1: Simple Example/Debugging**

**1. Match 3000-1.wav and 3000-1.TEXTGRID**

- Use Dataset Part 3 (used by Home Team)
- Specific datasets (used by Home Team)
    - Audio Same CloseMic
    - Audio Separate StandingMic 
- In this simple example, first settle the Audio Same CloseMic dataset


**2. Create 30s segments from 3000-1.wav and 3000-1.TEXTGRID**

In [None]:
# https://github.com/jiaaro/pydub#installation
# https://github.com/timmahrt/praatIO/tree/main

import os
from praatio import textgrid 
from pydub import AudioSegment

# Initialise input and output paths
audio_path = os.path.join(os.getcwd(), 'dataset', 'part3', 'simple_example', '3000-1.wav')
textgrid_path = os.path.join(os.getcwd(), 'dataset', 'part3', 'simple_example', '3000-1.TextGrid')
output_dir = os.path.join(os.getcwd(), 'dataset', 'part3', 'simple_example', '3000-1-splits')

# https://github.com/jiaaro/pydub
# https://github.com/timmahrt/praatIO
# https://timmahrt.github.io/praatIO/praatio.html
audio = AudioSegment.from_wav(audio_path)
tg = textgrid.openTextgrid(textgrid_path, False) # do not include intervals and points with empty labels

# pydub does things in milliseconds
segment_duration_ms = 30 * 1000  

# Get total duration of the audio in milliseconds
audio_duration = len(audio)

# Initialize start time and segment index
start_time = 0
segment_index = 1

#while start_time < audio_duration:
    # Initialise end time of the segment
end_time = min(start_time + segment_duration_ms, audio_duration)

# Extract audio segment given the current start and end timing
audio_segment = audio[start_time:end_time]

# Save the audio segment
audio_segment_path = os.path.join(output_dir, f'segment_{segment_index}.wav')
audio_segment.export(audio_segment_path, format="wav")

# Extract the corresponding TextGrid segment
# https://timmahrt.github.io/praatIO/praatio/data_classes/textgrid.html
tg_segment = tg.crop(start_time / 1000, end_time / 1000, mode="truncated", rebaseToZero=False)

# Check tg_segment 
# https://timmahrt.github.io/praatIO/praatio/data_classes/textgrid.html
tg_segment_path = os.path.join(output_dir, 'tg_segment.TextGrid')
tg_segment.save(tg_segment_path, "long_textgrid", True)

# Collect transcriptions from the TextGrid segment
transcriptions = []
for tier_name in tg_segment.tierNames: # For each tier (in order) in the TextGrid segment
    tier = tg_segment.getTier(tier_name) # Get the tier
    for entry in tier.entries: # For each of its entries, extract the labels 
        if entry.label.strip():  # Only include non-empty transcriptions -> but should be handled above already
            transcriptions.append(entry.label)

# Save the transcriptions to a text file
transcription_path = os.path.join(output_dir, f'segment_{segment_index}_transcription.txt')
with open(transcription_path, 'w') as f:
    f.write("\n".join(transcriptions))

In [None]:
output_dir_audio = os.path.join(output_dir, 'segment_1.wav')

from IPython.display import Audio
display(Audio(output_dir_audio))

**Transcription**
```
<S>
(um) you can go first
<S>
you guys are going to stand here [ah]
<S>
they are like !wow! this is a weird topic (um)
<S>
Singapore and Malaysia are like
<S>
you know brothers but not really brothers brothers on a on a tricky relationship
<S>
you know what let's skip this topic
<S>
next do I go do I go next
```

**TextGrid**
```
File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0 
xmax = 30 
tiers? <exists> 
size = 1 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "3000-1" 
        xmin = 0 
        xmax = 30 
        intervals: size = 14 
        intervals [1]:
            xmin = 0 
            xmax = 1.556 
            text = "<S>" 
        intervals [2]:
            xmin = 1.556 
            xmax = 2.661 
            text = "(um) you can go first" 
        intervals [3]:
            xmin = 2.661 
            xmax = 3.848 
            text = "<S>" 
        intervals [4]:
            xmin = 3.848 
            xmax = 4.998 
            text = "you guys are going to stand here [ah]" 
        intervals [5]:
            xmin = 4.998 
            xmax = 10.473 
            text = "<S>" 
        intervals [6]:
            xmin = 10.473 
            xmax = 13.531 
            text = "they are like !wow! this is a weird topic (um)" 
        intervals [7]:
            xmin = 13.531 
            xmax = 16.156 
            text = "<S>" 
        intervals [8]:
            xmin = 16.156 
            xmax = 17.868 
            text = "Singapore and Malaysia are like" 
        intervals [9]:
            xmin = 17.868 
            xmax = 19.781 
            text = "<S>" 
        intervals [10]:
            xmin = 19.781 
            xmax = 24.718 
            text = "you know brothers but not really brothers brothers on a on a tricky relationship" 
        intervals [11]:
            xmin = 24.718 
            xmax = 26.281 
            text = "<S>" 
        intervals [12]:
            xmin = 26.281 
            xmax = 27.318 
            text = "you know what let's skip this topic" 
        intervals [13]:
            xmin = 27.318 
            xmax = 28.156 
            text = "<S>" 
        intervals [14]:
            xmin = 28.156 
            xmax = 30 
            text = "next do I go do I go next" 

```

<br/>
<br/>
<br/>

**3. Clean and format the transcripts**

In [None]:
output_dir_transcript = os.path.join(output_dir, 'segment_1_transcription.txt')

with open(output_dir_transcript, 'r') as f:
    transcript = ' '.join(line.strip() for line in f)

In [None]:
transcript

<u>Cleaning</u>

1. Lower-case the text

2. Remove and replace annotations

- Acronyms: Remove '_'
- Multi-word nouns: Replace '-' with ' '
- Discourse particles: Remove '[' and ']'
- Fillers: Remove '(' and ')'
- Interjections: Remove '!'
- Paralinguistic Phenomena: Remove '(ppb)', '(ppc)', '(ppl)', '(ppo)'
- Other languages: Remove '#'
- Unclear words: Remove ```'<unk>'```
- Incomplete words: Remove '~'
- Short pauses: Remove ```'<s>'```
- Invalid: Remove ```'<z>'```
- Long-running non-english utterances: Remove ```'<nen>'```
- Fillers: Remove ```'<fil/>'```
- Speaker Noise: Remove ```'<spk/>'```
- Unknown: Remove '**'
- Non-primary speaker sound: Remove ```'<non/>'```
- End of sentence: Remove ```'<s/>'```
- Comma: Remove ```'<c/>'```

In [None]:
import re

transcript = transcript.lower()

remove = [r'_', r'\[|\]', r'\(|\)', r'!', r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', 
          r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
          r'\*', r'<non/>', r'<s/>', r'<c/>']

replace = ['-']


for e in remove:
    transcript = re.sub(e, '', transcript)

In [None]:
for e in replace:
    transcript = re.sub(e, ' ', transcript)

In [None]:
transcript

In [None]:
# Remove extra spaces created by <s> and stuff
transcript = re.sub(r'\s+', ' ', transcript).strip()

In [None]:
transcript

**Need to change the order** 

(ppl) (ppb) etc. should be put infront because if the parantheses are removed, they won't be matched later

Also need to remove all ```<example_word>```, example: ```<malay>malay word</malay>```

In [None]:
testing = ['(ppl)','(test)','sfs','(rdg)', 'tg_s']
testing_2 = ' '.join(test.strip() for test in testing)
remove = [r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', r'_', r'\[|\]', r'\(|\)', r'!', 
            r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
            r'\*', r'<non/>', r'<s/>', r'<c/>']
for e in remove:
    testing_2 = re.sub(e, '', testing_2)
testing_2 = re.sub(r'\s+', ' ', testing_2).strip()

In [None]:
# https://github.com/jiaaro/pydub#installation
# https://github.com/timmahrt/praatIO/tree/main

import os
from praatio import textgrid 
from pydub import AudioSegment

# Initialise input and output paths
audio_path = os.path.join(os.getcwd(), 'dataset', 'part3', 'simple_example', '3000-1.wav')
textgrid_path = os.path.join(os.getcwd(), 'dataset', 'part3', 'simple_example', '3000-1.TextGrid')
output_dir = os.path.join(os.getcwd(), 'dataset', 'part3', 'simple_example', '3000-1-splits')

# https://github.com/jiaaro/pydub
# https://github.com/timmahrt/praatIO
# https://timmahrt.github.io/praatIO/praatio.html
audio = AudioSegment.from_wav(audio_path)
tg = textgrid.openTextgrid(textgrid_path, False) # do not include intervals and points with empty labels

# pydub does things in milliseconds
segment_duration_ms = 30 * 1000  

# Get total duration of the audio in milliseconds
audio_duration = len(audio)

# Initialize start time and segment index
start_time = 0
segment_index = 1

#while start_time < audio_duration:
    # Initialise end time of the segment
end_time = min(start_time + segment_duration_ms, audio_duration)

# Extract audio segment given the current start and end timing
audio_segment = audio[start_time:end_time]

# Save the audio segment
audio_segment_path = os.path.join(output_dir, f'segment_{segment_index}.wav')
audio_segment.export(audio_segment_path, format="wav")

# Extract the corresponding TextGrid segment
# https://timmahrt.github.io/praatIO/praatio/data_classes/textgrid.html
tg_segment = tg.crop(start_time / 1000, end_time / 1000, mode="truncated", rebaseToZero=False)

# Check tg_segment 
# https://timmahrt.github.io/praatIO/praatio/data_classes/textgrid.html
tg_segment_path = os.path.join(output_dir, 'tg_segment.TextGrid')
tg_segment.save(tg_segment_path, "long_textgrid", True)

# Collect transcriptions from the TextGrid segment
transcriptions = []
for tier_name in tg_segment.tierNames: # For each tier (in order) in the TextGrid segment
    tier = tg_segment.getTier(tier_name) # Get the tier
    for entry in tier.entries: # For each of its entries, extract the labels 
        if entry.label.strip():  # Only include non-empty transcriptions -> but should be handled above already
            transcriptions.append(entry.label)

print(transcriptions)

In [None]:
' '.join(line.strip() for line in transcriptions)

In [None]:
def clean_transcription(transcript):
    transcript = ' '.join(line.strip() for line in transcript)

    transcript = transcript.lower()

    remove = [r'_', r'\[|\]', r'\(|\)', r'!', r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', 
            r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
            r'\*', r'<non/>', r'<s/>', r'<c/>']

    replace = ['-']


    for e in remove:
        transcript = re.sub(e, '', transcript)

    for e in replace:
        transcript = re.sub(e, ' ', transcript)

    transcript = re.sub(r'\s+', ' ', transcript).strip()

    return transcript

In [None]:
clean_transcription(transcriptions)

**Things to check**
- check out 3000-1_33: <malay>malay word</malay>
- check out 3000-1_36: no transcription

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### **Iteration 2: Check if the transcriptions are still ok**

```
- imda_nsc_p3.tar.gz
    - imda_nsc_p3.tar
        - train
            - prompts.txt: Contains transcriptions for all the .wav files
            - waves
                - 3000-1_1.wav
                - 3000-1_2.wav
                - 3000-1_3.wav
- prompts-train.txt.gz
    - prompts-train.txt: Contains transcriptions for all the train .wav files
```

In [None]:
import re 
import os
from praatio import textgrid 
from pydub import AudioSegment

In [None]:
def clean_transcription(transcript):
    transcript = ' '.join(line.strip() for line in transcript)
    transcript = transcript.lower()
    remove = [r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', r'_', r'\[|\]', r'\(|\)', r'!', 
            r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
            r'\*', r'<non/>', r'<s/>', r'<c/>', r'<[^>]+>'] # Addition: remove all instances of <whatever's inside>
    replace = ['-']
    for e in remove:
        transcript = re.sub(e, '', transcript)
    for e in replace:
        transcript = re.sub(e, ' ', transcript)
    transcript = re.sub(r'\s+', ' ', transcript).strip()
    return transcript

In [None]:
# Input paths
audio_filename = '3000-2'

audio_path = os.path.join(os.getcwd(), 'dataset', 'dev', 'org_wavs', f'{audio_filename}.wav')
textgrid_path = os.path.join(os.getcwd(), 'dataset', 'dev', 'org_transcripts', f'{audio_filename}.TextGrid')

# Output paths
# output_dir_train_wav = os.path.join(os.getcwd(), 'dataset', 'imda_nsc_prototype', 'train', 'waves', f'{audio_filename}')
output_dir_train_wav = os.path.join(os.getcwd(), 'dataset', 'dev', 'train', 'waves')
os.makedirs(output_dir_train_wav, exist_ok=True)
output_dir_train_text = os.path.join(os.getcwd(), 'dataset', 'dev', 'train', 'prompts.txt')
output_dir_train_tg = os.path.join(os.getcwd(), 'dataset', 'dev', 'train', 'textgrids')

# https://github.com/jiaaro/pydub
# https://github.com/timmahrt/praatIO
# https://timmahrt.github.io/praatIO/praatio.html
# Extract the audio and text grid
audio = AudioSegment.from_wav(audio_path)
tg = textgrid.openTextgrid(textgrid_path, False) # do not include intervals and points with empty labels

# pydub does things in milliseconds
segment_duration_ms = 30 * 1000  

# Get total duration of the audio in milliseconds
audio_duration = len(audio)

# Initialize start time and segment index
start_time = 0
segment_index = 1

while start_time < audio_duration:
    # Initialise end time of the segment
    end_time = min(start_time + segment_duration_ms, audio_duration)

    # Extract audio segment given the current start and end timing
    audio_segment = audio[start_time:end_time]

    # Extract the corresponding TextGrid segment
    # https://timmahrt.github.io/praatIO/praatio/data_classes/textgrid.html
    tg_segment = tg.crop(start_time / 1000, end_time / 1000, mode="truncated", rebaseToZero=False)

    tg_segment_path = os.path.join(output_dir_train_tg, f'{audio_filename}_{segment_index}.TextGrid')
    tg_segment.save(tg_segment_path, "long_textgrid", True)

    # Collect transcriptions from the TextGrid segment
    transcriptions = []
    for tier_name in tg_segment.tierNames: # For each tier (in order) in the TextGrid segment
        tier = tg_segment.getTier(tier_name) # Get the tier
        for entry in tier.entries: # For each of its entries, extract the labels 
            if entry.label.strip():  # Only include non-empty transcriptions -> but should be handled above already
                transcriptions.append(entry.label)

    print(f"Dirty transcription: {transcriptions}")
    # Clean the transcriptions
    transcriptions_clean = clean_transcription(transcriptions)
    print(f"Clean transcription: {transcriptions_clean}")
    #print("")

    if len(transcriptions_clean) > 0:
        # Save the transcriptions to a text file, append mode
        with open(output_dir_train_text, 'a') as f:
            f.write(f'{audio_filename}_{segment_index} {transcriptions_clean}\n')

        # Save the audio segment
        audio_segment_path = os.path.join(output_dir_train_wav, f'{audio_filename}_{segment_index}.wav')
        audio_segment.export(audio_segment_path, format="wav")

        start_time+=segment_duration_ms
        segment_index+=1
    else:
        start_time+=segment_duration_ms

**```tar.gz``` file resources**

- https://stackoverflow.com/questions/2032403/how-to-create-full-compressed-tar-file-using-python
- https://www.tutorialspoint.com/how-to-create-a-tar-file-using-python
- https://www.geeksforgeeks.org/python-os-path-relpath-method/

**```txt.gz file resources```**
- https://stackoverflow.com/questions/8156707/gzip-a-file-in-python

**```folder structure resources```**
- https://huggingface.co/docs/datasets/en/audio_dataset#loading-script
- https://huggingface.co/datasets/AILAB-VNUHCM/vivos/tree/main/data

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### **Iteration 3: Testing**

<u>Before running processing code</u>
```
dataset
- testing
    - data: Empty
    - org_waves: Manually add in .wav files
        - 3000-1.wav
        - 3000-2.wav
        - ...
    - org_transcripts: Manually add in .TextGrid files
        - 3000-1.TextGrid
        - 3000-2.TextGrid
        - ...
    - train
        - waves: Empty
        - transcripts: Empty
    - test
        - waves: Empty
        - transcripts: Empty
```

<br/>

<u>After running processing code</u>
```
dataset
- testing
    - data: Empty
    - org_waves: Manually add in .wav files
        - 3000-1.wav
        - 3000-2.wav
        - ...
    - org_transcripts: Manually add in .TextGrid files
        - 3000-1.TextGrid
        - 3000-2.TextGrid
        - ...
    - train
        - prompts.txt: Contains transcriptions for all the .wav files in train
        - waves
            - 3000-1_1.wav
            - 3000-1_2.wav
            - 3000-1_3.wav
            - ...
            - 3000-2_1.wav
            - 3000-2_2.wav
            - 3000-2_3.wav
        - transcripts
            - 3000-1_1.txt
            - 3000-1_2.txt
            - 3000-1_3.txt
            - ...
            - 3000-2_1.txt
            - 3000-2_2.txt
            - 3000-2_3.txt
    - test
        - prompts.txt: Contains transcriptions for all the .wav files in test
        - waves
            - 3000-3_1.wav
            - 3000-3_2.wav
            - 3000-3_3.wav
            - ...
            - 3000-4_1.wav
            - 3000-4_2.wav
            - 3000-4_3.wav
        - transcripts
            - 3000-3_1.txt
            - 3000-3_2.txt
            - 3000-3_3.txt
            - ...
            - 3000-4_1.txt
            - 3000-4_2.txt
            - 3000-4_3.txt
```

<br/>

<u>After running compression code</u>

```
data
    - input_name.tar.gz
        - train
            - prompts.txt: Contains transcriptions for all the .wav files
            - waves
                - 3000-1_1.wav
                - 3000-1_2.wav
                - 3000-1_3.wav
                - ...
                - 3000-2_1.wav
                - 3000-2_2.wav
                - 3000-2_3.wav
        - test
            - prompts.txt: Contains transcriptions for all the .wav files
            - waves
                - 3000-3_1.wav
                - 3000-3_2.wav
                - 3000-3_3.wav
                - ...
                - 3000-4_1.wav
                - 3000-4_2.wav
                - 3000-4_3.wav
    - prompts-train.txt.gz
        - prompts-train.txt: Contains transcriptions for all the train .wav files -> take this from train/prompts.txt
    - prompts-test.txt.gz
        - prompts-test.txt: Contains transcriptions for all the test .wav files -> take this from test/prompts.txt
```

**Imports**

In [None]:
import re 
import os
import shutil
import tarfile
import gzip
import math
import random
from praatio import textgrid 
from pydub import AudioSegment

**Input Relative Paths**

In [None]:
input_audio_path = ['dataset', 'testing', 'org_wavs']
input_textgrid_path = ['dataset', 'testing', 'org_transcripts']
output_train_path = ['dataset', 'testing', 'train']
output_test_path = ['dataset', 'testing', 'test']
output_compressed_path = ['dataset', 'testing']
compressed_filename = 'imda_nsc_p3_testing.tar.gz'
compressed_train_prompt_filename = 'prompts-train.txt.gz'
compressed_test_prompt_filename = 'prompts-test.txt.gz'

**Initialise Paths and Create the directories**

**IMPT**: Remember to add in the ```.wav``` and ```.TextGrid``` files to org_waves and org_transcripts

In [None]:
input_wav_folder = os.path.join(os.getcwd(), *input_audio_path)
input_textgrid_folder = os.path.join(os.getcwd(), *input_textgrid_path)
output_train_folder_waves = os.path.join(os.getcwd(), *output_train_path, 'waves')
output_train_folder_transcripts = os.path.join(os.getcwd(), *output_train_path, 'transcripts')
output_test_folder_waves  = os.path.join(os.getcwd(), *output_test_path, 'waves')
output_test_folder_transcripts = os.path.join(os.getcwd(), *output_test_path, 'transcripts')
output_textgrids_folder = os.path.join(os.getcwd(), *output_train_path, 'textgrids')
output_compressed_folder = os.path.join(os.getcwd(), *output_compressed_path, 'data')
output_compressed_file = os.path.join(output_compressed_folder, compressed_filename)
output_compressed_train_prompt_file = os.path.join(output_compressed_folder, compressed_train_prompt_filename)
output_compressed_test_prompt_file = os.path.join(output_compressed_folder, compressed_test_prompt_filename)

create_dir = [input_wav_folder, input_textgrid_folder, output_train_folder_waves, output_train_folder_transcripts,
              output_test_folder_waves, output_test_folder_transcripts, output_textgrids_folder, output_compressed_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

**Helper function to clean the transcription**

1. Lower-case the text

2. Remove and replace annotations

- Paralinguistic Phenomena: Remove '(ppb)', '(ppc)', '(ppl)', '(ppo)'
- Acronyms: Remove '_'
- Multi-word nouns: Replace '-' with ' '
- Discourse particles: Remove '[' and ']'
- Fillers: Remove '(' and ')'
- Interjections: Remove '!'
- Other languages: Remove '#'
- Unclear words: Remove ```'<unk>'```
- Incomplete words: Remove '~'
- Short pauses: Remove ```'<s>'```
- Invalid: Remove ```'<z>'```
- Long-running non-english utterances: Remove ```'<nen>'```
- Fillers: Remove ```'<fil/>'```
- Speaker Noise: Remove ```'<spk/>'```
- Unknown: Remove '**'
- Non-primary speaker sound: Remove ```'<non/>'```
- End of sentence: Remove ```'<s/>'```
- Comma: Remove ```'<c/>'```
- Remove all instances of ```<whatever is inside>```

3. Remove extra spaces created by ```<s>``` and stuff

Refer to the Transcription Guidelines by IMDA

In [None]:
def clean_transcription(transcript):
    transcript = ' '.join(line.strip() for line in transcript)
    transcript = transcript.lower()
    remove = [r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', r'_', r'\[|\]', r'\(|\)', r'!', 
            r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
            r'\*', r'<non/>', r'<s/>', r'<c/>', r'<[^>]+>'] 
    replace = ['-']
    for e in remove:
        transcript = re.sub(e, '', transcript)
    for e in replace:
        transcript = re.sub(e, ' ', transcript)
    transcript = re.sub(r'\s+', ' ', transcript).strip()
    return transcript

**Main function**

Matches a single ```.wav``` file to its respective ```.TextGrid``` file

- Break the ```.wav``` file and ```.TextGrid``` file into 30s segments
- Clean the ```.TextGrid``` file
- Only keep segments that have audio

In [None]:
def process_audio_transcript(audio_filename, input_audio_path, input_textgrid_path, output_path, sanity_check=False):
    audio_path = os.path.join(os.getcwd(), *input_audio_path, f'{audio_filename}.wav')
    textgrid_path = os.path.join(os.getcwd(), *input_textgrid_path, f'{audio_filename}.TextGrid')

    output_dir_wav = os.path.join(os.getcwd(), *output_path, 'waves')
    output_dir_transcript = os.path.join(os.getcwd(), *output_path, 'transcripts')

    output_dir_textgrid = os.path.join(os.getcwd(), *output_path, 'textgrids')

    audio = AudioSegment.from_wav(audio_path)
    tg = textgrid.openTextgrid(textgrid_path, False) 

    segment_duration_ms = 30 * 1000  

    audio_duration = len(audio)

    start_time = 0
    segment_index = 1

    while start_time < audio_duration:
        end_time = min(start_time + segment_duration_ms, audio_duration)

        audio_segment = audio[start_time:end_time]
        tg_segment = tg.crop(start_time / 1000, end_time / 1000, mode="truncated", rebaseToZero=False)

        transcriptions = []
        for tier_name in tg_segment.tierNames: 
            tier = tg_segment.getTier(tier_name) 
            for entry in tier.entries:  
                if entry.label.strip():  
                    transcriptions.append(entry.label)

        transcriptions_clean = clean_transcription(transcriptions)

        if len(transcriptions_clean) > 0:
            transcript_segment_path = os.path.join(output_dir_transcript, f'{audio_filename}_{segment_index}.txt')
            with open(transcript_segment_path, 'w') as f:
                f.write(f'{audio_filename}_{segment_index} {transcriptions_clean}')

            if sanity_check:
                tg_segment_path = os.path.join(output_dir_textgrid, f'{audio_filename}_{segment_index}.TextGrid')
                tg_segment.save(tg_segment_path, "long_textgrid", True)
            
            audio_segment_path = os.path.join(output_dir_wav, f'{audio_filename}_{segment_index}.wav')
            audio_segment.export(audio_segment_path, format="wav")

            start_time+=segment_duration_ms
            segment_index+=1
        else:
            start_time+=segment_duration_ms

**Run the main function to segment 30s chunks for each ```.wav``` and ```.TextGrid``` file**

Output is the segmented ```.wav``` files and transcriptions for each ```.wav``` file stored in ```train/waves``` and ```train/transcripts``` respectively

Note: We first put the files into the train folder

A sanity check can be set to True to view the segmented ```.TextGrid``` files in ```./train/textgrids/```

In [None]:
audio_path = os.path.join(os.getcwd(), *input_audio_path)
for filename in os.listdir(audio_path):
    filename = filename.split('.')[0]
    process_audio_transcript(filename, input_audio_path, input_textgrid_path, output_train_path, True)

**Move a split of the ```.wav``` files and ```.txt``` file to test**

In [None]:
test_split = 0.2

sample_filenames = []
for filename in os.listdir(output_train_folder_waves):
    sample_filenames.append(filename.split('.')[0])

samples = len(sample_filenames)

num_train_samples = math.floor((1-test_split)*samples)
num_test_samples = samples-num_train_samples

print(f"The total number of samples is {samples}")
print(f"The total number of training samples will be {num_train_samples}")
print(f"The total number of test samples will be {num_test_samples}")

In [None]:
random.shuffle(sample_filenames)

In [None]:
for i in range(num_test_samples):
    filename = sample_filenames[i]

    source_wav = os.path.join(output_train_folder_waves, filename + '.wav')
    destination_wav = os.path.join(output_test_folder_waves)
    shutil.move(source_wav, destination_wav)

    source_transcript = os.path.join(output_train_folder_transcripts, filename + '.txt')
    destination_transcript = os.path.join(output_test_folder_transcripts)
    shutil.move(source_transcript, destination_transcript)

**Write the ```/train/prompts.txt``` and ```/test/prompts.txt``` files**

In [None]:
train_prompts_path = os.path.join(os.getcwd(), *output_train_path, 'prompts.txt')
with open(train_prompts_path, 'a') as outfile:
    for filename in os.listdir(output_train_folder_transcripts):
        file_path = os.path.join(output_train_folder_transcripts, filename)
        with open(file_path, "r") as infile:
            outfile.write(infile.read() + '\n')

In [None]:
test_prompts_path = os.path.join(os.getcwd(), *output_test_path, 'prompts.txt')
with open(test_prompts_path, 'a') as outfile:
    for filename in os.listdir(output_test_folder_transcripts):
        file_path = os.path.join(output_test_folder_transcripts, filename)
        with open(file_path, "r") as infile:
            outfile.write(infile.read() + '\n')

**Compress the folders into ```.tar.gzip```**

In [None]:
paths_to_compress = [train_prompts_path, output_train_folder_waves, test_prompts_path, output_test_folder_waves]

with tarfile.open(output_compressed_file, "w:gz") as tar_gz:
    for path in paths_to_compress:
        rel_path = os.path.relpath(path, os.path.join(os.getcwd(), *output_compressed_path))
        tar_gz.add(path, arcname=rel_path) 

In [None]:
with open(train_prompts_path, 'rb') as f_in, gzip.open(output_compressed_train_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)

In [None]:
with open(test_prompts_path, 'rb') as f_in, gzip.open(output_compressed_test_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)

**Sanity Check**

In [None]:
with open(train_prompts_path, "r") as f:
    lines = f.readlines()
    train_prompts_filenames = sorted([l.split(' ')[0] for l in lines])

In [None]:
train_prompts_filenames[:10]

In [None]:
train_wavs_filenames = []
for filename in os.listdir(output_train_folder_waves):
    filename = filename.split('.')[0]
    train_wavs_filenames.append(filename)
train_waves_filename = sorted(train_wavs_filenames)

In [None]:
train_waves_filename[:10]

In [None]:
train_prompts_filenames==train_waves_filename

In [None]:
with open(test_prompts_path, "r") as f:
    lines = f.readlines()
    test_prompts_filenames = sorted([l.split(' ')[0] for l in lines])

In [None]:
test_prompts_filenames[:10]

In [None]:
test_wavs_filenames = []
for filename in os.listdir(output_test_folder_waves):
    filename = filename.split('.')[0]
    test_wavs_filenames.append(filename)
test_waves_filename = sorted(test_wavs_filenames)

In [None]:
test_wavs_filenames[:10]

In [None]:
test_prompts_filenames==test_wavs_filenames

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### **Iteration 4: How to fix overlap between some audio files and transcriptions**

Example: 

3000-1_12 and 3000-1_13

Audio at the end of 3000-1_12 includes 3/4-ish of text in ```intervals[22]```

Audio at the start of 3000-1_13 includes 1/4-ish of text in ```intervals[1]```

Solution: Segment based on TextGrid files instead of Audio files?

### Step 1:
<u>Initialising the directory</u>
```
dataset
- testing
    - data: Used to store compression files
    - org_waves: Manually add in .wav files
        - 3000-1.wav
        - 3000-2.wav
        - ...
    - org_transcripts: Manually add in .TextGrid files
        - 3000-1.TextGrid
        - 3000-2.TextGrid
        - ...
    - train
        - waves: Empty
        - transcripts: Empty
    - test
        - waves: Empty
        - transcripts: Empty
```

<br/>

### Step 2:
<u>After running the processing code</u>
```
dataset
- testing
    - data: Used to store compression files
    - org_waves: Manually add in .wav files
        - 3000-1.wav
        - 3000-2.wav
        - ...
    - org_transcripts: Manually add in .TextGrid files
        - 3000-1.TextGrid
        - 3000-2.TextGrid
        - ...
    - train
        - prompts.txt: Contains transcriptions for all the .wav files in train
        - waves
            - 3000-1_1.wav
            - 3000-1_2.wav
            - 3000-1_3.wav
            - ...
            - 3000-2_1.wav
            - 3000-2_2.wav
            - 3000-2_3.wav
        - transcripts
            - 3000-1_1.txt
            - 3000-1_2.txt
            - 3000-1_3.txt
            - ...
            - 3000-2_1.txt
            - 3000-2_2.txt
            - 3000-2_3.txt
    - test
        - prompts.txt: Contains transcriptions for all the .wav files in test
        - waves
            - 3000-3_1.wav
            - 3000-3_2.wav
            - 3000-3_3.wav
            - ...
            - 3000-4_1.wav
            - 3000-4_2.wav
            - 3000-4_3.wav
        - transcripts
            - 3000-3_1.txt
            - 3000-3_2.txt
            - 3000-3_3.txt
            - ...
            - 3000-4_1.txt
            - 3000-4_2.txt
            - 3000-4_3.txt
```

<br/>

### Step 3:
<u>After running the compression code</u>

```
data
    - input_name.tar.gz
        - train
            - prompts.txt: Contains transcriptions for all the .wav files
            - waves
                - 3000-1_1.wav
                - 3000-1_2.wav
                - 3000-1_3.wav
                - ...
                - 3000-2_1.wav
                - 3000-2_2.wav
                - 3000-2_3.wav
        - test
            - prompts.txt: Contains transcriptions for all the .wav files
            - waves
                - 3000-3_1.wav
                - 3000-3_2.wav
                - 3000-3_3.wav
                - ...
                - 3000-4_1.wav
                - 3000-4_2.wav
                - 3000-4_3.wav
    - prompts-train.txt.gz
        - prompts-train.txt: Contains transcriptions for all the train .wav files -> take this from train/prompts.txt
    - prompts-test.txt.gz
        - prompts-test.txt: Contains transcriptions for all the test .wav files -> take this from test/prompts.txt
```

**Imports**

In [None]:
import re 
import os
import shutil
import tarfile
import gzip
import math
import random
from praatio import textgrid 
from pydub import AudioSegment

**<u>USER INPUT REQUIRED</u> Input Relative Paths**

In [None]:
input_audio_path = ['dataset', 'testing', 'org_wavs']
input_textgrid_path = ['dataset', 'testing', 'org_transcripts']
output_train_path = ['dataset', 'testing', 'train']
output_test_path = ['dataset', 'testing', 'test']
output_compressed_path = ['dataset', 'testing']
compressed_filename = 'imda_nsc_p3_testing.tar.gz'
compressed_train_prompt_filename = 'prompts-train.txt.gz'
compressed_test_prompt_filename = 'prompts-test.txt.gz'

**Initialise Paths and Create the directories**

**IMPT <u>USER INPUT REQUIRED</u>**: Remember to add in the ```.wav``` and ```.TextGrid``` files to org_waves and org_transcripts

In [None]:
input_wav_folder = os.path.join(os.getcwd(), *input_audio_path)
input_textgrid_folder = os.path.join(os.getcwd(), *input_textgrid_path)
output_train_folder_waves = os.path.join(os.getcwd(), *output_train_path, 'waves')
output_train_folder_transcripts = os.path.join(os.getcwd(), *output_train_path, 'transcripts')
output_test_folder_waves  = os.path.join(os.getcwd(), *output_test_path, 'waves')
output_test_folder_transcripts = os.path.join(os.getcwd(), *output_test_path, 'transcripts')
output_textgrids_folder = os.path.join(os.getcwd(), *output_train_path, 'textgrids')
output_compressed_folder = os.path.join(os.getcwd(), *output_compressed_path, 'data')
output_compressed_file = os.path.join(output_compressed_folder, compressed_filename)
output_compressed_train_prompt_file = os.path.join(output_compressed_folder, compressed_train_prompt_filename)
output_compressed_test_prompt_file = os.path.join(output_compressed_folder, compressed_test_prompt_filename)

create_dir = [input_wav_folder, input_textgrid_folder, output_train_folder_waves, output_train_folder_transcripts,
              output_test_folder_waves, output_test_folder_transcripts, output_textgrids_folder, output_compressed_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

**Helper function to clean the transcription**

1. Lower-case the text

2. Remove and replace annotations

- Paralinguistic Phenomena: Remove '(ppb)', '(ppc)', '(ppl)', '(ppo)'
- Acronyms: Remove '_'
- Multi-word nouns: Replace '-' with ' '
- Discourse particles: Remove '[' and ']'
- Fillers: Remove '(' and ')'
- Interjections: Remove '!'
- Other languages: Remove '#'
- Unclear words: Remove ```'<unk>'```
- Incomplete words: Remove '~'
- Short pauses: Remove ```'<s>'```
- Invalid: Remove ```'<z>'```
- Long-running non-english utterances: Remove ```'<nen>'```
- Fillers: Remove ```'<fil/>'```
- Speaker Noise: Remove ```'<spk/>'```
- Unknown: Remove '**'
- Non-primary speaker sound: Remove ```'<non/>'```
- End of sentence: Remove ```'<s/>'```
- Comma: Remove ```'<c/>'```
- Remove all instances of ```<whatever is inside>```

3. Remove extra spaces created by ```<s>``` and stuff

Refer to the Transcription Guidelines by IMDA

In [None]:
def clean_transcription(transcript):
    transcript = ' '.join(line.strip() for line in transcript)
    transcript = transcript.lower()
    remove = [r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', r'_', r'\[|\]', r'\(|\)', r'!', 
            r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
            r'\*', r'<non/>', r'<s/>', r'<c/>', r'<[^>]+>'] 
    replace = ['-']
    for e in remove:
        transcript = re.sub(e, '', transcript)
    for e in replace:
        transcript = re.sub(e, ' ', transcript)
    transcript = re.sub(r'\s+', ' ', transcript).strip()
    return transcript

**Main function**

Matches a single ```.wav``` file to its respective ```.TextGrid``` file

- Break the ```.wav``` file and ```.TextGrid``` file into 30s segments
- Clean the ```.TextGrid``` file
- Only keep segments that have audio

**How to fix overlap between some audio files and transcriptions**

Example: 

3000-1_12 and 3000-1_13

Audio at the end of 3000-1_12 includes 3/4-ish of text in ```intervals[22]```

Audio at the start of 3000-1_13 includes 1/4-ish of text in ```intervals[1]```

```
input_textgrid_path = ['dataset', 'testing', 'org_transcripts']
test_textgrid_path = os.path.join(os.getcwd(), *input_textgrid_path, '3000-1.TextGrid')
tg_test = textgrid.openTextgrid(test_textgrid_path, False) 

for tier_name in tg_test.tierNames: 
    print(tier_name)

>>> 3000-1

for tier_name in tg_test.tierNames: 
    tier = tg_test.getTier(tier_name) 
    for entry in tier.entries:  
        print(entry)

>>> Interval(start=0.0, end=1.556, label='<S>')
>>> Interval(start=1.556, end=2.661, label='(um) you can go first')
>>>...
```

In [None]:
def process_audio_transcript(audio_filename, input_audio_path, input_textgrid_path, output_path, sanity_check=False):
    audio_path = os.path.join(os.getcwd(), *input_audio_path, f'{audio_filename}.wav')
    textgrid_path = os.path.join(os.getcwd(), *input_textgrid_path, f'{audio_filename}.TextGrid')

    output_dir_wav = os.path.join(os.getcwd(), *output_path, 'waves')
    output_dir_transcript = os.path.join(os.getcwd(), *output_path, 'transcripts')

    output_dir_textgrid = os.path.join(os.getcwd(), *output_path, 'textgrids')

    audio = AudioSegment.from_wav(audio_path)
    tg = textgrid.openTextgrid(textgrid_path, False) 

    # Specify the duration of each segment
    segment_duration_s = 30 
    # Specify the current segment duration
    curr_segment_duration = 0
    # Specify the current segment index
    segment_index = 1
    # Specify the timestamps traversed for the current segment
    curr_timestamps = []
    # Specify the transcriptions for the current segment
    curr_transcriptions = []

    for tier_name in tg.tierNames: 
        tier = tg.getTier(tier_name) 
        for start,end,label in tier.entries:  
            # Get the duration of this new entry
            entry_duration = end-start
            # If the addition of this new entry to the current segment duration does not exceed
            # our specified duration of each segment, we can accumulate the current segment
            if curr_segment_duration + entry_duration <= segment_duration_s:
                # Update the current_segment_duration
                curr_segment_duration+=entry_duration
                # Update the timestamps and transcriptions
                curr_timestamps.extend([start,end])
                curr_transcriptions.append(label)

            # If the addition of a new entry exceeds our specified duration of each segment
            # that means the current segment is completed and
            # we save the transcription and the segmented audio as well as
            # perform resetting
            else:
                # Clean the transcription
                curr_transcriptions_clean = clean_transcription(curr_transcriptions)
                # If there are words after cleaning
                if len(curr_transcriptions_clean) > 0:
                    # Initialise the transcription segment path
                    transcript_segment_path = os.path.join(output_dir_transcript, f'{audio_filename}_{segment_index}.txt')
                    # Write the transcription to the transcription segment file
                    with open(transcript_segment_path, 'w') as f:
                        f.write(f'{audio_filename}_{segment_index} {curr_transcriptions_clean}')
                    # Calculate the boundaries for the audio segment in ms
                    segment_start = min(curr_timestamps)*1000
                    segment_end = max(curr_timestamps)*1000

                    # Sanity check on TextGrid Segments
                    if sanity_check:
                        tg_segment = tg.crop(segment_start / 1000, segment_end / 1000, mode="strict", rebaseToZero=False)
                        tg_segment_path = os.path.join(output_dir_textgrid, f'{audio_filename}_{segment_index}.TextGrid')
                        tg_segment.save(tg_segment_path, "long_textgrid", True)

                    # Segment the audio using the start and time from the TextGrid
                    audio_segment = audio[segment_start:segment_end]

                    # Save the audio segment
                    audio_segment_path = os.path.join(output_dir_wav, f'{audio_filename}_{segment_index}.wav')
                    audio_segment.export(audio_segment_path, format="wav")

                # Resetting
                # If a single entry is <= than 30s
                if entry_duration <= segment_duration_s:
                    # Reset the current segment duration
                    curr_segment_duration = entry_duration
                    # Reset the current timestamps to include the start and end of this iteration
                    curr_timestamps = [start,end]
                    # Reset the current transcriptions to include the label of this iteration
                    curr_transcriptions = [label]
                # Skip the entry as a sample if it is > than 30s
                else:
                    # Reset the current segment duration
                    curr_segment_duration = 0
                    # Reset the current timestamps from empty
                    curr_timestamps = []
                    # Reset the current transcriptions from empty
                    curr_transcriptions = []

                # Increment the segment index only if there was transcriptions (and thus audio) to be saved
                if len(curr_transcriptions_clean) > 0:
                    # Increment the segment index
                    segment_index+=1

**Run the main function to segment 30s chunks for each ```.wav``` and ```.TextGrid``` file**

Output is the segmented ```.wav``` files and transcriptions for each ```.wav``` file stored in ```train/waves``` and ```train/transcripts``` respectively

Note: We first put the files into the train folder

A sanity check can be set to True to view the segmented ```.TextGrid``` files in ```./train/textgrids/```

In [None]:
audio_path = os.path.join(os.getcwd(), *input_audio_path)
for filename in os.listdir(audio_path):
    filename = filename.split('.')[0]
    process_audio_transcript(filename, input_audio_path, input_textgrid_path, output_train_path, True)

**Move a split of the ```.wav``` files and ```.txt``` file to test**

In [None]:
test_split = 0.2

sample_filenames = []
for filename in os.listdir(output_train_folder_waves):
    sample_filenames.append(filename.split('.')[0])

samples = len(sample_filenames)

num_train_samples = math.floor((1-test_split)*samples)
num_test_samples = samples-num_train_samples

print(f"The total number of samples is {samples}")
print(f"The total number of training samples will be {num_train_samples}")
print(f"The total number of test samples will be {num_test_samples}")

In [None]:
random.shuffle(sample_filenames)

In [None]:
for i in range(num_test_samples):
    filename = sample_filenames[i]

    source_wav = os.path.join(output_train_folder_waves, filename + '.wav')
    destination_wav = os.path.join(output_test_folder_waves)
    shutil.move(source_wav, destination_wav)

    source_transcript = os.path.join(output_train_folder_transcripts, filename + '.txt')
    destination_transcript = os.path.join(output_test_folder_transcripts)
    shutil.move(source_transcript, destination_transcript)

**Write the ```/train/prompts.txt``` and ```/test/prompts.txt``` files**

In [None]:
train_prompts_path = os.path.join(os.getcwd(), *output_train_path, 'prompts.txt')
with open(train_prompts_path, 'a') as outfile:
    for filename in os.listdir(output_train_folder_transcripts):
        file_path = os.path.join(output_train_folder_transcripts, filename)
        with open(file_path, "r") as infile:
            outfile.write(infile.read() + '\n')

In [None]:
test_prompts_path = os.path.join(os.getcwd(), *output_test_path, 'prompts.txt')
with open(test_prompts_path, 'a') as outfile:
    for filename in os.listdir(output_test_folder_transcripts):
        file_path = os.path.join(output_test_folder_transcripts, filename)
        with open(file_path, "r") as infile:
            outfile.write(infile.read() + '\n')

**Compress the folders into ```.tar.gzip```**

In [None]:
paths_to_compress = [train_prompts_path, output_train_folder_waves, test_prompts_path, output_test_folder_waves]

with tarfile.open(output_compressed_file, "w:gz") as tar_gz:
    for path in paths_to_compress:
        rel_path = os.path.relpath(path, os.path.join(os.getcwd(), *output_compressed_path))
        tar_gz.add(path, arcname=rel_path) 

In [None]:
with open(train_prompts_path, 'rb') as f_in, gzip.open(output_compressed_train_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)

In [None]:
with open(test_prompts_path, 'rb') as f_in, gzip.open(output_compressed_test_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

**Sanity Check**

In [None]:
with open(train_prompts_path, "r") as f:
    lines = f.readlines()
    train_prompts_filenames = sorted([l.split(' ')[0] for l in lines])

In [None]:
train_prompts_filenames[:10]

In [None]:
train_wavs_filenames = []
for filename in os.listdir(output_train_folder_waves):
    filename = filename.split('.')[0]
    train_wavs_filenames.append(filename)
train_waves_filename = sorted(train_wavs_filenames)

In [None]:
train_waves_filename[:10]

In [None]:
train_prompts_filenames==train_waves_filename

In [None]:
with open(test_prompts_path, "r") as f:
    lines = f.readlines()
    test_prompts_filenames = sorted([l.split(' ')[0] for l in lines])

In [None]:
test_prompts_filenames[:10]

In [None]:
test_wavs_filenames = []
for filename in os.listdir(output_test_folder_waves):
    filename = filename.split('.')[0]
    test_wavs_filenames.append(filename)
test_waves_filename = sorted(test_wavs_filenames)

In [None]:
test_wavs_filenames[:10]

In [None]:
test_prompts_filenames==test_wavs_filenames

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### **Iteration 5: Segment audio to only include main speakers speech**

Example: 

3000-1_27 previously from Iteration 4

There was people (the non-main speaker) talking. For training, the ground truth from iteration 4 only includes the main speaker's speech but this is unfair to the ASR because it may transcribe the non-main speaker's speech as well which affects training and evaluation

Solution: Segment based on entry and only if the entry has proper ground truth transcriptions

### Step 1:
<u>Initialising the directory</u>
```
dataset
- testing
    - data: Used to store compression files
    - org_waves: Manually add in .wav files to be segmented
        - 3000-1.wav
        - 3000-2.wav
        - ...
    - org_transcripts: Manually add in .TextGrid files to be segmented
        - 3000-1.TextGrid
        - 3000-2.TextGrid
        - ...
    - train
        - waves: Empty
        - transcripts: Empty
    - test
        - waves: Empty
        - transcripts: Empty
```

<br/>

### Step 2:
<u>After running the processing code</u>
```
dataset
- testing
    - data: Used to store compression files
    - org_waves: Manually add in .wav files
        - 3000-1.wav
        - 3000-2.wav
        - ...
    - org_transcripts: Manually add in .TextGrid files
        - 3000-1.TextGrid
        - 3000-2.TextGrid
        - ...
    - train
        - prompts.txt: Contains transcriptions for all the .wav files in train
        - waves
            - 3000-1_1.wav
            - 3000-1_2.wav
            - 3000-1_3.wav
            - ...
            - 3000-2_1.wav
            - 3000-2_2.wav
            - 3000-2_3.wav
        - transcripts
            - 3000-1_1.txt
            - 3000-1_2.txt
            - 3000-1_3.txt
            - ...
            - 3000-2_1.txt
            - 3000-2_2.txt
            - 3000-2_3.txt
    - test
        - prompts.txt: Contains transcriptions for all the .wav files in test
        - waves
            - 3000-3_1.wav
            - 3000-3_2.wav
            - 3000-3_3.wav
            - ...
            - 3000-4_1.wav
            - 3000-4_2.wav
            - 3000-4_3.wav
        - transcripts
            - 3000-3_1.txt
            - 3000-3_2.txt
            - 3000-3_3.txt
            - ...
            - 3000-4_1.txt
            - 3000-4_2.txt
            - 3000-4_3.txt
```

<br/>

### Step 3:
<u>After running the compression code</u>

```
data
    - input_name.tar.gz
        - train
            - prompts.txt: Contains transcriptions for all the .wav files in train
            - waves
                - 3000-1_1.wav
                - 3000-1_2.wav
                - 3000-1_3.wav
                - ...
                - 3000-2_1.wav
                - 3000-2_2.wav
                - 3000-2_3.wav
        - test
            - prompts.txt: Contains transcriptions for all the .wav files in test
            - waves
                - 3000-3_1.wav
                - 3000-3_2.wav
                - 3000-3_3.wav
                - ...
                - 3000-4_1.wav
                - 3000-4_2.wav
                - 3000-4_3.wav
    - prompts-train.txt.gz
        - prompts-train.txt: Contains transcriptions for all the train .wav files -> taken from train/prompts.txt
    - prompts-test.txt.gz
        - prompts-test.txt: Contains transcriptions for all the test .wav files -> take from test/prompts.txt
```

**Imports**

In [None]:
import re 
import os
import shutil
import tarfile
import gzip
import math
import random
from praatio import textgrid 
from pydub import AudioSegment

**<u>USER INPUT REQUIRED</u> Input Relative Paths**

In [None]:
input_audio_path = ['dataset', 'testing', 'org_wavs']
input_textgrid_path = ['dataset', 'testing', 'org_transcripts']
output_train_path = ['dataset', 'testing', 'train']
output_test_path = ['dataset', 'testing', 'test']
output_compressed_path = ['dataset', 'testing']
compressed_filename = 'imda_nsc_p3_testing.tar.gz'
compressed_train_prompt_filename = 'prompts-train.txt.gz'
compressed_test_prompt_filename = 'prompts-test.txt.gz'

**Initialise Paths and Create the directories**

**IMPT <u>USER INPUT REQUIRED</u>**: Remember to add in the ```.wav``` and ```.TextGrid``` files to org_waves and org_transcripts

In [None]:
input_wav_folder = os.path.join(os.getcwd(), *input_audio_path)
input_textgrid_folder = os.path.join(os.getcwd(), *input_textgrid_path)
output_train_folder_waves = os.path.join(os.getcwd(), *output_train_path, 'waves')
output_train_folder_transcripts = os.path.join(os.getcwd(), *output_train_path, 'transcripts')
output_test_folder_waves  = os.path.join(os.getcwd(), *output_test_path, 'waves')
output_test_folder_transcripts = os.path.join(os.getcwd(), *output_test_path, 'transcripts')
output_textgrids_folder = os.path.join(os.getcwd(), *output_train_path, 'textgrids')
output_compressed_folder = os.path.join(os.getcwd(), *output_compressed_path, 'data')
output_compressed_file = os.path.join(output_compressed_folder, compressed_filename)
output_compressed_train_prompt_file = os.path.join(output_compressed_folder, compressed_train_prompt_filename)
output_compressed_test_prompt_file = os.path.join(output_compressed_folder, compressed_test_prompt_filename)

create_dir = [input_wav_folder, input_textgrid_folder, output_train_folder_waves, output_train_folder_transcripts,
              output_test_folder_waves, output_test_folder_transcripts, output_textgrids_folder, output_compressed_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

**Helper function to clean the transcription**

1. Lower-case the text

2. Remove and replace annotations

- Paralinguistic Phenomena: Remove '(ppb)', '(ppc)', '(ppl)', '(ppo)'
- Acronyms: Remove '_'
- Multi-word nouns: Replace '-' with ' '
- Discourse particles: Remove '[' and ']'
- Fillers: Remove '(' and ')'
- Interjections: Remove '!'
- Other languages: Remove '#'
- Unclear words: Remove ```'<unk>'```
- Incomplete words: Remove '~'
- Short pauses: Remove ```'<s>'```
- Invalid: Remove ```'<z>'```
- Long-running non-english utterances: Remove ```'<nen>'```
- Fillers: Remove ```'<fil/>'```
- Speaker Noise: Remove ```'<spk/>'```
- Unknown: Remove '**'
- Non-primary speaker sound: Remove ```'<non/>'```
- End of sentence: Remove ```'<s/>'```
- Comma: Remove ```'<c/>'```
- Remove all instances of ```<whatever is inside>```

3. Remove extra spaces created by ```<s>``` and stuff

Refer to the Transcription Guidelines by IMDA

In [None]:
def clean_transcription(transcript):
    transcript = transcript.strip()
    transcript = transcript.lower()
    remove = [r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', r'_', r'\[|\]', r'\(|\)', r'!', 
            r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
            r'\*', r'<non/>', r'<s/>', r'<c/>', r'<[^>]+>'] 
    replace = ['-']
    for e in remove:
        transcript = re.sub(e, '', transcript)
    for e in replace:
        transcript = re.sub(e, ' ', transcript)
    transcript = re.sub(r'\s+', ' ', transcript).strip()
    return transcript

**Main function**

- Matches a single ```.wav``` file to its respective ```.TextGrid``` file

- Break the ```.wav``` file and ```.TextGrid``` files into segments such that each segment only contains a transcription that is <= 30s long


In [None]:
def process_audio_transcript(audio_filename, input_audio_path, input_textgrid_path, output_path, sanity_check=False):
    audio_path = os.path.join(os.getcwd(), *input_audio_path, f'{audio_filename}.wav')
    textgrid_path = os.path.join(os.getcwd(), *input_textgrid_path, f'{audio_filename}.TextGrid')

    output_dir_wav = os.path.join(os.getcwd(), *output_path, 'waves')
    output_dir_transcript = os.path.join(os.getcwd(), *output_path, 'transcripts')

    output_dir_textgrid = os.path.join(os.getcwd(), *output_path, 'textgrids')

    audio = AudioSegment.from_wav(audio_path)
    tg = textgrid.openTextgrid(textgrid_path, False) 

    # Specify the duration of each segment
    segment_duration_s = 30 
    # Specify the current segment index
    segment_index = 1

    for tier_name in tg.tierNames: 
        tier = tg.getTier(tier_name) 
        for start,end,label in tier.entries:  
            # Get the duration of this new entry
            entry_duration = end-start
            # If the entry's duration is less than our specified duration of each segment
            if entry_duration <= segment_duration_s:
                # Clean the transcription/label of this entry
                curr_transcriptions_clean = clean_transcription(label)
                # If this entry has text after cleaning i.e. contains proper ground truth transcription
                if len(curr_transcriptions_clean) > 0:
                    # Initialise the transcription segment path
                    transcript_segment_path = os.path.join(output_dir_transcript, f'{audio_filename}_{segment_index}.txt')
                    # Write the transcription to the transcription segment file
                    with open(transcript_segment_path, 'w') as f:
                        f.write(f'{audio_filename}_{segment_index} {curr_transcriptions_clean}')

                    # Calculate the boundaries for the audio segment in ms
                    segment_start = start*1000
                    segment_end = end*1000

                    # Sanity check on TextGrid Segments
                    if sanity_check:
                        tg_segment = tg.crop(segment_start / 1000, segment_end / 1000, mode="strict", rebaseToZero=False)
                        tg_segment_path = os.path.join(output_dir_textgrid, f'{audio_filename}_{segment_index}.TextGrid')
                        tg_segment.save(tg_segment_path, "long_textgrid", True)

                    # Segment the audio using the start and time from the current TextGrid entry
                    audio_segment = audio[segment_start:segment_end+1] # Add 1 ms s.t the end timing is inclusive

                    # Save the audio segment
                    audio_segment_path = os.path.join(output_dir_wav, f'{audio_filename}_{segment_index}.wav')
                    audio_segment.export(audio_segment_path, format="wav")

                    # Increment the segment index
                    segment_index+=1

**Run the main function to create segments for each ```.wav``` and ```.TextGrid``` file**

Output is the segmented ```.wav``` audio files and corresponding ```.txt``` transcription files that is stored in ```train/waves``` and ```train/transcripts``` respectively

Note: We first put the files into the train folder

A sanity check can be set to ```True``` to view the corresponding segmented ```.TextGrid``` files in ```./train/textgrids/```

In [None]:
audio_path = os.path.join(os.getcwd(), *input_audio_path)
for filename in os.listdir(audio_path):
    filename = filename.split('.')[0]
    process_audio_transcript(filename, input_audio_path, input_textgrid_path, output_train_path, True)

**Move a split of the ```.wav``` files and ```.txt``` file to test**

In [None]:
test_split = 0.2

sample_filenames = []
for filename in os.listdir(output_train_folder_waves):
    sample_filenames.append(filename.split('.')[0])

samples = len(sample_filenames)

num_train_samples = math.floor((1-test_split)*samples)
num_test_samples = samples-num_train_samples

print(f"The total number of samples is {samples}")
print(f"The total number of training samples will be {num_train_samples}")
print(f"The total number of test samples will be {num_test_samples}")

In [None]:
random.shuffle(sample_filenames)

In [None]:
for i in range(num_test_samples):
    filename = sample_filenames[i]

    source_wav = os.path.join(output_train_folder_waves, filename + '.wav')
    destination_wav = os.path.join(output_test_folder_waves)
    shutil.move(source_wav, destination_wav)

    source_transcript = os.path.join(output_train_folder_transcripts, filename + '.txt')
    destination_transcript = os.path.join(output_test_folder_transcripts)
    shutil.move(source_transcript, destination_transcript)

**Write the ```/train/prompts.txt``` and ```/test/prompts.txt``` files**

In [None]:
train_prompts_path = os.path.join(os.getcwd(), *output_train_path, 'prompts.txt')
with open(train_prompts_path, 'a') as outfile:
    for filename in os.listdir(output_train_folder_transcripts):
        file_path = os.path.join(output_train_folder_transcripts, filename)
        with open(file_path, "r") as infile:
            outfile.write(infile.read() + '\n')

In [None]:
test_prompts_path = os.path.join(os.getcwd(), *output_test_path, 'prompts.txt')
with open(test_prompts_path, 'a') as outfile:
    for filename in os.listdir(output_test_folder_transcripts):
        file_path = os.path.join(output_test_folder_transcripts, filename)
        with open(file_path, "r") as infile:
            outfile.write(infile.read() + '\n')

**Compress the folders into ```.tar.gzip```**

In [None]:
paths_to_compress = [train_prompts_path, output_train_folder_waves, test_prompts_path, output_test_folder_waves]

with tarfile.open(output_compressed_file, "w:gz") as tar_gz:
    for path in paths_to_compress:
        rel_path = os.path.relpath(path, os.path.join(os.getcwd(), *output_compressed_path))
        tar_gz.add(path, arcname=rel_path) 

In [None]:
with open(train_prompts_path, 'rb') as f_in, gzip.open(output_compressed_train_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)

In [None]:
with open(test_prompts_path, 'rb') as f_in, gzip.open(output_compressed_test_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

**Sanity Check**

In [None]:
with open(train_prompts_path, "r") as f:
    lines = f.readlines()
    train_prompts_filenames = sorted([l.split(' ')[0] for l in lines])

In [None]:
train_prompts_filenames[:10]

In [None]:
train_wavs_filenames = []
for filename in os.listdir(output_train_folder_waves):
    filename = filename.split('.')[0]
    train_wavs_filenames.append(filename)
train_waves_filename = sorted(train_wavs_filenames)

In [None]:
train_waves_filename[:10]

In [None]:
train_prompts_filenames==train_waves_filename

In [None]:
with open(test_prompts_path, "r") as f:
    lines = f.readlines()
    test_prompts_filenames = sorted([l.split(' ')[0] for l in lines])

In [None]:
test_prompts_filenames[:10]

In [None]:
test_wavs_filenames = []
for filename in os.listdir(output_test_folder_waves):
    filename = filename.split('.')[0]
    test_wavs_filenames.append(filename)
test_waves_filename = sorted(test_wavs_filenames)

In [None]:
test_wavs_filenames[:10]

In [None]:
test_prompts_filenames==test_wavs_filenames

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### **Iteration 6: Fix error ```ParsingError: Expected field in Textgrid missing```**

TODOs
- Change input paths of non-hard drive file
- Check why there is an error with the file
- Combine the files into 30s?

## Overview of code flow

### Step 1:
<u>After running the code to initialise the directory</u>
```
D:\
- org_wavs: Manually add in .wav files to be segmented
    - 3000-1.wav
    - 3000-2.wav
    - ...
- org_transcripts: Manually add in .TextGrid files to be segmented
    - 3000-1.TextGrid
    - 3000-2.TextGrid
    - ...

dataset
- data: Used to store compression files
- train
    - waves: Empty
    - transcripts: Empty
- test
    - waves: Empty
    - transcripts: Empty
```

<br/>

### Step 2:
<u>After running the processing code</u>
```
D:\
- org_wavs: Manually add in .wav files to be segmented
    - 3000-1.wav
    - 3000-2.wav
    - ...
- org_transcripts: Manually add in .TextGrid files to be segmented
    - 3000-1.TextGrid
    - 3000-2.TextGrid
    - ...

dataset
- data: Used to store compression files
- train
    - prompts.txt: Contains transcriptions for all the .wav files in train
    - waves
        - 3000-1_1.wav
        - 3000-1_2.wav
        - 3000-1_3.wav
        - ...
        - 3000-2_1.wav
        - 3000-2_2.wav
        - 3000-2_3.wav
    - transcripts
        - 3000-1_1.txt
        - 3000-1_2.txt
        - 3000-1_3.txt
        - ...
        - 3000-2_1.txt
        - 3000-2_2.txt
        - 3000-2_3.txt
- test
    - prompts.txt: Contains transcriptions for all the .wav files in test
    - waves
        - 3000-3_1.wav
        - 3000-3_2.wav
        - 3000-3_3.wav
        - ...
        - 3000-4_1.wav
        - 3000-4_2.wav
        - 3000-4_3.wav
    - transcripts
        - 3000-3_1.txt
        - 3000-3_2.txt
        - 3000-3_3.txt
        - ...
        - 3000-4_1.txt
        - 3000-4_2.txt
        - 3000-4_3.txt
```

<br/>

### Step 3:
<u>After running the compression code</u>

```
data
- imda_nsc_p3.tar.gz
    - train
        - prompts.txt: Contains transcriptions for all the .wav files in train
        - waves
            - 3000-1_1.wav
            - 3000-1_2.wav
            - 3000-1_3.wav
            - ...
            - 3000-2_1.wav
            - 3000-2_2.wav
            - 3000-2_3.wav
    - test
        - prompts.txt: Contains transcriptions for all the .wav files in test
        - waves
            - 3000-3_1.wav
            - 3000-3_2.wav
            - 3000-3_3.wav
            - ...
            - 3000-4_1.wav
            - 3000-4_2.wav
            - 3000-4_3.wav
- prompts-train.txt.gz
    - prompts-train.txt: Contains transcriptions for all the train .wav files -> taken from train/prompts.txt
- prompts-test.txt.gz
    - prompts-test.txt: Contains transcriptions for all the test .wav files -> take from test/prompts.txt
```

<br/>
<br/>
<br/>

**Checking file 3025-1**

In [None]:
import re 
import os
import shutil
import tarfile
import gzip
import math
import random
from praatio import textgrid 
from pydub import AudioSegment

In [None]:
input_textgrid_path = ['dataset', 'testing', 'org_transcripts']
input_textgrid_folder = os.path.join(os.getcwd(), *input_textgrid_path)
create_dir = [input_textgrid_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

In [None]:
textgrid_path = os.path.join(input_textgrid_folder, '3025-1-test-2.TextGrid')
tg = textgrid.openTextgrid(textgrid_path, False) 

Fails with the addition of 

```
intervals [1115]:
            xmin = 3925.5698506061176 
            xmax = 3928.472 
            text = "to the item [lah] that that the that the owner has"

onwards
```

Conclusion: cannot have ```item [something]``` in the text

Example:

```
File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0 
xmax = 8245.897 
tiers? <exists> 
size = 1 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "3025-1" 
        xmin = 0 
        xmax = 8245.897 
        intervals: size = 2603 
        intervals [1]:
            xmin = 0 
            xmax = 2.706 
            text = "<S>" 
        intervals [2]:
            xmin = 2.706 
            xmax = 4.018 
            text = "okay item [testing] hi Joshua" 
        intervals [3]:
            xmin = 4.018 
            xmax = 6.268 
            text = "<S>"
```

Probably because of the way the function ```praatIO/praatio/utilities/textgrid_io.py/_parseNormalTextgrid(data: str)``` segments: it splits on item and [ ]


**Solution 1: Remove all instances of the ```[,]``` in ```text = "... item [something]..."```**

In [None]:
text_restriction = r'text = "(.*?item \[.*?\].*?)"'
test_text = 'text = "to the item [lah] that that the that the owner has"'

def replace_brackets(match):
    print("Match")
    print(match.group(0))
    text_content = match.group(1)
    text_content = text_content.replace("[", "").replace("]", "")
    return f'text = "{text_content}"'

# Receives regex pattern, function to do replacement for matched patterns 
# (res of function is used as replacement text), input string where the replacement will occur

# function receives a match object. It is called for each match found in the content string
# match object represents a specific occurence of the matched pattern
test_text_fixed = re.sub(text_restriction, replace_brackets, test_text)

In [None]:
input_path = os.path.join(input_textgrid_folder, '3025-1-test-2.TextGrid')
output_path = os.path.join(input_textgrid_folder, '3025-1-test-2-fixed.TextGrid')

# PraatIO seems to try utf-8 and utf-16
try:
    with open(input_path, "r", encoding="utf-16") as file:
        content = file.read()
    encoding = "utf-16"
except UnicodeError:
    with open(input_path, "r", encoding="utf-8") as file:
        content = file.read()
    encoding = "utf-8"

print("The content is: ")
print(content)
print("")

text_restriction = r'text = "(.*?item \[.*?\].*?)"'

def replace_brackets(match):
    print("Match:")
    print(match)
    print("Match group 0")
    print(match.group(0))
    print("Match group 1")
    print(match.group(1))
    text_content = match.group(1)
    text_content = text_content.replace("[", "").replace("]", "")
    return f'text = "{text_content}"'

content_fixed = re.sub(text_restriction, replace_brackets, content)

with open(output_path, "w", encoding=encoding) as file:
    file.write(content_fixed)


**Test on 3025-1**

In [None]:
def remove_text_restriction(textgrid_path):
    try:
        with open(textgrid_path, "r", encoding="utf-16") as file:
            textgrid = file.read()
        encoding = "utf-16"
    except UnicodeError:
        with open(textgrid_path, "r", encoding="utf-8") as file:
            textgrid = file.read()
        encoding = "utf-8"

    text_restriction = r'text = "(.*?item \[.*?\].*?)"'

    def replace_brackets(match):
        print("Match:")
        print(match.group(0))
        print("")
        text_content = match.group(1)
        text_content = text_content.replace("[", "").replace("]", "")
        return f'text = "{text_content}"'

    textgrid_fixed = re.sub(text_restriction, replace_brackets, textgrid)

    with open(textgrid_path, "w", encoding=encoding) as file:
        file.write(textgrid_fixed)

In [None]:
textgrid_path = os.path.join(input_textgrid_folder, '3025-1.TextGrid')
remove_text_restriction(textgrid_path)
tg = textgrid.openTextgrid(textgrid_path, False) 

<br/>
<br/>
<br/>

### **Iteration 7: Fix more TextGrid errors**

**Imports**

In [None]:
import re 
import os
import shutil
import tarfile
import gzip
import math
import random
from praatio import textgrid 
from pydub import AudioSegment

**<u>USER INPUT REQUIRED</u> Change Relative Paths and Naming Conventions if you want**

In [None]:
org_transcripts_path = ['clean_textgrid', 'org_transcripts']
testing_transcripts_path = ['clean_textgrid', 'testing']

**Initialise Paths and Create the directories**

**<u>USER INPUT REQUIRED</u>**: Remember to add in the <u>original</u> ```.TextGrid``` files provided by IMDA NSC to ```org_transcripts``` in the directory below after running the code block below

In [None]:
org_transcripts_folder = os.path.join(os.getcwd(), *org_transcripts_path)
testing_transcripts_folder = os.path.join(os.getcwd(), *testing_transcripts_path)
create_dir = [org_transcripts_folder,testing_transcripts_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

**Helper function to remove instances of ```text = "...item [something]..."``` from a single TextGrid file**

In [None]:
def remove_text_restriction(textgrid_path):
    try:
        with open(textgrid_path, "r", encoding="utf-16") as file:
            textgrid = file.read()
        encoding = "utf-16"
    except UnicodeError:
        with open(textgrid_path, "r", encoding="utf-8") as file:
            textgrid = file.read()
        encoding = "utf-8"

    text_restriction = r'text = "(.*?item \[.*?\].*?)"'

    def replace_brackets(match):
        print(textgrid_path)
        print("Match:")
        print(match.group(0))
        print("")
        text_content = match.group(1)
        text_content = text_content.replace("[", "").replace("]", "")
        return f'text = "{text_content}"'

    textgrid_fixed = re.sub(text_restriction, replace_brackets, textgrid)

    with open(textgrid_path, "w", encoding=encoding) as file:
        file.write(textgrid_fixed)

In [None]:
cleaned_successfully = []
cleaned_unsuccessfully = []
for filename in os.listdir(org_transcripts_folder):
    try:
        textgrid_path = os.path.join(org_transcripts_folder, filename)
        tg = textgrid.openTextgrid(textgrid_path, False)
    except:
        remove_text_restriction(textgrid_path)
        try:
            tg = textgrid.openTextgrid(textgrid_path, False)
            cleaned_successfully.append(filename)
        except:
            cleaned_unsuccessfully.append(filename)

In [None]:
cleaned_successfully

In [None]:
cleaned_unsuccessfully

**File 3035-2**

Conclusion: Skip it because the instantaneous timing and transcription don't match

In [None]:
filename = '3035-2.TextGrid'
textgrid_path = os.path.join(testing_transcripts_folder, filename)
tg = textgrid.openTextgrid(textgrid_path, False)

```
intervals [1216]:
    xmin = 3059.354 
    xmax = 3059.354 
    text = "that time got p_s_l_e or not"
```

**File 3075-2**

Conclusion: Skip it because the instantaneous timing and transcription don't match

In [None]:
filename = '3075-2.TextGrid'
textgrid_path = os.path.join(testing_transcripts_folder, filename)
tg = textgrid.openTextgrid(textgrid_path, False)

```
intervals [385]:
    xmin = 894.703
    xmax = 894.703
    text = "what is your [eh] what is a deal maker in your search for a partner"
```

**File 3083-1**

Conclusion: Account for ```text = "...intervals [something]..."```

In [None]:
filename = '3083-1.TextGrid'
textgrid_path = os.path.join(testing_transcripts_folder, filename)
tg = textgrid.openTextgrid(textgrid_path, False)

Found a ```text = "<UNK ya the the bush is green"```. Need to update cleaning to account for ```<UNK``` ? nevermind, feels like a rare occurence. removing ```< and >``` by themselves will mess things up more for the single character annotations

In [None]:
filename = '3083-1-test.TextGrid'
textgrid_path = os.path.join(testing_transcripts_folder, filename)
tg = textgrid.openTextgrid(textgrid_path, False)

Because of ```intervals [something]```, similar to item

```
intervals [2002]:
    xmin = 5548.948 
    xmax = 5550.81 
    text = "what's it called intervals [ah] they call it intervals"
```

In [None]:
text_restriction = r'text = "(.*?intervals \[.*?\].*?)"'
test_text = 'text = "whats it called intervals [ah] they call it intervals"'

def replace_brackets(match):
    print("Match")
    print(match.group(0))
    text_content = match.group(1)
    text_content = text_content.replace("[", "").replace("]", "")
    return f'text = "{text_content}"'

# Receives regex pattern, function to do replacement for matched patterns 
# (res of function is used as replacement text), input string where the replacement will occur

# function receives a match object. It is called for each match found in the content string
# match object represents a specific occurence of the matched pattern
test_text_fixed = re.sub(text_restriction, replace_brackets, test_text)

In [None]:
test_text_fixed

**File 3143-2**

Conclusion: Skip it because there is an overlap in transcription timing

In [None]:
filename = '3143-2.TextGrid'
textgrid_path = os.path.join(testing_transcripts_folder, filename)
tg = textgrid.openTextgrid(textgrid_path, False)

```
intervals [2116]:
    xmin = 6414.084
    xmax = 6418.1359
    text = "I dare not talk one [ah] you know I I been through my life (uh)"
intervals [2117]:
    xmin = 6418.135
    xmax = 6419.022
    text = "first"
```

**File 3201-1**

Conclusion: Skip it because the instantaneous timing and transcription don't match

In [None]:
filename = '3201-1.TextGrid'
textgrid_path = os.path.join(testing_transcripts_folder, filename)
tg = textgrid.openTextgrid(textgrid_path, False)

```
intervals [251]:
    xmin = 561.919
    xmax = 561.919
    text = "(uh) so for mine"
```

**File 3250-2**

Conclusion: Skip it because there is an overlap in transcription timing

In [None]:
filename = '3250-2.TextGrid'
textgrid_path = os.path.join(testing_transcripts_folder, filename)
tg = textgrid.openTextgrid(textgrid_path, False)

```
intervals [2004]:
    xmin = 5813.161
    xmax = 55816.75
    text = "I will do it at night (ppo)"
intervals [2005]:
    xmin = 5814.861
    xmax = 5815.399
    text = "<S>"
```

<br/>
<br/>

**<u>Files to Ignore</u>**

- 3035-2: Instantaneous timing and transcription don't match
- 3075-2: Instantaneous timing and transcription don't match
- 3143-2: Overlap in transcription timing
- 3201-1: Instantaneous timing and transcription don't match
- 3250-2: Overlap in transcription timing

**Files to Rename and Delete to match the ```.wav``` files**

In [None]:
for filename in os.listdir(org_transcripts_folder):
    if len(filename.split(".")[0])>6:
        print(filename)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### **Iteration 8: Solve the TextGrid errors**

**Imports**

In [None]:
import re 
import os
import shutil
import tarfile
import gzip
import math
import random
from praatio import textgrid 
from pydub import AudioSegment

**<u>USER INPUT REQUIRED</u>**

Change Relative Paths and Naming Conventions if you want

In [None]:
org_transcripts_path = ['clean_textgrid', 'org_transcripts']

**Initialise Paths and Create the directories**

**<u>USER INPUT REQUIRED</u>**: 

Remember to add in the <u>original</u> ```.TextGrid``` files provided by IMDA NSC to ```clean_textgrid/org_transcripts``` in the directory below after running the code block below

In [None]:
org_transcripts_folder = os.path.join(os.getcwd(), *org_transcripts_path)
create_dir = [org_transcripts_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

Rename the following files:
- 3108-1_edited.TextGrid: Rename to 3108-1.TextGrid
- 3115-1 9 (Update 2.05).TextGrid: Rename to 3115-1.TextGrid
- 3115-2 (Update 2.05).TextGrid: Rename to 3115-2.TextGrid
- 3209-1_edited.TextGrid: Rename to 3209-1.TextGrid

Delete the following files: 
- 3115-1 (Update 2.04).TextGrid: Delete because outdated
- 3115-2 (Update 2.04).TextGrid -> Delete because outdated
- 3035-2.TextGrid: Instantaneous timing and transcription don't match
- 3075-2.TextGrid: Instantaneous timing and transcription don't match
- 3143-2.TextGrid: Overlap in transcription timing
- 3201-1.TextGrid: Instantaneous timing and transcription don't match
- 3250-2.TextGrid: Overlap in transcription timing

In [None]:
files_to_delete = ['3115-1 (Update 2.04).TextGrid', '3115-2 (Update 2.04).TextGrid', '3035-2.TextGrid', 
                   '3075-2.TextGrid', '3143-2.TextGrid', '3201-1.TextGrid', '3250-2.TextGrid']

files_to_rename = {
    "3108-1_edited.TextGrid": "3108-1.TextGrid",
    "3115-1 9 (Update 2.05).TextGrid": "3115-1.TextGrid",
    "3115-2 (Update 2.05).TextGrid": "3115-2.TextGrid",
    "3209-1_edited.TextGrid": "3209-1.TextGrid"
}

for filename in files_to_delete:
    file_path = os.path.join(org_transcripts_folder, filename)
    os.remove(file_path)
    print(f"Deleted {filename}")

for old_name, new_name in files_to_rename.items():
    old_path = os.path.join(org_transcripts_folder, old_name)
    new_path = os.path.join(org_transcripts_folder, new_name)
    os.rename(old_path, new_path)
    print(f"Renamed {old_name} to {new_name}")

**Helper function to remove instances of ```text = "...item [something]..."``` and ```text = "...intervals [something]..."``` from a single TextGrid file**

- To not interfere with praatio library's splitting logic

In [None]:
def remove_text_restriction(textgrid_path):
    try:
        with open(textgrid_path, "r", encoding="utf-16") as file:
            textgrid = file.read()
        encoding = "utf-16"
    except UnicodeError:
        with open(textgrid_path, "r", encoding="utf-8") as file:
            textgrid = file.read()
        encoding = "utf-8"

    text_restriction_1 = r'text = "(.*?item \[.*?\].*?)"'
    text_restriction_2 = r'text = "(.*?intervals \[.*?\].*?)"'

    def replace_brackets(match):
        text_content = match.group(1)
        text_content = text_content.replace("[", "").replace("]", "")
        return f'text = "{text_content}"'

    textgrid_fixed = re.sub(text_restriction_1, replace_brackets, textgrid)
    textgrid_fixed_final = re.sub(text_restriction_2, replace_brackets, textgrid_fixed)

    with open(textgrid_path, "w", encoding=encoding) as file:
        file.write(textgrid_fixed_final)

**Remove text restrictions to let praatio library run properly**

In [None]:
cleaned_successfully = []
cleaned_unsuccessfully = []
for filename in os.listdir(org_transcripts_folder):
    try:
        textgrid_path = os.path.join(org_transcripts_folder, filename)
        tg = textgrid.openTextgrid(textgrid_path, False)
    except:
        remove_text_restriction(textgrid_path)
        try:
            tg = textgrid.openTextgrid(textgrid_path, False)
            cleaned_successfully.append(filename)
        except:
            cleaned_unsuccessfully.append(filename)

In [None]:
cleaned_successfully

In [None]:
cleaned_unsuccessfully

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### **Iteration 9**

In [1]:
def clean_transcription(transcript):
    transcript = transcript.strip()
    transcript = transcript.lower()
    remove = [r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', r'_', r'\[|\]', r'\(|\)', r'!', 
            r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
            r'\*', r'<non/>', r'<s/>', r'<c/>', r'<[^>]+>'] 
    replace = ['-']
    for e in remove:
        transcript = re.sub(e, '', transcript)
    for e in replace:
        transcript = re.sub(e, ' ', transcript)
    transcript = re.sub(r'\s+', ' ', transcript).strip()
    return transcript

In [2]:
def process_audio_transcript(audio_filename, input_audio_path, input_textgrid_path, output_dir_wav, output_dir_transcript, output_dir_textgrid, sanity_check=False):
    audio_path = os.path.join(input_audio_path, f'{audio_filename}.wav')
    textgrid_path = os.path.join(input_textgrid_path, f'{audio_filename}.TextGrid')

    audio = AudioSegment.from_wav(audio_path)
    tg = textgrid.openTextgrid(textgrid_path, False) 

    # Specify the duration of each segment
    segment_duration_s = 30 
    # Specify the current segment index
    segment_index = 1

    for tier_name in tg.tierNames: 
        tier = tg.getTier(tier_name) 
        for start,end,label in tier.entries:  
            # Get the duration of this new entry
            entry_duration = end-start
            # If the entry's duration is less than our specified duration of each segment
            if entry_duration <= segment_duration_s:
                # Clean the transcription/label of this entry
                curr_transcriptions_clean = clean_transcription(label)
                # If this entry has text after cleaning i.e. contains proper ground truth transcription
                if len(curr_transcriptions_clean) > 0:
                    # Initialise the transcription segment path
                    transcript_segment_path = os.path.join(output_dir_transcript, f'{audio_filename}_{segment_index}.txt')
                    # Write the transcription to the transcription segment file
                    with open(transcript_segment_path, 'w') as f:
                        f.write(f'{audio_filename}_{segment_index} {curr_transcriptions_clean}')

                    # Calculate the boundaries for the audio segment in ms
                    segment_start = start*1000
                    segment_end = end*1000

                    # Sanity check on TextGrid Segments
                    if sanity_check:
                        tg_segment = tg.crop(segment_start / 1000, segment_end / 1000, mode="strict", rebaseToZero=False)
                        tg_segment_path = os.path.join(output_dir_textgrid, f'{audio_filename}_{segment_index}.TextGrid')
                        tg_segment.save(tg_segment_path, "long_textgrid", True)

                    # Segment the audio using the start and time from the current TextGrid entry
                    audio_segment = audio[segment_start:segment_end+1] # Add 1 ms s.t the end timing is inclusive

                    # Save the audio segment
                    audio_segment_path = os.path.join(output_dir_wav, f'{audio_filename}_{segment_index}.wav')
                    audio_segment.export(audio_segment_path, format="wav")

                    # Increment the segment index
                    segment_index+=1

In [3]:
input_audio_path = ['org_wavs']
input_textgrid_path = ['org_transcripts']
output_train_path = ['dataset', 'train']
output_test_path = ['dataset', 'test']
output_compressed_path = ['dataset','data']
compressed_filename = 'imda_nsc_p3.tar.gz'
compressed_train_prompt_filename = 'prompts-train.txt.gz'
compressed_test_prompt_filename = 'prompts-test.txt.gz'

In [None]:
hard_drive_path = 'D:\\'
input_wav_folder = os.path.join(hard_drive_path, *input_audio_path)
input_textgrid_folder = os.path.join(hard_drive_path, *input_textgrid_path)
output_train_folder_waves = os.path.join(hard_drive_path, *output_train_path, 'waves')
output_train_folder_transcripts = os.path.join(hard_drive_path, *output_train_path, 'transcripts')
output_test_folder_waves  = os.path.join(hard_drive_path, *output_test_path, 'waves')
output_test_folder_transcripts = os.path.join(hard_drive_path, *output_test_path, 'transcripts')
output_textgrids_folder = os.path.join(hard_drive_path, *output_train_path, 'textgrids')
output_compressed_folder = os.path.join(hard_drive_path, *output_compressed_path)
output_compressed_file = os.path.join(output_compressed_folder, compressed_filename)
output_compressed_train_prompt_file = os.path.join(output_compressed_folder, compressed_train_prompt_filename)
output_compressed_test_prompt_file = os.path.join(output_compressed_folder, compressed_test_prompt_filename)

create_dir = [output_train_folder_waves, output_train_folder_transcripts,
              output_test_folder_waves, output_test_folder_transcripts, output_textgrids_folder, output_compressed_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

In [None]:
process_audio_transcript('3009-1', input_wav_folder, input_textgrid_folder, output_train_folder_waves, output_train_folder_transcripts, output_textgrids_folder, False)

Seems like hard drive disconnected itself after the operations

<br/>
<br/>
<br/>
<br/>
<br/>

### **Iteration 10: Make into 30s long**

In [1]:
import re 
import os
import shutil
import tarfile
import gzip
import math
import random
from praatio import textgrid 
from pydub import AudioSegment



In [2]:
input_audio_path = ['org_wavs']
input_textgrid_path = ['org_transcripts']
output_train_path = ['dataset', 'train']
output_test_path = ['dataset', 'test']
output_compressed_path = ['dataset','data']
compressed_filename = 'imda_nsc_p3.tar.gz'
compressed_train_prompt_filename = 'prompts-train.txt.gz'
compressed_test_prompt_filename = 'prompts-test.txt.gz'

In [3]:
input_drive_path = os.getcwd() #'D:\\'
output_drive_path = os.getcwd()
input_wav_folder = os.path.join(input_drive_path, *input_audio_path)
input_textgrid_folder = os.path.join(input_drive_path, *input_textgrid_path)
output_train_folder_waves = os.path.join(output_drive_path, *output_train_path, 'waves')
output_train_folder_transcripts = os.path.join(output_drive_path, *output_train_path, 'transcripts')
output_test_folder_waves  = os.path.join(output_drive_path, *output_test_path, 'waves')
output_test_folder_transcripts = os.path.join(output_drive_path, *output_test_path, 'transcripts')
output_textgrids_folder = os.path.join(output_drive_path, *output_train_path, 'textgrids')
output_compressed_folder = os.path.join(output_drive_path, *output_compressed_path)
output_compressed_file = os.path.join(output_compressed_folder, compressed_filename)
output_compressed_train_prompt_file = os.path.join(output_compressed_folder, compressed_train_prompt_filename)
output_compressed_test_prompt_file = os.path.join(output_compressed_folder, compressed_test_prompt_filename)

create_dir = [input_wav_folder, input_textgrid_folder, output_train_folder_waves, output_train_folder_transcripts,
              output_test_folder_waves, output_test_folder_transcripts, output_textgrids_folder, output_compressed_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

In [4]:
def clean_transcription(transcript):
    transcript = transcript.strip()
    transcript = transcript.lower()
    remove = [r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', r'_', r'\[|\]', r'\(|\)', r'!', 
            r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
            r'\*', r'<non/>', r'<s/>', r'<c/>', r'<[^>]+>'] 
    replace = ['-']
    for e in remove:
        transcript = re.sub(e, '', transcript)
    for e in replace:
        transcript = re.sub(e, ' ', transcript)
    transcript = re.sub(r'\s+', ' ', transcript).strip()
    return transcript

In [None]:
def process_audio_transcript(audio_filename, input_audio_path, input_textgrid_path, output_dir_wav, output_dir_transcript, segment_duration_s, buffer):
    # Initialise the wav and TextGrid paths of the current file
    audio_path = os.path.join(input_audio_path, f'{audio_filename}.wav')
    textgrid_path = os.path.join(input_textgrid_path, f'{audio_filename}.TextGrid')

    audio = AudioSegment.from_wav(audio_path)
    tg = textgrid.openTextgrid(textgrid_path, False) 

    # Specify the current segment index
    segment_index = 1

    # Initialise the current segment duration
    curr_segment_duration = 0
    # Initialise a list to hold the transcriptions for the current segment
    curr_transcriptions = []
    # Initialise a list to hold the audios for the current segment
    curr_wavs = []
    # Get the buffer in seconds -> To separate potentially unrelated speech
    buffer_s = buffer/1000 
    # Initialise audio buffer
    buffer_audio = AudioSegment.silent(duration=buffer)

    for tier_name in tg.tierNames: 
        tier = tg.getTier(tier_name) 
        for start,end,label in tier.entries:  
            # Get the duration of this new entry
            entry_duration = end-start

            # if entry_duration <= segment_duration_s -> don't need to consider and

            # If the new entry does not exceed our sepcified duration of each segment and
            # adding a buffer and new entry to the current segment does not exceed our specified duration of each segment
            # we can try accumulating the current segment
            if entry_duration < segment_duration_s and curr_segment_duration + buffer_s + entry_duration <= segment_duration_s:
                # Clean the transcription/label of this entry
                curr_transcription_clean = clean_transcription(label)
                # If this entry has text after cleaning i.e. contains proper ground truth transcription,
                # it is a valid sample
                if len(curr_transcription_clean) > 0:
                    # Update the current_segment_duration
                    curr_segment_duration = curr_segment_duration + buffer_s + entry_duration
                    # Add the current cleaned transcription of this entry
                    curr_transcriptions.append(curr_transcription_clean)
                    # Add the audio of this entry: Segment the audio using the start and end time from the current TextGrid entry
                    curr_wavs.append(audio[start*1000:(end*1000)+1]) # Add 1 ms s.t the end timing is inclusive

            # If adding a buffer and new entry exceeds our specified duration of each segment,
            # that means the current segment is completed and
            # we save the current transcription and the segmented audio as well as perform resetting
            elif curr_segment_duration > 0:
                    # Join the current transcription for the segment
                    transcript_segment = ' '.join(curr_transcriptions)

                    # Initialise the transcription segment path
                    transcript_segment_path = os.path.join(output_dir_transcript, f'{audio_filename}_{segment_index}.txt')
                    # Write the transcription to the transcription segment file
                    with open(transcript_segment_path, 'w') as f:
                        f.write(f'{audio_filename}_{segment_index} {transcript_segment}')

                    # Join the audio segments together with an audio buffer between them
                    audio_segment = curr_wavs[0]
                    for wav in curr_wavs[1:]:
                        audio_segment = audio_segment + buffer_audio + wav

                    # Initialise the audio segment path
                    audio_segment_path = os.path.join(output_dir_wav, f'{audio_filename}_{segment_index}.wav')
                    # Save the audio segment
                    audio_segment.export(audio_segment_path, format="wav")

                    # Increment the segment index
                    segment_index+=1

                    # Resetting
                    curr_transcription_clean = clean_transcription(label)
                    # If the entry in the current iteration is <= than our specified duration of each segment and has text after cleaning i.e. contains proper ground truth transcription
                    if entry_duration <= segment_duration_s and len(curr_transcription_clean) > 0:
                        # Reset the current segment duration
                        curr_segment_duration = entry_duration
                        # Reset the list to hold the transcriptions for the new segment
                        curr_transcriptions = [curr_transcription_clean]
                        # Reset the list to hold the audios for the new segment
                        curr_wavs = [audio[start*1000:(end*1000)+1]] # Add 1 ms s.t the end timing is inclusive
                    # Skip the entry as a sample if it is > than our specified duration of each segment
                    else:
                        # Reset the new segment duration
                        curr_segment_duration = 0
                        # Reset the list to hold the transcriptions for the new segment
                        curr_transcriptions = []
                        # Reset the list to hold the audios for the new segment
                        curr_wavs = []

Error happened because curr_segment was empty but entry_duration exceeded in this iteration

In [10]:
for filename in os.listdir(input_wav_folder):
    try:
        filename = filename.split('.')[0]
        process_audio_transcript(filename, input_wav_folder, input_textgrid_folder, output_train_folder_waves, output_train_folder_transcripts, 30, 1000)
    except Exception as e:
        print(f"Filename {filename}")
        print(f"Exception {e}")
        # break

In [8]:
# Estimated wav storage for script same: 274*2 samples, each sample around 104MB
# 56992 MB which is 57GB
274*2*104

56992

**Observation: Text can get a little cut off sometimes**

```
intervals [598]:
    xmin = 3013.472 
    xmax = 3019.4779488360505 
    text = "on Sunday right I did set up there was only two people okay including me so me and this uncle" 
```

In [None]:
audio = AudioSegment.from_wav(os.path.join(input_wav_folder,'3003-1.wav'))
test_audio = audio[3013.472*1000:3019.4779488360505*1000 + 1]
test_audio.export('testing.wav', format="wav")

<_io.BufferedRandom name='testing.wav'>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Part 3: Upload to HF

<u>Upload to HuggingFace</u>

Prepare our own audio dataset and upload it to HF

Stream data during the training process

Each file is around 112770 KB which is 0.11 GB

Part 3 consists of 1000 hours, which is maybe 110 GB ish

But maybe half of it is not the enviornment we want

<br/>
<br/>

Folder structure

Configure your dataset repository with audio files

- https://huggingface.co/docs/datasets/audio_dataset#audiofolder
- https://huggingface.co/docs/datasets/en/repository_structure#split-pattern-hierarchy
- https://huggingface.co/docs/hub/datasets-audio

```
test_dataset
    - metadata.csv: file_name (full relative path to audio file), transcription
    - data
        - train
            - first_train_audio_file.wav
            - second_train_audio_file.wav
            - ...
```

<br/>
<br/>
<br/>
<br/>
<br/>


### <u>Approach 1</u>

**<u>Part 1: Folder-based builders: Build dataset locally</u>**

https://huggingface.co/docs/datasets/create_dataset

https://huggingface.co/docs/datasets/audio_dataset#audiofolder

https://huggingface.co/docs/datasets/en/repository_structure#split-pattern-hierarchy

AudioFolder is a dataset builder to load an audio dataset with several thousand audio files. Additional information such as transcription is loaded by AudioFolder if its included in the metadata file

AudioFolder creates splits based on split pattern hierarchy 

```
# After structuring the data
from datasets import load_dataset
dataset = load_dataset("audiofolder", data_dir="/path/to/data")
```

**<u>Part 2: Push local dataset to Hub</u>**

https://huggingface.co/docs/datasets/upload_dataset

```
pip install huggingface_hub

huggingface-cli login

from datasets import load_dataset

dataset = load_dataset("stevhliu/demo")

dataset.push_to_hub("stevhliu/processed_demo")
```

<br/>
<br/>
<br/>

### <u>Approach 2</u>

https://huggingface.co/docs/datasets/audio_dataset#audiofolder

https://huggingface.co/docs/hub/datasets-adding

**<u>Part 1: Upload local dataset directory to Hub</u>**

**<u>Uploading Datasets in general</u>**

https://huggingface.co/docs/hub/datasets-adding

- Dataset repos are Git repos, so we can use Git to push data files to the Hub
- Starter: https://huggingface.co/docs/hub/repositories-getting-started
- Parquet is the recommended format due to its efficient compression etc.
    - For more general use cases involving analytics, data filtering or metadata parsing, Parquet is the recommended option for large scale image and audio datasets.
- For large scale image and audio datasets streaming, WebDataset should be preferred over raw image and audio files to avoid the overhead of accessing individual files
- Hugging Face Hub supports large scale datasets, usually uploaded in Parquet via push_to_hub() or WebDataset format

**<u>Creating audio datasets</u>**

- https://huggingface.co/docs/hub/datasets-audio
- https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607

**<u>Uploading large folders</u>**

https://huggingface.co/docs/huggingface_hub/guides/upload#upload-a-folder-by-chunks

- Upload folder normally: ```upload_folder()```
    - Upload a local folder to an existing repo
    - Specify the path of the local folder to upload, where you want to upload the folder to in the repository, and the name of the repository you want to add the folder to. Depending on your repository type, you can optionally set the repository type as a dataset, model, or space

    ```
    from huggingface_hub import HfApi
    api = HfApi()

    api.upload_folder(
        folder_path="/path/to/local/space",
        repo_id="username/my-cool-space",
        repo_type="space",
    )
    ```

    - By default, the .gitignore file will be taken into account to know which files should be committed or not. By default we check if a .gitignore file is present in a commit, and if not, we check if it exists on the Hub. Please be aware that only a .gitignore file present at the root of the directory with be used. We do not check for .gitignore files in subdirectories.

    - Makes a single commit, fails explicitly when something wrong happens

- Upload a large folder: ```upload_large_folder()```
    - Resumable
        - Upload process is split into many small tasks
        - Each time a task is completed, result is cached locally in ```./cache/huggingface``` inside the folder you're trying to upload
    - Multi-threaded
    - Resilient to errors: High-level retry-mechanism
        - Downside: If transient errors happen, the process will continue and retry. If permanent errors happen (e.g. permission denied), it will retry indefinitely without solving the root cause.
    - Limitations
        - ...


    ```
    api.upload_large_folder(
        repo_id="HuggingFaceM4/Docmatix",
        repo_type="dataset",
        folder_path="/path/to/local/docmatix",
    )
    ```

- Recommendations
    - Start small

- Upload a folder by chunks: ```upload_folder()```
    - Upload a folder in serveral commits so we don't have to resume the process from the beginning: Pass ```multi_commits=True``` as a argument
    - Recommended to pass ```multi_commits_verbose=True```
    - Upload will resume from where it stopped
        - If the process is interrupted before completing, you can rerun your script to resume the upload. The created PR will be automatically detected and the upload will resume from where it stopped
    - ```multi_commits``` is still an experimental feature

**<u>Repo Limits and recommendations</u>**

https://huggingface.co/docs/hub/repositories-recommendations

- Repo size: Generally support repos up to 300GB
- Number of files: Keep total number of files under 100k
    - Large datasets can be exported as Parque files or in WebDataset format
    - Cannot exceed 10k files per folder. Solution is to create a repo structure that uses subdirectories 


**<u>Part 2: Load dataset from the hub using audiofolder</u>**

```
from datasets import load_dataset
dataset = load_dataset("audiofolder", data_dir="/path/to/data") # There's a streaming option: https://huggingface.co/docs/datasets/en/stream
```

### <u>Approach 3</u>

https://huggingface.co/docs/hub/repositories-getting-started

https://huggingface.co/docs/datasets/en/audio_dataset#loading-script ((Legacy) Loading script)

https://huggingface.co/docs/hub/datasets-audio

https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607

https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809

https://huggingface.co/docs/hub/datasets-webdataset

Custom loading script

Reasons
- For large scale image and audio datasets streaming, WebDataset should be preferred over raw image and audio files to avoid the overhead of accessing individual files. 
- Audio datasets are commonly stored in tar.gz archives which requires a particular approach to support streaming mode. 



<br/>
<br/>
<br/>

### Creating a dataset loading script for audio datasets

Audio datasets are commonly stored in tar.gz archives which requires a particular approach to support streaming mode

see ```new_dataset_script tutorial.py```

Step 1: Put the dataset into WebDataset format

vivos format:

```
- vivos.tar.gz
    - vivos.tar
        - train
            - genders.txt: Contains the gender type for each waves folder
            - prompts.txt: Contains transcriptions for all the .wav files
            - waves
                - VIVOSSPK01 -> Speaker ID
                    - VIVOSSPK01_R001.wav
                    - VIVOSSPK01_R002.wav
                    - VIVOSSPK01_R003.wav

            - test
    - prompts-train.txt.gz
        - prompts-train.txt: Contains transcriptions for all the .wav files
    - prompts-test.txt.gz
```

Usual size per archive is generally around 1GB?

```
- imda_nsc_p3.tar.gz
    - imda_nsc_p3.tar
        - train
            - prompts.txt: Contains transcriptions for all the .wav files
            - waves
                - 3000-1.tar
                    - 3000-1_1.wav
                    - 3000-1_2.wav
                    - 3000-1_3.wav
- prompts-train.txt.gz
    - prompts-train.txt: Contains transcriptions for all the train .wav files
```

try this first

```
- imda_nsc_p3.tar.gz
    - imda_nsc_p3.tar
        - train
            - prompts.txt: Contains transcriptions for all the .wav files
            - waves
                - 3000-1_1.wav
                - 3000-1_2.wav
                - 3000-1_3.wav
- prompts-train.txt.gz
    - prompts-train.txt: Contains transcriptions for all the train .wav files
```