# Preparing Data

## Part 1: Get Data 

National Speech Corpus
- Part 3: 1000 hours of conversational speech data (Used by Home team)
- Part 2: 1000 hours of prompted recordings of random sentences containing local words and entities (Used by some developer)
- Part 4: Conversational code-switched data (from Singaporean English to various native languages)

<br/>
<br/>

- https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809
- https://www.jensenlwt.com/blog/singlish-whisper-finetuning-asr-for-singapore-unique-english


<br/>
<br/>
<br/>
<br/>
<br/>

## Part 2: Prepare Data

- Match each transcript sentence to its corresponding audio file
- Check on the environment where the audio is recorded (decide the environment)
    - Hometeam
        - The NSC Part 3 recordings are split into two environments, each with two different microphones used for recording. In the first environment, where speakers were in the same room, we selected the recordings using the close-talk mic as this isolated the main speaker’s voice (without picking up background noise or the secondary speaker). For the second environment with speakers in different rooms, we chose to use the standing microphone recordings, as opposed to recordings via telephone.
    - Same room environment: Close-talk mic that isolates main speaker's voice 
    - Different room environment: Standing microphone as opposed to telephone
- Clean the transcripts by removing annotations
- Normalise the transcript text
    - Remove punctuations
    - Lowercase text
- Create 30s audio segments with corresponding transcripts
    - Using time segments from ```TextGrid files```, splice out corresponding segments from WAV files
    - Combine shorter consecutive segments (?)
    - 30s: Whisper's feature extractor ensures all audio is 30s (intrinsic design)

<br/>
<br/>

- https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809
- https://www.jensenlwt.com/blog/singlish-whisper-finetuning-asr-for-singapore-unique-english

<br/>
<br/>

More on dataset part 3 (see ```ABOUT.txt```):

Part 3 consists of about 1000 hours of conversational data recorded from about 1000 local English speakers, split into pairs. The data includes conversations covering daily life and of speakers playing games provided. 

Part 3's recordings were split into 2 environments. In the Same Room environment where speakers were in same room, the recordings were done using 2 microphones: a close-talk mic and a boundary mic. In the Separate Room environment, speakers were separated into individual rooms. The recordings were done using 2 microphones in each room: a standing mic and a telephone. 

Part 3 is further organised into a six subdirectories, 3 for each recording environment (Same Room or Separate Room). Among each group of 3 subdirectories, 1 contains transcriptions, while the remaining 2 contain audio data from each of the two microphones used for the environment. There is also a manifest document at the root of the Part 3 folder that lists the files released.


Summary of Part 3 data organization:
- Same Room environment, files organized by speaker number:
    - /Scripts Same: Orthographic transcripts saved in TextGrid format
    - /Audio Same BoundaryMic: Audio files in WAV format recorded using the boundary mic, sampled at 16kHz
    - /Audio Same CloseMic: Audio files in WAV format recorded using the close-talk mic, sampled at 16kHz


- Separate Room environment, files organized by speaker number and session number:
    - /Scripts Separate: Orthographic transcripts saved in TextGrid format 
    - /Audio Separate IVR: Audio files in WAV format recorded using the telephone, sampled at 16kHz
    - /Audio Separate StandingMic: Audio files in WAV format recorded using the standing mic, sampled at 16kHz


<br/>
<br/>
<br/>
<br/>
<br/>

### Simple Example

**1. Match 3000-1.wav and 3000-1.TEXTGRID**

- Use Dataset Part 3 (used by Home Team)
- Specific datasets (used by Home Team)
    - Audio Same CloseMic
    - Audio Separate StandingMic 
- In this simple example, first settle the Audio Same CloseMic dataset


**2. Create 30s segments from 3000-1.wav and 3000-1.TEXTGRID**

In [1]:
# https://github.com/jiaaro/pydub#installation
# https://github.com/timmahrt/praatIO/tree/main

import os
from praatio import textgrid 
from pydub import AudioSegment

# Initialise input and output paths
audio_path = os.path.join(os.getcwd(), 'dataset', 'part3', 'simple_example', '3000-1.wav')
textgrid_path = os.path.join(os.getcwd(), 'dataset', 'part3', 'simple_example', '3000-1.TextGrid')
output_dir = os.path.join(os.getcwd(), 'dataset', 'part3', 'simple_example', '3000-1-splits')

# https://github.com/jiaaro/pydub
# https://github.com/timmahrt/praatIO
# https://timmahrt.github.io/praatIO/praatio.html
audio = AudioSegment.from_wav(audio_path)
tg = textgrid.openTextgrid(textgrid_path, False) # do not include intervals and points with empty labels

# pydub does things in milliseconds
segment_duration_ms = 30 * 1000  

# Get total duration of the audio in milliseconds
audio_duration = len(audio)

# Initialize start time and segment index
start_time = 0
segment_index = 1

#while start_time < audio_duration:
    # Initialise end time of the segment
end_time = min(start_time + segment_duration_ms, audio_duration)

# Extract audio segment given the current start and end timing
audio_segment = audio[start_time:end_time]

# Save the audio segment
audio_segment_path = os.path.join(output_dir, f'segment_{segment_index}.wav')
audio_segment.export(audio_segment_path, format="wav")

# Extract the corresponding TextGrid segment
# https://timmahrt.github.io/praatIO/praatio/data_classes/textgrid.html
tg_segment = tg.crop(start_time / 1000, end_time / 1000, mode="truncated", rebaseToZero=False)

# Check tg_segment 
# https://timmahrt.github.io/praatIO/praatio/data_classes/textgrid.html
tg_segment_path = os.path.join(output_dir, 'tg_segment.TextGrid')
tg_segment.save(tg_segment_path, "long_textgrid", True)

# Collect transcriptions from the TextGrid segment
transcriptions = []
for tier_name in tg_segment.tierNames: # For each tier (in order) in the TextGrid segment
    tier = tg_segment.getTier(tier_name) # Get the tier
    for entry in tier.entries: # For each of its entries, extract the labels 
        if entry.label.strip():  # Only include non-empty transcriptions -> but should be handled above already
            transcriptions.append(entry.label)

# Save the transcriptions to a text file
transcription_path = os.path.join(output_dir, f'segment_{segment_index}_transcription.txt')
with open(transcription_path, 'w') as f:
    f.write("\n".join(transcriptions))



In [5]:
output_dir_audio = os.path.join(output_dir, 'segment_1.wav')

from IPython.display import Audio
display(Audio(output_dir_audio))

**Transcription**
```
<S>
(um) you can go first
<S>
you guys are going to stand here [ah]
<S>
they are like !wow! this is a weird topic (um)
<S>
Singapore and Malaysia are like
<S>
you know brothers but not really brothers brothers on a on a tricky relationship
<S>
you know what let's skip this topic
<S>
next do I go do I go next
```

**TextGrid**
```
File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0 
xmax = 30 
tiers? <exists> 
size = 1 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "3000-1" 
        xmin = 0 
        xmax = 30 
        intervals: size = 14 
        intervals [1]:
            xmin = 0 
            xmax = 1.556 
            text = "<S>" 
        intervals [2]:
            xmin = 1.556 
            xmax = 2.661 
            text = "(um) you can go first" 
        intervals [3]:
            xmin = 2.661 
            xmax = 3.848 
            text = "<S>" 
        intervals [4]:
            xmin = 3.848 
            xmax = 4.998 
            text = "you guys are going to stand here [ah]" 
        intervals [5]:
            xmin = 4.998 
            xmax = 10.473 
            text = "<S>" 
        intervals [6]:
            xmin = 10.473 
            xmax = 13.531 
            text = "they are like !wow! this is a weird topic (um)" 
        intervals [7]:
            xmin = 13.531 
            xmax = 16.156 
            text = "<S>" 
        intervals [8]:
            xmin = 16.156 
            xmax = 17.868 
            text = "Singapore and Malaysia are like" 
        intervals [9]:
            xmin = 17.868 
            xmax = 19.781 
            text = "<S>" 
        intervals [10]:
            xmin = 19.781 
            xmax = 24.718 
            text = "you know brothers but not really brothers brothers on a on a tricky relationship" 
        intervals [11]:
            xmin = 24.718 
            xmax = 26.281 
            text = "<S>" 
        intervals [12]:
            xmin = 26.281 
            xmax = 27.318 
            text = "you know what let's skip this topic" 
        intervals [13]:
            xmin = 27.318 
            xmax = 28.156 
            text = "<S>" 
        intervals [14]:
            xmin = 28.156 
            xmax = 30 
            text = "next do I go do I go next" 

```

<br/>
<br/>
<br/>

**3. Clean and format the transcripts**

In [29]:
output_dir_transcript = os.path.join(output_dir, 'segment_1_transcription.txt')

with open(output_dir_transcript, 'r') as f:
    transcript = ' '.join(line.strip() for line in f)

In [30]:
transcript

"<S> (um) you can go first <S> you guys are going to stand here [ah] <S> they are like !wow! this is a weird topic (um) <S> Singapore and Malaysia are like <S> you know brothers but not really brothers brothers on a on a tricky relationship <S> you know what let's skip this topic <S> next do I go do I go next"

<u>Cleaning</u>

1. Lower-case the text

2. Remove and replace annotations

- Acronyms: Remove '_'
- Multi-word nouns: Replace '-' with ' '
- Discourse particles: Remove '[' and ']'
- Fillers: Remove '(' and ')'
- Interjections: Remove '!'
- Paralinguistic Phenomena: Remove '(ppb)', '(ppc)', '(ppl)', '(ppo)'
- Other languages: Remove '#'
- Unclear words: Remove ```'<unk>'```
- Incomplete words: Remove '~'
- Short pauses: Remove ```'<s>'```
- Invalid: Remove ```'<z>'```
- Long-running non-english utterances: Remove ```'<nen>'```
- Fillers: Remove ```'<fil/>'```
- Speaker Noise: Remove ```'<spk/>'```
- Unknown: Remove '**'
- Non-primary speaker sound: Remove ```'<non/>'```
- End of sentence: Remove ```'<s/>'```
- Comma: Remove ```'<c/>'```

In [31]:
import re

transcript = transcript.lower()

remove = [r'_', r'\[|\]', r'\(|\)', r'!', r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', 
          r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
          r'\*', r'<non/>', r'<s/>', r'<c/>']

replace = ['-']


for e in remove:
    transcript = re.sub(e, '', transcript)

In [32]:
for e in replace:
    transcript = re.sub(e, ' ', transcript)

In [33]:
transcript

" um you can go first  you guys are going to stand here ah  they are like wow this is a weird topic um  singapore and malaysia are like  you know brothers but not really brothers brothers on a on a tricky relationship  you know what let's skip this topic  next do i go do i go next"

In [None]:
# Remove extra spaces created by <s> and stuff
transcript = re.sub(r'\s+', ' ', transcript).strip()

In [35]:
transcript

"um you can go first you guys are going to stand here ah they are like wow this is a weird topic um singapore and malaysia are like you know brothers but not really brothers brothers on a on a tricky relationship you know what let's skip this topic next do i go do i go next"

**Need to change the order** 

(ppl) (ppb) etc. should be put infront because if the parantheses are removed, they won't be matched later

In [None]:
testing = ['(ppl)','(test)','sfs','(rdg)', 'tg_s']
testing_2 = ' '.join(test.strip() for test in testing)
remove = [r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', r'_', r'\[|\]', r'\(|\)', r'!', 
            r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
            r'\*', r'<non/>', r'<s/>', r'<c/>']
for e in remove:
    testing_2 = re.sub(e, '', testing_2)
testing_2 = re.sub(r'\s+', ' ', testing_2).strip()

In [36]:
# https://github.com/jiaaro/pydub#installation
# https://github.com/timmahrt/praatIO/tree/main

import os
from praatio import textgrid 
from pydub import AudioSegment

# Initialise input and output paths
audio_path = os.path.join(os.getcwd(), 'dataset', 'part3', 'simple_example', '3000-1.wav')
textgrid_path = os.path.join(os.getcwd(), 'dataset', 'part3', 'simple_example', '3000-1.TextGrid')
output_dir = os.path.join(os.getcwd(), 'dataset', 'part3', 'simple_example', '3000-1-splits')

# https://github.com/jiaaro/pydub
# https://github.com/timmahrt/praatIO
# https://timmahrt.github.io/praatIO/praatio.html
audio = AudioSegment.from_wav(audio_path)
tg = textgrid.openTextgrid(textgrid_path, False) # do not include intervals and points with empty labels

# pydub does things in milliseconds
segment_duration_ms = 30 * 1000  

# Get total duration of the audio in milliseconds
audio_duration = len(audio)

# Initialize start time and segment index
start_time = 0
segment_index = 1

#while start_time < audio_duration:
    # Initialise end time of the segment
end_time = min(start_time + segment_duration_ms, audio_duration)

# Extract audio segment given the current start and end timing
audio_segment = audio[start_time:end_time]

# Save the audio segment
audio_segment_path = os.path.join(output_dir, f'segment_{segment_index}.wav')
audio_segment.export(audio_segment_path, format="wav")

# Extract the corresponding TextGrid segment
# https://timmahrt.github.io/praatIO/praatio/data_classes/textgrid.html
tg_segment = tg.crop(start_time / 1000, end_time / 1000, mode="truncated", rebaseToZero=False)

# Check tg_segment 
# https://timmahrt.github.io/praatIO/praatio/data_classes/textgrid.html
tg_segment_path = os.path.join(output_dir, 'tg_segment.TextGrid')
tg_segment.save(tg_segment_path, "long_textgrid", True)

# Collect transcriptions from the TextGrid segment
transcriptions = []
for tier_name in tg_segment.tierNames: # For each tier (in order) in the TextGrid segment
    tier = tg_segment.getTier(tier_name) # Get the tier
    for entry in tier.entries: # For each of its entries, extract the labels 
        if entry.label.strip():  # Only include non-empty transcriptions -> but should be handled above already
            transcriptions.append(entry.label)

print(transcriptions)

['<S>', '(um) you can go first', '<S>', 'you guys are going to stand here [ah]', '<S>', 'they are like !wow! this is a weird topic (um)', '<S>', 'Singapore and Malaysia are like', '<S>', 'you know brothers but not really brothers brothers on a on a tricky relationship', '<S>', "you know what let's skip this topic", '<S>', 'next do I go do I go next']


In [37]:
' '.join(line.strip() for line in transcriptions)

"<S> (um) you can go first <S> you guys are going to stand here [ah] <S> they are like !wow! this is a weird topic (um) <S> Singapore and Malaysia are like <S> you know brothers but not really brothers brothers on a on a tricky relationship <S> you know what let's skip this topic <S> next do I go do I go next"

In [40]:
def clean_transcription(transcript):
    transcript = ' '.join(line.strip() for line in transcript)

    transcript = transcript.lower()

    remove = [r'_', r'\[|\]', r'\(|\)', r'!', r'\(ppb\)|\(ppc\)|\(ppl\)|\(ppo\)', 
            r'#', r'<unk>', r'~', r'<s>', r'<z>', r'<nen>', r'<fil/>', r'<spk/>',
            r'\*', r'<non/>', r'<s/>', r'<c/>']

    replace = ['-']


    for e in remove:
        transcript = re.sub(e, '', transcript)

    for e in replace:
        transcript = re.sub(e, ' ', transcript)

    transcript = re.sub(r'\s+', ' ', transcript).strip()

    return transcript

In [41]:
clean_transcription(transcriptions)

"um you can go first you guys are going to stand here ah they are like wow this is a weird topic um singapore and malaysia are like you know brothers but not really brothers brothers on a on a tricky relationship you know what let's skip this topic next do i go do i go next"

<br/>
<br/>
<br/>
<br/>
<br/>

## Part 3: Upload to HF

<u>Upload to HuggingFace</u>

Prepare our own audio dataset and upload it to HF

Stream data during the training process

Each file is around 112770 KB which is 0.11 GB

Part 3 consists of 1000 hours, which is maybe 110 GB ish

But maybe half of it is not the enviornment we want

<br/>
<br/>

Folder structure

Configure your dataset repository with audio files

- https://huggingface.co/docs/datasets/audio_dataset#audiofolder
- https://huggingface.co/docs/datasets/en/repository_structure#split-pattern-hierarchy
- https://huggingface.co/docs/hub/datasets-audio

```
test_dataset
    - metadata.csv: file_name (full relative path to audio file), transcription
    - data
        - train
            - first_train_audio_file.wav
            - second_train_audio_file.wav
            - ...
```

<br/>
<br/>
<br/>
<br/>
<br/>


### <u>Approach 1</u>

**<u>Part 1: Folder-based builders: Build dataset locally</u>**

https://huggingface.co/docs/datasets/create_dataset

https://huggingface.co/docs/datasets/audio_dataset#audiofolder

https://huggingface.co/docs/datasets/en/repository_structure#split-pattern-hierarchy

AudioFolder is a dataset builder to load an audio dataset with several thousand audio files. Additional information such as transcription is loaded by AudioFolder if its included in the metadata file

AudioFolder creates splits based on split pattern hierarchy 

```
# After structuring the data
from datasets import load_dataset
dataset = load_dataset("audiofolder", data_dir="/path/to/data")
```

**<u>Part 2: Push local dataset to Hub</u>**

https://huggingface.co/docs/datasets/upload_dataset

```
pip install huggingface_hub

huggingface-cli login

from datasets import load_dataset

dataset = load_dataset("stevhliu/demo")

dataset.push_to_hub("stevhliu/processed_demo")
```

<br/>
<br/>
<br/>

### <u>Approach 2</u>

https://huggingface.co/docs/datasets/audio_dataset#audiofolder

https://huggingface.co/docs/hub/datasets-adding

**<u>Part 1: Upload local dataset directory to Hub</u>**

**<u>Uploading Datasets in general</u>**

https://huggingface.co/docs/hub/datasets-adding

- Dataset repos are Git repos, so we can use Git to push data files to the Hub
- Starter: https://huggingface.co/docs/hub/repositories-getting-started
- Parquet is the recommended format due to its efficient compression etc.
    - For more general use cases involving analytics, data filtering or metadata parsing, Parquet is the recommended option for large scale image and audio datasets.
- For large scale image and audio datasets streaming, WebDataset should be preferred over raw image and audio files to avoid the overhead of accessing individual files
- Hugging Face Hub supports large scale datasets, usually uploaded in Parquet via push_to_hub() or WebDataset format

**<u>Creating audio datasets</u>**

- https://huggingface.co/docs/hub/datasets-audio
- https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607

**<u>Uploading large folders</u>**

https://huggingface.co/docs/huggingface_hub/guides/upload#upload-a-folder-by-chunks

- Upload folder normally: ```upload_folder()```
    - Upload a local folder to an existing repo
    - Specify the path of the local folder to upload, where you want to upload the folder to in the repository, and the name of the repository you want to add the folder to. Depending on your repository type, you can optionally set the repository type as a dataset, model, or space

    ```
    from huggingface_hub import HfApi
    api = HfApi()

    api.upload_folder(
        folder_path="/path/to/local/space",
        repo_id="username/my-cool-space",
        repo_type="space",
    )
    ```

    - By default, the .gitignore file will be taken into account to know which files should be committed or not. By default we check if a .gitignore file is present in a commit, and if not, we check if it exists on the Hub. Please be aware that only a .gitignore file present at the root of the directory with be used. We do not check for .gitignore files in subdirectories.

    - Makes a single commit, fails explicitly when something wrong happens

- Upload a large folder: ```upload_large_folder()```
    - Resumable
        - Upload process is split into many small tasks
        - Each time a task is completed, result is cached locally in ```./cache/huggingface``` inside the folder you're trying to upload
    - Multi-threaded
    - Resilient to errors: High-level retry-mechanism
        - Downside: If transient errors happen, the process will continue and retry. If permanent errors happen (e.g. permission denied), it will retry indefinitely without solving the root cause.
    - Limitations
        - ...


    ```
    api.upload_large_folder(
        repo_id="HuggingFaceM4/Docmatix",
        repo_type="dataset",
        folder_path="/path/to/local/docmatix",
    )
    ```

- Recommendations
    - Start small

- Upload a folder by chunks: ```upload_folder()```
    - Upload a folder in serveral commits so we don't have to resume the process from the beginning: Pass ```multi_commits=True``` as a argument
    - Recommended to pass ```multi_commits_verbose=True```
    - Upload will resume from where it stopped
        - If the process is interrupted before completing, you can rerun your script to resume the upload. The created PR will be automatically detected and the upload will resume from where it stopped
    - ```multi_commits``` is still an experimental feature

**<u>Repo Limits and recommendations</u>**

https://huggingface.co/docs/hub/repositories-recommendations

- Repo size: Generally support repos up to 300GB
- Number of files: Keep total number of files under 100k
    - Large datasets can be exported as Parque files or in WebDataset format
    - Cannot exceed 10k files per folder. Solution is to create a repo structure that uses subdirectories 


**<u>Part 2: Load dataset from the hub using audiofolder</u>**

```
from datasets import load_dataset
dataset = load_dataset("audiofolder", data_dir="/path/to/data") # There's a streaming option: https://huggingface.co/docs/datasets/en/stream
```

### <u>Approach 3</u>

https://huggingface.co/docs/hub/repositories-getting-started

https://huggingface.co/docs/datasets/en/audio_dataset#loading-script ((Legacy) Loading script)

https://huggingface.co/docs/hub/datasets-audio

https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607

https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809

https://huggingface.co/docs/hub/datasets-webdataset

Custom loading script

Reasons
- For large scale image and audio datasets streaming, WebDataset should be preferred over raw image and audio files to avoid the overhead of accessing individual files. 
- Audio datasets are commonly stored in tar.gz archives which requires a particular approach to support streaming mode. 



<br/>
<br/>
<br/>

### Creating a dataset loading script for audio datasets

Audio datasets are commonly stored in tar.gz archives which requires a particular approach to support streaming mode

see ```new_dataset_script tutorial.py```

Step 1: Put the dataset into WebDataset format

vivos format:

```
- vivos.tar.gz
    - vivos.tar
        - train
            - genders.txt: Contains the gender type for each waves folder
            - prompts.txt: Contains transcriptions for all the .wav files
            - waves
                - VIVOSSPK01 -> Speaker ID
                    - VIVOSSPK01_R001.wav
                    - VIVOSSPK01_R002.wav
                    - VIVOSSPK01_R003.wav

            - test
    - prompts-train.txt.gz
        - prompts-train.txt: Contains transcriptions for all the .wav files
    - prompts-test.txt.gz
```

Usual size per archive is generally around 1GB?

```
- imda_nsc_p3.tar.gz
    - imda_nsc_p3.tar
        - train
            - prompts.txt: Contains transcriptions for all the .wav files
            - waves
                - 3000-1.tar
                    - 3000-1_1.wav
                    - 3000-1_2.wav
                    - 3000-1_3.wav
- prompts-train.txt.gz
    - prompts-train.txt: Contains transcriptions for all the train .wav files
```