# Whisper Finetune

## Part 1: Get Data 

National Speech Corpus
- Part 3: 1000 hours of conversational speech data (Used by Home team)
- Part 2: 1000 hours of prompted recordings of random sentences containing local words and entities (Used by some developer)
- Part 4: Conversational code-switched data (from Singaporean English to various native languages)

<br/>
<br/>

- https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809
- https://www.jensenlwt.com/blog/singlish-whisper-finetuning-asr-for-singapore-unique-english


<br/>
<br/>
<br/>
<br/>
<br/>

## Part 2: Prepare Data

- Match each transcript sentence to its corresponding audio file
- Check on the environment where the audio is recorded (decide the environment)
    - Close-talk mic that isolates main speaker's voice 
    - Standing microphone 
- Clean the transcripts by removing annotations for
    - <em>Paralinguistic phenomena</em>: (e.g., breathing, coughing, laughing) — this is represented in the text by annotations such as (ppo), (ppb), (ppl) etc.
    - <em>Fillers or unknown words</em>: ```<FIL/>```, unclear words ```<UNK>```, short pauses ```<S>``` etc. according to the NSC transcription guidelines
    - <em>Unique Singlish particles</em>: we removed the annotations and kept the particles as part of the text e.g., ‘ok ```[lah]``` we go there’ → ‘ok lah we go there’
- Normalise the transcript text
    - Remove punctuations
    - Lowercase text
- Create 30s audio segments with corresponding transcripts
    - Using time segments from ```TextGrid files```, splice out corresponding segments from WAV files
    - Combine shorter consecutive segments (?)
    - 30s: Whisper's feature extractor ensures all audio is 30s (intrinsic design)

<br/>
<br/>

- https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809
- https://www.jensenlwt.com/blog/singlish-whisper-finetuning-asr-for-singapore-unique-english


### Simple Example

**1. Match 3000-1.wav and 3000-1.TEXTGRID**

**2. Clean transcripts by removing annotations**

Reference Transcription Guidelines

- Apostophes (own observation)
    - Example: couldn't, he's
- Acronyms
    - Acronyms have '_' by convention
    - Example: n_a_f_a, n_t_u, l_m_s
- Multi-word nouns
    - Multi-word nouns have '-' by convention
    - Example: hong-kong, ang-mo-kio, s_t-engineering
- Discourse particles (Unique Singlish particles)
    - Discourse particles have [...] by convention
    - Example: [oh], [ah], [wah], [one], [lah]
- Fillers
    - Fillers have (...) by convention
    - Example: (uh), (um), (er), (erm)
- Interjections
    - Interjections have !...! by example
    - Example: !walao!, !wow!, !aiyo!
- Paralinguistic Phenomena (e.g., breathing, coughing, laughing)
    - Paralinguistic Phenomena includes
        - (ppb) breath
        - (ppc) cough
        - (ppl) laugh
        - (ppo) others
    - Example: (ppc), (ppo), (ppb), (ppl)
- Other languages 
    - Other languages have #...# by convention
    - Example: #pasar malam#, #roti-john#, #tak sedap#, #muah-chee#, #shiok#, #pek chek#
- Unclear words
    - Unclear words are denoted by ```<UNK>``` or ```<unk>```
- Incomplete words
    - Incomplete words have '~' at the end by convention
    - Example: abbre~, abbrev~
- Short pauses: ```<S>```
- Invalid: ```<Z>```
    - Invalid speech, Noise, Non-primary speaker's voice, Sounds from the monitor/speaker, Continuous noise 
- <u>Long-running</u> Non-English utterances: ```<NEN>```
- Fillers: ```<FIL/>```
    - This is equivalent to [xxx]
- Speaker Noise: ```<SPK/>```
    - This is equivalent to (ppb), (ppc), (ppl), (ppo)
- Unknown: ** 
    - This is equivalent to ```<UNK>```
- Non-primary speaker sound: ```<NON/>```
    - This includes background sounds, including sounds made by other speaker, background noise etc.
- End of sentence: ```<s/>```
- Comma: ```<c/>```

    

In [23]:
with open('./dataset/part3/simple_example/3000-1.TextGrid', 'rb') as f: 
    contents = f.readlines() 

<br/>
<br/>
<br/>
<br/>
<br/>

## Part 3: Load Data

<u>Upload to HuggingFace</u>

Prepare our own audio dataset and upload it to HF

Stream data during the training process

- https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809
- https://twodatadetectives.medium.com/push-your-custom-dataset-to-huggingface-two-ways-47482e8a0f34
- https://huggingface.co/docs/datasets/en/create_dataset
- https://huggingface.co/docs/datasets/en/audio_dataset#loading-script
- https://github.com/huggingface/datasets/blob/main/templates/new_dataset_script.py