## Prepare Dataset 3: Decrease the size of the train dataset from 530 hours to 10 hours

- https://www.jensenlwt.com/blog/singlish-whisper-finetuning-asr-for-singapore-unique-english

### New Folder Structure

```
output drive
- dataset_3
    - data: Used to store compression files
    - train
        - prompts.txt: Contains transcriptions for all the .wav files in train
        - waves
        - transcripts
    - test
        - prompts.txt: Contains transcriptions for all the .wav files in test
        - waves
        - transcripts
...

data
- imda_nsc_p3_extra_small.tar.gz
    - train
        - prompts.txt: Contains transcriptions for all the .wav files in train
        - waves
            - 3000-1_1.wav
            - 3000-1_2.wav
            - 3000-1_3.wav
            - ...
            - 3000-2_1.wav
            - 3000-2_2.wav
            - 3000-2_3.wav
    - test
        - prompts.txt: Contains transcriptions for all the .wav files in test
        - waves
            - 3000-3_1.wav
            - 3000-3_2.wav
            - 3000-3_3.wav
            - ...
            - 3000-4_1.wav
            - 3000-4_2.wav
            - 3000-4_3.wav
- prompts-train-extra-small.txt.gz
    - prompts-train.txt: Contains transcriptions for all the train .wav files -> taken from train/prompts.txt
- prompts-test-extra-small.txt.gz
    - prompts-test.txt: Contains transcriptions for all the test .wav files -> take from test/prompts.txt
```

**Imports**

In [1]:
import re 
import os
import shutil
import tarfile
import gzip
import math
import random
from praatio import textgrid 
from pydub import AudioSegment



**Initialise Paths and Create the directories**

**<u>USER ACTION REQUIRED</u>**

- Change Relative Paths and Naming Conventions if you want 

In [2]:
org_train_path = ['dataset', 'train']
org_test_path = ['dataset', 'test']
small_train_path = ['dataset_3', 'train']
small_test_path = ['dataset_3', 'test']
small_compressed_path = ['dataset_3','data']
small_compressed_filename = 'imda_nsc_p3_extra_small.tar.gz'
small_compressed_train_prompt_filename = 'prompts-train-extra-small.txt.gz'
small_compressed_test_prompt_filename = 'prompts-test-extra-small.txt.gz'

**<u>USER ACTION REQUIRED</u>**

- Specify the output drive path
- Change Relative Paths and Naming Conventions if you want 

In [3]:
output_drive_path = os.getcwd()
small_train_folder_waves = os.path.join(output_drive_path, *small_train_path, 'waves')
small_train_folder_transcripts = os.path.join(output_drive_path, *small_train_path, 'transcripts')
small_test_folder_waves = os.path.join(output_drive_path, *small_test_path, 'waves')
small_test_folder_transcripts = os.path.join(output_drive_path, *small_test_path, 'transcripts')
small_compressed_folder = os.path.join(output_drive_path, *small_compressed_path)
small_compressed_file = os.path.join(small_compressed_folder, small_compressed_filename)
small_compressed_train_prompt_file = os.path.join(small_compressed_folder, small_compressed_train_prompt_filename)
small_compressed_test_prompt_file = os.path.join(small_compressed_folder, small_compressed_test_prompt_filename)

org_train_folder_waves = os.path.join(output_drive_path, *org_train_path, 'waves')
org_train_folder_transcripts = os.path.join(output_drive_path, *org_train_path, 'transcripts')
org_test_folder_waves = os.path.join(output_drive_path, *org_test_path, 'waves')
org_test_folder_transcripts = os.path.join(output_drive_path, *org_test_path, 'transcripts')

create_dir = [small_train_folder_waves, small_train_folder_transcripts, small_test_folder_waves, small_test_folder_transcripts, 
              small_compressed_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

**<u>USER ACTION REQUIRED</u>**

- Decide the total dataset hours that will be uploaded to HuggingFace
- Change Relative Paths and Naming Conventions if you want 

In [5]:
total_dataset_hours = 12
train_split = 0.9
train_data_hours = math.floor(0.9*total_dataset_hours)
test_data_hours = total_dataset_hours - train_data_hours
print(f'Train data hours will be {train_data_hours}')
print(f'Test data hours will be {test_data_hours}')

Train data hours will be 10
Test data hours will be 2


**Define a function to copy files**

In [6]:
def copyfiles_wav(src_dir,dest_dir,filenames):
    for filename in filenames:
        src_fp = os.path.join(src_dir,filename + '.wav')
        shutil.copy2(src_fp, dest_dir) 

In [7]:
def copyfiles_txt(src_dir,dest_dir,filenames):
    for filename in filenames:
        src_fp = os.path.join(src_dir,filename + '.txt')
        shutil.copy2(src_fp, dest_dir) 

<br/>
<br/>
<br/>
<br/>
<br/>

**Accumulate the required hours of training data**

In [8]:
total_duration = 0 
train_filenames = []
for filename in os.listdir(org_train_folder_waves):
    fp = os.path.join(org_train_folder_waves, filename)
    audio = AudioSegment.from_file(fp)
    total_duration += len(audio)/1000 # Add the length of audio segments in seconds
    train_filenames.append(filename.split('.')[0])
    if total_duration/3600 >= train_data_hours: # Check if the total duration has exceeded our requirements in hours
        print(f'Accumulated {total_duration/3600} hours of training data')
        break

Accumulated 10.001350555555547 hours of training data


In [9]:
train_filenames[:10]

['3000-1_11',
 '3000-1_12',
 '3000-1_13',
 '3000-1_16',
 '3000-1_17',
 '3000-1_19',
 '3000-1_2',
 '3000-1_20',
 '3000-1_21',
 '3000-1_22']

**Copy training wav files from ```data/train/waves``` to ```data_3/train/waves```**

In [10]:
copyfiles_wav(org_train_folder_waves,small_train_folder_waves,train_filenames)

**Copy training transcript files from ```data/train/transcripts``` to ```data_3/train/transcripts```**

In [11]:
copyfiles_txt(org_train_folder_transcripts,small_train_folder_transcripts,train_filenames)

<br/>
<br/>
<br/>
<br/>
<br/>

**Accumulate the required hours of test data**

In [12]:
total_duration = 0 
test_filenames = []
for filename in os.listdir(org_test_folder_waves):
    fp = os.path.join(org_test_folder_waves, filename)
    audio = AudioSegment.from_file(fp)
    total_duration += len(audio)/1000 # Add the length of audio segments in seconds
    test_filenames.append(filename.split('.')[0])
    if total_duration/3600 >= test_data_hours: # Check if the total duration has exceeded our requirements in hours
        print(f'Accumulated {total_duration/3600} hours of test data')
        break

Accumulated 2.001969722222222 hours of test data


In [13]:
test_filenames[:10]

['3000-1_1',
 '3000-1_10',
 '3000-1_14',
 '3000-1_15',
 '3000-1_18',
 '3000-1_29',
 '3000-1_44',
 '3000-1_45',
 '3000-1_49',
 '3000-1_50']

**Copy training wav files from ```data/test/waves``` to ```data_3/test/waves```**

In [14]:
copyfiles_wav(org_test_folder_waves,small_test_folder_waves,test_filenames)

**Copy training transcript files from ```data/test/transcripts``` to ```data_3/test/transcripts```**

In [15]:
copyfiles_txt(org_test_folder_transcripts,small_test_folder_transcripts,test_filenames)

<br/>
<br/>
<br/>
<br/>
<br/>

**Write the ```/train/prompts.txt``` and ```/test/prompts.txt``` files**

In [16]:
train_prompts_path = os.path.join(output_drive_path, *small_train_path, 'prompts.txt')
with open(train_prompts_path, 'a') as outfile:
    for filename in os.listdir(small_train_folder_transcripts):
        file_path = os.path.join(small_train_folder_transcripts, filename)
        with open(file_path, "r") as infile:
            outfile.write(infile.read() + '\n')

In [17]:
test_prompts_path = os.path.join(output_drive_path, *small_test_path, 'prompts.txt')
with open(test_prompts_path, 'a') as outfile:
    for filename in os.listdir(small_test_folder_transcripts):
        file_path = os.path.join(small_test_folder_transcripts, filename)
        with open(file_path, "r") as infile:
            outfile.write(infile.read() + '\n')

<br/>
<br/>
<br/>
<br/>
<br/>

**Compress the folders into ```.tar.gzip```**

In [18]:
paths_to_compress = [train_prompts_path, small_train_folder_waves, test_prompts_path, small_test_folder_waves]

with tarfile.open(small_compressed_file, "w:gz") as tar_gz:
    for path in paths_to_compress:
        rel_path = os.path.relpath(path, os.path.join(os.getcwd(), *small_compressed_path))
        tar_gz.add(path, arcname=rel_path) 

In [19]:
with open(train_prompts_path, 'rb') as f_in, gzip.open(small_compressed_train_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)

In [20]:
with open(test_prompts_path, 'rb') as f_in, gzip.open(small_compressed_test_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)