## **Prepare Full Dataset Split: Compress the files to separate train and test data**

### New Folder Structure

dataset_split
- data_train
- data_test

data_train
- imda_nsc_p3_train.tar.gz
    - train
        - prompts.txt: Contains transcriptions for all the .wav files in train
        - waves
            - 3000-1_1.wav
            - 3000-1_2.wav
            - 3000-1_3.wav
            - ...
            - 3000-2_1.wav
            - 3000-2_2.wav
            - 3000-2_3.wav
- prompts-train.txt.gz
    - prompts-train.txt: Contains transcriptions for all the train .wav files -> taken from train/prompts.txt

...

data_test
- imda_nsc_p3_test.tar.gz
    - test
        - prompts.txt: Contains transcriptions for all the .wav files in test
        - waves
            - 3000-3_1.wav
            - 3000-3_2.wav
            - 3000-3_3.wav
            - ...
            - 3000-4_1.wav
            - 3000-4_2.wav
            - 3000-4_3.wav
- prompts-test.txt.gz
    - prompts-test.txt: Contains transcriptions for all the test .wav files -> take from test/prompts.txt
```

**Imports**

In [1]:
import re 
import os
import shutil
import tarfile
import gzip
import math
import random
from praatio import textgrid 
from pydub import AudioSegment



**Initialise Paths and Create the directories**

**<u>USER ACTION REQUIRED</u>**

- Change Relative Paths and Naming Conventions if you want 

**Paths for splits**

In [2]:
compressed_train_path = ['dataset_split','data_train']
compressed_train_filename = 'imda_nsc_p3_train.tar.gz'
compressed_train_prompt_filename = 'prompts-train.txt.gz'

compressed_test_path = ['dataset_split','data_test']
compressed_test_filename = 'imda_nsc_p3_test.tar.gz'
compressed_test_prompt_filename = 'prompts-test.txt.gz'

In [3]:
output_drive_path = os.getcwd()

compressed_train_folder = os.path.join(output_drive_path, *compressed_train_path)
compressed_train_file = os.path.join(compressed_train_folder, compressed_train_filename)
compressed_train_prompt_file = os.path.join(compressed_train_folder, compressed_train_prompt_filename)

compressed_test_folder = os.path.join(output_drive_path, *compressed_test_path)
compressed_test_file = os.path.join(compressed_test_folder, compressed_test_filename)
compressed_test_prompt_file = os.path.join(compressed_test_folder, compressed_test_prompt_filename)

create_dir = [compressed_train_folder, compressed_test_folder]

for dir in create_dir:
    os.makedirs(dir, exist_ok=True)

**Paths to original data**

In [4]:
output_train_path = ['dataset', 'train']
train_prompts_path = os.path.join(output_drive_path, *output_train_path, 'prompts.txt')
output_train_folder_waves = os.path.join(output_drive_path, *output_train_path, 'waves')

output_test_path = ['dataset', 'test']
test_prompts_path = os.path.join(output_drive_path, *output_test_path, 'prompts.txt')
output_test_folder_waves  = os.path.join(output_drive_path, *output_test_path, 'waves')

**Compress the training data into ```.tar.gzip```**

In [5]:
paths_to_compress = [train_prompts_path, output_train_folder_waves]

with tarfile.open(compressed_train_file, "w:gz") as tar_gz:
    for path in paths_to_compress:
        rel_path = os.path.relpath(path, os.path.join(os.getcwd(), *compressed_train_path))
        tar_gz.add(path, arcname=rel_path) 

In [6]:
with open(train_prompts_path, 'rb') as f_in, gzip.open(compressed_train_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)

**Compress the test data into ```.tar.gzip```**

In [7]:
paths_to_compress = [test_prompts_path, output_test_folder_waves]

with tarfile.open(compressed_test_file, "w:gz") as tar_gz:
    for path in paths_to_compress:
        rel_path = os.path.relpath(path, os.path.join(os.getcwd(), *compressed_test_path))
        tar_gz.add(path, arcname=rel_path) 

In [8]:
with open(test_prompts_path, 'rb') as f_in, gzip.open(compressed_test_prompt_file, 'wb') as f_out:
    f_out.writelines(f_in)