# Prepare Sebut Perkataan Dataset

The dataset is very simple,

if file name is `sebut-perkataan/ayam.wav`, so text is `sebut perkataan ayam`.

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [malaya-speech/example/prepare-sst-data](https://github.com/huseinzol05/malaya-speech/tree/master/example/prepare-sst-data).
    
</div>

### Download dataset

Simply uncomment code below to download the dataset

In [8]:
# https://github.com/huseinzol05/Malay-Dataset/tree/master/speech/sebut-perkataan
# !wget https://f000.backblazeb2.com/file/malay-dataset/speech-bahasa.zip
# !mkdir sebut-perkataan
# !unzip speech-bahasa.zip -d sebut-perkataan

In [2]:
import malaya_speech.train as train
import malaya_speech

### Character encoding

We simply use ASCII table to encode string into integer representation, just pass string into `malaya_speech.char.encode`.

In [3]:
encoded = malaya_speech.char.encode('hello ketiak saya masham')
encoded

[106,
 103,
 110,
 110,
 113,
 34,
 109,
 103,
 118,
 107,
 99,
 109,
 34,
 117,
 99,
 123,
 99,
 34,
 111,
 99,
 117,
 106,
 99,
 111,
 1]

In [4]:
malaya_speech.char.decode(encoded)

'hello ketiak saya masham<EOS>'

### Building the dataset

In [5]:
from glob import glob
import os

files = glob('sebut-perkataan/*/*.wav', recursive = True)
len(files)

1463

In [6]:
def get_text(file):
    file = file.replace('.wav', '')
    splitted = file.split('/')[1:]
    splitted[0] = splitted[0].replace('-woman', '').replace('-man', '').replace('-', ' ')
    return ' '.join(splitted).lower().strip()

get_text(files[0])

'sebut perkataan amko'

In [7]:
audios, texts = [], []

for file in files:
    text = get_text(file)
    audios.append(file)
    texts.append(text)
    
len(audios), len(texts)

(1463, 1463)

### Change into TFRecord

This is not necessary step, we recommend to use yield iterator to train the model, but we also can save our data into TFRecord to speed up data pipelines. To do that, we need to create a yield iterator.

In [34]:
from tqdm import tqdm

def generator():
    for i in tqdm(range(len(audios))):
        wav_data, sr = malaya_speech.load(audios[i])

        yield {
            'waveforms': wav_data.tolist(),
            'waveform_lens': [len(wav_data)],
            'targets': malaya_speech.char.encode(texts[i]),
            'raw_transcript': [texts[i]],
        }
        
generator = generator()

In [35]:
import os
import tensorflow as tf

os.system('rm tolong-sebut/data/*')
DATA_DIR = os.path.expanduser('tolong-sebut/data')
tf.gfile.MakeDirs(DATA_DIR)

#### Define shards

Like we defined below,

```python
shards = [{'split': 'train', 'shards': 99}, {'split': 'dev', 'shards': 1}]
```

If we have 100 samples, 99% of it will use for train, 1% of it will use for dev.

In [36]:
shards = [{'split': 'train', 'shards': 99}, {'split': 'dev', 'shards': 1}]

#### Save to TFRecord

Just pass yield iterator to `malaya_speech.train_prepare_dataset`,

```python
def prepare_dataset(
    generator,
    data_dir: str,
    shards: List[Dict],
    prefix: str = 'dataset',
    shuffle: bool = True,
    already_shuffled: bool = False,
):
```

In [37]:
train.prepare_dataset(generator, DATA_DIR, shards, prefix = 'tolong-sebut')





  0%|          | 0/1463 [00:00<?, ?it/s]


INFO:tensorflow:Generating case 0.


100%|██████████| 1463/1463 [00:24<00:00, 60.74it/s]


INFO:tensorflow:Generated 1463 Examples
INFO:tensorflow:Shuffling data...
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`






INFO:tensorflow:Data shuffled.
