# Demonstration of data processing

Demo created to show how to convert both audio and text data to a common format for language modeling. We do a basic demonstration with Xhosa, as
* HuggingFace datasets exist in both text and audio form for the language.
* Epitran supports it out of the box
* Allosaurus works on any language. 

In [None]:
! pip install datasets transformers

## Text Data to phones

We use the https://huggingface.co/datasets/cc100 dataset, for convenience. 

In [None]:
from datasets import load_dataset

xhosa_text_dataset = load_dataset("cc100", lang="xh")

Using custom data configuration xh-lang=xh
Reusing dataset cc100 (/root/.cache/huggingface/datasets/cc100/xh-lang=xh/0.0.0/b583dd47b0dd43a3c3773075abd993be12d0eee93dbd2cfe15a0e4e94d481e80)


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
print(xhosa_text_dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'text'],
        num_rows: 776651
    })
})


In [None]:
# print the first sample
print(xhosa_text_dataset["train"][0])

{'id': '0', 'text': 'Isigaba 409a ongakhetha kuzo - Ongakhetha isigaba\n'}


In [None]:
! pip install epitran
import epitran
epi = epitran.Epitran('xho-Latn')  # Xhosa in Latin script

In [None]:
def transliterate_example(example):
  example["text"] = epi.transliterate(example["text"])
  return example

transliterated =   transliterate_example(xhosa_text_dataset["train"][0])
print(transliterated)

{'id': '0', 'text': 'isiɡ̤aɓa 409a ɔnɡ̤akʰɛtʰa kuz̤ɔ - ɔnɡ̤akʰɛtʰa isiɡ̤aɓa\n'}


In [None]:
xhosa_text_dataset_phonemized = xhosa_text_dataset.map(transliterate_example)

  0%|          | 0/776651 [00:00<?, ?ex/s]

In [None]:
print(xhosa_text_dataset["train"][0])
print(xhosa_text_dataset_phonemized["train"][0])

{'id': '0', 'text': 'Isigaba 409a ongakhetha kuzo - Ongakhetha isigaba\n'}
{'id': '0', 'text': 'isiɡ̤aɓa 409a ɔnɡ̤akʰɛtʰa kuz̤ɔ - ɔnɡ̤akʰɛtʰa isiɡ̤aɓa\n'}


## Audio to phones

In [None]:
from datasets import load_dataset
xhosa_audio_dataset = load_dataset(
   'openslr', 'SLR32')

Downloading:   0%|          | 0.00/5.41k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Downloading and preparing dataset open_slr/SLR32 (download: 3.08 GiB, generated: 2.07 MiB, post-processed: Unknown size, total: 3.09 GiB) to /root/.cache/huggingface/datasets/open_slr/SLR32/0.0.0/6bfe61a587ab51a421bcdfb71e6d51a3e050e529766de63dfc5fcf6c1bd8dff8...


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/951M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/724M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/729M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/907M [00:00<?, ?B/s]

  0%|          | 0/4 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

Dataset open_slr downloaded and prepared to /root/.cache/huggingface/datasets/open_slr/SLR32/0.0.0/6bfe61a587ab51a421bcdfb71e6d51a3e050e529766de63dfc5fcf6c1bd8dff8. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
print(xhosa_audio_dataset)

DatasetDict({
    train: Dataset({
        features: ['path', 'sentence'],
        num_rows: 9821
    })
})


In [None]:
# After exploring https://huggingface.co/datasets/viewer/?dataset=openslr, 
# it seems that id 8047 is Xhosa
xhosa_datapoint  = xhosa_audio_dataset["train"][8047]
print(xhosa_datapoint)

{'path': '/root/.cache/huggingface/datasets/downloads/extracted/b6f48a77f831fb5edfde334d727e993beabf1663f3ed6fb36d4157765b02ca76/xh_za/za/xho/wavs/xho_1547_1717451646.wav', 'sentence': 'Kwakukho namadoda anxibe iminqwazi emthubi eqinileyo.'}


In [None]:
from IPython.display import Audio
audio_file = xhosa_datapoint["path"]
Audio(audio_file)

In [None]:
! pip install allosaurus

lang = "xho" # swahili
from allosaurus.app import read_recognizer

# load your model by the <model name>, will use 'latest' if left empty
model = read_recognizer()

# run inference on <audio_file> with <lang>, lang will be 'ipa' if left empty
model.recognize(audio_file, lang)


downloading model  latest
from:  https://github.com/xinjli/allosaurus/releases/download/v1.0/latest.tar.gz
to:    /usr/local/lib/python3.7/dist-packages/allosaurus/pretrained
please wait...


'w a k u x o l a m a t o ŋ t e a n i m e i m e n a s e ɛ n tʰ o m i k e l k i m i l e'