# Syllable tokenizer

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/tokenizer-syllable](https://github.com/huseinzol05/Malaya/tree/master/example/tokenizer-syllable).
    
</div>

<div class="alert alert-warning">

This module only suitable for standard language structure, so it is not save to use it for local language structure.
    
</div>

In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

In [2]:
%%time
import malaya

  warn("The installed version of bitsandbytes was compiled without GPU support. "


/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CPU times: user 3.2 s, sys: 2.88 s, total: 6.08 s
Wall time: 2.56 s


  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))


### Load rules based syllable tokenizer

```python
def rules(**kwargs):
    """
    Load rules based syllable tokenizer.
    originally from https://github.com/fahadh4ilyas/syllable_splitter/blob/master/SyllableSplitter.py
    - improved `cuaca` double vocal `ua` based on https://en.wikipedia.org/wiki/Comparison_of_Indonesian_and_Standard_Malay#Syllabification
    - improved `rans` double consonant `ns` based on https://www.semanticscholar.org/paper/Syllabification-algorithm-based-on-syllable-rules-Musa-Kadir/a819f255f066ae0fd7a30b3534de41da37d04ea1
    - improved `au` and `ai` double vocal.

    Returns
    -------
    result: malaya.syllable.Tokenizer class
    """
```

In [3]:
tokenizer = malaya.syllable.rules()

#### Tokenize

```python
def tokenize(self, string):
    """
    Tokenize string into multiple strings using syllable patterns.
    Example from https://www.semanticscholar.org/paper/Syllabification-algorithm-based-on-syllable-rules-Musa-Kadir/a819f255f066ae0fd7a30b3534de41da37d04ea1/figure/0,
    'cuaca' -> ['cua', 'ca']
    'insurans' -> ['in', 'su', 'rans']
    'praktikal' -> ['prak', 'ti', 'kal']
    'strategi' -> ['stra', 'te', 'gi']
    'ayam' -> ['a', 'yam']
    'anda' -> ['an', 'da']
    'hantu' -> ['han', 'tu']

    Parameters
    ----------
    string : str

    Returns
    -------
    result: List[str]
    """
```

In [4]:
tokenizer.tokenize('angan-angan')

  or re.findall(_expressions['ic'], word.lower())


['a', 'ngan', '-', 'a', 'ngan']

In [5]:
tokenizer.tokenize('cuaca')

['cua', 'ca']

In [6]:
tokenizer.tokenize('hidup')

['hi', 'dup']

In [7]:
tokenizer.tokenize('insuran')

['in', 'su', 'ran']

In [8]:
tokenizer.tokenize('insurans')

['in', 'su', 'rans']

In [9]:
tokenizer.tokenize('ayam')

['a', 'yam']

In [10]:
tokenizer.tokenize('strategi')

['stra', 'te', 'gi']

In [11]:
tokenizer.tokenize('hantu')

['han', 'tu']

In [12]:
tokenizer.tokenize('hello')

['hel', 'lo']

#### Better performance

Split by words and tokenize it.

In [13]:
string = 'sememang-memangnya kau sakai siot'

In [14]:
results = []
for w in string.split():
    results.extend(tokenizer.tokenize(w))
results

['se', 'me', 'mang', '-', 'me', 'mang', 'nya', 'kau', 'sa', 'kai', 'siot']

### List available HuggingFace models

We are also provide syllable tokenizer using deep learning, trained on DBP dataset.

In [15]:
malaya.syllable.available_huggingface

{'mesolitica/syllable-lstm': {'Size (MB)': 35.2,
  'hidden size': 512,
  'CER': 0.011996584781229728,
  'WER': 0.06915983606557377}}

### Load deep learning model

In [16]:
model = malaya.syllable.huggingface()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


#### Tokenize

```python
def tokenize(self, string, beam_search: bool = False):
    """
    Tokenize string into multiple strings using deep learning.

    Parameters
    ----------
    string : str
    beam_search : bool, (optional=False)
        If True, use beam search decoder, else use greedy decoder.

    Returns
    -------
    result: List[str]
    """
```

In [17]:
model.tokenize('angan-angan')

['a', 'ngan', 'a', 'ngan']

In [18]:
model.tokenize('insuran')

['in', 'su', 'ran']

In [19]:
model.tokenize('insurans')

['in', 'sur', 'ans']

### Harder example

Test set from DBP at https://huggingface.co/datasets/mesolitica/syllable/raw/main/test-syllable.json

In [20]:
import requests
import json

r = requests.get('https://huggingface.co/datasets/mesolitica/syllable/raw/main/test-syllable.json')
test_set = r.json()

In [21]:
len(test_set)

1952

In [22]:
def calculate_wer(actual, hyp):
    """
    Calculate WER using `python-Levenshtein`.
    """
    import Levenshtein as Lev

    b = set(actual.split() + hyp.split())
    word2char = dict(zip(b, range(len(b))))

    w1 = [chr(word2char[w]) for w in actual.split()]
    w2 = [chr(word2char[w]) for w in hyp.split()]

    return Lev.distance(''.join(w1), ''.join(w2)) / len(actual.split())

In [23]:
wers = []
for test in test_set:
    t = tokenizer.tokenize(test[0])
    t = [t_ for t_ in t if t_ not in ['-']]
    wer = calculate_wer(test[1], '.'.join(t))
    wers.append(wer)
    
sum(wers) / len(wers)

0.09016393442622951

In [24]:
for test in test_set[:50]:
    print('original:', test[0])
    print('actual:', test[1].split('.'))
    t = tokenizer.tokenize(test[0])
    print('predicted:', t)
    print()

original: mengilukan
actual: ['me', 'ngi', 'lu', 'kan']
predicted: ['me', 'ngi', 'lu', 'kan']

original: menjongkok
actual: ['men', 'jong', 'kok']
predicted: ['men', 'jong', 'kok']

original: tergabas
actual: ['ter', 'ga', 'bas']
predicted: ['ter', 'ga', 'bas']

original: perunding
actual: ['pe', 'run', 'ding']
predicted: ['pe', 'run', 'ding']

original: kemahalan
actual: ['ke', 'ma', 'ha', 'lan']
predicted: ['ke', 'ma', 'ha', 'lan']

original: renggang
actual: ['reng', 'gang']
predicted: ['reng', 'gang']

original: bersuci
actual: ['ber', 'su', 'ci']
predicted: ['ber', 'su', 'ci']

original: jelebat
actual: ['je', 'le', 'bat']
predicted: ['je', 'le', 'bat']

original: rekod
actual: ['re', 'kod']
predicted: ['re', 'kod']

original: amang
actual: ['a', 'mang']
predicted: ['a', 'mang']

original: aromaterapi
actual: ['a', 'ro', 'ma', 'te', 'ra', 'pi']
predicted: ['a', 'ro', 'ma', 'te', 'ra', 'pi']

original: pengkompaunan
actual: ['peng', 'kom', 'pau', 'nan']
predicted: ['peng', 'kom', '

In [26]:
wers = []
for test in test_set:
    t = model.tokenize(test[0])
    t = [t_ for t_ in t if t_ not in ['-']]
    wer = calculate_wer(test[1], '.'.join(t))
    wers.append(wer)
    
sum(wers) / len(wers)

0.0630122950819672

In [25]:
for test in test_set[:50]:
    print('original:', test[0])
    print('actual:', test[1].split('.'))
    t = model.tokenize(test[0])
    print('predicted:', t)
    print()

original: mengilukan
actual: ['me', 'ngi', 'lu', 'kan']
predicted: ['me', 'ngi', 'lu', 'kan']

original: menjongkok
actual: ['men', 'jong', 'kok']
predicted: ['men', 'jong', 'kok']

original: tergabas
actual: ['ter', 'ga', 'bas']
predicted: ['ter', 'ga', 'bas']

original: perunding
actual: ['pe', 'run', 'ding']
predicted: ['pe', 'run', 'ding']

original: kemahalan
actual: ['ke', 'ma', 'ha', 'lan']
predicted: ['ke', 'ma', 'ha', 'lan']

original: renggang
actual: ['reng', 'gang']
predicted: ['reng', 'gang']

original: bersuci
actual: ['ber', 'su', 'ci']
predicted: ['ber', 'su', 'ci']

original: jelebat
actual: ['je', 'le', 'bat']
predicted: ['je', 'le', 'bat']

original: rekod
actual: ['re', 'kod']
predicted: ['re', 'kod']

original: amang
actual: ['a', 'mang']
predicted: ['a', 'mang']

original: aromaterapi
actual: ['a', 'ro', 'ma', 'te', 'ra', 'pi']
predicted: ['a', 'ro', 'ma', 'te', 'ra', 'pi']

original: pengkompaunan
actual: ['peng', 'kom', 'pau', 'nan']
predicted: ['peng', 'kom', '