# Syllable tokenizer

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/tokenizer-syllable](https://github.com/huseinzol05/Malaya/tree/master/example/tokenizer-syllable).
    
</div>

<div class="alert alert-warning">

This module only suitable for standard language structure, so it is not save to use it for local language structure.
    
</div>

In [1]:
%%time
import malaya

CPU times: user 6.01 s, sys: 1.14 s, total: 7.15 s
Wall time: 7.88 s


### Load Syllable tokenizer

```python
class Tokenizer:
    def __init__(self):
        """
        originally from https://github.com/fahadh4ilyas/syllable_splitter/blob/master/SyllableSplitter.py
        - improved `cuaca` double vocal `ua` based on https://en.wikipedia.org/wiki/Comparison_of_Indonesian_and_Standard_Malay#Syllabification
        - improved `rans` double consonant `ns` based on https://www.semanticscholar.org/paper/Syllabification-algorithm-based-on-syllable-rules-Musa-Kadir/a819f255f066ae0fd7a30b3534de41da37d04ea1
        - improved `au` and `ai` double vocal.
        """
```

In [2]:
tokenizer = malaya.syllable.Tokenizer()

### Tokenize

```python
def tokenize(self, string):
    """
    Tokenize string into multiple strings using syllable patterns.
    Example from https://www.semanticscholar.org/paper/Syllabification-algorithm-based-on-syllable-rules-Musa-Kadir/a819f255f066ae0fd7a30b3534de41da37d04ea1/figure/0,
    'cuaca' -> ['cua', 'ca']
    'insurans' -> ['in', 'su', 'rans']
    'praktikal' -> ['prak', 'ti', 'kal']
    'strategi' -> ['stra', 'te', 'gi']
    'ayam' -> ['a', 'yam']
    'anda' -> ['an', 'da']
    'hantu' -> ['han', 'tu']

    Parameters
    ----------
    string : str

    Returns
    -------
    result: List[str]
    """
```

In [3]:
tokenizer.tokenize('angan-angan')

['a', 'ngan', '-', 'a', 'ngan']

In [4]:
tokenizer.tokenize('cuaca')

['cua', 'ca']

In [5]:
tokenizer.tokenize('hidup')

['hi', 'dup']

In [6]:
tokenizer.tokenize('insuran')

['in', 'su', 'ran']

In [7]:
tokenizer.tokenize('insurans')

['in', 'su', 'rans']

In [8]:
tokenizer.tokenize('ayam')

['a', 'yam']

In [9]:
tokenizer.tokenize('strategi')

['stra', 'te', 'gi']

In [10]:
tokenizer.tokenize('hantu')

['han', 'tu']

In [11]:
tokenizer.tokenize('hello')

['hel', 'lo']

#### Better performance

Split by words and tokenize it.

In [12]:
string = 'sememang-memangnya kau sakai siot'

In [13]:
results = []
for w in string.split():
    results.extend(tokenizer.tokenize(w))
results

['se', 'me', 'mang', '-', 'me', 'mang', 'nya', 'kau', 'sa', 'kai', 'si', 'ot']