# Stemmer and Lemmatization

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/stemmer](https://github.com/huseinzol05/Malaya/tree/master/example/stemmer).
    
</div>

<div class="alert alert-warning">

This module only trained on standard language structure, so it is not save to use it for local language structure.
    
</div>

In [1]:
%%time
import malaya

CPU times: user 4.65 s, sys: 659 ms, total: 5.31 s
Wall time: 4.38 s


### Use deep learning model

Load LSTM + Bahdanau Attention stemming model, this also include lemmatization.

```python
def deep_model(quantized: bool = False, **kwargs):
    """
    Load LSTM + Bahdanau Attention stemming model, this also include lemmatization.
    Original size 41.6MB, quantized size 10.6MB .

    Parameters
    ----------
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model. 
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: malaya.stem.DEEP_STEMMER class
    """
```

In [9]:
model = malaya.stem.deep_model()

### Load Quantized model

To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

In [8]:
quantized_model = malaya.stem.deep_model(quantized = True)



#### Stem and lemmatization

```python
def stem(self, string: str, beam_search: bool = True):
    """
    Stem a string, this also include lemmatization.

    Parameters
    ----------
    string : str
    beam_search : bool, (optional=True)
        If True, use beam search decoder, else use greedy decoder.

    Returns
    -------
    result: str
    """
```

If want to speed up the inference, set `beam_search = False`.

In [3]:
string = 'Benda yg SALAH ni, jgn lah didebatkan. Yg SALAH xkan jadi betul. Ingat tu. Mcm mana kesat sekalipun org sampaikan mesej, dan memang benda tu salah, diam je. Xyah nk tunjuk kau open sangat nk tegur cara org lain berdakwah'
another_string = 'melayu bodoh, dah la gay, sokong lgbt lagi, memang tak guna, http://twitter.com'

In [4]:
%%time

model.stem(string)

CPU times: user 1.22 s, sys: 305 ms, total: 1.52 s
Wall time: 540 ms


'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'

In [5]:
%%time

model.stem(string, beam_search = False)

CPU times: user 285 ms, sys: 102 ms, total: 387 ms
Wall time: 289 ms


'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'

In [6]:
%%time

quantized_model.stem(string)

CPU times: user 1.29 s, sys: 230 ms, total: 1.52 s
Wall time: 573 ms


'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'

In [7]:
%%time

quantized_model.stem(string, beam_search = False)

CPU times: user 331 ms, sys: 105 ms, total: 436 ms
Wall time: 329 ms


'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'

In [8]:
model.stem(another_string)

'layu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com'

In [9]:
quantized_model.stem(another_string)

'layu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com'

In [11]:
model.stem('saya menyerukanlah')

'saya seru'

In [10]:
quantized_model.stem('saya menyerukanlah')

'saya seru'

## Use Sastrawi stemmer

Malaya also included interface for [Sastrawi stemmer](https://pypi.org/project/PySastrawi/). We also use it for internal purpose. To use it, simply,

```python
malaya.stem.sastrawi(str)
```

But it not able to maintain words like url, hashtag, money, datetime and user mention.

In [11]:
malaya.stem.sastrawi(another_string)

'melayu bodoh dah la gay sokong lgbt lagi memang tak guna http twitter com'

In [12]:
malaya.stem.sastrawi('saya menyerukanlah')

'saya seru'

In [13]:
malaya.stem.sastrawi('menarik')

'tarik'