# Stemmer and Lemmatization

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/stemmer](https://github.com/huseinzol05/Malaya/tree/master/example/stemmer).
    
</div>

<div class="alert alert-warning">

This module only trained on standard language structure, so it is not save to use it for local language structure.
    
</div>

In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

In [2]:
%%time
import malaya

CPU times: user 2.88 s, sys: 2.56 s, total: 5.44 s
Wall time: 2.26 s


  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))


In [3]:
string = 'Benda yg SALAH ni, jgn lah didebatkan. Yg SALAH xkan jadi betul. Ingat tu. Mcm mana kesat sekalipun org sampaikan mesej, dan memang benda tu salah, diam je. Xyah nk tunjuk kau open sangat nk tegur cara org lain berdakwah'
another_string = 'melayu bodoh, dah la gay, sokong lgbt lagi, memang tak guna, http://twitter.com @kesedihan rm15'

### Use deep learning model

Load LSTM + Bahdanau Attention stemming model, this also include lemmatization.

If you are using Tensorflow 2, make sure Tensorflow Addons already installed,

```bash
pip install tensorflow-addons U
```

```python
def deep_model(model: str = 'base', quantized: bool = False, **kwargs):
    """
    Load LSTM + Bahdanau Attention stemming model,
    256 filter size, 2 layers, BPE level (YouTokenToMe 20k vocab size).
    This model also include lemmatization.
    Original size 41.6MB, quantized size 10.6MB .

    Parameters
    ----------
    model : str, optional (default='base')
        Model architecture supported. Allowed values:

        * ``'base'`` - trained on default dataset.
        * ``'noisy'`` - trained on default and augmentation dataset.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: malaya.stem.DeepStemmer class
    """
```

In [4]:
model = malaya.stem.deep_model(model = 'base')

2022-08-31 18:50:30.272527: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-31 18:50:30.280040: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-08-31 18:50:30.280067: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: huseincomel-desktop
2022-08-31 18:50:30.280073: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: huseincomel-desktop
2022-08-31 18:50:30.280152: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2022-08-31 18:50:30.280186: I

### Load Quantized model

To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

In [6]:
quantized_model = malaya.stem.deep_model(model = 'base', quantized = True)

#### Stem and lemmatization

```python
def stem(self, string: str, beam_search: bool = True):
    """
    Stem a string, this also include lemmatization.

    Parameters
    ----------
    string : str
    beam_search : bool, (optional=True)
        If True, use beam search decoder, else use greedy decoder.

    Returns
    -------
    result: str
    """
```

If want to speed up the inference, set `beam_search = False`.

In [7]:
%%time

model.stem(string)

CPU times: user 376 ms, sys: 41.2 ms, total: 417 ms
Wall time: 339 ms


'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'

In [8]:
%%time

model.stem(string, beam_search = False)

CPU times: user 196 ms, sys: 20.7 ms, total: 217 ms
Wall time: 138 ms


'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'

In [9]:
%%time

quantized_model.stem(string)

CPU times: user 291 ms, sys: 44.5 ms, total: 336 ms
Wall time: 252 ms


'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'

In [10]:
%%time

quantized_model.stem(string, beam_search = False)

CPU times: user 467 ms, sys: 47.4 ms, total: 514 ms
Wall time: 316 ms


'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'

In [11]:
model.stem(another_string)

'layu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com @kesedihan rm15'

In [12]:
quantized_model.stem(another_string)

'layu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com @kesedihan rm15'

In [13]:
model.stem('saya menyerukanlah')

'saya seru'

In [14]:
quantized_model.stem('saya menyerukanlah')

'saya seru'

### Use Sastrawi stemmer

Malaya also included interface for [Sastrawi stemmer](https://pypi.org/project/PySastrawi/). We use it for internal purpose. To use it, simply,

```python
def sastrawi():
    """
    Load stemming model using Sastrawi, this also include lemmatization.

    Returns
    -------
    result: malaya.stem.SASTRAWI class
    """
```

In [15]:
sastrawi = malaya.stem.sastrawi()

In [16]:
sastrawi.stem('saya menyerukanlah')

'saya seru'

In [17]:
sastrawi.stem('menarik')

'tarik'

In [24]:
sastrawi.stem(string)

'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'

In [18]:
sastrawi.stem(another_string)

'melayu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com @kesedihan rm15'

### Use Naive stemmer

Simply use regex pattern to do stemming. This method not able to lemmatize.

```python
def naive():
    """
    Load stemming model using startswith and endswith naively using regex patterns.

    Returns
    -------
    result : malaya.stem.NAIVE class
    """
```

In [19]:
naive = malaya.stem.naive()

In [20]:
naive.stem('saya menyerukanlah')

'saya yerukan'

In [21]:
naive.stem('menarik')

'arik'

In [23]:
naive.stem(string)

'Benda yg SALAH ni , jgn  debat . Yg SALAH x jadi betul . Ingat tu . Mcm mana sat kalipun org sampai sej , d ang benda tu sa , am je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'

In [22]:
naive.stem(another_string)

'layu bodoh , dah la gay , sokong lgbt lagi , ang tak guna , http://twitter.com @kesedihan rm15'