# Segmentation HuggingFace

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/segmentation-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/segmentation-huggingface).
    
</div>

<div class="alert alert-info">

This module trained on both standard and local (included social media) language structures, so it is save to use for both.
    
</div>

In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

In [2]:
import logging

logging.basicConfig(level=logging.INFO)

In [3]:
%%time

import malaya

CPU times: user 3.07 s, sys: 3.53 s, total: 6.6 s
Wall time: 2.28 s


  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))


Common problem for social media texts, there are missing spaces in the text, so text segmentation can help you,

1. huseinsukamakan ayam,dia sgtrisaukan -> husein suka makan ayam, dia sgt risaukan.
2. drmahathir sangat menekankan budaya budakzamansekarang -> dr mahathir sangat menekankan budaya budak zaman sekarang.
3. ceritatunnajibrazak -> cerita tun najib razak.
4. TunM sukakan -> Tun M sukakan.

Segmentation only,

1. Solve spacing error.
3. Not correcting any grammar.

In [4]:
string1 = 'huseinsukamakan ayam,dia sgtrisaukan'
string2 = 'drmahathir sangat menekankan budaya budakzamansekarang'
string3 = 'ceritatunnajibrazak'
string4 = 'TunM sukakan'
string_hard = 'IPOH-AhliDewanUndangan Negeri(ADUN) HuluKinta, MuhamadArafat Varisai Mahamadmenafikanmesejtularmendakwa beliau akan melompatparti menyokong UMNO membentuk kerajaannegeridiPerak.BeliauyangjugaKetua Penerangan Parti Keadilan Rakyat(PKR)Perak dalam satumesejringkaskepadaSinar Harian menjelaskan perkara itutidakbenarsama sekali.'
string_socialmedia = 'aqxsukalah apeyg tejadidekat mamattu'

### List available HuggingFace model

In [5]:
malaya.segmentation.available_huggingface()

INFO:malaya.segmentation:tested on random generated dataset at https://f000.backblazeb2.com/file/malay-dataset/segmentation/test-set-segmentation.json


Unnamed: 0,Size (MB),WER,Suggested length
mesolitica/finetune-segmentation-t5-super-tiny-standard-bahasa-cased,51.0,0.13456,256.0
mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased,139.0,0.13456,256.0
mesolitica/finetune-segmentation-t5-small-standard-bahasa-cased,242.0,0.13456,256.0


### Load HuggingFace model

```python
def huggingface(model: str = 'mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased', **kwargs):
    """
    Load HuggingFace model to segmentation.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased')
        Check available models at `malaya.segmentation.available_huggingface()`.

    Returns
    -------
    result: malaya.torch_model.huggingface.Generator
    """
```

In [6]:
model = malaya.segmentation.huggingface()

#### Predict

```python
def generate(self, strings: List[str], **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    **kwargs: vector arguments pass to huggingface `generate` method.
        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

    Returns
    -------
    result: List[str]
    """
```

In [7]:
%%time

model.generate([string1, string2, string3, string4], max_length = 256)

CPU times: user 1.43 s, sys: 39.4 ms, total: 1.47 s
Wall time: 126 ms


['husein suka makan ayam, dia sgt risikokan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan']

In [8]:
%%time

model.generate([string1, string2, string3, string4, string_hard, string_socialmedia], max_length = 256)

CPU times: user 6.13 s, sys: 0 ns, total: 6.13 s
Wall time: 550 ms


['husein suka makan ayam, dia sgt risikokan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan',
 'IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Ararat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq x sukalah ape yg tejadi dekat mamat tu']

### able to infer mixed MS and EN

In [10]:
string5 = 'ihate chicken, but ilike fish'
string6 = 'Higuys! I noticedsemalam & harini dahramai yangdapat cookiesni kan. So hariniinak sharesome post mortemof our first batch:'

In [11]:
%%time

model.generate([string1, string2, string3, string4, string_hard, string_socialmedia,
               string5, string6], max_length = 256)

CPU times: user 6.61 s, sys: 0 ns, total: 6.61 s
Wall time: 617 ms


['husein suka makan ayam, dia sgt risikokan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan',
 'IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Ararat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq x sukalah ape yg tejadi dekat mamat tu',
 'i hate chicken, but i like fish',
 'Hi guys! I noticed semalam & hari ni dah ramai yang dapat cookies ni kan. So hari ni inak share some post mortem of our first batch:']