# True Case HuggingFace

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/true-case-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/true-case-huggingface).
    
</div>

<div class="alert alert-info">

This module trained on both standard and local (included social media) language structures, so it is save to use for both.
    
</div>

In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

In [2]:
import logging

logging.basicConfig(level=logging.INFO)

In [3]:
%%time

import malaya

CPU times: user 3.99 s, sys: 1.9 s, total: 5.89 s
Wall time: 4.13 s


### Explanation

Common third party NLP services like Google Speech to Text or PDF to Text will returned unsensitive case and no punctuations or mistake punctuations and cases. So True Case can help you.

1. jom makan di us makanan di sana sedap -> jom makan di US, makanan di sana sedap.
2. kuala lumpur menteri di jabatan perdana menteri datuk seri dr mujahid yusof rawa hari ini mengakhiri lawatan kerja lapan hari ke jordan turki dan bosnia herzegovina lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga tiga negara berkenaan -> KUALA LUMPUR - Menteri di Jabatan Perdana Menteri, Datuk Seri Dr Mujahid Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan, Turki dan Bosnia Herzegovina, lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga-tiga negara berkenaan.

True case only,

1. Solve mistake / no punctuations.
2. Solve mistake / unsensitive case.
3. Not correcting any grammar.

### List available HuggingFace model

In [4]:
malaya.true_case.available_huggingface()

INFO:malaya.true_case:tested on generated dataset at https://f000.backblazeb2.com/file/malay-dataset/true-case/test-set-true-case.json


Unnamed: 0,Size (MB),WER,CER,Suggested length
mesolitica/finetune-true-case-t5-super-tiny-standard-bahasa-cased,51.0,0.13456,0.079891,256.0
mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased,139.0,0.13456,0.079891,256.0
mesolitica/finetune-true-case-t5-small-standard-bahasa-cased,242.0,0.081105,0.016384,256.0


### Load Transformer model

```python
def huggingface(model: str = 'mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased', **kwargs):
    """
    Load HuggingFace model to true case.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased')
        Check available models at `malaya.true_case.available_huggingface()`.

    Returns
    -------
    result: malaya.torch_model.huggingface.Generator
    """
```

In [5]:
model = malaya.true_case.huggingface()

Downloading tokenizer_config.json:   0%|          | 0.00/2.49k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/784k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/2.15k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/815 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/133M [00:00<?, ?B/s]

In [6]:
string1 = 'jom makan di us makanan di sana sedap'
string2 = 'kuala lumpur menteri di jabatan perdana menteri datuk seri dr mujahid yusof rawa hari ini mengakhiri lawatan kerja lapan hari ke jordan turki dan bosnia herzegovina lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga tiga negara berkenaan'

#### Predict

```python
def generate(self, strings: List[str], **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    **kwargs: vector arguments pass to huggingface `generate` method.
        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

    Returns
    -------
    result: List[str]
    """
```

In [7]:
model.generate([string1, string2], max_length = 256)

['Jom makan di US makanan di sana sedap',
 'KUALA LUMPUR: Menteri di Jabatan Perdana Menteri, Datuk Seri Dr Mujahid Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan Turki dan Bosnia Herzegovina, lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga-tiga negara berkenaan.']

In [8]:
import random

def random_uppercase(string):
    string = [c.upper() if random.randint(0,1) else c for c in string]
    return ''.join(string)

In [9]:
r = random_uppercase(string2)
r

'KUAlA LuMpuR mEnterI di jABAtan pERDanA MenTErI dAtUk SErI dR mujaHId YusOF raWa hAri iNi MENGaKhIRI lAwaTAn KERJa lApAn HArI kE joRdAn turKi daN BOSnIA heRZeGoViNA LAWATAN Yang bERTUJUAN mENGeRaTkan LaGi hubUngAn DUa hAla DengaN ketiGa TIGa nEgaRa BERKEnAAn'

In [10]:
model.generate([r], max_length = 256)

['Kuala Lumpur Menteri di Jabatan Perdana Menteri Datuk Seri Dr Mujahid Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan Turki dan Bosnia, Herzegovina lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga tiga negara berkenaan.']

### able to infer mixed MS and EN

In [23]:
string3 = 'i hate chicken but i like fish'
string4 = 'Tun Dr Mahathir Mohamad and Perikatan Nasional (PN) Information chief Datuk Seri Azmin Ali may have differences, but both men are on the same page one thing – the belief that Pakatan Harapan (PH) is bad news for the economy.'
string4 = random_uppercase(string4)
string4

'TUn DR MaHAtHir MOhAmad AnD PerikAtAN NASIonal (PN) InfoRmAtIon CHIEF DAtuK SEri AzMiN ALI mAY hAvE diFFerencES, bUt bOth MEn ARe On ThE saMe PAgE one ThiNG – THe BElIEF tHAt PAKAtAn HaRaPan (PH) is baD NewS FoR THE EcOnOmy.'

In [24]:
model.generate([string3, string4], max_length = 256)

['I hate chicken but I like fish.',
 'Tun Dr Mahathir Mohamad and Perikatan Nasional (PN) information chief Datuk Seri Azmin Ali may have differences, but both men are on the same page one thing – the belief that Pakatan Harapan (PH) is bad news for the economy.']