# Noisy HuggingFace

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/noisy-translation-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/noisy-translation-huggingface).
    
</div>

<div class="alert alert-info">

This module trained on both standard and local (included social media) language structures, so it is save to use for both.
    
</div>

In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

In [2]:
%%time

import malaya
import logging

logging.basicConfig(level=logging.INFO)

Cannot import beam_search_ops from Tensorflow 1, ['malaya.jawi_rumi.deep_model', 'malaya.phoneme.deep_model', 'malaya.rumi_jawi.deep_model', 'malaya.stem.deep_model'] for stemmer will not available to use, make sure Tensorflow 1 version >= 1.15


CPU times: user 5.04 s, sys: 3.52 s, total: 8.56 s
Wall time: 5.28 s


### List available HuggingFace models

In [3]:
malaya.translation.available_huggingface()

INFO:malaya.translation:tested on FLORES200 pair `dev` set, https://github.com/huseinzol05/malay-dataset/tree/master/translation/flores200-eval
INFO:malaya.translation:tested on noisy test set, https://github.com/huseinzol05/malay-dataset/tree/master/translation/noisy-eval
INFO:malaya.translation:check out NLLB 200 metrics from `malaya.translation.nllb_metrics`.
INFO:malaya.translation:check out Google Translate metrics from `malaya.translation.google_translate_metrics`.


Unnamed: 0,Size (MB),Suggested length,en-ms chrF2++,ms-en chrF2++,pasar ms-ms chrF2++,pasar ms-en chrF2++,from lang,to lang,old model,ind-ms chrF2++,jav-ms chrF2++,manglish-ms chrF2++,manglish-en chrF2++
mesolitica/finetune-noisy-translation-t5-tiny-bahasa-cased-v3,139,1024,43.937298,43.937298,43.937298,43.937298,"[en, ms, pasar ms]","[ms, en]",True,,,,
mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v3,242,1024,43.937298,43.937298,43.937298,43.937298,"[en, ms, pasar ms]","[ms, en]",True,,,,
mesolitica/finetune-noisy-translation-t5-base-bahasa-cased-v3,892,1024,43.937298,43.937298,43.937298,43.937298,"[en, ms, pasar ms]","[ms, en]",True,,,,
mesolitica/translation-t5-tiny-standard-bahasa-cased,139,1536,43.937298,43.937298,43.937298,43.937298,"[en, ms, ind, jav, bjn, manglish, pasar ms]","[en, ms]",False,43.937298,43.937298,43.937298,43.937298
mesolitica/translation-t5-small-standard-bahasa-cased,242,1536,43.937298,43.937298,43.937298,43.937298,"[en, ms, ind, jav, bjn, manglish, pasar ms]","[en, ms]",False,43.937298,43.937298,43.937298,43.937298
mesolitica/translation-t5-base-standard-bahasa-cased,892,1536,43.937298,43.937298,43.937298,43.937298,"[en, ms, ind, jav, bjn, manglish, pasar ms]","[en, ms]",False,43.937298,43.937298,43.937298,43.937298


### Improvements of new model

1. able to translate `[en, ms, ind, jav, bjn, manglish, pasar ms]` while old model only able to translate `[en, ms, pasar ms]`.
2. No longer required `from_lang` part of the prefix.
3. able to retain text structure as it is.

### Load Transformer models

```python
def huggingface(
    model: str = 'mesolitica/translation-t5-small-standard-bahasa-cased',
    force_check: bool = True,
    from_lang: List[str] = None,
    to_lang: List[str] = None,
    old_model: bool = False,
    **kwargs,
):
    """
    Load HuggingFace model to translate.

    Parameters
    ----------
    model: str, optional (default='mesolitica/translation-t5-small-standard-bahasa-cased')
        Check available models at `malaya.translation.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.huggingface.Translation
    """
```

In [4]:
model = malaya.translation.huggingface()

In [5]:
old_model = malaya.translation.huggingface(model = 'mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v3')

### Translate

```python
def generate(self, strings: List[str], from_lang: str = None, to_lang: str = 'ms', **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    from_lang: str, optional (default=None)
        old model required `from_lang` parameter to make it works properly,
        while new model not required.
    to_lang: str, optional (default='ms')
        target language to translate.
    **kwargs: vector arguments pass to huggingface `generate` method.
        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

    Returns
    -------
    result: List[str]
    """
```

In [6]:
from pprint import pprint

### Noisy malay

In [8]:
strings = [
    'ak tak paham la',
    'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:',
    "Memanglah. Ini tak payah expert, aku pun tau. It's a gesture, bodoh.",
    'jam 8 di pasar KK memang org ramai ðŸ˜‚, pandai dia pilih tmpt.',
    'Jadi haram jadahðŸ˜€ðŸ˜ƒðŸ¤­',
    'nak gi mana tuu',
    'Macam nak ambil half day',
    "Bayangkan PH dan menang pru-14. Pastu macam-macam pintu belakang ada. Last-last Ismail Sabri naik. That's why I don't give a fk about politics anymore. Sumpah dah fk up dah.",
    'mesolitica boleh buat asr tak',
]

In [9]:
%%time

pprint(model.generate(strings,  to_lang = 'ms', max_length = 1000))

['Saya tidak faham',
 'Hai kawan-kawan! Saya perasan semalam dan hari ini ramai yang menerima '
 'biskut kan? Jadi hari ini saya ingin berkongsi beberapa post mortem batch '
 'pertama kami:',
 'Memang betul. Ini tidak perlu menjadi pakar, saya juga tahu. Ini adalah '
 'isyarat, bodoh.',
 '"Pada pukul 8 di pasar KK memang ramai orang, pandai pilih tempat."',
 'Jadi haram jadah',
 'Di mana kamu pergi?',
 'Saya rasa nak ambil separuh hari',
 'Bayangkan PH dan menang dalam PRU-14. Kemudian terdapat pelbagai pintu '
 'belakang. Akhirnya, Ismail Sabri naik. Itulah sebabnya saya tidak lagi '
 'peduli tentang politik. Saya bersumpah sudah cukup.',
 'Bolehkah Mesolitica membuat Asr?']
CPU times: user 58.8 s, sys: 230 ms, total: 59.1 s
Wall time: 5.03 s


In [10]:
%%time

pprint(model.generate(strings,  to_lang = 'en', max_length = 1000))

["I don't understand",
 'Hi guys! I noticed yesterday and today many people have received cookies, '
 'right? So today I want to share some post mortem of our first batch:',
 "Indeed. This doesn't need an expert, I know too. It's a gesture, stupid.",
 "At 8 o'clock at the KK market, there are many people, he knows how to choose "
 'places.',
 "So it's forbidden jadah",
 'Where do you want to go?',
 "It's like taking half a day",
 'Imagine PH and won the 14th general election. Then there were various '
 "backdoors. In the end, Ismail Sabri rose. That's why I don't give a fk about "
 "politics anymore. I swear it's already fk up.",
 'Can Mesolitica do it?']
CPU times: user 1min 45s, sys: 772 ms, total: 1min 46s
Wall time: 9.16 s


### Manglish

In [11]:
strings = [
    'i know plenty of people who snack on sambal ikan bilis.',
    'I often visualize my own programs algorithm before implemment it.',
    'Am I the only one who used their given name ever since I was a kid?',
    'Gotta be wary of pimples. Oh they bleed bad when cut',
    'Smh the dude literally has a rubbish bin infront of his house',
    "I think I won't be able to catch it within 1 min lol"
]

In [12]:
%%time

pprint(model.generate(strings,  to_lang = 'ms', max_length = 1000))

['Saya tahu ramai orang yang makan sambal ikan bilis.',
 'Saya sering mengvisualisasikan algoritma program saya sendiri sebelum '
 'menerapkannya.',
 'Adakah saya satu-satunya yang menggunakan nama mereka sejak kecil?',
 'Kena berhati-hati dengan jerawat. Oh mereka berdarah teruk apabila dipotong',
 'Smh lelaki itu sebenarnya mempunyai tong sampah di depan rumahnya.',
 'Saya rasa saya tidak akan dapat menangkapnya dalam masa 1 minit haha']
CPU times: user 10.8 s, sys: 0 ns, total: 10.8 s
Wall time: 906 ms


In [13]:
%%time

pprint(model.generate(strings,  to_lang = 'en', max_length = 1000))

['I know many people who snack on sambal ikan bilis.',
 'I often visualize my own programs algorithm before implemmenting it.',
 'Am I the only one who has used their name since I was a child?',
 'You must be careful of pimples. Oh, they bleed badly when cut.',
 'Oh my goodness, the man actually has a rubbish bin in front of his house.',
 "I don't think I will be able to catch it within a minute, haha."]
CPU times: user 14.8 s, sys: 3.98 ms, total: 14.8 s
Wall time: 1.24 s
