# EN to MS HuggingFace

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/en-ms-translation-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/en-ms-translation-huggingface).
    
</div>

<div class="alert alert-warning">

This module trained on standard language and augmented local language structures, proceed with caution.
    
</div>

In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

In [2]:
%%time

import malaya
import logging

logging.basicConfig(level=logging.INFO)

CPU times: user 3.97 s, sys: 3.52 s, total: 7.49 s
Wall time: 4.09 s


### List available HuggingFace models

In [3]:
malaya.translation.en_ms.available_huggingface()

INFO:malaya.translation.en_ms:tested on FLORES200 EN-MS (eng_Latn-zsm_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200
INFO:malaya.translation.en_ms:for noisy, tested on noisy twitter google translation, https://huggingface.co/datasets/mesolitica/augmentation-test-set


Unnamed: 0,Size (MB),BLEU,SacreBLEU Verbose,SacreBLEU-chrF++-FLORES200,Suggested length
mesolitica/finetune-translation-t5-tiny-standard-bahasa-cased-v2,139,41.625536,73.4/50.1/35.7/25.7 (BP = 0.971 ratio = 0.972 ...,65.7,256
mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2,242,43.937298,74.9/52.2/37.9/27.7 (BP = 0.976 ratio = 0.977 ...,67.43,256
mesolitica/finetune-translation-t5-base-standard-bahasa-cased-v2,892,44.173559,74.7/52.3/38.0/28.0 (BP = 0.979 ratio = 0.979 ...,67.6,256
mesolitica/finetune-noisy-translation-t5-tiny-bahasa-cased-v3,139,60.000967,77.9/63.9/54.6/47.7 (BP = 1.000 ratio = 1.036 ...,,256
mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v3,242,64.062582,80.1/67.7/59.1/52.5 (BP = 1.000 ratio = 1.042 ...,,256
mesolitica/finetune-noisy-translation-t5-base-bahasa-cased-v2,892,64.583819,80.2/68.1/59.8/53.2 (BP = 1.000 ratio = 1.048 ...,,256


### Load Transformer models

```python
def huggingface(
    model: str = 'mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2',
    force_check: bool = True,
    **kwargs,
):
    """
    Load HuggingFace model to translate EN-to-MS.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2')
        Check available models at `malaya.translation.en_ms.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.huggingface.Generator
    """
```

In [16]:
transformer = malaya.translation.en_ms.transformer()

In [18]:
transformer_huggingface = malaya.translation.en_ms.huggingface()

### Translate

```python
def generate(self, strings: List[str], **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    **kwargs: vector arguments pass to huggingface `generate` method.
        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

    Returns
    -------
    result: List[str]
    """
```

**For better results, always split by end of sentences**.

In [6]:
from pprint import pprint

In [7]:
# https://www.malaymail.com/news/malaysia/2020/07/01/dr-mahathir-again-claims-anwar-lacks-popularity-with-malays-to-be-pakatans/1880420

string_news1 = 'KUALA LUMPUR, July 1 - Datuk Seri Anwar Ibrahim is not suitable to as the prime minister candidate as he is allegedly not "popular" among the Malays, Tun Dr Mahathir Mohamad claimed. The former prime minister reportedly said the PKR president needs someone like himself in order to acquire support from the Malays and win the election.'
pprint(string_news1)

('KUALA LUMPUR, July 1 - Datuk Seri Anwar Ibrahim is not suitable to as the '
 'prime minister candidate as he is allegedly not "popular" among the Malays, '
 'Tun Dr Mahathir Mohamad claimed. The former prime minister reportedly said '
 'the PKR president needs someone like himself in order to acquire support '
 'from the Malays and win the election.')


In [8]:
# https://edition.cnn.com/2020/07/06/politics/new-york-attorney-general-blm/index.html

string_news2 = '(CNN)New York Attorney General Letitia James on Monday ordered the Black Lives Matter Foundation -- which she said is not affiliated with the larger Black Lives Matter movement -- to stop collecting donations in New York. "I ordered the Black Lives Matter Foundation to stop illegally accepting donations that were intended for the #BlackLivesMatter movement. This foundation is not affiliated with the movement, yet it accepted countless donations and deceived goodwill," James tweeted.'
pprint(string_news2)

('(CNN)New York Attorney General Letitia James on Monday ordered the Black '
 'Lives Matter Foundation -- which she said is not affiliated with the larger '
 'Black Lives Matter movement -- to stop collecting donations in New York. "I '
 'ordered the Black Lives Matter Foundation to stop illegally accepting '
 'donations that were intended for the #BlackLivesMatter movement. This '
 'foundation is not affiliated with the movement, yet it accepted countless '
 'donations and deceived goodwill," James tweeted.')


In [9]:
# https://www.thestar.com.my/business/business-news/2020/07/04/malaysia-worries-new-eu-food-rules-could-hurt-palm-oil-exports

string_news3 = 'Amongst the wide-ranging initiatives proposed are a sustainable food labelling framework, a reformulation of processed foods, and a sustainability chapter in all EU bilateral trade agreements. The EU also plans to publish a proposal for a legislative framework for sustainable food systems by 2023 to ensure all foods on the EU market become increasingly sustainable.'
pprint(string_news3)

('Amongst the wide-ranging initiatives proposed are a sustainable food '
 'labelling framework, a reformulation of processed foods, and a '
 'sustainability chapter in all EU bilateral trade agreements. The EU also '
 'plans to publish a proposal for a legislative framework for sustainable food '
 'systems by 2023 to ensure all foods on the EU market become increasingly '
 'sustainable.')


In [10]:
# https://jamesclear.com/articles

string_article1 = 'This page shares my best articles to read on topics like health, happiness, creativity, productivity and more. The central question that drives my work is, “How can we live better?” To answer that question, I like to write about science-based ways to solve practical problems.'
pprint(string_article1)

('This page shares my best articles to read on topics like health, happiness, '
 'creativity, productivity and more. The central question that drives my work '
 'is, “How can we live better?” To answer that question, I like to write about '
 'science-based ways to solve practical problems.')


In [11]:
%%time

pprint(transformer.greedy_decoder([string_news1, string_news2, string_news3, string_article1]))

['KUALA LUMPUR 1 Julai - Datuk Seri Anwar Ibrahim tidak sesuai menjadi calon '
 'Perdana Menteri kerana beliau didakwa tidak "popular" dalam kalangan orang '
 'Melayu, Tun Dr Mahathir Mohamad mendakwa, bekas Perdana Menteri itu '
 'dilaporkan berkata Presiden PKR itu memerlukan seseorang seperti dirinya '
 'bagi mendapatkan sokongan daripada orang Melayu dan memenangi pilihan raya.',
 '(CNN) Peguam Negara New York Letitia James pada hari Isnin memerintahkan '
 'Black Lives Matter Foundation - yang menurutnya tidak berafiliasi dengan '
 'gerakan Black Lives Matter yang lebih besar - untuk berhenti mengumpulkan '
 'sumbangan di New York. "Saya memerintahkan Black Lives Matter Foundation '
 'untuk berhenti menerima sumbangan secara haram yang bertujuan untuk gerakan '
 '#BlackLivesMatter. Yayasan ini tidak berafiliasi dengan gerakan itu, namun '
 'ia menerima banyak sumbangan dan muhibah yang ditipu," tweet James.',
 'Di antara inisiatif luas yang diusulkan adalah kerangka pelabelan makan

In [12]:
%%time

pprint(transformer_huggingface.generate([string_news1, string_news2, string_news3, string_article1],
                                       max_length = 1000))

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


['KUALA LUMPUR 1 Julai - Datuk Seri Anwar Ibrahim tidak sesuai sebagai calon '
 'perdana menteri kerana didakwa tidak "popular" dalam kalangan orang Melayu, '
 'dakwa Tun Dr Mahathir Mohamad. Bekas perdana menteri itu dilaporkan berkata '
 'presiden PKR memerlukan seseorang seperti dirinya untuk memperoleh sokongan '
 'daripada orang Melayu dan memenangi pilihan raya.',
 '(CNN) Peguam Negara New York Letitia James pada hari Isnin mengarahkan '
 'Yayasan Black Lives Matter - yang katanya tidak bergabung dengan pergerakan '
 'Black Lives Matter yang lebih besar - untuk berhenti mengumpul sumbangan di '
 'New York. "Saya mengarahkan Yayasan Black Lives Matter berhenti menerima '
 'sumbangan secara haram yang bertujuan untuk pergerakan #BlackLivesMatter. '
 'Yayasan ini tidak bergabung dengan pergerakan itu, namun ia menerima '
 'sumbangan yang tidak terkira banyaknya dan menipu muhibah," tulis James.',
 'Antara inisiatif meluas yang dicadangkan ialah rangka pelabelan makanan '
 'mampan, p

### compare with Google translate using googletrans

Install it by,

```bash
pip3 install googletrans==4.0.0rc1
```

In [13]:
from googletrans import Translator

translator = Translator()

In [14]:
strings = [string_news1, string_news2, string_news3, string_article1]

In [15]:
for t in strings:
    r = translator.translate(t, src='en', dest = 'ms')
    print(r.text)

KUALA LUMPUR, 1 Julai - Datuk Seri Anwar Ibrahim tidak sesuai sebagai calon Perdana Menteri kerana dia tidak "popular" di kalangan orang Melayu, Tun Dr Mahathir Mohamad mendakwa.Bekas Perdana Menteri dilaporkan berkata presiden PKR memerlukan seseorang seperti dirinya untuk mendapatkan sokongan daripada orang Melayu dan memenangi pilihan raya.
(CNN) Peguam Negara New York, Letitia James pada hari Isnin mengarahkan Yayasan Black Lives Matter - yang dikatakannya tidak bergabung dengan pergerakan Black Lives Matter yang lebih besar - untuk berhenti mengumpul sumbangan di New York."Saya memerintahkan Yayasan Black Lives Matter untuk berhenti menerima sumbangan secara haram yang dimaksudkan untuk gerakan #BlackLivesMatter.
Di antara inisiatif yang luas yang dicadangkan adalah rangka kerja pelabelan makanan yang mampan, pembaharuan makanan yang diproses, dan bab kemampanan dalam semua perjanjian perdagangan dua hala EU.EU juga merancang untuk menerbitkan cadangan untuk rangka kerja perundang