# Transformer HuggingFace

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/transformer-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/transformer-huggingface).
    
</div>

Below are the list of dataset we pretrained,

Standard Bahasa dataset, 

1. [Malay-dataset/dumping](https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping).
2. [Malay-dataset/pure-text](https://github.com/huseinzol05/Malay-Dataset/tree/master/pure-text).

Bahasa social media,

1. [Malay-dataset/dumping/instagram](https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping/instagram).
2. [Malay-dataset/dumping/twitter](https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping/twitter).

Singlish / Manglish,

1. [Malay-dataset/dumping/singlish](https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping/singlish-text).
2. [Malay-dataset/dumping/singapore-news](https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping/singapore-news).

**This interface not able us to use it to do custom training**.

If you want to download pretrained model for Transformer-Bahasa and use it for custom transfer-learning, you can download it here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/, some notebooks to help you get started.

In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

In [2]:
%%time
import malaya

CPU times: user 3.1 s, sys: 3.46 s, total: 6.56 s
Wall time: 2.28 s


### list Transformer HuggingFace available

In [3]:
malaya.transformer.available_huggingface()

Unnamed: 0,Size (MB)
mesolitica/roberta-base-bahasa-cased,443.0
mesolitica/roberta-tiny-bahasa-cased,66.1
mesolitica/bert-base-standard-bahasa-cased,443.0
mesolitica/bert-tiny-standard-bahasa-cased,66.1
mesolitica/roberta-base-standard-bahasa-cased,443.0
mesolitica/roberta-tiny-standard-bahasa-cased,66.1
mesolitica/electra-base-generator-bahasa-cased,140.0
mesolitica/electra-small-generator-bahasa-cased,19.3
mesolitica/finetune-mnli-t5-super-tiny-standard-bahasa-cased,50.7
mesolitica/finetune-mnli-t5-tiny-standard-bahasa-cased,139.0


In [4]:
strings = ['Kerajaan galakkan rakyat naik public transport tapi parking kat lrt ada 15. Reserved utk staff rapid je dah berpuluh. Park kereta tepi jalan kang kene saman dgn majlis perbandaran. Kereta pulak senang kene curi. Cctv pun tak ada. Naik grab dah 5-10 ringgit tiap hari. Gampang juga',
           'Alaa Tun lek ahhh npe muka masam cmni kn agong kata usaha kerajaan terdahulu sejak selepas merdeka',
           "Orang ramai cakap nurse kerajaan garang. So i tell u this. Most of our local ppl will treat us as hamba abdi and they don't respect us as a nurse"]

### Load HuggingFace model

```python
def huggingface(
    model: str = 'mesolitica/electra-base-generator-bahasa-cased',
    force_check: bool = True,
    **kwargs,
):
    """
    Load transformer model.

    Parameters
    ----------
    model: str, optional (default='mesolitica/electra-base-generator-bahasa-cased')
        Check available models at `malaya.transformer.available_transformer()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.
    """
```

In [5]:
model = malaya.transformer.huggingface(model = 'mesolitica/electra-base-generator-bahasa-cased')

I have random sentences copied from Twitter, searched using `kerajaan` keyword.

#### Vectorization

Change a string or batch of strings to latent space / vectors representation.

```python
def vectorize(
    self,
    strings: List[str],
    method: str = 'last',
    method_token: str = 'first',
    **kwargs,
):
    """
    Vectorize string inputs.

    Parameters
    ----------
    strings: List[str]
    method: str, optional (default='last')
        hidden layers supported. Allowed values:

        * ``'last'`` - last layer.
        * ``'first'`` - first layer.
        * ``'mean'`` - average all layers.
    method_token: str, optional (default='first')
        token layers supported. Allowed values:

        * ``'last'`` - last token.
        * ``'first'`` - first token.
        * ``'mean'`` - average all tokens.

        usually pretrained models trained on `first` token for classification task.

    Returns
    -------
    result: np.array
    """
```

In [6]:
v = model.vectorize(strings)
v.shape

(3, 256)

#### Attention

```python
def attention(
    self,
    strings: List[str],
    method: str = 'last',
    method_head: str = 'mean',
):
    """
    Get attention string inputs.

    Parameters
    ----------
    strings: List[str]
    method: str, optional (default='last')
        Attention layer supported. Allowed values:

        * ``'last'`` - attention from last layer.
        * ``'first'`` - attention from first layer.
        * ``'mean'`` - average attentions from all layers.
    method_head: str, optional (default='mean')
        attention head layer supported. Allowed values:

        * ``'last'`` - attention from last layer.
        * ``'first'`` - attention from first layer.
        * ``'mean'`` - average attentions from all layers.

    Returns
    -------
    result : List[List[Tuple[str, float]]]
    """
```

You can give list of strings or a string to get the attention, in this documentation, I just want to use a string.

In [7]:
model.attention([strings[1]], method = 'last')

[[('Alaa', 0.058868613),
  ('Tun', 0.061252564),
  ('lek', 0.068989426),
  ('ahhh', 0.06439798),
  ('npe', 0.05082519),
  ('muka', 0.07244482),
  ('masam', 0.053202268),
  ('cmni', 0.048232816),
  ('kn', 0.058161996),
  ('agong', 0.06559847),
  ('kata', 0.05514033),
  ('usaha', 0.057437424),
  ('kerajaan', 0.041059926),
  ('terdahulu', 0.044371374),
  ('sejak', 0.06925424),
  ('selepas', 0.069484584),
  ('merdeka', 0.061277922)]]

In [8]:
model.attention([strings[1]], method = 'first')

[[('Alaa', 0.06183809),
  ('Tun', 0.053071994),
  ('lek', 0.04778199),
  ('ahhh', 0.04694453),
  ('npe', 0.052150376),
  ('muka', 0.053927913),
  ('masam', 0.05807442),
  ('cmni', 0.08068735),
  ('kn', 0.05034357),
  ('agong', 0.054398913),
  ('kata', 0.057019003),
  ('usaha', 0.058209926),
  ('kerajaan', 0.069378614),
  ('terdahulu', 0.080670245),
  ('sejak', 0.057985097),
  ('selepas', 0.06437356),
  ('merdeka', 0.053144373)]]

In [9]:
model.attention([strings[1]], method = 'mean')

[[('Alaa', 0.048754193),
  ('Tun', 0.054038025),
  ('lek', 0.05312951),
  ('ahhh', 0.05706034),
  ('npe', 0.04947073),
  ('muka', 0.060973253),
  ('masam', 0.05763235),
  ('cmni', 0.0723617),
  ('kn', 0.05290027),
  ('agong', 0.053802904),
  ('kata', 0.07015141),
  ('usaha', 0.06137535),
  ('kerajaan', 0.06380818),
  ('terdahulu', 0.06389959),
  ('sejak', 0.056653723),
  ('selepas', 0.0524459),
  ('merdeka', 0.07154253)]]

In [10]:
roberta = malaya.transformer.huggingface(model = 'mesolitica/roberta-base-standard-bahasa-cased')

In [11]:
roberta.attention([strings[1]], method = 'last')

[[('Alaa', 0.052424457),
  ('Tun', 0.08523697),
  ('lek', 0.06813959),
  ('ahhh', 0.06153968),
  ('npe', 0.06513652),
  ('muka', 0.059199467),
  ('masam', 0.061626367),
  ('cmni', 0.06737202),
  ('kn', 0.06622731),
  ('agong', 0.052743737),
  ('kata', 0.067238666),
  ('usaha', 0.044102527),
  ('kerajaan', 0.060376044),
  ('terdahulu', 0.041831747),
  ('sejak', 0.04189242),
  ('selepas', 0.039302666),
  ('merdeka', 0.065609865)]]