# Transformer HuggingFace

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/transformer-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/transformer-huggingface).
    
</div>

Below are the list of dataset we pretrained,

Standard Bahasa dataset, 

1. [Malay-dataset/dumping](https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping).
2. [Malay-dataset/pure-text](https://github.com/huseinzol05/Malay-Dataset/tree/master/pure-text).

Bahasa social media,

1. [Malay-dataset/dumping/instagram](https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping/instagram).
2. [Malay-dataset/dumping/twitter](https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping/twitter).

Singlish / Manglish,

1. [Malay-dataset/dumping/singlish](https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping/singlish-text).
2. [Malay-dataset/dumping/singapore-news](https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping/singapore-news).

For T5 transformer models, the models trained on MNLI both english and malay languages.

**This interface not able us to use it to do custom training**.

If you want to download pretrained model for Transformer-Bahasa and use it for custom transfer-learning, you can download it here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/, some notebooks to help you get started.

In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

In [2]:
%%time
import malaya

CPU times: user 3.1 s, sys: 3.31 s, total: 6.41 s
Wall time: 2.33 s


### list Transformer HuggingFace available

In [3]:
malaya.transformer.available_huggingface()

Unnamed: 0,Size (MB)
mesolitica/roberta-base-bahasa-cased,443.0
mesolitica/roberta-tiny-bahasa-cased,66.1
mesolitica/bert-base-standard-bahasa-cased,443.0
mesolitica/bert-tiny-standard-bahasa-cased,66.1
mesolitica/roberta-base-standard-bahasa-cased,443.0
mesolitica/roberta-tiny-standard-bahasa-cased,66.1
mesolitica/electra-base-generator-bahasa-cased,140.0
mesolitica/electra-small-generator-bahasa-cased,19.3
mesolitica/finetune-mnli-t5-super-tiny-standard-bahasa-cased,50.7
mesolitica/finetune-mnli-t5-tiny-standard-bahasa-cased,139.0


In [4]:
strings = ['Kerajaan galakkan rakyat naik public transport tapi parking kat lrt ada 15. Reserved utk staff rapid je dah berpuluh. Park kereta tepi jalan kang kene saman dgn majlis perbandaran. Kereta pulak senang kene curi. Cctv pun tak ada. Naik grab dah 5-10 ringgit tiap hari. Gampang juga',
           'Alaa Tun lek ahhh npe muka masam cmni kn agong kata usaha kerajaan terdahulu sejak selepas merdeka',
           "Orang ramai cakap nurse kerajaan garang. So i tell u this. Most of our local ppl will treat us as hamba abdi and they don't respect us as a nurse",
          'Pemuda mogok lapar desak kerajaan prihatin isu iklim',
          'kerajaan perlu kisah isu iklim, pemuda mogok lapar',
          'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar']

### Load HuggingFace model

```python
def huggingface(
    model: str = 'mesolitica/electra-base-generator-bahasa-cased',
    force_check: bool = True,
    **kwargs,
):
    """
    Load transformer model.

    Parameters
    ----------
    model: str, optional (default='mesolitica/electra-base-generator-bahasa-cased')
        Check available models at `malaya.transformer.available_transformer()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.
    """
```

In [5]:
model = malaya.transformer.huggingface(model = 'mesolitica/electra-base-generator-bahasa-cased')

In [6]:
t5 = malaya.transformer.huggingface(model = 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased')

I have random sentences copied from Twitter, searched using `kerajaan` keyword.

#### Vectorization

Change a string or batch of strings to latent space / vectors representation.

```python
def vectorize(
    self,
    strings: List[str],
    method: str = 'last',
    method_token: str = 'first',
    t5_head_logits: bool = True,
    **kwargs,
):
    """
    Vectorize string inputs.

    Parameters
    ----------
    strings: List[str]
    method: str, optional (default='last')
        hidden layers supported. Allowed values:

        * ``'last'`` - last layer.
        * ``'first'`` - first layer.
        * ``'mean'`` - average all layers.

        This only applicable for non T5 models.
    method_token: str, optional (default='first')
        token layers supported. Allowed values:

        * ``'last'`` - last token.
        * ``'first'`` - first token.
        * ``'mean'`` - average all tokens.

        usually pretrained models trained on `first` token for classification task.
        This only applicable for non T5 models.
    t5_head_logits: str, optional (default=True)
        if True, will take head logits, else, last token.
        This only applicable for T5 models.

    Returns
    -------
    result: np.array
    """
```

In [7]:
from sklearn.metrics.pairwise import cosine_similarity

In [8]:
v = model.vectorize(strings)
v.shape

(6, 256)

In [9]:
cosine_similarity(v)

array([[1.0000002 , 0.7221392 , 0.7054833 , 0.66821235, 0.64426583,
        0.680184  ],
       [0.7221392 , 0.9999999 , 0.6226625 , 0.7186684 , 0.69928503,
        0.710604  ],
       [0.7054833 , 0.6226625 , 1.        , 0.6309349 , 0.6351998 ,
        0.6296929 ],
       [0.66821235, 0.7186684 , 0.6309349 , 0.9999999 , 0.95470273,
        0.8564712 ],
       [0.64426583, 0.69928503, 0.6351998 , 0.95470273, 1.        ,
        0.82342035],
       [0.680184  , 0.710604  , 0.6296929 , 0.8564712 , 0.82342035,
        1.        ]], dtype=float32)

In [10]:
v = t5.vectorize(strings)
v.shape

(6, 512)

In [11]:
cosine_similarity(v)

array([[ 0.9999999 ,  0.811328  ,  0.31431305,  0.6870531 ,  0.8720901 ,
         0.8247642 ],
       [ 0.811328  ,  1.        ,  0.20530072,  0.7460517 ,  0.68824977,
         0.8908862 ],
       [ 0.31431305,  0.20530072,  0.9999999 , -0.32410604,  0.18822116,
         0.14731692],
       [ 0.6870531 ,  0.7460517 , -0.32410604,  1.        ,  0.64864993,
         0.8376935 ],
       [ 0.8720901 ,  0.68824977,  0.18822116,  0.64864993,  0.99999976,
         0.7094655 ],
       [ 0.8247642 ,  0.8908862 ,  0.14731692,  0.8376935 ,  0.7094655 ,
         1.        ]], dtype=float32)

In [12]:
v = t5.vectorize(strings, t5_head_logits = False)
cosine_similarity(v)

array([[ 0.99999994,  0.54898286,  0.36759147,  0.4454881 ,  0.36219484,
         0.50276357],
       [ 0.54898286,  1.0000002 ,  0.5760975 ,  0.64363074,  0.07843906,
         0.7960009 ],
       [ 0.36759147,  0.5760975 ,  1.0000001 ,  0.29530543, -0.3313179 ,
         0.63890153],
       [ 0.4454881 ,  0.64363074,  0.29530543,  0.99999976,  0.22043958,
         0.7768799 ],
       [ 0.36219484,  0.07843906, -0.3313179 ,  0.22043958,  1.        ,
         0.00962304],
       [ 0.50276357,  0.7960009 ,  0.63890153,  0.7768799 ,  0.00962304,
         0.99999976]], dtype=float32)

#### Attention

```python
def attention(
    self,
    strings: List[str],
    method: str = 'last',
    method_head: str = 'mean',
    t5_attention: str = 'cross_attentions',
    **kwargs,
):
    """
    Get attention string inputs.

    Parameters
    ----------
    strings: List[str]
    method: str, optional (default='last')
        Attention layer supported. Allowed values:

        * ``'last'`` - attention from last layer.
        * ``'first'`` - attention from first layer.
        * ``'mean'`` - average attentions from all layers.
    method_head: str, optional (default='mean')
        attention head layer supported. Allowed values:

        * ``'last'`` - attention from last layer.
        * ``'first'`` - attention from first layer.
        * ``'mean'`` - average attentions from all layers.
    t5_attention: str, optional (default='cross_attentions')
        attention type for T5 models. Allowed values:

        * ``'cross_attentions'`` - cross attention.
        * ``'encoder_attentions'`` - encoder attention.
        * ``'decoder_attentions'`` - decoder attention.

        This only applicable for T5 models.

    Returns
    -------
    result : List[List[Tuple[str, float]]]
    """
```

You can give list of strings or a string to get the attention, in this documentation, I just want to use a string.

In [13]:
model.attention([strings[1]], method = 'last')

[[('Alaa', 0.058868613),
  ('Tun', 0.061252564),
  ('lek', 0.068989426),
  ('ahhh', 0.06439798),
  ('npe', 0.05082519),
  ('muka', 0.07244482),
  ('masam', 0.053202268),
  ('cmni', 0.048232816),
  ('kn', 0.058161996),
  ('agong', 0.06559847),
  ('kata', 0.05514033),
  ('usaha', 0.057437424),
  ('kerajaan', 0.041059926),
  ('terdahulu', 0.044371374),
  ('sejak', 0.06925424),
  ('selepas', 0.069484584),
  ('merdeka', 0.061277922)]]

In [14]:
model.attention([strings[1]], method = 'first')

[[('Alaa', 0.06183809),
  ('Tun', 0.053071994),
  ('lek', 0.04778199),
  ('ahhh', 0.04694453),
  ('npe', 0.052150376),
  ('muka', 0.053927913),
  ('masam', 0.05807442),
  ('cmni', 0.08068735),
  ('kn', 0.05034357),
  ('agong', 0.054398913),
  ('kata', 0.057019003),
  ('usaha', 0.058209926),
  ('kerajaan', 0.069378614),
  ('terdahulu', 0.080670245),
  ('sejak', 0.057985097),
  ('selepas', 0.06437356),
  ('merdeka', 0.053144373)]]

In [15]:
model.attention([strings[1]], method = 'mean')

[[('Alaa', 0.048754193),
  ('Tun', 0.054038025),
  ('lek', 0.05312951),
  ('ahhh', 0.05706034),
  ('npe', 0.04947073),
  ('muka', 0.060973253),
  ('masam', 0.05763235),
  ('cmni', 0.0723617),
  ('kn', 0.05290027),
  ('agong', 0.053802904),
  ('kata', 0.07015141),
  ('usaha', 0.06137535),
  ('kerajaan', 0.06380818),
  ('terdahulu', 0.06389959),
  ('sejak', 0.056653723),
  ('selepas', 0.0524459),
  ('merdeka', 0.07154253)]]

In [16]:
t5.attention([strings[1]], method = 'last', t5_attention = 'cross_attentions')

[[('Alaa', 0.020544203),
  ('Tun', 0.014763401),
  ('lek', 0.016012296),
  ('ahhh', 0.010291392),
  ('npe', 0.015250254),
  ('muka', 0.025421701),
  ('masam', 0.009508574),
  ('cmni', 0.02144086),
  ('kn', 0.016106293),
  ('agong', 0.011552193),
  ('kata', 0.054768853),
  ('usaha', 0.118117474),
  ('kerajaan', 0.09078922),
  ('terdahulu', 0.10955463),
  ('sejak', 0.08275472),
  ('selepas', 0.09991094),
  ('merdeka', 0.28321296)]]

In [17]:
t5.attention([strings[1]], method = 'last', t5_attention = 'encoder_attentions')

[[('Alaa', 0.06314257),
  ('Tun', 0.111133136),
  ('lek', 0.02435303),
  ('ahhh', 0.038024332),
  ('npe', 0.02238715),
  ('muka', 0.040521525),
  ('masam', 0.017215427),
  ('cmni', 0.03169653),
  ('kn', 0.031364124),
  ('agong', 0.055961456),
  ('kata', 0.09224097),
  ('usaha', 0.024536965),
  ('kerajaan', 0.07011165),
  ('terdahulu', 0.14476436),
  ('sejak', 0.086011514),
  ('selepas', 0.060092986),
  ('merdeka', 0.08644233)]]

In [18]:
t5.attention([strings[1]], method = 'last', t5_attention = 'decoder_attentions')

[[('Alaa', 0.20212816),
  ('Tun', 0.12662782),
  ('lek', 0.043903284),
  ('ahhh', 0.06454229),
  ('npe', 0.051137008),
  ('muka', 0.05798657),
  ('masam', 0.023386382),
  ('cmni', 0.062053815),
  ('kn', 0.052029867),
  ('agong', 0.06720375),
  ('kata', 0.064140044),
  ('usaha', 0.05823552),
  ('kerajaan', 0.036468714),
  ('terdahulu', 0.025139563),
  ('sejak', 0.016029576),
  ('selepas', 0.021070626),
  ('merdeka', 0.027917115)]]

In [19]:
roberta = malaya.transformer.huggingface(model = 'mesolitica/roberta-base-standard-bahasa-cased')

In [20]:
roberta.attention([strings[1]], method = 'last')

[[('Alaa', 0.052424457),
  ('Tun', 0.08523697),
  ('lek', 0.06813959),
  ('ahhh', 0.06153968),
  ('npe', 0.06513652),
  ('muka', 0.059199467),
  ('masam', 0.061626367),
  ('cmni', 0.06737202),
  ('kn', 0.06622731),
  ('agong', 0.052743737),
  ('kata', 0.067238666),
  ('usaha', 0.044102527),
  ('kerajaan', 0.060376044),
  ('terdahulu', 0.041831747),
  ('sejak', 0.04189242),
  ('selepas', 0.039302666),
  ('merdeka', 0.065609865)]]