# MS-EN alignment using HuggingFace

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/alignment-ms-en-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/alignment-ms-en-huggingface).
    
</div>

<div class="alert alert-warning">

Required Tensorflow >= 2.0 for HuggingFace interface.
    
</div>

In [1]:
%%time
import malaya

CPU times: user 6.21 s, sys: 1.19 s, total: 7.4 s
Wall time: 8.73 s


## Install Transformers

Make sure you already installed transformers,

```bash
pip3 install transformers
```

## List available HuggingFace models

In [2]:
malaya.alignment.ms_en.available_huggingface()

Unnamed: 0,Size (MB)
mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms,599
bert-base-multilingual-cased,714


## Load HuggingFace model

```python
def huggingface(model: str = 'mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms', **kwargs):
    """
    Load huggingface BERT model word alignment for MS-EN, Required Tensorflow >= 2.0.

    Parameters
    ----------
    model : str, optional (default='mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms')
        Model architecture supported. Allowed values:

        * ``'mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms'`` - finetuned BERT multilanguage on noisy EN-MS.
        * ``'bert-base-multilingual-cased'`` - pretrained BERT multilanguage.

    Returns
    -------
    result: malaya.model.alignment.HuggingFace
    """
```

In [3]:
model = malaya.alignment.ms_en.huggingface()

Some layers from the model checkpoint at mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms were not used when initializing TFBertModel: ['mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertModel were not initialized from the model checkpoint at mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms and are newly initialized: ['bert/pooler/dense/kernel:0', 'bert/pooler/dense/bias:0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Align

```python
def align(
    self,
    source: List[str],
    target: List[str],
    align_layer: int = 8,
    threshold: float = 1e-3,
):
    """
    align text using softmax output layers.

    Parameters
    ----------
    source: List[str]
    target: List[str]
    align_layer: int, optional (default=3)
        transformer layer-k to choose for embedding output.
    threshold: float, optional (default=1e-3)
        minimum probability to assume as alignment.

    Returns
    -------
    result: List[List[Tuple]]
    """
```

In [4]:
left = ['Terminal 1 KKIA dilengkapi kemudahan 64 kaunter daftar masuk, 12 aero bridge selain mampu menampung 3,200 penumpang dalam satu masa.']
right = ['Terminal 1 KKIA is equipped with 64 check-in counters, 12 aero bridges and can accommodate 3,200 passengers at a time.']

In [5]:
results = model.align(left, right, align_layer = 7)
results

[[(0, 0),
  (1, 1),
  (2, 2),
  (3, 4),
  (4, 5),
  (5, 6),
  (6, 7),
  (6, 8),
  (8, 8),
  (9, 9),
  (10, 10),
  (11, 11),
  (12, 12),
  (13, 13),
  (14, 14),
  (15, 15),
  (16, 16),
  (17, 17),
  (18, 18),
  (19, 19)]]

In [6]:
for i in range(len(left)):
    left_splitted = left[i].split()
    right_splitted = right[i].split()
    for k in results[i]:
        print(i, left_splitted[k[0]], right_splitted[k[0]])

0 Terminal Terminal
0 1 1
0 KKIA KKIA
0 dilengkapi is
0 kemudahan equipped
0 64 with
0 kaunter 64
0 kaunter 64
0 masuk, counters,
0 12 12
0 aero aero
0 bridge bridges
0 selain and
0 mampu can
0 menampung accommodate
0 3,200 3,200
0 penumpang passengers
0 dalam at
0 satu a
0 masa. time.
