<a href="https://colab.research.google.com/github/graehl/awesome-align/blob/master/awesome_align_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AWESOME: Aligning Word Embedding Spaces of Multilingual Encoders

[``awesome-align``](https://github.com/neulab/awesome-align) is a tool that can extract word alignments from multilingual BERT (mBERT) and allows you to fine-tune mBERT on parallel corpora for better alignment quality (see [our paper](https://arxiv.org/abs/2101.08231) for more details).

This is a simple demo of how `awesome-align` extracts word alignments from mBERT.

First, install and import the following packages. (Note that the original `awesome-align` tool does not require the `transformers` package.)

In [1]:
!pip install transformers
import torch
import transformers
import itertools



Load the multilingual BERT model and its tokenizer.

In [2]:
model = transformers.BertModel.from_pretrained('bert-base-multilingual-cased')
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-multilingual-cased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Input *tokenized* source and target sentences.

In [12]:
src = 'I bought a new car because I was going through a midlife crisis .'
tgt = 'Я купил новую тачку , потому что я переживал кризис среднего возраста .'
tgt = 'Compré un auto nuevo porque estaba pasando por una crisis de la mediana edad .'

Run the model and print the resulting alignments.

In [30]:
import pdb
# pre-processing
sent_src, sent_tgt = src.strip().split(), tgt.strip().split()
token_src, token_tgt = [tokenizer.tokenize(word) for word in sent_src], [tokenizer.tokenize(word) for word in sent_tgt]
wid_src, wid_tgt = [tokenizer.convert_tokens_to_ids(x) for x in token_src], [tokenizer.convert_tokens_to_ids(x) for x in token_tgt]
ids_src, ids_tgt = tokenizer.prepare_for_model(list(itertools.chain(*wid_src)), return_tensors='pt', model_max_length=tokenizer.model_max_length, truncation=True)['input_ids'], tokenizer.prepare_for_model(list(itertools.chain(*wid_tgt)), return_tensors='pt', truncation=True, model_max_length=tokenizer.model_max_length)['input_ids']
sub2word_map_src = []
for i, word_list in enumerate(token_src):
  sub2word_map_src += [i for x in word_list]
sub2word_map_tgt = []
for i, word_list in enumerate(token_tgt):
  sub2word_map_tgt += [i for x in word_list]


# printing
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'


# alignment


def sent_without_startend(batch, sent=0): return batch[sent, 1:-1]
def alignvec(batch, align_layer=8, sent=0): return sent_without_startend(batch[align_layer], sent=sent)
def hidden(model, ids): return model(ids.unsqueeze(0), output_hidden_states=True)[2]
with torch.no_grad():
  model.eval()
for align_layer in range(7,13):
 threshold = 1e-3
 for it in range(4):
  threshold = threshold * 1e-2
  with torch.no_grad():
    hidden_src = hidden(model, ids_src)
    hidden_tgt = hidden(model, ids_tgt)
    #pdb.set_trace()
    out_src = alignvec(hidden_src, align_layer) #model(ids_src.unsqueeze(0), output_hidden_states=True)[2][align_layer][0, 1:-1]
    out_tgt = alignvec(hidden_tgt, align_layer) #model(ids_tgt.unsqueeze(0), output_hidden_states=True)[2][align_layer][0, 1:-1]

    dot_prod = torch.matmul(out_src, out_tgt.transpose(-1, -2))

    softmax_srctgt = torch.nn.Softmax(dim=-1)(dot_prod)
    softmax_tgtsrc = torch.nn.Softmax(dim=-2)(dot_prod)

    softmax_inter = (softmax_srctgt > threshold)*(softmax_tgtsrc > threshold)

  align_subwords = torch.nonzero(softmax_inter, as_tuple=False)
  align_words = set()
  for i, j in align_subwords:
    align_words.add( (sub2word_map_src[i], sub2word_map_tgt[j]) )
  print(f" (layer {align_layer} > {threshold:.3g}) For '{src}' to '{tgt}'")
  for i, j in sorted(align_words):
    print(f'{color.BOLD}{color.BLUE}{sent_src[i]}{color.END}==={color.BOLD}{color.RED}{sent_tgt[j]}{color.END}')

 (layer 7 > 1e-05) For 'I bought a new car because I was going through a midlife crisis .' to 'Compré un auto nuevo porque estaba pasando por una crisis de la mediana edad .'
[1m[94mbought[0m===[1m[91mCompré[0m
[1m[94ma[0m===[1m[91mun[0m
[1m[94mnew[0m===[1m[91mnuevo[0m
[1m[94mcar[0m===[1m[91mauto[0m
[1m[94mbecause[0m===[1m[91mporque[0m
[1m[94mwas[0m===[1m[91mestaba[0m
[1m[94mgoing[0m===[1m[91mpasando[0m
[1m[94mthrough[0m===[1m[91mpor[0m
[1m[94ma[0m===[1m[91muna[0m
[1m[94mcrisis[0m===[1m[91mcrisis[0m
[1m[94m.[0m===[1m[91m.[0m
 (layer 7 > 1e-07) For 'I bought a new car because I was going through a midlife crisis .' to 'Compré un auto nuevo porque estaba pasando por una crisis de la mediana edad .'
[1m[94mbought[0m===[1m[91mCompré[0m
[1m[94ma[0m===[1m[91mun[0m
[1m[94mnew[0m===[1m[91mnuevo[0m
[1m[94mcar[0m===[1m[91mauto[0m
[1m[94mbecause[0m===[1m[91mporque[0m
[1m[94mwas[0m===[1m[91mestaba