---
title: "Word alignment with multilingual BERT: from subwords to lexicons"
subtitle: "Multilingual NLP -- Lab 2"
author: "Philippos Triantafyllou"
date-modified: last-modified
date-format: long
lang: en
format: html
theme: cosmo
toc: true
number-sections: true
number-depth: 2
code-line-numbers: true
echo: true
output: true
cap-location: top
embed-resources: true
---

:::{.callout-note}
## Instructions

In this lab, we will explore how to extract a bilingual lexicon (a list of word translation pairs linking a source language to a target language) directly from the internal representations of a "large" multilingual language model. Specifically, we will rely on `mBERT`, a variant of `BERT` pre-trained on Wikipedia in over 100 languages (Devlin et al. 2019).

As we discussed at (too) great length in class, one of the surprising findings about `mBERT` is its ability to perform cross-lingual transfer without any explicit alignment objective or parallel data (Pires, Schlinger et Garrette 2019; Wu et Dredze 2019). Although trained solely with a masked language modelling objective on monolingual corpora, `mBERT` appears to learn a shared semantic space across languages. Words that are translations of each other tend to occupy neighbouring regions in the representation space, even when the languages do not share scripts or subwords.

Because of this emergent alignment, it is possible to recover word-level translation pairs by comparing the contextualised embeddings produced by `mBERT` for parallel sentences. In this lab, we will experiment with three different alignment strategies:

- Direct argmax alignment, where each source word is linked to the target word with the most similar embedding.
- Competitive linking (via the Hungarian algorithm), which enforces one-to-one alignments.
- Canonical Correlation Analysis (CCA), which learns linear projections to better align the representation spaces of two languages (Cao, Kitaev et Klein 2020).

To carry out this lab, we will rely on a range of multilingual resources. First, we require parallel corpora (collections of sentence pairs that are translations of each other) so that we can compare word representations across languages. Well-known examples include the No Language Left Behind (`NLLB`) dataset (Team et al. 2022), which provides large-scale, high-quality translations across more than 200 languages. In addition, we will use bilingual lexicons, that is, precompiled lists of word translation pairs such as those available in `MUSE` (Lample et al. 2018) or `PanLex` (Kamholz, Pool et Colowick 2014). These lexicons serve both as supervision signals (for instance when applying Canonical Correlation Analysis) and as gold standards to evaluate the accuracy of the alignments we obtain.
:::

## Selecting languages

We select 5 languages that are well represented in the training of `mBERT` and 5 others that are either less represented or absent (the only language that is completely absent from `mBERT` is Kurdish). All the languages also have a bilingual lexicon from `MUSE` that we will use for the CCA.

In [1]:
from datasets import disable_progress_bars

disable_progress_bars()

In [2]:
# (language, code, link_to_MUSE)
langs = [
    ("albanian",    "sq",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-sq.0-5000.txt"),
    ("arabic",      "ar",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-ar.0-5000.txt"),
    ("french",      "fr",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-fr.0-5000.txt"),
    ("german",      "de",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-de.0-5000.txt"),
    ("greek",       "el",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-el.0-5000.txt"),
    ("hindi",       "hi",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.0-5000.txt"),
    ("japanese",    "ja",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-ja.0-5000.txt"),
    ("macedonian",  "mk",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-mk.0-5000.txt"),
    ("persian",     "fa",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-fa.0-5000.txt"),
    ("thai",        "th",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-th.0-5000.txt")
]

In [3]:
import requests
from datasets import load_dataset

def install_datasets(langs):
    corpora = {}

    for lang, code, links in langs:
        print(f"Processing {lang}...")
        corpora[lang] = {}
        
        print(f"\tLoading pairs...")
        data = load_dataset(
            "sentence-transformers/parallel-sentences-jw300",
            f"en-{code}",
            split="train"
        )
        print(f"\tLoaded JW300 for {lang}")
        corpora[lang]['pairs'] = data.take(5000).map( # type: ignore
            lambda x: {
                'english': x['english'].replace('\u200b', '').replace('\u200f', '').replace('\u200c', ''),
                'non_english': x['non_english'].replace('\u200b', '').replace('\u200f', '').replace('\u200c', '')
            }
        )

        print(f"\tLoading lexicons...")
        text = requests.get(links).content.decode("utf-8", errors="ignore")
        if '\t' in text:
            lexicon = [tuple(line.split('\t', 1)) for line in text.splitlines() if line.strip() and '\t' in line]
        else:
            lexicon = [tuple(line.split(None, 1)) for line in text.splitlines() if line.strip()]
        corpora[lang]["lexicon"] = lexicon

    return corpora

In [4]:
import os
from datasets import Dataset

def save_datasets(corpora, data_folder="data"):
    os.makedirs(data_folder, exist_ok=True)
    
    for lang, content in corpora.items():
        lang_folder = os.path.join(data_folder, lang)
        os.makedirs(lang_folder, exist_ok=True)
        
        # Save pairs as jsonl
        pairs_path = os.path.join(lang_folder, "pairs.jsonl")
        content['pairs'].to_json(pairs_path)
        
        # Save lexicon as text file
        lexicon_path = os.path.join(lang_folder, "lexicon.txt")
        with open(lexicon_path, 'w', encoding='utf-8') as f:
            for src, tgt in content['lexicon']:
                f.write(f"{src}\t{tgt}\n")

In [5]:
def load_datasets(langs, data_folder="data"):
    corpora = {}
    
    for lang, _, _ in langs:
        lang_folder = os.path.join(data_folder, lang)
        corpora[lang] = {}
        
        # Load pairs
        pairs_path = os.path.join(lang_folder, "pairs.jsonl")
        corpora[lang]['pairs'] = Dataset.from_json(pairs_path)
        
        # Load lexicon
        lexicon_path = os.path.join(lang_folder, "lexicon.txt")
        with open(lexicon_path, 'r', encoding='utf-8') as f:
            lexicon = [tuple(line.strip().split('\t', 1)) for line in f if line.strip()]
        corpora[lang]['lexicon'] = lexicon
    
    return corpora

In [6]:
data = install_datasets(langs)
save_datasets(data)

Processing albanian...
	Loading pairs...
	Loaded JW300 for albanian
	Loading lexicons...
Processing arabic...
	Loading pairs...
	Loaded JW300 for arabic
	Loading lexicons...
Processing french...
	Loading pairs...
	Loaded JW300 for french
	Loading lexicons...
Processing german...
	Loading pairs...
	Loaded JW300 for german
	Loading lexicons...
Processing greek...
	Loading pairs...
	Loaded JW300 for greek
	Loading lexicons...
Processing hindi...
	Loading pairs...
	Loaded JW300 for hindi
	Loading lexicons...
Processing japanese...
	Loading pairs...
	Loaded JW300 for japanese
	Loading lexicons...
Processing macedonian...
	Loading pairs...
	Loaded JW300 for macedonian
	Loading lexicons...
Processing persian...
	Loading pairs...
	Loaded JW300 for persian
	Loading lexicons...
Processing thai...
	Loading pairs...
	Loaded JW300 for thai
	Loading lexicons...


In [7]:
data = load_datasets(langs)

In [8]:
for lang, corpora in data.items():
    print(f"For {lang}:")
    for keys, vals in corpora.items():
        print(f"{keys} ({type(vals)})")
        print(f"{vals[:5]}")
    print()

For albanian:
pairs (<class 'datasets.arrow_dataset.Dataset'>)
{'english': ['Page Two', '1945 - 1995 What Have We Learned?', '3 - 14', 'Fifty years have passed since the end of World War II.', 'In what ways has humankind progressed?'], 'non_english': ['Faqja dy', '1945 - 1995 Çfarë kemi mësuar?', '3 - 14', 'Kanë kaluar pesëdhjetë vjet që nga mbarimi i Luftës II Botërore.', 'Në cilat fusha ka bërë progres njeriu?']}
lexicon (<class 'list'>)
[('and', 'dhe'), ('was', 'ishte'), ('for', 'për'), ('for', 'per'), ('that', 'që')]

For arabic:
pairs (<class 'datasets.arrow_dataset.Dataset'>)
{'english': ['How Can I Control My TV Viewing Habits?', 'Perhaps you have asked yourself this question, as well as others: “How Can I Say No to Premarital Sex? ”', '“ How Do I Know If It’s Real Love? ”', '“ Why Do I Get So Depressed? ”', 'These are chapter titles in the new book Questions Young People Ask  — Answers That Work.'], 'non_english': ['', 'كيف يمكنني ان اضبط عادات مشاهدتي التلفزيون ؟', '', 'ربما ط

In [9]:
import logging
logging.basicConfig(level=logging.INFO)

In [10]:
from transformers import logging

logging.set_verbosity_error()

In [11]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-multilingual-cased")
model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-multilingual-cased")

In [12]:
text = ["Replace me by any text you'd like.", "Another text to encode."]
encoded_input = tokenizer(text, return_tensors='pt', padding=True)
outputs = model(
    **encoded_input,
    output_hidden_states=True,
    return_dict=True
)

In [13]:
for input in encoded_input['input_ids']:
    print(len(input))
    print(input)

13
tensor([  101, 72337, 72654, 10911, 10155, 11178, 15541, 13028,   112,   172,
        11850,   119,   102])
13
tensor([  101, 17101, 15541, 10114, 10110, 54261,   119,   102,     0,     0,
            0,     0,     0])


In [14]:
word_ids = encoded_input.word_ids(batch_index=0)
print(word_ids)

[None, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, None]


In [15]:
token_embeddings = outputs.hidden_states[-1]
print(token_embeddings.shape)

torch.Size([2, 13, 768])
