---
title: "Word alignment with multilingual BERT: from subwords to lexicons"
subtitle: "Multilingual NLP -- Lab 2"
author: "Philippos Triantafyllou"
date-modified: last-modified
date-format: long
lang: en
format: html
theme: cosmo
toc: true
number-sections: true
number-depth: 2
code-line-numbers: true
echo: true
output: true
cap-location: top
embed-resources: true
---

:::{.callout-note}
## Instructions

In this lab, we will explore how to extract a bilingual lexicon (a list of word translation pairs linking a source language to a target language) directly from the internal representations of a "large" multilingual language model. Specifically, we will rely on `mBERT`, a variant of `BERT` pre-trained on Wikipedia in over 100 languages (Devlin et al. 2019).

As we discussed at (too) great length in class, one of the surprising findings about `mBERT` is its ability to perform cross-lingual transfer without any explicit alignment objective or parallel data (Pires, Schlinger et Garrette 2019; Wu et Dredze 2019). Although trained solely with a masked language modelling objective on monolingual corpora, `mBERT` appears to learn a shared semantic space across languages. Words that are translations of each other tend to occupy neighbouring regions in the representation space, even when the languages do not share scripts or subwords.

Because of this emergent alignment, it is possible to recover word-level translation pairs by comparing the contextualised embeddings produced by `mBERT` for parallel sentences. In this lab, we will experiment with three different alignment strategies:

- Direct argmax alignment, where each source word is linked to the target word with the most similar embedding.
- Competitive linking (via the Hungarian algorithm), which enforces one-to-one alignments.
- Canonical Correlation Analysis (CCA), which learns linear projections to better align the representation spaces of two languages (Cao, Kitaev et Klein 2020).

To carry out this lab, we will rely on a range of multilingual resources. First, we require parallel corpora (collections of sentence pairs that are translations of each other) so that we can compare word representations across languages. Well-known examples include the No Language Left Behind (`NLLB`) dataset (Team et al. 2022), which provides large-scale, high-quality translations across more than 200 languages. In addition, we will use bilingual lexicons, that is, precompiled lists of word translation pairs such as those available in `MUSE` (Lample et al. 2018) or `PanLex` (Kamholz, Pool et Colowick 2014). These lexicons serve both as supervision signals (for instance when applying Canonical Correlation Analysis) and as gold standards to evaluate the accuracy of the alignments we obtain.
:::

## Selecting languages

We select 5 languages that are well represented in the training of `mBERT` and 5 others that are either less represented or absent (the only language that is completely absent from `mBERT` is Kurdish). All the languages also have a bilingual lexicon from `MUSE` that we will use for the CCA.

In [16]:
# (language, code, link_to_MUSE)
langs = [
    ("albanian",    "sq",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-sq.0-5000.txt"),
    ("arabic",      "ar",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-ar.0-5000.txt"),
    ("french",      "fr",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-fr.0-5000.txt"),
    ("german",      "de",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-de.0-5000.txt"),
    ("greek",       "el",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-el.0-5000.txt"),
    ("hindi",       "hi",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.0-5000.txt"),
    ("japanese",    "ja",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-ja.0-5000.txt"),
    ("macedonian",  "mk",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-mk.0-5000.txt"),
    ("persian",     "fa",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-fa.0-5000.txt"),
    ("thai",        "th",   "https://dl.fbaipublicfiles.com/arrival/dictionaries/en-th.0-5000.txt")
]

In [13]:
import requests
from datasets import load_dataset

def install_datasets(langs):
    corpora = {}

    for lang, code, links in langs:
        print(f"Processing {lang}...")
        corpora[lang] = {}
        
        print(f"\tLoading pairs...")
        data = load_dataset(
            "sentence-transformers/parallel-sentences-jw300",
            f"en-{code}",
            split="train"
        )
        print(f"\tLoaded JW300 for {lang}")
        corpora[lang]['pairs'] = data.take(5000) # type: ignore

        print(f"\tLoading lexicons...")
        text = requests.get(links).content.decode("utf-8", errors="ignore")
        lexicon = [tuple(line.split(None, 1)) for line in text.splitlines() if line.strip()]
        corpora[lang]["lexicon"] = lexicon

    return corpora