<a href="https://colab.research.google.com/github/dasmiq/cs6120-assignment5/blob/main/cross_language.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross-Language Retrieval

In this notebook, you will evaluate models on the task of cross-language retrieval. We will use a sample of the first paragraphs of Wikipedia articles. Sometimes, a Wikipedia article in one language will be a translation of the article in another; in other cases, articles cover the some topic but are not translations. In any case, we use the links between Wikipedia articles in different languages as ground truth for our evaluation.

Since we often want to enrich the context information available to a language model with retrieval results, we will evaluate not only whether the exact matching document ranks highest, but also whether the matching document ranks in the top $k$.

Work through the notebook and complete code and text cells marked **TODO**.

We start by installing the `sentence-transformers` library.

In [None]:
pip install -U sentence-transformers

We then download a sample of the first paragraphs of Wikipedia articles in six languages.

In [None]:
!wget https://raw.githubusercontent.com/dasmiq/cs6120-assignment5/refs/heads/main/sample-6lang.jsonl

In [1]:
import json
articles = []

for line in open('sample-6lang.jsonl', mode='r', encoding='utf-8'):
  rec = json.loads(line)
  articles.append(rec)

len(articles)

11838

We include articles from the three most prevalent Wikipedia languages—English, German, and French—and from three other languages in non-Latin scripts—Chinese, Arabic, and Greek. The dataset includes fields for the `text` of the paragraph, as well as the (lower-cased) `title` of the article and `lang` for the language code. Finally, each record contains the Wikidata `id` used to link related articles in different languages. For convenience, the records have been sorted by `id` and `lang`.

If you read a few of these languages (or translate them), you can look at a set of paragraphs and see that most pairs are not translations of each other.

In [2]:
articles[6:12]

[{'id': 'Q1005289',
  'lang': 'ar',
  'title': 'قانون الجنسية الكندي',
  'url': 'https://ar.wikipedia.org/wiki/%D9%82%D8%A7%D9%86%D9%88%D9%86%20%D8%A7%D9%84%D8%AC%D9%86%D8%B3%D9%8A%D8%A9%20%D8%A7%D9%84%D9%83%D9%86%D8%AF%D9%8A',
  'text': 'قانون الجنسية الكندي، يشار إليها أيضًا بالجنسية الكندية، هو وضع قانوني يمنح الشخص الطبيعي حقوقًا ومسؤوليات محددة في كندا. نشأ في عام ، وصار معلمًا هامًا في عملية استقلال كندا عن المملكة المتحدة مع دخول قانون الجنسية الكندية الأول حيز التنفيذ. تخضع الجنسية الكندية الآن لقانون الجنسية لعام 1977، الذي خضع لعدة تعديلات مهمة منذ دخوله حيز التنفيذ. كما ساهمت المحاكم الفيدرالية، من خلال قانونها القضائي، في توضيح التعريف القانوني للجنسية الكندية.'},
 {'id': 'Q1005289',
  'lang': 'de',
  'title': 'kanadische staatsangehörigkeit',
  'url': 'https://de.wikipedia.org/wiki/Kanadische%20Staatsangeh%C3%B6rigkeit',
  'text': 'Die kanadische Staatsbürgerschaft ( bzw. Canadian Citizenship) ist die Staatsbürgerschaft Kanadas, die im engeren Sinne seit 1947 existiert.'},

We load a sentence embedding model, `LaBSE`, that was trained on several languages, including the six we work with here.

In [3]:
from sentence_transformers import SentenceTransformer
import numpy as np
labse = SentenceTransformer('sentence-transformers/LaBSE')




To demonstrate finding similar paragraphs, we encode the text of the first twelve records, which gives us a 768-dimensional embedding vector for each one.

In [4]:
encoded = labse.encode([r['text'] for r in articles[0:12]])
encoded.shape

(12, 768)

If we multiply this $12 \times 768$ matrix by its transpose, we get a $12 \times 12$ (symmetric) matrix with the cosine similarity between all pairs of paragraphs. The diagonal entries are, of course, approximately 1. In the first six rows, we can see that the first six columns are higher than the latter six. In the latter six rows, we can see that the latter six columns are higher than the first six.

In [5]:
encoded @ encoded.T

array([[1.        , 0.76256895, 0.7722024 , 0.7502577 , 0.58582175,
        0.69546473, 0.3728796 , 0.32353523, 0.3308503 , 0.27172476,
        0.32900906, 0.2040158 ],
       [0.76256895, 1.0000001 , 0.7690512 , 0.9233679 , 0.63490856,
        0.6571194 , 0.4801092 , 0.48846245, 0.40316343, 0.36106884,
        0.42383692, 0.26584464],
       [0.7722024 , 0.7690512 , 1.0000002 , 0.7562181 , 0.5307791 ,
        0.6586367 , 0.3545523 , 0.362997  , 0.35417837, 0.264056  ,
        0.32125577, 0.21048513],
       [0.7502577 , 0.9233679 , 0.7562181 , 0.99999994, 0.6517094 ,
        0.6516983 , 0.4132587 , 0.40341082, 0.35402966, 0.32176867,
        0.3496587 , 0.21814038],
       [0.58582175, 0.63490856, 0.5307791 , 0.6517094 , 0.99999994,
        0.5415669 , 0.38234922, 0.3980207 , 0.3689012 , 0.3174028 ,
        0.3376816 , 0.28169256],
       [0.69546473, 0.6571194 , 0.6586367 , 0.6516983 , 0.5415669 ,
        0.9999997 , 0.2784195 , 0.29993677, 0.30385196, 0.21876112,
        0.26020145,

## Evaluating Retrieval

To introduce the problem, we take some example Chinese paragraphs to use as queries and English paragraphs to use as candidate results to search through.

In [6]:
query_articles = [r['text'] for r in articles if r['lang'] == 'zh']
result_articles = [r['text'] for r in articles if r['lang'] == 'en']

To make the example clearer, we will use different numbers of queries and results.

In [7]:
qembed = labse.encode(query_articles[0:200])
rembed = labse.encode(result_articles[0:500])

Multiplying the query embeddings by the result embeddings, we get a $200 \times 500$ queries-by-results matrix.

In [8]:
sim = qembed @ rembed.T
sim.shape

(200, 500)

We use numpy's `argmax` function along the second dimension (`axis=1`) to get the index of the top result for each query.

In [9]:
argmax = np.argmax(sim, axis=1)
argmax

array([454,   1,   2, 282, 410, 120,  13,   7, 162,   9,  10,  11,  12,
        13,  14,  15, 153, 212, 247,  19,  20,  21,  22,  23, 372, 372,
        26,  27,  28,  29,  82,  31,  32,  33,  34,  35,  36,  66,  93,
        39, 397,  41, 383,  83,  44,  45,  46,  47,  48,  49,  50, 245,
       296, 139,  54,  55,  56,  57,  58,  59, 266,  61,  62, 171,  64,
        31,  29,  67,  68,  69,  70,  71, 242, 372,  74,  75,  76,  77,
       253,  79,  80,  81,  82,  83, 160,  85,  86, 492,  88,  17,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99,  32, 101, 102, 103,
       104, 307, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120,  82, 122, 123, 124,  14, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       273, 144, 424, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 120, 165, 166, 167, 171,
       169, 170, 171, 147, 173, 174, 175, 176, 177, 178, 179, 18

Since the query and result documents are in the same order, matching Chinese and English documents have the same index. This allows us to compute the accuracy, or &ldquo;recall at 1&rdquo;, of Chinese-to-English retrieval.

In [10]:
sum([a==b for (a, b) in zip(range(len(argmax)), argmax)])/len(argmax)

np.float64(0.785)

Your first task is to compute the recall at 1 for Arabic, Chinese, French, German, and Greek query documents matching English documents. Use the first 1000 English documents as the candidates you will search through.

In [11]:
candidates = labse.encode(result_articles[0:1000])

For each of the other five languages, construct embeddings for the first 1000 documents and measure how often the most similar English document is the matching one.

In [12]:
langs = ['ar', 'de', 'el', 'fr', 'zh']
lang_texts = {lang: [] for lang in langs}
remaining = set(langs)
for rec in articles:
    bucket = lang_texts.get(rec['lang'])
    if bucket is not None and len(bucket) < 1000:
        bucket.append(rec['text'])
        if len(bucket) == 1000:
            remaining.discard(rec['lang'])
            if not remaining:
                break
for lang in langs:
    qembed = labse.encode(lang_texts[lang])
    sim = qembed @ candidates.T
    recall = (np.argmax(sim, axis=1) == np.arange(len(lang_texts[lang]))).mean()
    print(f"R@1 {lang}-en: {recall:.3f}")


R@1 ar-en: 0.868
R@1 de-en: 0.885
R@1 el-en: 0.884
R@1 fr-en: 0.856
R@1 zh-en: 0.717


We often use retrieved documents to provide extra context to a language model. In that case, we might retrieve more than one document per query to increase the likelihood that useful documents are in the top $k$. For each of the five non-English languages, write code to evaluate the **recall at k** (R@k), i.e., the proportion of queries for which the correct document was anywhere in the top k results.

In [13]:

def recall_at_k(sim_matrix, k):
    # Compute recall@k given a similarity matrix.
    if k <= 0:
        raise ValueError('k must be positive')
    if sim_matrix.ndim != 2:
        raise ValueError('sim_matrix must be 2-D')
    k = min(k, sim_matrix.shape[1])
    top_k = np.argpartition(sim_matrix, -k, axis=1)[:, -k:]
    hits = (top_k == np.arange(sim_matrix.shape[0])[:, None]).any(axis=1)
    return hits.mean()

recall_at_k(sim, 5)


np.float64(0.831)

In [14]:

for lang in langs:
    qembed = labse.encode(lang_texts[lang])
    sim = qembed @ candidates.T
    r_at_5 = recall_at_k(sim, 5)
    r_at_10 = recall_at_k(sim, 10)
    print(f"R@5 {lang}-en: {r_at_5:.3f}")
    print(f"R@10 {lang}-en: {r_at_10:.3f}")


R@5 ar-en: 0.945
R@10 ar-en: 0.959
R@5 de-en: 0.949
R@10 de-en: 0.962
R@5 el-en: 0.948
R@10 el-en: 0.957
R@5 fr-en: 0.929
R@10 fr-en: 0.944
R@5 zh-en: 0.831
R@10 zh-en: 0.876


## Different Retrieval Strategies

**TODO**: Not all languages perform equally well using the LaBSE model. Your task is to find an alternative retrieval method that _improves performance for at least one language_ while _not degrading performance for other languages_.

You are free to use any open encoder or generative models available on huggingface. Here are three ideas to get you started. You only need to implement one improvement, although you may keep other dead-ends in the notebook.

1. Find other embedding models on huggingface that work better for, e.g., Chinese, while maintaining performance on the other languages.
1. LaBSE was trained on translation pairs, but Wikipedia articles are not necessarily translations of each other. Use the remaining articles in the dataset to fine-tune LaBSE (or another model). [This huggingface guide to fine-tuning sentence embeddings](https://huggingface.co/blog/train-sentence-transformers) may be helpful.
1. Instead of using embeddings, you could use a generative model to try to directly output the title of the English article given the foreign-language title and article. This approach is known as [generative retrieval](https://arxiv.org/abs/2404.14851).

What you try is up to you. Describe your approach and use the recall at k function above to evaluate your results.

In [None]:
# Lets do 