# Bridging Language Gaps in Multilingual Embeddings via Contrastive Learning by Jina AI (https://jina.ai/news/bridging-language-gaps-in-multilingual-embeddings-via-contrastive-learning/)

## Key Highlights

### The Language Gap in Multilingual Models
- Multilingual embedding models often exhibit poor alignment between semantically similar phrases in different languages, resulting in a "language gap."
- This gap limits the effectiveness of cross-lingual applications like multilingual semantic search and translation.

### Role of Training Approaches
- **Masked Language Modeling (MLM):**
  - Pretraining with masked tokens enables models to learn language patterns. However, embeddings often cluster by language rather than by shared semantics, contributing to language gap.
- **Contrastive Learning:**
  - This technique aligns embeddings of semantically similar text pairs across languages, pulling them closer in the shared embedding space.

### Impact of Parallel Multilingual Data
- Surprisingly, experiments with multilingual models (e.g., `jina-embeddings-v3`) showed that **explicit cross-lingual training data provided little to no improvement** for most language pairs.
- Cross-language data appears to have more value for **low-resource languages** that are underrepresented in pretraining corpora, though further investigation is needed.

## Key Takeaways

1. **Multilingual Embedding Alignment:**
   - Contrastive learning improves cross-lingual embedding alignment but does not always require explicit parallel data for most language pairs.

2. **Parallel Data Value:**
   - Explicit cross-lingual training data may be more beneficial for **low-resource languages** than for widely represented ones.

3. **Future Exploration:**
   - More research is needed to evaluate the role of cross-lingual data in fully trained models and larger-scale datasets.


## Mini project

In [18]:
# Importing Required Libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
from tqdm import tqdm
from transformers import AutoModel, AutoTokenizer
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW
from datasets import load_dataset
from scipy.stats import spearmanr

In [19]:
# Define class for Data Preparation
class ParallelDataset(Dataset):
    def __init__(self, texts_a, texts_b, tokenizer, max_length=128):
        self.texts_a = texts_a
        self.texts_b = texts_b
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts_a)

    def __getitem__(self, idx):
        tokenized_a = self.tokenizer(self.texts_a[idx], truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt")
        tokenized_b = self.tokenizer(self.texts_b[idx], truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt")
        return {
            'input_ids_a': tokenized_a['input_ids'].squeeze(0),
            'attention_mask_a': tokenized_a['attention_mask'].squeeze(0),
            'input_ids_b': tokenized_b['input_ids'].squeeze(0),
            'attention_mask_b': tokenized_b['attention_mask'].squeeze(0)
        }

In [20]:
# Define Contrastive Loss Function
def contrastive_loss(embeddings_a, embeddings_b, temperature=0.07):
    logits = torch.matmul(embeddings_a, embeddings_b.T) / temperature
    labels = torch.arange(logits.size(0)).to(logits.device)
    loss = F.cross_entropy(logits, labels)
    return loss

# Define Evaluation Metrics
def cosine_similarity(embeddings_a, embeddings_b):
    return F.cosine_similarity(embeddings_a, embeddings_b).mean().item()

def spearman_correlation(embeddings_a, embeddings_b):
    a = embeddings_a.cpu().detach().numpy()
    b = embeddings_b.cpu().detach().numpy()
    correlations = [spearmanr(a[i], b[i]).correlation for i in range(len(a))]
    return sum(correlations) / len(correlations)

In [21]:
# Load a parallel dataset (example: OPUS Books dataset)
dataset = load_dataset("opus_books", "el-en", split="train[:5000]")
texts_a = [pair['en'] for pair in dataset['translation'][:5000]]
texts_b = [pair['el'] for pair in dataset['translation'][:5000]]

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
parallel_dataset = ParallelDataset(texts_a, texts_b, tokenizer)
dataloader = DataLoader(parallel_dataset, batch_size=64, shuffle=True)

In [22]:
# Model Architecture
model = AutoModel.from_pretrained("xlm-roberta-base")
projection_head = nn.Sequential(
    nn.Linear(model.config.hidden_size, 256),  # Reduced dimensions for faster computation
    nn.ReLU(),
    nn.Linear(256, 128)
)

In [23]:
# Pre-training Evaluation
sample_texts_a = ["Two young girls are playing outside in a non-urban environment."]
sample_texts_b = ["Δύο νεαρά κορίτσια παίζουν έξω σε ενα μη αστικό περιβάλλον."]

inputs_a = tokenizer(sample_texts_a, return_tensors="pt", padding=True, truncation=True)
inputs_b = tokenizer(sample_texts_b, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    embed_a_pre = model(**inputs_a).last_hidden_state.mean(dim=1)
    embed_b_pre = model(**inputs_b).last_hidden_state.mean(dim=1)

cosine_sim_pre = cosine_similarity(embed_a_pre, embed_b_pre)
spearman_corr_pre = spearman_correlation(embed_a_pre, embed_b_pre)
print(f"Pre-training Cosine Similarity: {cosine_sim_pre}")
print(f"Pre-training Spearman Correlation: {spearman_corr_pre}")

Pre-training Cosine Similarity: 0.9976966977119446
Pre-training Spearman Correlation: 0.6391859295076658


In [24]:
# Training Loop
optimizer = AdamW(list(model.parameters()) + list(projection_head.parameters()), lr=5e-5)
num_epochs = 2

for epoch in range(num_epochs):
    model.train()
    projection_head.train()
    total_loss = 0

    for batch in tqdm(dataloader):
        optimizer.zero_grad()

        # Process input batch
        input_ids_a = batch['input_ids_a'].to(model.device)
        attention_mask_a = batch['attention_mask_a'].to(model.device)
        input_ids_b = batch['input_ids_b'].to(model.device)
        attention_mask_b = batch['attention_mask_b'].to(model.device)

        # Forward pass through model
        outputs_a = model(input_ids_a, attention_mask=attention_mask_a).last_hidden_state.mean(dim=1)
        outputs_b = model(input_ids_b, attention_mask=attention_mask_b).last_hidden_state.mean(dim=1)

        # Projection head
        embeddings_a = projection_head(outputs_a)
        embeddings_b = projection_head(outputs_b)

        # Compute contrastive loss
        loss = contrastive_loss(embeddings_a, embeddings_b)
        total_loss += loss.item()

        # Backpropagation
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch + 1}, Loss: {total_loss / len(dataloader)}")

100%|██████████| 21/21 [08:21<00:00, 23.88s/it]


Epoch 1, Loss: 3.8583420515060425


100%|██████████| 21/21 [08:20<00:00, 23.83s/it]

Epoch 2, Loss: 3.597190573101952





In [25]:
# Post-training Evaluation
with torch.no_grad():
    embed_a_post = projection_head(model(**inputs_a).last_hidden_state.mean(dim=1))
    embed_b_post = projection_head(model(**inputs_b).last_hidden_state.mean(dim=1))

cosine_sim_post = cosine_similarity(embed_a_post, embed_b_post)
spearman_corr_post = spearman_correlation(embed_a_post, embed_b_post)
print(f"Post-training Cosine Similarity: {cosine_sim_post}")
print(f"Post-training Spearman Correlation: {spearman_corr_post}")

Post-training Cosine Similarity: 0.9605379104614258
Post-training Spearman Correlation: 0.9579403955319539


### Findings

- Despite the small sample size and model, Spearman Correlation post-training improved significantly, increasing from 63% to 96%. This highlights the efficacy of the contrastive learning method in enhancing semantic alignment across languages.