Backtranslation Evaluation Notebook
===================================

This notebook performs backtranslation on multilingual questions using a GPT-based model 
(GPT-4o) and evaluates the quality of the translations by comparing them to 
the original English questions.

Workflow:
---------
1. Load a multilingual dataset containing questions in various languages.
2. For each non-English question, use GPT to backtranslate it into English.
3. Match each backtranslated question with its corresponding original English version.
4. Evaluate the quality of the backtranslation using:
   - BERTScore F1 (semantic similarity metric)
   - Cosine similarity between sentence embeddings (using Sentence-Transformers)
5. Summarize the results by calculating the average scores across languages.

Evaluation Metrics:
-------------------
- **BERTScore F1**: Measures token-level semantic similarity.
- **Semantic Cosine Similarity**: Measures sentence-level semantic similarity.

Notes:
------
- A delay (`time.sleep(1.5)`) is included after each GPT call to avoid API rate limits.
- The notebook uses `paraphrase-MiniLM-L6-v2` model from `sentence-transformers` for embeddings.
- Quality ranges are defined as:
    - Good (≥ 0.85)
    - Fair (0.70–0.85)
    - Poor (< 0.70)

In [1]:
# Import necessary libraries
from tqdm import tqdm
import openai
import pandas as pd
import time
from openai import OpenAI
from bert_score import score
from sentence_transformers import SentenceTransformer, util

In [2]:
# Load your OpenAI API key from a file
with open("api.txt", "r") as f:
    openai.api_key = f.read().strip()

<openai.OpenAI at 0x148b4a4b95e0>

In [3]:
# Language code to language name mapping
lang_map = {
    'id': 'Indonesian',
    'jv': 'Javanese',
    'su': 'Sundanese',
    'bn': 'Bengali',
    'es': 'Spanish',
    'ja': 'Japanese',
    'te': 'Telugu',
    'th': 'Thailand',
    'en': 'English' 
}

In [4]:
# Set the folder and filename for the multilingual dataset
folder = 'mgsm_new/results_deepseekr1' # you can change it into specific generated folder
filename = f'{folder}/final_result.csv' 

print(filename)

# Load the multilingual questions dataset
df = pd.read_csv(filename)
display(df[:3])

mgsm_new/results_deepseekr1/final_result.csv


Unnamed: 0,q_no,lang,question,answer_number,gen_text,gen_answer,is_correct,self_revision,wait_counts,revised_answer,is_correct_rev
0,1,id,Bebek milik Bu Halimah bertelur 16 butir per h...,18,Bu Halimah's ducks lay 16 eggs daily. She cons...,18.0,True,,,,True
1,2,id,Sebuah jubah membutuhkan 2 gulungan serat biru...,3,The problem states that a robe requires 2 roll...,3.0,True,,,,True
2,3,id,Najwa memutuskan untuk mencoba menjual rumah. ...,70000,Najwa membeli rumah seharga $80.000 dan mengel...,65000.0,False,\n---\nWait... Let me think again.\n---\n\n---...,2.0,70000.0,True


In [5]:
def backtranslate_with_gpt(row):
    """
    Backtranslates a non-English question into English using GPT-4o.
    
    Args:
        row (pd.Series): A single row from the dataframe containing 'lang' and 'question'.
        
    Returns:
        str or None: The backtranslated English question, or None if there is an error.
    """
    lang_code = row['lang']
    lang_full = lang_map.get(lang_code, 'Unknown')
    
    if lang_code == 'en':
        # If already English, return the original question (no backtranslation needed)
        return row['question']

    question = row['question']
    prompt = f"Translate this question from {lang_full} to English:\n\n{question}"

    try:
        # Call GPT model to perform translation
        response = client.chat.completions.create(
            model="gpt-4o",  # Model used for translation
            messages=[
                {"role": "system", "content": "You are a helpful translation assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        # Handle API errors gracefully
        print(f"Error on row {row.get('q_no', 'N/A')}: {e}")
        return None

In [None]:
# List to store backtranslated questions
backtranslated = []

# Iterate over each row and perform backtranslation
for _, row in tqdm(df.iterrows(), total=len(df), desc="Backtranslating"):
    backtranslated.append(backtranslate_with_gpt(row))
    time.sleep(1.5)  # Delay to avoid API rate limits

# Add the backtranslated questions to the dataframe
df['backtranslated_question'] = backtranslated

In [9]:
# Load the original English questions from another dataset
original_df = pd.read_csv("mgsm/results_deepseekr1/final_result.csv")
original_df_en = original_df[original_df['lang'] == 'en']

# Create a mapping from question number to original English question
original_map = original_df_en.set_index('q_no')['question'].to_dict()

# Map the original English questions to the dataframe
df['original_question'] = df['q_no'].map(original_map)

In [11]:
# Filter to only the rows where both original and backtranslated questions are available
comparison = df.dropna(subset=['original_question', 'backtranslated_question'])[
    ['q_no', 'lang', 'original_question', 'question', 'backtranslated_question']
]

comparison

Unnamed: 0,q_no,lang,original_question,question,backtranslated_question
0,1,id,Janet’s ducks lay 16 eggs per day. She eats th...,Bebek milik Bu Halimah bertelur 16 butir per h...,Mrs. Halimah's duck lays 16 eggs per day. She ...
1,2,id,A robe takes 2 bolts of blue fiber and half th...,Sebuah jubah membutuhkan 2 gulungan serat biru...,A robe requires 2 rolls of blue fiber and half...
2,3,id,Josh decides to try flipping a house. He buys...,Najwa memutuskan untuk mencoba menjual rumah. ...,Najwa decided to try selling a house. She boug...
3,4,id,James decides to run 3 sprints 3 times a week....,Hasna memutuskan untuk berlari 3 kali seminggu...,Hasna decided to run 3 times a week. She runs ...
4,5,id,"Every day, Wendi feeds each of her chickens th...","Setiap hari, Wendi memberi makan ayam-ayamnya ...","Every day, Wendi feeds her chickens three cups..."
...,...,...,...,...,...
145,46,su,Meredith is a freelance blogger who writes abo...,Hanif mangrupikeun blogger lepas anu nyerat ng...,Hanif is a freelance blogger who writes about ...
146,47,su,Candice put 80 post-it notes in her purse befo...,Salsa nempatkeun 80 catetan pos-éta dina dompe...,Salsa placed 80 Post-it notes in her bag befor...
147,48,su,John buys twice as many red ties as blue ties....,Bagas meuli dua kali leuwih loba dasi beureum ...,Bagas bought twice as many red ties as blue ti...
148,49,su,Tracy used a piece of wire 4 feet long to supp...,Laras ngagunakeun sapotong kawat panjangna 4 k...,Laras uses a piece of wire that is 4 feet long...


In [13]:
# Evaluate the backtranslated questions using BERTScore (token-level semantic similarity)
P, R, F1 = score(
    comparison['backtranslated_question'].tolist(),
    comparison['original_question'].tolist(),
    lang="en"
)

# Add BERTScore F1 scores to the comparison dataframe
comparison['bertscore_f1'] = F1

  from .autonotebook import tqdm as notebook_tqdm
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
# Load the sentence transformer model for sentence embeddings
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Encode original and backtranslated questions into embeddings
original_embs = model.encode(comparison['original_question'].tolist(), convert_to_tensor=True)
backtrans_embs = model.encode(comparison['backtranslated_question'].tolist(), convert_to_tensor=True)

# Compute cosine similarity between the embeddings
similarities = util.pytorch_cos_sim(original_embs, backtrans_embs).diagonal()

# Add semantic similarity scores to the dataframe
comparison['semantic_similarity'] = similarities.cpu().numpy()

In [15]:
# Print average BERTScore F1 and semantic similarity across all examples
print(comparison['bertscore_f1'].mean())
print(comparison['semantic_similarity'].mean())

Unnamed: 0,q_no,lang,original_question,question,backtranslated_question,bertscore_f1,semantic_similarity
0,1,id,Janet’s ducks lay 16 eggs per day. She eats th...,Bebek milik Bu Halimah bertelur 16 butir per h...,Mrs. Halimah's duck lays 16 eggs per day. She ...,0.969530,0.907748
1,2,id,A robe takes 2 bolts of blue fiber and half th...,Sebuah jubah membutuhkan 2 gulungan serat biru...,A robe requires 2 rolls of blue fiber and half...,0.954528,0.701479
2,3,id,Josh decides to try flipping a house. He buys...,Najwa memutuskan untuk mencoba menjual rumah. ...,Najwa decided to try selling a house. She boug...,0.959591,0.722308
3,4,id,James decides to run 3 sprints 3 times a week....,Hasna memutuskan untuk berlari 3 kali seminggu...,Hasna decided to run 3 times a week. She runs ...,0.946048,0.716118
4,5,id,"Every day, Wendi feeds each of her chickens th...","Setiap hari, Wendi memberi makan ayam-ayamnya ...","Every day, Wendi feeds her chickens three cups...",0.970420,0.981483
...,...,...,...,...,...,...,...
145,46,su,Meredith is a freelance blogger who writes abo...,Hanif mangrupikeun blogger lepas anu nyerat ng...,Hanif is a freelance blogger who writes about ...,0.977788,0.623388
146,47,su,Candice put 80 post-it notes in her purse befo...,Salsa nempatkeun 80 catetan pos-éta dina dompe...,Salsa placed 80 Post-it notes in her bag befor...,0.966333,0.863011
147,48,su,John buys twice as many red ties as blue ties....,Bagas meuli dua kali leuwih loba dasi beureum ...,Bagas bought twice as many red ties as blue ti...,0.981589,0.923684
148,49,su,Tracy used a piece of wire 4 feet long to supp...,Laras ngagunakeun sapotong kawat panjangna 4 k...,Laras uses a piece of wire that is 4 feet long...,0.962693,0.801034


In [17]:
# F1 quality interpretation:
# - Good: ≥ 0.85
# - Fair: 0.70–0.85
# - Poor: < 0.70

In [18]:
# Group by language and calculate the mean evaluation scores
comparison.groupby('lang')[['bertscore_f1', 'semantic_similarity']].mean().reset_index()

Unnamed: 0,lang,bertscore_f1,semantic_similarity
0,id,0.968273,0.826166
1,jv,0.970283,0.838202
2,su,0.973691,0.838601
