# Objective
Translate the english keywords and clusters to arabic. For production I think it would be prefered to use commercial translation APIs to ensure best quality. 

Forr the purpose of this assignment I will go for open-source tooling to keep it reproducible. After a quick online search I will opt to use the Helsinki NLP Opus translation model, fine-tuned for English to Arabic translation: Helsinki-NLP/opus-mt-tc-big-en-ar

NOTE: I should beware that the translation do not necessarily match words in the texts and could be off. So perhaps working with some kind of semantic similarity range would be helpful here to allow for more than correct matches. Fuzzy Matching.

In [14]:
import pandas as pd
from transformers import MarianMTModel, MarianTokenizer
import warnings
warnings.filterwarnings('ignore')

# Load the translation model
model_name = "Helsinki-NLP/opus-mt-tc-big-en-ar"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

print("Model and tokenizer input successful!")

Model and tokenizer input successful!


In [None]:
def translate_text(texts, model, tokenizer):
    """
    Translate a list of English texts to Arabic.
    
    Args:
        texts: List of English text strings to translate
        model: Pre-loaded model
        tokenizer: Pre-loaded tokenizer
    
    Returns:
        List of translated Arabic texts
    """
    translations = []
    
    total = len(texts)
    
    for i, text in enumerate(texts):
        if i % 10 == 0 or i == total - 1:
            print(f"Translating {i+1}/{total}: '{text}'")
        
        # setting max length and truncation is not really necesary as we only translate few words but for good practice its kept here
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
        translated_tokens = model.generate(**inputs, max_length=128)
        # skip special tokens for clean translation output to avoid extra cleaning step...
        translation = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
        translations.append(translation)
            
    return translations


In [7]:
risk_categories_df = pd.read_excel('data/risk-factors-categories.xlsx')

In [8]:
# Translate risk factors and clusters
print("Translating risk factors...")
risk_factors_arabic = translate_text(risk_categories_df['risk_factor'].tolist(), model, tokenizer)

print("Translating clusters...")
clusters_arabic = translate_text(risk_categories_df['cluster'].tolist(), model, tokenizer)

print("Translation completed.")


Translating risk factors...
Translating 1/167: 'land seizures'
Translating 11/167: 'international alarm'
Translating 21/167: 'economic impoverishment'
Translating 31/167: 'foreign troops'
Translating 41/167: 'military dictatorship'
Translating 51/167: 'international terrorists'
Translating 61/167: 'continued strife'
Translating 71/167: 'land invasions'
Translating 81/167: 'prolonged dry spell'
Translating 91/167: 'gangs of bandits'
Translating 101/167: 'coup'
Translating 111/167: 'land degradation'
Translating 121/167: 'environmental degradation'
Translating 131/167: 'corruption'
Translating 141/167: 'displaced'
Translating 151/167: 'cholera outbreak'
Translating 161/167: 'd'etat'
Translating 167/167: 'mismanagement'
Translating clusters...
Translating 1/167: 'land-related issues'
Translating 11/167: 'humanitarian aid'
Translating 21/167: 'economic issues'
Translating 31/167: 'conflicts and violence'
Translating 41/167: 'political instability'
Translating 51/167: 'conflicts and violenc

In [15]:
# Create new data frame that contains original text and translated columns:
translated_df = pd.DataFrame({
    'risk_factor_english': risk_categories_df['risk_factor'],
    'cluster_english': risk_categories_df['cluster'],
    'risk_factor_arabic': risk_factors_arabic,
    'cluster_arabic': clusters_arabic
})

translated_df.head(5)


Unnamed: 0,risk_factor_english,cluster_english,risk_factor_arabic,cluster_arabic
0,land seizures,land-related issues,الاستيلاء على الأراضي,القضايا المتعلقة بالأراضي
1,slashed export,economic issues,تصدير مخفض,القضايا الاقتصادية
2,price rise,economic issues,ارتفاع الأسعار,القضايا الاقتصادية
3,mass hunger,food crisis,جوع جماعي,أزمة الغذاء
4,cyclone,weather shocks,الإعصار,صدمات الطقس


In [None]:
# Save to Excel file
output_path = 'new_data/risk-factors-translated.xlsx'
translated_df.to_excel(output_path, index=False)
print(f"Saved translated data to {output_path}")

# Verify the file was saved correctly
verification_df = pd.read_excel(output_path)
verification_df.head(5)


Saved translated data to data/risk-factors-translated.xlsx


Unnamed: 0,risk_factor_english,cluster_english,risk_factor_arabic,cluster_arabic
0,land seizures,land-related issues,الاستيلاء على الأراضي,القضايا المتعلقة بالأراضي
1,slashed export,economic issues,تصدير مخفض,القضايا الاقتصادية
2,price rise,economic issues,ارتفاع الأسعار,القضايا الاقتصادية
3,mass hunger,food crisis,جوع جماعي,أزمة الغذاء
4,cyclone,weather shocks,الإعصار,صدمات الطقس
