# Opus-MT Translation

Opus-MT is a set of translation models developed by Helsinki-NLP (Language Technology at the University of Helsinki), you can easily use these models with Hugging Face, therefore we're going to need to install transformers and TensorFlow.

In [1]:
!pip install transformers



In [2]:
!pip install tensorflow



Now we will create a simple function to translate a string. The model we're going to work with is "Helsinki-NLP/opus-mt-en-es", we are going to set this model as a global var for convenience.

In [3]:
from transformers import pipeline

In [4]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es")
translator("Default to expanded threads")

[{'translation_text': 'Predeterminado a los hilos de discusión expandidos'}]

The following method will filter the english question from the questions list and use the translator to generate a new quesiton list that includes the spanish question.

In [5]:
def translate_question_to_spanish(question, id):
    try:
        en_question = next(filter((lambda x: x.get("language") == "en"), question))
        translation = translator(en_question.get("string"))[0].get('translation_text')
        question.append({"language": "es", "question" : translation})
    except StopIteration:
        print("No english version was found for the question ", id)
    except:
        print("Unexpected Error")
    finally:
        return question
def generate_new_question(question):
    question['question'] = translate_question_to_spanish(question.get('question'), question.get('id'))
    return question
    
def translate_questions(questions):
    questions['questions'] = list(map(generate_new_question, questions.get('questions')))
    return questions

Now let's import the files to generate a spanish version of all the question

In [7]:
import json

In [8]:
def read_json(filename):
    with open(filename, 'r', encoding="utf8") as f:
        return json.load(f)

In [9]:
test_dbpedia = read_json("../data/qald_9_plus_test_dbpedia.json")
train_dbpedia = read_json("../data/qald_9_plus_train_dbpedia.json")
test_wikidata = read_json("../data/qald_9_plus_test_wikidata.json")
train_wikidata = read_json("../data/qald_9_plus_train_wikidata.json")

Let's check if everythin is OK by using DBpedia test data

In [10]:
test_dbpedia = translate_questions(test_dbpedia)

In [11]:
test_dbpedia

{'questions': [{'id': '99',
   'question': [{'language': 'en',
     'string': 'What is the time zone of Salt Lake City?'},
    {'language': 'de', 'string': 'In welcher Zeitzone liegt Salt Lake City?'},
    {'language': 'de', 'string': 'Was ist die Zeitzone von Salt Lake City?'},
    {'language': 'ru', 'string': 'Какой часовой пояс в Солт-Лейк-Сити'},
    {'language': 'ru',
     'string': 'В каком часовом поясе расположен Солт-Лейк-Сити?'},
    {'language': 'uk', 'string': 'Який часовий пояс у Солт-Лейк Сіті?'},
    {'language': 'lt', 'string': 'Kokia Solt Leik Sičio laiko zona?'},
    {'language': 'be', 'string': 'Які гадзінны пояс у Солт-Лэйк-Сіці'},
    {'language': 'lt', 'string': 'Kokia laiko juosta yra Solt Leik Sityjes'},
    {'language': 'ba', 'string': 'Ниндей вакыт поясы Солт-Лейк-Ситила'},
    {'language': 'es',
     'question': '¿Cuál es la zona horaria de Salt Lake City?'}],
   'query': {'sparql': 'PREFIX res: <http://dbpedia.org/resource/> PREFIX dbp: <http://dbpedia.org/p

Everything seem to be correct, let's check the stats

In [12]:
def show_stats(data):
    lang_dict = dict()
    for q in data['questions']:
        for lang in q['question']:
            if lang['language'] not in lang_dict.keys():
                lang_dict[lang['language']] = [lang]
            else:
                lang_dict[lang['language']].append(lang)
                
    for k, v in lang_dict.items():
        print(k, len(v))

In [13]:
show_stats(test_dbpedia)

en 150
de 176
ru 348
uk 176
lt 186
be 155
ba 117
es 150
hy 20
fr 26


Now we can translate all the data

In [14]:
train_dbpedia = translate_questions(train_dbpedia)

In [15]:
test_wikidata = translate_questions(test_wikidata)

In [16]:
train_wikidata = translate_questions(train_wikidata)

Let's check all the stats before saving

In [17]:
show_stats(train_dbpedia)

en 408
de 543
ru 1203
lt 468
uk 447
fr 260
es 408
be 441
ba 284
hy 80


In [18]:
show_stats(test_wikidata)

en 136
de 159
ru 318
uk 160
lt 166
be 141
ba 107
es 136
hy 19
fr 25


In [19]:
show_stats(train_wikidata)

en 371
de 497
ru 1095
lt 426
uk 407
fr 251
es 371
be 403
ba 260
hy 71


Now we are ready to save the new version

In [20]:
def save_json(filename, data):
    """save json"""
    with open(filename, 'w', encoding="utf8") as f:
        json.dump(data, f, indent=4, ensure_ascii=False)

In [21]:
save_json("../data/qald_9_plus_test_dbpedia.json", test_dbpedia)

In [22]:
save_json("../data/qald_9_plus_train_dbpedia.json", train_dbpedia)

In [23]:
save_json("../data/qald_9_plus_test_wikidata.json", test_wikidata)

In [24]:
save_json("../data/qald_9_plus_train_wikidata.json", train_wikidata)