# MarianMT

This notebook is mostly for educational purpose. Transformers released a new function called [MarianMT](https://huggingface.co/transformers/model_doc/marian.html) which seems to be very powerful to translate text data.

In this notebook I used a small subset of the validation data because performing translations using this method takes a lot of time and probably doesn't give an edge compared to dataset transleted using tradtionnal methods.

In [1]:
!pip install -U transformers

Collecting transformers
  Downloading transformers-2.9.1-py3-none-any.whl (641 kB)
[K     |████████████████████████████████| 641 kB 2.8 MB/s 
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 2.9.0
    Uninstalling transformers-2.9.0:
      Successfully uninstalled transformers-2.9.0
Successfully installed transformers-2.9.1


# Load data

In [2]:
import pandas as pd
from tqdm.notebook import tqdm

In [3]:
df = pd.read_csv('../input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
df = df.sample(100, random_state=12)
df.head(3)

Unnamed: 0,id,comment_text,lang,toxic
1091,1091,"Selam, tıpkı yeni mesaj aldığımızda çıkan sarı...",tr,0
5754,5754,Con tu permiso te he revertido el mensaje de e...,es,0
1564,1564,"He añadido a Haaretz como fuente en el texto, ...",es,0


# Tokenize & translate

In [4]:
df['lang'].unique()

array(['tr', 'es', 'it'], dtype=object)

In [5]:
from transformers import MarianMTModel, MarianTokenizer



In [6]:
df['content_english'] = ''

In [7]:
for i, lang in tqdm(enumerate(['es', 'it', 'tr'])):
    if lang in ['es', 'it']:
        model_name = 'Helsinki-NLP/opus-mt-ROMANCE-en'
        df_lang = df.loc[df['lang']==lang, 'comment_text'].apply(lambda x: '>>{}<< '.format(lang) + x)
    else:
        model_name = 'Helsinki-NLP/opus-mt-{}-en'.format(lang)
        df_lang = df.loc[df['lang']==lang, 'comment_text']
    
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name, output_loading_info=False)
        
    batch = tokenizer.prepare_translation_batch(df_lang.values,
                                               max_length=192,
                                               pad_to_max_length=True)
    translated = model.generate(**batch)

    df.loc[df['lang']==lang, 'content_english'] = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=800087.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=779494.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1460304.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=265.0, style=ProgressStyle(description_…






HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1113.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=312087009.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=839750.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=796647.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1563964.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=42.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1133.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=306727185.0, style=ProgressStyle(descri…





In [8]:
df.head(3)

Unnamed: 0,id,comment_text,lang,toxic,content_english
1091,1091,"Selam, tıpkı yeni mesaj aldığımızda çıkan sarı...",tr,0,"Hi, there's a yellow tape, just like the yello..."
5754,5754,Con tu permiso te he revertido el mensaje de e...,es,0,With your permission I have reversed the messa...
1564,1564,"He añadido a Haaretz como fuente en el texto, ...",es,0,"I have added Haaretz as a source in the text, ..."


In [9]:
df.to_csv("df_translated.csv")

MarianMT offers a wider range of things you can do, pleaste take a look at [the official documentation](https://huggingface.co/transformers/model_doc/marian.html).