# Hugging Face - MarianMT (Language Translator with Summarization)

### Codes for the Representation of Names of Languages (Use ISO 639-1 Code in src and trg): https://www.loc.gov/standards/iso639-2/php/code_list.php
### Documentation for Language Support: https://huggingface.co/Helsinki-NLP 

### Note that we are using HuggingFace implementation of the MarianMT from the C++ code
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.

It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).

Main features:

    Efficient pure C++ implementation
    Fast multi-GPU training and GPU/CPU translation
    State-of-the-art NMT architectures: deep RNN and Transformer
    Permissive open source license (MIT)
    more details...



Source: 

First Version Trial Code:
https://github.com/chuachinhon/practical_nlp/blob/master/notebooks/4.2_chinese_to_english_translate.ipynb

Article about the implementation of MarianMT:
https://towardsdatascience.com/lost-in-machine-translation-3b05615d68e7

We used this code in conjunction with the github code in order to make it work with the Nvidia GPU: https://fantashit.com/marianmt-cuda-out-of-memory-when-translating-many-times-with-the-marianmt-model/ https://github.com/huggingface/transformers/issues/6796

MarianMT Website: https://marian-nmt.github.io/

White Paper: https://www.researchgate.net/publication/334116039_Marian_Fast_Neural_Machine_Translation_in_C



In [1]:
import pandas as pd
import re
from transformers import MarianMTModel, MarianTokenizer

In [2]:
df = pd.read_excel('2 - Thailand_Thai_Takeda.xlsx')
df.head()

Unnamed: 0,Datetime,Date,Time,Title,Url,Domain,Tags,Sentiment,Page Type,tweet_id,...,Twitter Following,Twitter Reply Count,Twitter Reply to,Twitter Retweet of,Twitter Retweets,Twitter Likes,Twitter Tweets,Twitter Verified,Page Type Name,Reddit Score
0,2021-04-13 06:44:07,2021-04-13,06:44:07,http://twitter.com/1/statuses/1381860734108196865,http://twitter.com/1/statuses/1381860734108196865,twitter.com,,negative,twitter,1.381861e+18,...,109,,,1.381594e+18,10751,0.0,34853,False,Twitter,0
1,2021-04-13 06:44:48,2021-04-13,06:44:48,http://twitter.com/1/statuses/1381860905206521856,http://twitter.com/1/statuses/1381860905206521856,twitter.com,,positive,twitter,1.381861e+18,...,548,,,1.381805e+18,23152,0.0,26426,False,Twitter,0
2,2021-04-13 06:44:54,2021-04-13,06:44:54,http://twitter.com/1/statuses/1381860932247121923,http://twitter.com/1/statuses/1381860932247121923,twitter.com,,positive,twitter,1.381861e+18,...,329,,,1.381805e+18,23152,0.0,90651,False,Twitter,0
3,2021-04-13 06:45:07,2021-04-13,06:45:07,http://twitter.com/1/statuses/1381860988224380931,http://twitter.com/1/statuses/1381860988224380931,twitter.com,,positive,twitter,1.381861e+18,...,92,,,1.381805e+18,23152,0.0,54502,False,Twitter,0
4,2021-04-13 06:45:13,2021-04-13,06:45:13,http://twitter.com/1/statuses/1381861011670495234,http://twitter.com/1/statuses/1381861011670495234,twitter.com,,negative,twitter,1.381861e+18,...,133,,,1.381594e+18,10751,0.0,45665,False,Twitter,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59999 entries, 0 to 59998
Data columns (total 77 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Datetime                      59999 non-null  datetime64[ns]
 1   Date                          59999 non-null  datetime64[ns]
 2   Time                          59999 non-null  object        
 3   Title                         59999 non-null  object        
 4   Url                           59999 non-null  object        
 5   Domain                        59999 non-null  object        
 6   Tags                          8173 non-null   object        
 7   Sentiment                     59999 non-null  object        
 8   Page Type                     59999 non-null  object        
 9   tweet_id                      46418 non-null  float64       
 10  Author                        46753 non-null  object        
 11  Full Name                   

In [4]:
df.isnull().sum()

Datetime                0
Date                    0
Time                    0
Title                   0
Url                     0
                    ...  
Twitter Likes       13581
Twitter Tweets          0
Twitter Verified        0
Page Type Name          0
Reddit Score            0
Length: 77, dtype: int64

In [5]:
df = df.dropna(subset=['Title', 'Content'])

In [6]:
df.isnull().sum()

Datetime                0
Date                    0
Time                    0
Title                   0
Url                     0
                    ...  
Twitter Likes       13581
Twitter Tweets          0
Twitter Verified        0
Page Type Name          0
Reddit Score            0
Length: 77, dtype: int64

In [7]:
src = 'th'  # source language
trg = 'en'  # target language
mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'
model = MarianMTModel.from_pretrained(mname).to('cuda')
tok = MarianTokenizer.from_pretrained(mname)
model.to('cuda')

def translate(data):
    batch = tok.prepare_seq2seq_batch(data, return_tensors='pt').to('cuda')
    gen = model.generate(**batch).to('cuda')
    data: List[str] = tok.batch_decode(gen, skip_special_tokens=True)
    return data

In [8]:
%%time
# Translating Content First
df['Translated Content'] = df['Content'].apply(translate)

CPU times: user 8h 27min 36s, sys: 3.32 s, total: 8h 27min 40s
Wall time: 8h 27min 35s


In [9]:
# Look at Translated Content Here
df

Unnamed: 0,Datetime,Date,Time,Title,Url,Domain,Tags,Sentiment,Page Type,tweet_id,...,Twitter Reply Count,Twitter Reply to,Twitter Retweet of,Twitter Retweets,Twitter Likes,Twitter Tweets,Twitter Verified,Page Type Name,Reddit Score,Translated Content
0,2021-04-13 06:44:07,2021-04-13,06:44:07,http://twitter.com/1/statuses/1381860734108196865,http://twitter.com/1/statuses/1381860734108196865,twitter.com,,negative,twitter,1.381861e+18,...,,,1.381594e+18,10751,0.0,34853,False,Twitter,0,[RT@kksnow: Great news today: high-level Chine...
1,2021-04-13 06:44:48,2021-04-13,06:44:48,http://twitter.com/1/statuses/1381860905206521856,http://twitter.com/1/statuses/1381860905206521856,twitter.com,,positive,twitter,1.381861e+18,...,,,1.381805e+18,23152,0.0,26426,False,Twitter,0,[RT@Pran2844: China's sinovac vaccine is a str...
2,2021-04-13 06:44:54,2021-04-13,06:44:54,http://twitter.com/1/statuses/1381860932247121923,http://twitter.com/1/statuses/1381860932247121923,twitter.com,,positive,twitter,1.381861e+18,...,,,1.381805e+18,23152,0.0,90651,False,Twitter,0,[RT@Pran2844: China's sinovac vaccine is a str...
3,2021-04-13 06:45:07,2021-04-13,06:45:07,http://twitter.com/1/statuses/1381860988224380931,http://twitter.com/1/statuses/1381860988224380931,twitter.com,,positive,twitter,1.381861e+18,...,,,1.381805e+18,23152,0.0,54502,False,Twitter,0,[RT@Pran2844: China's sinovac vaccine is a str...
4,2021-04-13 06:45:13,2021-04-13,06:45:13,http://twitter.com/1/statuses/1381861011670495234,http://twitter.com/1/statuses/1381861011670495234,twitter.com,,negative,twitter,1.381861e+18,...,,,1.381594e+18,10751,0.0,45665,False,Twitter,0,[RT@kksnow: Great news today: high-level Chine...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59994,2021-05-11 12:30:32,2021-05-11,12:30:32,http://twitter.com/1/statuses/1392094773050109962,http://twitter.com/1/statuses/1392094773050109962,twitter.com,,negative,twitter,1.392095e+18,...,,,1.392053e+18,4406,0.0,58833,False,Twitter,0,"[Now, most people are not just anti-vaxers, bu..."
59995,2021-05-11 12:30:40,2021-05-11,12:30:40,http://twitter.com/1/statuses/1392094805572808709,http://twitter.com/1/statuses/1392094805572808709,twitter.com,,negative,twitter,1.392095e+18,...,,,1.392053e+18,4406,0.0,135307,False,Twitter,0,"[Now, most people are not just anti-vaxers, bu..."
59996,2021-05-11 12:30:44,2021-05-11,12:30:44,http://twitter.com/1/statuses/1392094823809617923,http://twitter.com/1/statuses/1392094823809617923,twitter.com,,negative,twitter,1.392095e+18,...,,,1.392053e+18,4406,0.0,5035,False,Twitter,0,"[Now, most people are not just anti-vaxers, bu..."
59997,2021-05-11 12:30:49,2021-05-11,12:30:49,http://twitter.com/1/statuses/1392094842885349381,http://twitter.com/1/statuses/1392094842885349381,twitter.com,,negative,twitter,1.392095e+18,...,,,1.392053e+18,4406,0.0,332207,False,Twitter,0,"[Now, most people are not just anti-vaxers, bu..."


In [10]:
# We will be exporting Content first in case anything goes wrong, your consolation is the content already translated
df.to_excel('TRANSLATED CONTENT Only.xlsx', index=False)
# We will also be exporting csv file due to URL limit of the excel writer
df.to_csv('TRANSLATED CONTENT Only.csv', index=False)

eds Excel's limit of 65,530 URLS per worksheet.
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '

In [11]:
%%time
# Translating Title
df['Translated Title'] = df['Title'].apply(translate)

CPU times: user 3h 4min 25s, sys: 17.5 s, total: 3h 4min 43s
Wall time: 3h 5min 25s


In [12]:
# Look at both Translated Title and Translated Content Here
df

Unnamed: 0,Datetime,Date,Time,Title,Url,Domain,Tags,Sentiment,Page Type,tweet_id,...,Twitter Reply to,Twitter Retweet of,Twitter Retweets,Twitter Likes,Twitter Tweets,Twitter Verified,Page Type Name,Reddit Score,Translated Content,Translated Title
0,2021-04-13 06:44:07,2021-04-13,06:44:07,http://twitter.com/1/statuses/1381860734108196865,http://twitter.com/1/statuses/1381860734108196865,twitter.com,,negative,twitter,1.381861e+18,...,,1.381594e+18,10751,0.0,34853,False,Twitter,0,[RT@kksnow: Great news today: high-level Chine...,[http: /twitter.com/1/status/138187341086865]
1,2021-04-13 06:44:48,2021-04-13,06:44:48,http://twitter.com/1/statuses/1381860905206521856,http://twitter.com/1/statuses/1381860905206521856,twitter.com,,positive,twitter,1.381861e+18,...,,1.381805e+18,23152,0.0,26426,False,Twitter,0,[RT@Pran2844: China's sinovac vaccine is a str...,[http: /twitter.com/1/statuses/138186090521856]
2,2021-04-13 06:44:54,2021-04-13,06:44:54,http://twitter.com/1/statuses/1381860932247121923,http://twitter.com/1/statuses/1381860932247121923,twitter.com,,positive,twitter,1.381861e+18,...,,1.381805e+18,23152,0.0,90651,False,Twitter,0,[RT@Pran2844: China's sinovac vaccine is a str...,[http: /twitter.com/1/status/1381893 22471223]
3,2021-04-13 06:45:07,2021-04-13,06:45:07,http://twitter.com/1/statuses/1381860988224380931,http://twitter.com/1/statuses/1381860988224380931,twitter.com,,positive,twitter,1.381861e+18,...,,1.381805e+18,23152,0.0,54502,False,Twitter,0,[RT@Pran2844: China's sinovac vaccine is a str...,[http: /twitter.com/1/status/1381888224380931]
4,2021-04-13 06:45:13,2021-04-13,06:45:13,http://twitter.com/1/statuses/1381861011670495234,http://twitter.com/1/statuses/1381861011670495234,twitter.com,,negative,twitter,1.381861e+18,...,,1.381594e+18,10751,0.0,45665,False,Twitter,0,[RT@kksnow: Great news today: high-level Chine...,[http: /twitter.com/1/status/138181070495234]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59994,2021-05-11 12:30:32,2021-05-11,12:30:32,http://twitter.com/1/statuses/1392094773050109962,http://twitter.com/1/statuses/1392094773050109962,twitter.com,,negative,twitter,1.392095e+18,...,,1.392053e+18,4406,0.0,58833,False,Twitter,0,"[Now, most people are not just anti-vaxers, bu...",[http: /twitter.com/1/statuses/ 13920947050109...
59995,2021-05-11 12:30:40,2021-05-11,12:30:40,http://twitter.com/1/statuses/1392094805572808709,http://twitter.com/1/statuses/1392094805572808709,twitter.com,,negative,twitter,1.392095e+18,...,,1.392053e+18,4406,0.0,135307,False,Twitter,0,"[Now, most people are not just anti-vaxers, bu...",[http: /twitter.com/1/statuses/ 13920948055280...
59996,2021-05-11 12:30:44,2021-05-11,12:30:44,http://twitter.com/1/statuses/1392094823809617923,http://twitter.com/1/statuses/1392094823809617923,twitter.com,,negative,twitter,1.392095e+18,...,,1.392053e+18,4406,0.0,5035,False,Twitter,0,"[Now, most people are not just anti-vaxers, bu...",[http: /twitter.com/1/status/ 1392098809617923]
59997,2021-05-11 12:30:49,2021-05-11,12:30:49,http://twitter.com/1/statuses/1392094842885349381,http://twitter.com/1/statuses/1392094842885349381,twitter.com,,negative,twitter,1.392095e+18,...,,1.392053e+18,4406,0.0,332207,False,Twitter,0,"[Now, most people are not just anti-vaxers, bu...",[http: /twitter.com/1/statuses/ 13920948428853...


In [13]:
# We will be exporting both Translated Title and Content Here. Use this file over the first file as this one is the completed one.
df.to_excel('TRANSLATED.xlsx', index=False)
# We will also be exporting csv file due to URL limit of the excel writer
df.to_csv('TRANSLATED.csv', index=False)

eds Excel's limit of 65,530 URLS per worksheet.
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '%s' since it exceeds Excel's limit of "
  warn("Ignoring URL '