<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/4/47/Acronimo_y_nombre_uc3m.png"/>

<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" width=15%/>
</center> 

# Back translation

La idea básica de este método es traducir el texto de entrada a un idioma, y volver a traducirlo al idioma original. De esta forma, podemos generar un nuevo texto con diferentes palabras conservando el significado del texto original. Para hacer esto, podemos usar diferentes API de traducción de idiomas como google translate, Bing, Yandex.

<center>
<figure>
<img src="https://lh6.googleusercontent.com/x3ZAhTDLT1QVSD8gCdaBVMquM2dcYA15A-orfzXyTzhTP8m0ZKLXz_2NrJdWlTgWKRS7BimExM8RO9Ce_uVVVdRR29vGeP0VZdncDZY0GTwkctocQyYg7HK9VL5ay3QC4JhbSXBK">

<figcaption>Amit Chaudhary “Back Translation for Text Augmentation with Google Sheets”</figcaption>
</figure>
</center>

En la imagen anterior, podemos ver que la oración original y generada tienen el mismo significado. 





Veamos cómo implementar este método. Vamos a usar la librería **translators**, cuyo objetivo es brindar una traducción gratuita, múltiple y agradable para las personas en Python. 

Se basa en la interfaz de traducción de Google, Yandex, Microsoft(Bing), Baidu, Alibaba, Tencent, NetEase(Youdao), Sogou, Kingsoft(Iciba), Iflytek, Niutrans, Lingvanex, Naver(Papago), Deepl, Reverso, Itranslate , Caiyun, TranslateCom, Mglip, Utibet, Argos, etc.

In [1]:
!pip install translators


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting translators
  Downloading translators-5.6.3-py3-none-any.whl (36 kB)
Collecting pathos>=0.2.9
  Downloading pathos-0.3.0-py3-none-any.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 KB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting requests>=2.28.1
  Downloading requests-2.28.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 KB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting PyExecJS>=1.5.1
  Downloading PyExecJS-1.5.1.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill>=0.3.6
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess>=0.70.14
  Downloading multiprocess-0.70.14-py39-none-any.whl (132

In the following cell, we will implement some functions to generate new sentences by using back translation:

In [21]:
import pandas as pd
# current version have logs, which is not very comfortable
import translators as ts
from multiprocessing import Pool
from tqdm import *
import pandas as pd
# current version have logs, which is not very comfortable
import translators as ts
from multiprocessing import Pool
from tqdm import *

LANG = 'en'
API = 'bing'


def translator_constructor(api):
    """This function calls to a particular API translation to be used and returns
    the object trasnlator created"""
    if api == 'google':
        return ts.google
    elif api == 'bing':
        return ts.bing
    elif api == 'baidu':
        return ts.baidu
    elif api == 'sogou':
        return ts.sogou
    elif api == 'youdao':
        return ts.youdao
    elif api == 'tencent':
        return ts.tencent
    elif api == 'alibaba':
        return ts.alibaba
    else:
        raise NotImplementedError(f'{api} translator is not realised!')


def translate(x):
    """This function gests an instance whose first field is a text and the second one its language.
    Then, it returns the original text and its translation to LANG"""
    text = x[0]
    lang = x[1]
    print(' lang: ' +  lang + ' LANG:' + LANG)

    try:
        translator = translator_constructor(API)
        try:
            # translate the text from lang to LANG
            new_text = translator(text, lang, LANG)
        except:
            new_text = text

        # again, we translate the text from LANG to lang
        new_text = translator(new_text, LANG, lang)
        return [text, new_text]

    except:
        return [text, None]


def imap_unordered_bar(func, args, n_processes: int = 48):
    """This funtion allows to parallelize the execution of a function, func, on a
    dataframe args"""

    p = Pool(n_processes, maxtasksperchild=100)
    res_list = []
    with tqdm(total=len(args)) as pbar:
        for i, res in tqdm(enumerate(p.imap_unordered(func, args))):
            pbar.update()
            res_list.append(res)
    pbar.close()
    p.close()
    p.join()
    return res_list

Vamos a usarlo sobre un dataset que contiene oraciones de diferentes idiomas (https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/data). (Solo cargamos una pequeña muestra).

In [19]:
from google.colab import drive
import pandas as pd

drive.mount('/content/drive')
path = "/content/drive/My Drive/Colab Notebooks/data/toxic/sample_jigsaw_toxic.csv"

df = pd.read_csv(path) 

print('dataset was loaded ', df.shape)
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
dataset was loaded  (51, 4)


Unnamed: 0,id,comment_text,lang,toxic
0,0,Este usuario ni siquiera llega al rango de ...,es,0
1,1,Il testo di questa voce pare esser scopiazzato...,it,0
2,2,Vale. Sólo expongo mi pasado. Todo tiempo pasa...,es,1
3,3,Bu maddenin alt başlığı olarak uluslararası i...,tr,0
4,4,Belçika nın şehirlerinin yanında ilçe ve belde...,tr,0


In [22]:
# tqdm.pandas('Translation progress')
outputs = imap_unordered_bar(translate, df[['comment_text', 'lang']].values)

  0%|          | 0/51 [00:00<?, ?it/s]

 lang: it LANG:en


0it [00:00, ?it/s][A

 lang: tr LANG:en lang: tr LANG:en lang: tr LANG:en lang: es LANG:en lang: es LANG:en

  2%|▏         | 1/51 [00:00<00:06,  7.64it/s]

 lang: es LANG:en






 lang: tr LANG:en

1it [00:00,  9.97it/s]

 lang: es LANG:en lang: es LANG:en

[A

 lang: tr LANG:en lang: it LANG:en lang: es LANG:en lang: it LANG:en lang: tr LANG:en lang: es LANG:en lang: es LANG:en lang: es LANG:en lang: tr LANG:en lang: tr LANG:en lang: it LANG:en
 lang: tr LANG:en lang: tr LANG:en lang: es LANG:en lang: it LANG:en lang: es LANG:en lang: es LANG:en lang: it LANG:en lang: tr LANG:en lang: es LANG:en lang: it LANG:en lang: tr LANG:en lang: it LANG:en lang: tr LANG:en lang: it LANG:en lang: it LANG:en lang: it LANG:en lang: tr LANG:en lang: it LANG:en lang: it LANG:en lang: es LANG:en



  8%|▊         | 4/51 [00:00<00:02, 18.59it/s]












4it [00:00, 19.68it/s]

 lang: es LANG:en lang: es LANG:en lang: tr LANG:en lang: es LANG:en lang: it LANG:en lang: tr LANG:en
 lang: es LANG:en

[A





51it [00:00, 170.97it/s]


 lang: it LANG:en


100%|██████████| 51/51 [00:00<00:00, 152.09it/s]



 lang: es LANG:en










 lang: tr LANG:en































In [16]:
for i in range(len(outputs)):
    print('Original text:', outputs[i][0])
    print('Generated text:', outputs[i][1])
    print()

Original text: Este usuario ni siquiera llega al rango de    hereje   . Por lo tanto debería ser quemado en la barbacoa para purificar su alma y nuestro aparato digestivo mediante su ingestión.    Skipe linkin 22px   Honor, valor, leltad.      17:48 13 mar 2008 (UTC)
Generated text: None

Original text: Il testo di questa voce pare esser scopiazzato direttamente da qui. Immagino possano esserci problemi di copyright, nel fare cio .
Generated text: None

Original text: Vale. Sólo expongo mi pasado. Todo tiempo pasado fue mejor, ni mucho menos, yo no quisiera retroceder 31 años a nivel particular. Las volveria a pasar putas.Fernando 
Generated text: None

Original text: Bu maddenin alt başlığı olarak  uluslararası ilişkiler  ile konuyu sürdürmek ile ilgili tereddütlerim var.Önerim siyaset bilimi ana başlığından sonra siyasal yaşam ve toplum, siyasal güç, siyasal çatışma, siyasal gruplar, çağdaş ideolojiler, din, siyasal değişme, kamuoyu, propaganda ve siyasal katılma temelinde çoğulcu si

## Libraries that already implement back translation:

Fortunately, back translation is already implemented in some libraries written in Python. 
For example, textaugment is one of these libraries. Let's try it!!!





In [None]:
! pip install textaugment

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting textaugment
  Downloading textaugment-1.3.4-py3-none-any.whl (16 kB)
Collecting googletrans
  Downloading googletrans-3.0.0.tar.gz (17 kB)
Collecting httpx==0.13.3
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 2.7 MB/s 
[?25hCollecting rfc3986<2,>=1.3
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl (31 kB)
Collecting sniffio
  Downloading sniffio-1.3.0-py3-none-any.whl (10 kB)
Collecting httpcore==0.9.*
  Downloading httpcore-0.9.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.5 MB/s 
[?25hCollecting hstspreload
  Downloading hstspreload-2022.11.1-py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 16.7 MB/s 
Collecting h2==3.*
  Downloading h2-3.2.0-py2.py3-none-any.whl (65 kB)
[K     |████████████████████████████████| 65 kB 4.1 MB/s 
[?25hCollecting h11<0.10,>=0.8

In [None]:
from textaugment import Translate
SRC = "en" # source language of the sentence
TO = "es" # target language


texts = ["John is going to town",   
         "I want to be a computer engineer.",
         "It usually rains everyday here",
         "I am so mad at her.",
         'I do love my pets',
         'I have no money at the moment'
    ]
t = Translate(src=SRC, to=TO)
for text in texts:
    print('original text:', text)
    print('augmented text:', t.augment(text))
    print()


original text: John is going to town
augmented text: john goes to the city

original text: I want to be a computer engineer.
augmented text: i want to be computer engineer.

original text: It usually rains everyday here
augmented text: usually, it rains every day here

original text: I am so mad at her.
augmented text: i am so angry with her.

original text: I do love my pets
augmented text: i love my pets

original text: I have no money at the moment
augmented text: i don't have money right now

