# Translating from English to LOTE (Language Other Than English)

The purpose of this notebook is to do translation from English to some other languages.

I need to do a translation exam, but there is not much material available for the Indonesian language translation exam on the internet. I received a list of vocabulary (~1.5k words) from a friend who is doing it in English to the Tamil language. I need to translate these languages to the Indonesian language, however here are my problems:
- File consists of the text in both English and Tamil languages, I only need to take the English words.
- Copy pasting these words to Google translate only have a maximum of ~150 lines (150 words).
- Copying the translation result of the above 150 words back to excel returns all in one cell.
- If I copy the translation result, it will only give me Indonesian words. I need it to be `English word - Indonesian word` for each English word, in one row.

The above actions create too much trouble, especially with the manual work of cleaning up the excel workbook around 10 times. So, I did a bit of programming to do the translation. 

I also convert the translation results to audio, so that I can listen to the words from my phone anywhere, anytime, whether I do some chores or just resting. I use the `deep_translator` library for the translation service and the `gtts` library for the text-to-audio conversion.

This code has been used to help some of my friends who are required to get the translation to Russian and Hindi languages. So I hope this can be useful for other people who are looking for translation services in python. 

I am using `sample_tamil_vocab.txt` which only consists of 20 words for this demonstration.

In [1]:
#Importing libraries

import pandas as pd
import re
import os
from tqdm import tqdm
from gtts import gTTS
from deep_translator import GoogleTranslator

In [2]:
#### SETTING LANGUAGE ####

LANGUAGE = 'indonesian'
LANGUAGE_CODE = 'id'

print (f'Selected language: {LANGUAGE} with language code of {LANGUAGE_CODE}')

Selected language: indonesian with language code of id


In [3]:
#Read the file and take English words with regex

df = pd.read_csv('./input_file/sample_tamil_vocab.txt', on_bad_lines='skip')

df['word'] = df['english'].str.extract(r'((?<=\d\.).*?(?=-|–|=))')

df['word'] = df['word'].str.lower().str.strip()

df.loc[df['word']=='x', ['word']] = 'x-ray'

df.head()

Unnamed: 0,english,word
0,1. Magistrate - குற்றவியல் நடுவர்,magistrate
1,2. Office - அலுவலகம்,office
2,3. Lawyer - வழக்கறிஞர்,lawyer
3,4. Application - விண்ணப்பம்,application
4,5. Appointment - முன்னேற்பாடு / நியமேம் / உத்த...,appointment


In [4]:
#Make list updated with str lower and remove spacing
word_list = list(df['word'].unique())
word_list[:5]

['magistrate', 'office', 'lawyer', 'application', 'appointment']

In [5]:
#Sample of Google Translator
translated = GoogleTranslator(source='auto', target=LANGUAGE_CODE).translate("keep it up, you are awesome")

translated

'tetap semangat, kalian luar biasa'

In [6]:
#Loop each word and translate
word_dictionary = {}

for word in tqdm(word_list):
    try:
        translation = GoogleTranslator(source='en', target=LANGUAGE_CODE).translate(word)
        word_dictionary[word] = translation
    except Exception as e:
        word_dictionary[word] = 'UNKNOWN'

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:08<00:00,  2.34it/s]


In [7]:
#Save it to a dataframe
word_df = pd.DataFrame()
word_df['english'] = word_dictionary.keys()
word_df['translated'] = word_dictionary.values()

In [8]:
word_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   english     20 non-null     object
 1   translated  20 non-null     object
dtypes: object(2)
memory usage: 448.0+ bytes


In [9]:
#Saving to csv file
word_df.to_csv(f'./output_file/{LANGUAGE}_word_translation.txt', index=False)
word_df.to_csv(f'./output_file/{LANGUAGE}_word_translation.csv', index=False)

# Converting to Audio


I can convert to audio in the previous loop, however, the idea of converting to audio came up after I finished translating the words. So, I create a separate section to convert the audio.

The `gtts` library has a limit of 1,000 words daily.

In [10]:
#Setting up the variables again
LANGUAGE = 'indonesian'
LANGUAGE_CODE = 'id'


word_df = pd.read_csv(f'./output_file/{LANGUAGE}_word_translation.txt')
word_df.head()

Unnamed: 0,english,translated
0,magistrate,hakim
1,office,kantor
2,lawyer,pengacara
3,application,aplikasi
4,appointment,janji temu


In [11]:
#Set a file name and the audio number
file_name = f'sample_{LANGUAGE}_translation'
audio_number = str(1)

frame = word_df.copy()
frame = frame.dropna()

word_dict = frame.set_index('english').to_dict()['translated']

with open(f'{file_name}_{audio_number}.mp3', 'wb') as ff:
    for eng, lang in tqdm(word_dict.items()):
        tts1 = gTTS(text=eng, lang='en', slow=True)
        tts1.write_to_fp(ff)
        tts_stop = gTTS(text="translation", lang='en', slow=True)
        tts_stop.write_to_fp(ff)
        tts2 = gTTS(text=lang, lang=LANGUAGE_CODE, slow=True)
        tts2.write_to_fp(ff)

        
#Play the music
os.system(f'{file_name}_{audio_number}.mp3')

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:25<00:00,  1.29s/it]


0