## **Translating XML documents using Google Cloud Translation API**

### **Introduction**

When dealing with documents based in XML format, a common struggle is how to accommodate the document to audiences with different languages. One common approach is to use Google CLoud Translation API - But, as XML docs has many tags, those extra tags are considered/counted when calculating the number of characters sent to the API.

As can be found on the product documentation, Translate API billing is based on number of requests to the API and the requests are counted by every request with up to 1000 characters. More details here: https://cloud.google.com/translate/pricing

In this tutorial we gonna go through an step by step of how to perform the following tasks using Translation API:
- digest a XML file
- parse its relevant tags and detect their contents
- manipulate those contents and translate it to a different language
- create e new XML file, on the same format, but with contents translated

**It is important reinforce:** all those results are possible to be generated with agility and high confidence without the need of manually training, optimizing and testing machine learning models or writing machine learning code - which increases drastically the time to market of generated solutions.

Also it is very important to mention that it is a prototype and a demonstration of how it could be done. For production usage, further developments/validation/tests are required.

### **Preparation Steps**

Before starting, it is important to:

**a) First, install the required Python modules:**
- bs4 # BeautifulSoup4
- google-cloud-translate

To install those packages simply run:

`$ sudo pip3 install bs4 google-cloud-translate`

**b) Also, validate that you have a valid Application Credential to use on this exercise**

If you don't have or if you don't know, please check: https://cloud.google.com/docs/authentication/getting-started

Below just follow and execute the code cells to perform the activity

In [1]:
# import initial required modules
import os
from math import ceil
from dateutil.parser import parse

In [2]:
## configure environment variables for authentication
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '<add your key information here>'
sourceFile = 'data/serv_aenews_google_mat_33951343.xml'
target_language = 'en'
payload = ''
glossary = {}
dict_idx = 0

In [3]:
## basic functions to validate tags contents
# useful to skip numeric and datetime information (not useful for translation)


def checkFloat(string):
    try:
        float(string)
        return True
    except ValueError:
        return False


def checkDate(string):
    try: 
        parse(string, fuzzy=False)
        return True

    except ValueError:
        return False

In [4]:
## basic functions to validate tags contents
# specific parser for <font> tag and its subtags


def parseFont(tag):
    if tag.findChild():
        for child in tag.findChild():
            parseTag(child)
    else:
         if tag.name is not None and \
            tag.text is not None and \
            tag.text != '' and \
            not tag.text.isdigit() and \
            not checkDate(tag.text) and \
            not checkFloat(tag.text):
            if str(tag.text).strip() in list(glossary.values()):
                key = list(glossary.keys())[list(glossary.values()).index(str(tag.text).strip())]
                tag.string = str(key)
            else:
                if glossary.keys():
                    dict_idx = int(sorted(list(glossary.keys()))[-1])+1
                    glossary[dict_idx] = str(tag.text).strip()
                    tag.string = str(dict_idx)
                else:
                    dict_idx = 0
                    glossary[dict_idx] = str(tag.text).strip()
                    tag.string = str(dict_idx)

In [5]:
## basic functions to validate tags contents
# parse all document tags (excepts <font> tag)


def parseTag(tag):
    if tag.name == 'font':
        parseFont(tag)
    if type(tag) == '<class \'bs4.element.Tag\'>' and tag.findChild():
        for child in tag.findChild():
            parseTag(child)
    else:
        if tag.name is not None and \
            tag.text is not None and \
            tag.text != '' and \
            not tag.text.isdigit() and \
            not checkDate(tag.text) and \
            not checkFloat(tag.text):
            if str(tag.text).strip() in list(glossary.values()):
                key = list(glossary.keys())[list(glossary.values()).index(str(tag.text).strip())]
                tag.string = str(key)
            else:
                if glossary.keys():
                    dict_idx = int(sorted(list(glossary.keys()))[-1])+1
                    glossary[dict_idx] = str(tag.text).strip()
                    tag.string = str(dict_idx)
                else:
                    dict_idx = 0
                    glossary[dict_idx] = str(tag.text).strip()
                    tag.string = str(dict_idx)

In [6]:
## basic functions to validate tags contents
# read the XML and import into a BeautifulSoup object
# it also saves the string contents into a python dictionary (free of tags)

from bs4 import BeautifulSoup


def soupFile(file, dict_idx, parse=True):
    with open(file, encoding='iso-8859-1') as xmlFile:    
        contents = xmlFile.read()
    soup = BeautifulSoup(contents, 'lxml')

    if parse:
        for tag in soup.find_all():
            if tag.name not in ['html', 'head', 'body', 'materias', 'materia', 'integra', 
                                'ul', 'li', 'table', 'tr', 'td', 'a']:
                parseTag(tag)
        return(soup, glossary)
    else:
        return(soup)

In [7]:
## basic functions to validate tags contents
# get the contents from the dictionary of strings and translate it on a single API request

from google.cloud import translate_v2 as translate

## Translate API client definition
client_translate = translate.Client()
target_language = 'en'


def translateData(data):
    translation = client_translate.translate(payload, 
                                            source_language='pt',
                                            target_language=target_language)
    translated_text = translation['translatedText']
    return(translated_text)

In [8]:
# parse the sample file present at data/ directory
# also print the glossary of strings got from the document
soup, glossary = soupFile(sourceFile, dict_idx)
glossary

{0: 'XML',
 1: 'Incertezas geradas pelo novo atrito do presidente Jair Bolsonaro com o presidente de Câmara, Rodrigo Maia, principalmente no que diz respeito à aprovação de medidas, apoiam a busca por proteção no mercado de câmbio, com reflexos nos juros futuros e bolsa. Após começo de dia de otimista, quando abriu em queda e marcou a mínima cotação do dia na casa de R$ 5,20, o dólar escalou mais de sete centavos, intensificando o ajuste positivo mais ao fim da manhã, paralelamente ao discurso do presidente Jair Bolsonaro na posse de Nelson Teich como Ministro da Saúde. Segundo profissionais, há temor sobre a votação do Orçamento de Guerra, em segundo turno, no Senado, como desdobramento do conflito Bolsonaro-Maia, que piorou o clima no Congresso. O Senado já avalia rejeitar ou deixar a medida provisória do contrato Verde Amarelo caducar. O texto estava pautado para esta sexta-feira. A relatoria do texto na Casa é do PT, contrário à medida. Senadores citam o prazo curto para analisar o

In [9]:
# print the XML contents as read from the source data
soup.prettify()

'<?xml version="1.0" encoding="ISO-8859-1"?>\n<html>\n <body>\n  <materias>\n   <materia>\n    <id>\n     33951343\n    </id>\n    <urgenc>\n    </urgenc>\n    <copyright>\n    </copyright>\n    <fornecedor>\n    </fornecedor>\n    <servico>\n    </servico>\n    <editoria>\n    </editoria>\n    <data>\n     17/04/2020\n    </data>\n    <hora>\n     12:53:00\n    </hora>\n    <formato>\n     0\n    </formato>\n    <destaque>\n    </destaque>\n    <titulo>\n    </titulo>\n    <retranca>\n    </retranca>\n    <ketwords>\n    </ketwords>\n    <local>\n    </local>\n    <autor>\n    </autor>\n    <sinopse>\n    </sinopse>\n    <editcodigo>\n    </editcodigo>\n    <url>\n    </url>\n    <integra>\n     <font face="arial" size="3">\n      1\n     </font>\n     <br/>\n     <ul>\n      <li>\n       <font face="arial" size="2">\n        2\n       </font>\n      </li>\n      <li>\n       <font face="arial" size="2">\n        3\n       </font>\n      </li>\n      <li>\n       <font face="arial" si

In [10]:
# generate a consolidated payload with all identified tag contents
for item in glossary.values():
    payload += str(item).strip() + '<p>'
    results = translateData(payload)

print(results)
# once the content is translated, create a new dictionary with the
# translated strings
results_list = []
for item in results.split('<p>'):
    if item.replace('<p>', '').strip() not in results_list:
        results_list.append(item.replace('<p>', '').strip())

glossary_en = glossary.copy()

for index, data in enumerate(results_list):
        glossary_en[index] = data

# print the dictionary of translated strings
glossary_en

XML <p> Uncertainties generated by the new friction of President Jair Bolsonaro with the Mayor, Rodrigo Maia, mainly with regard to the approval of measures, support the search for protection in the foreign exchange market, with reflexes on future interest and stock market. After an optimistic start to the day, when it opened in a fall and marked the day&#39;s minimum price of R $ 5.20, the dollar climbed more than seven cents, intensifying the positive adjustment later in the morning, in parallel with the president&#39;s speech Jair Bolsonaro in the possession of Nelson Teich as Minister of Health. According to professionals, there is fear about the vote on the War Budget, in the second round, in the Senate, as an outcome of the Bolsonaro-Maia conflict, which worsened the climate in Congress. The Senate is already considering rejecting or letting the provisional measure of the Green Yellow contract expire. The text was scheduled for this Friday. The report of the text in the House is 

{0: 'XML',
 1: 'Uncertainties generated by the new friction of President Jair Bolsonaro with the Mayor, Rodrigo Maia, mainly with regard to the approval of measures, support the search for protection in the foreign exchange market, with reflexes on future interest and stock market. After an optimistic start to the day, when it opened in a fall and marked the day&#39;s minimum price of R $ 5.20, the dollar climbed more than seven cents, intensifying the positive adjustment later in the morning, in parallel with the president&#39;s speech Jair Bolsonaro in the possession of Nelson Teich as Minister of Health. According to professionals, there is fear about the vote on the War Budget, in the second round, in the Senate, as an outcome of the Bolsonaro-Maia conflict, which worsened the climate in Congress. The Senate is already considering rejecting or letting the provisional measure of the Green Yellow contract expire. The text was scheduled for this Friday. The report of the text in the H

In [11]:
# replace the soup object data with the translate information
for tag in soup.find_all():
    if tag.name not in ['html', 'head', 'body', 'materias', 'materia', 'integra', 
                        'ul', 'li', 'table', 'tr', 'td']:
        if tag.text is not None and tag.text != '':
            if tag.text.isdigit():
                content = int(tag.text)
                if content in glossary.keys():
                    tag.string = glossary_en[content]

# print the current version of the XML file
soup.prettify()

'<?xml version="1.0" encoding="ISO-8859-1"?>\n<html>\n <body>\n  <materias>\n   <materia>\n    <id>\n     33951343\n    </id>\n    <urgenc>\n    </urgenc>\n    <copyright>\n    </copyright>\n    <fornecedor>\n    </fornecedor>\n    <servico>\n    </servico>\n    <editoria>\n    </editoria>\n    <data>\n     17/04/2020\n    </data>\n    <hora>\n     12:53:00\n    </hora>\n    <formato>\n     XML\n    </formato>\n    <destaque>\n    </destaque>\n    <titulo>\n    </titulo>\n    <retranca>\n    </retranca>\n    <ketwords>\n    </ketwords>\n    <local>\n    </local>\n    <autor>\n    </autor>\n    <sinopse>\n    </sinopse>\n    <editcodigo>\n    </editcodigo>\n    <url>\n    </url>\n    <integra>\n     <font face="arial" size="3">\n      Uncertainties generated by the new friction of President Jair Bolsonaro with the Mayor, Rodrigo Maia, mainly with regard to the approval of measures, support the search for protection in the foreign exchange market, with reflexes on future interest and sto

In [12]:
# save the translate version to a new XML file
targetFile = 'data/serv_aenews_google_mat_33951343_en.xml'

with open(targetFile,'w') as targetFile:
    targetFile.write(soup.prettify())