# **Testing Google Translate Accuracy**

![Poorly translated shirts](https://justsomething.co/wp-content/uploads/2015/07/bad-asian-translations-on-shirts-14.jpg)

# Introduction
Language is a major part of human communication and sets us apart from every other species on the planet. According to Ethnologue, there are over 7,000 languages spoken in the world today (“How Many Languages Are There in the World?”). 

Throughout history, linguists and translators have put thousands of hours of work into translating from one language to another in order to connect people across the globe and improve communication between countries and cultures. Luckily for us living in 2021, we live in a world where machines are able to translate text or speech from one language to another in milliseconds. 

However, these translations are not always perfect. The images above show translations-gone-wrong when t-shirt designers attempted to translate from their native language to English. Though these images are comical, they make us ponder the accuracy of such machine learning translators.

Using the Google Translate machine learning API (insert link), we will assess the accuracy of the translator by translating 3 sentences of various difficulties from English to six languages, then back to English again. 

**Hypothesis: We hypothesize that languages with closer roots to English will have a higher translation accuracy than languages farther removed from English due to similarties of grammar and syntax.** 


# Methods
We will present the ML API translator with three sentences of increasing levels of difficulty:
1. Easy - "This is an introduction to computer science class."

2. Medium - (From the Emory University Motto): “The wise heart seeks knowledge.”

3. Difficult - (From the Emory Alma Mater): "In the heart of dear old Emory where the sun doth shine, that is where our hearts are turning 'Round old Emory's shrine.

We will take each of these sentences in English, a European - Germanic language, and translate to each of the following languages:

1. German (European, Germanic)
2. Spanish (European, Ibero-Romance)
3. Russian (European, Slavic)
4. Hindi (Indo-Iranian, Indic)
5. Persian (Indo-Iranian, Iranian)
6. Hungarian (Uralic, Ugric)

Following this, we will translate the sentence once again into English. Then, we will assess the amount to which the ML API translator was able to recapture the original sentence in English.


# Machine Learning API

First, we have to set up the API.

In [None]:
import getpass

APIKEY = getpass.getpass()

··········


Next, we invoke the Translate API and set up the environment to pull the sentences for the '.txt' files in stored in Google Cloud. The Google Application Credentials reflects a service account JSON uploaded to the notebook.

In [None]:
# import build
from googleapiclient.discovery import build
service = build('translate', 'v2', developerKey=APIKEY)

#Import operating system and language v1
import os
from google.cloud import language_v1
from google.cloud.language_v1 import enums

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/content/premium-apex-303114-2a44ede94528.json"

client = language_v1.LanguageServiceClient()


Then, we can store information and variables in preparation to make the request from Google Cloud.

In [None]:
type_ = enums.Document.Type.PLAIN_TEXT
gcs_content_uri1 = 'gs://web-page-meganj/Sentence1.txt'
gcs_content_uri2 = 'gs://web-page-meganj/Sentence2.txt'
gcs_content_uri3 = 'gs://web-page-meganj/Sentence3.txt'
type_ = enums.Document.Type.PLAIN_TEXT
language = 'en'

#store the 3 text files
document1 = {"gcs_content_uri": gcs_content_uri1, "type": type_, "language": language}
document2 = {"gcs_content_uri": gcs_content_uri2, "type": type_, "language": language}
document3 = {"gcs_content_uri": gcs_content_uri3, "type": type_, "language": language}

Then, we make the request from Google Cloud for each of the 3 sentences we will analyze.

In [None]:
response1 = client.analyze_sentiment(document1, encoding_type=enums.EncodingType.UTF8)
response2 = client.analyze_sentiment(document2, encoding_type=enums.EncodingType.UTF8)
response3 = client.analyze_sentiment(document3, encoding_type=enums.EncodingType.UTF8)

Next, we store the sentences in an input array to run through the API.

In [None]:
inputs = []
inputs.append(response1.sentences[0].text.content)
inputs.append(response2.sentences[0].text.content)
inputs.append(response3.sentences[0].text.content)
inputs
#inputs = ['In the heart of dear old Emory Where the sun doth shine, That is where our hearts are turning Round old Emory shrine.', 'The wise heart seeks knowledge', 'This is an introduction to computer science class']

['In the heart of dear old Emory Where the sun doth shine, That is where our hearts are turning Round old Emory shrine.',
 'The wise heart seeks knowledge',
 'This is an introduction to computer science class']

Then, we create arrays with the languages to use in the API.

In [None]:
languagesab = ['de', 'es', 'ru', 'hi', 'fa', 'hu']
languages = ['German', 'Spanish', 'Russian', 'Hindi', 'Persian', 'Hungarian']

Next, we use a 'for' loop to move through each language. We can use the Translate API to translate each sentence input. Then, the results are saved into a new array (outputlist), which is translated back to English. The results are printed using a for loop to move through each input sentence.

We translated 3 sentences in each language. The first one, alma mater, is from Emory's school alma mater: 'In the heart of dear old Emory Where the sun doth shine, That is where our hearts are turning Round old Emory shrine.' The second sentence, motto, is Emory's school motto: 'The wise heart seeks knowledge'. The final sentence, class, is a simple sentence about QTM 250: 'This is an introduction to computer science class'.

In [None]:
# use the service for each language
for langab, lang in zip(languagesab, languages):
  print(lang)
  outputs = service.translations().list(source='en', target=langab, q=inputs).execute()
  outputlist = []
# save the outputs into a new list
  for key in outputs['translations']:
      outputlist.append(key['translatedText'])
  finaloutputs = service.translations().list(source=langab, target='en', q=outputlist).execute()
# print outputs
  for input, output, foutput in zip(inputs, outputlist, finaloutputs['translations'] ):
    print(u"{0} -> {1} -> {2}".format(input, output, foutput['translatedText']))

German
In the heart of dear old Emory Where the sun doth shine, That is where our hearts are turning Round old Emory shrine. -> Im Herzen des lieben alten Emory Wo die Sonne scheint, drehen sich unsere Herzen um den alten Emory-Schrein. -> In the heart of dear old Emory Where the sun shines our hearts revolve around the old Emory shrine.
The wise heart seeks knowledge -> Das weise Herz sucht Wissen -> The wise heart seeks knowledge
This is an introduction to computer science class -> Dies ist eine Einführung in den Informatikunterricht -> This is an introduction to computer science classes
Spanish
In the heart of dear old Emory Where the sun doth shine, That is where our hearts are turning Round old Emory shrine. -> En el corazón de la querida Emory, donde brilla el sol, ahí es donde nuestros corazones giran alrededor del antiguo santuario de Emory. -> In the heart of beloved Emory, where the sun shines, that&#39;s where our hearts revolve around the ancient Emory shrine.
The wise hear

After translating the sentences, we determined if the final English translation retained its original meaning (TRUE or FALSE). We then rated the translation's usefulness on a scale of 1 to 4 (1 = meaningless translation, 2 = close translation with meaning lost, 3 = close translation with meaning retained, 4 = exact translation)

[Data Table](https://docs.google.com/spreadsheets/d/1GBEVlFydUrXVf3-2Hy0bYtxQMushKhFSIKk8fVwNqgA/edit?usp=sharing)

# Data Analysis
![chart](https://docs.google.com/spreadsheets/d/e/2PACX-1vT_4822CqlNWsWiGS217CZEuAawpCOQbf6MZV491jUN_FzaPHfns3YqQY53VHXhcdtJi3m3OBX0bEFi/pubchart?oid=884285617&format=image)


Looking at the graph above, of the six languages that translated the three English sentences, Spanish had the highest collective score of closest translations (4, 4, 3) for the Class, Motto, and Alma Mater sentences, respectively. German, Russian, and Persian tied for second closest overall translation accuracy with scores of (3, 4, 3), (4, 3, 3), (4, 4, 2) for the Class, Motto, and Alma Mater translations, respectively. Persian scored 4s on both the Class and Motto sentence, but fell short with the Alma Mater score of 2. Next is Hungarian with a total of 9 points, scoring a (2, 4, 3. In last place with the least accurate translations is Hindi, with a score of (4, 2, 1) respectively.

Also, four of six languages (Spanish, Russian, Hindi, Persian) translated the first sentence with exact translation; however, two languages (German and Hungarian) could not translate the first sentence with exact translation, but they both translated the second sentence with exact translation, which is interesting since the first sentence (the Class sentence) is supposed to be easier than the second sentence (the Motto sentence). 

# Conclusion
Our hypothesis of languages with closer roots to English having a higher translation accuracy compared to languages with farther roots to English was supported by the data. The language that retained the original meaning of the sentences the best was Spanish with the languages German and Russian following close behind. The language Persian also shared the same composite score as German and Russian; however, the Alma Mater sentence was translated closely with meaning lost in Persian.

The languages that retained the original meaning for all the sentences that were translated have European roots, and the languages that translated some sentences while losing the meaning have Indo-Iranian or Uralic roots. Although no language got a perfect composite score, the languages with European roots always kept the meaning of the original sentences intact; whereas, some meaning was lost of some sentences translated by languages with Indo-Iranian or Uralic roots. However, all languages, regardless of their roots, was able to translate at least one sentence exactly.

# [GitHub Repo](https://github.com/jperelm/QTM250Hw4)


# [Architecture diagram](https://github.com/jperelm/QTM250Hw4/blob/main/ArchDiagram.PNG)

# References
1. “How Many Languages Are There in the World?” Ethnologue, 3 May 2016, https://www.ethnologue.com/guides/how-many-languages.
