# An Exploration into the variations between Human Generated  and Machine Generated Translations using Sentiment Analysis 

### The purpose of this notebook is the explore the variations created when translations are  conducted by a human and a machine between english and french. In particular, this notebook is interested in exploring how AI translation software interprets emotion and whether it can recreate it similarly to a human translator. For the human generated data, the sources range from news articles and press releases, fictional literature, and film subtitling which were available in both french and english; and for the machine generated data, the english versions were processed through DeepL. 


In [24]:
import nltk
nltk.download('punkt')

The Natural Language Toolkit or NLTK , is an open-source Python library that enable us to apply natural langauge processing techniques to human language text data. 
In this step we are importing the toolkit into the notebook so that we can pull the necessary features we need from it, such as 'punkt' which is a tokenizer means it breaks down text data into smaller components such as sentences or words to analyse. 

In [25]:
pip install vaderSentiment-fr

Note: you may need to restart the kernel to use updated packages.


VADER (Valence Aware Dictionary for Sentiment Reasoning) is used in text sentiment analysis to determine how negative/positive a text is. VADER is a rule based model, so when importing it we are importing a dictionary of words that have been awarded a score based on the emotion associated with it. In this step we are specifically import the french version. 

In [27]:
pip install python-Levenshtein

Note: you may need to restart the kernel to use updated packages.


In [26]:
from vaderSentiment_fr.vaderSentiment import SentimentIntensityAnalyzer   

In [None]:
Alongside measuring the polarity of a text, VADAR also can analyse intensity of a text. Fromt he VADAR package we are importing this feature.

In [28]:
SIA = SentimentIntensityAnalyzer()

Next, we are putting the SentimentIntensityAnalyser into the variable SIA so it will be quicker to call the function later. 

In [43]:
dataMT = open('machine-text.txt', encoding="utf-8")
message_textMT = dataMT.read()

In [44]:
dataHT = open('Human-Text.txt', encoding="utf-8")
message_textHT = dataHT.read()

Now we are opening our text files for analysis, Machine_text and Human_text. They are then being out into their respective message_text variables to be analysed when we call various functions. 

In [45]:
scoresMT = SIA.polarity_scores(message_textMT)

In [46]:
scoresHT = SIA.polarity_scores(message_textHT)

In [None]:
Here, we have created each text a respective variable to create their intensity score.

In [47]:
scoresMT

{'neg': 0.058, 'neu': 0.853, 'pos': 0.089, 'compound': 0.9996}

In [48]:
scoresHT

{'neg': 0.058, 'neu': 0.833, 'pos': 0.11, 'compound': 0.9999}

When we print their respective scores variable, we will get the texts' polarity scores and their intenstiy scores. We can see that the human translations have a slightly more positive score. 

In [49]:
for key in sorted(scoresMT):
        print('{0}: {1}, '.format(key, scoresMT[key]), end='')

compound: 0.9996, neg: 0.058, neu: 0.853, pos: 0.089, 

In [50]:
for key in sorted(scoresHT):
        print('{0}: {1}, '.format(key, scoresHT[key]), end='')

compound: 0.9999, neg: 0.058, neu: 0.833, pos: 0.11, 

# # Sentence by Sentence Analysis.

In [51]:
tokenizer = nltk.data.load('tokenizers/punkt/french.pickle')

As previously explained, a tokenizer breaks down a text into smaller components. In this stage we are creating our 'tokenizer variable to be the french package within the punkt tokenizer. 

In [52]:
sentencesMT = tokenizer.tokenize(message_textMT)

In [53]:
sentencesHT = tokenizer.tokenize(message_textHT)

In this step we are tokenizing the texts and creating their 'sentences' variable.

In [54]:
for sentence in sentencesMT:
        scores = SIA.polarity_scores(sentence)
        for key in sorted(scores):
                print('{0}: {1}, '.format(key, scores[key]), end='')
        print()

compound: 0.296, neg: 0.0, neu: 0.856, pos: 0.144, 
compound: -0.6688, neg: 0.157, neu: 0.843, pos: 0.0, 
compound: 0.1171, neg: 0.149, neu: 0.571, pos: 0.28, 
compound: 0.0, neg: 0.109, neu: 0.781, pos: 0.109, 
compound: 0.5719, neg: 0.0, neu: 0.793, pos: 0.207, 
compound: 0.0428, neg: 0.096, neu: 0.726, pos: 0.178, 
compound: -0.0516, neg: 0.054, neu: 0.946, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.1027, neg: 0.131, neu: 0.724, pos: 0.145, 
compound: -0.5411, neg: 0.164, neu: 0.786, pos: 0.05, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.3802, neg: 0.0, neu: 0.909, pos: 0.091, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.3818, neg: 0.0, neu: 0.843, pos: 0.157, 
compound: 0.4767, neg: 0.088, neu: 0.737, pos: 0.176, 
compound: -0.2023, neg: 0.101, neu: 0.899, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.7717, neg: 0.0, neu: 0.438, pos: 0.562, 
compound: 0.4019, 

In [55]:
for sentence in sentencesHT:
        scores = SIA.polarity_scores(sentence)
        for key in sorted(scores):
                print('{0}: {1}, '.format(key, scores[key]), end='')
        print()

compound: 0.296, neg: 0.0, neu: 0.862, pos: 0.138, 
compound: -0.6103, neg: 0.162, neu: 0.711, pos: 0.127, 
compound: -0.2831, neg: 0.234, neu: 0.537, pos: 0.229, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.5106, neg: 0.0, neu: 0.875, pos: 0.125, 
compound: 0.0428, neg: 0.089, neu: 0.746, pos: 0.166, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.1027, neg: 0.141, neu: 0.704, pos: 0.156, 
compound: -0.5983, neg: 0.15, neu: 0.85, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.3818, neg: 0.0, neu: 0.852, pos: 0.148, 
compound: 0.2732, neg: 0.144, neu: 0.68, pos: 0.176, 
compound: -0.2023, neg: 0.087, neu: 0.913, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.4926, neg: 0.0, neu: 0.892, pos: 0.108, 
compound: 0.6597, neg: 0.0, neu: 0.565, pos: 0.435, 
compound: 0.5574, neg: 0.0, neu:

In these for loops, the notebook goes through sentence by sentence and calculate their polaroty and intensity scores. For each, it prints the key (neutral, positive, negative) and  the score.

This notebook was produced in spring/summer 2022 as part of the Digital Humanities Internship funded by the University of Southampton. 