# An Exploration into the variations between Human Generated  and Machine Generated Translations using Sentiment Analysis 

### Purpose

The purpose of this notebook is the explore the variations created when translations are  conducted by a human and a machine between english and french. In particular, this notebook is interested in exploring how AI translation software interprets emotion and whether it can recreate it similarly to a human translator. For the human generated data, the sources range from news articles and press releases, fictional literature, and film subtitling which were available in both french and english; and for the machine generated data, the english versions were processed through DeepL.

### Setup

In [24]:
import nltk
nltk.download('punkt')

The [Natural Language Toolkit](https://www.nltk.org/) or NLTK, is an open-source Python library that enable us to apply natural langauge processing techniques to human language text data. 

In the step above we import the MLTK toolkit into the notebook so that we can pull the necessary features we need. `punkt` is a tokenizer, which means it breaks down text data into smaller components such as sentences or words to analyse.

In [25]:
pip install vaderSentiment-fr

Note: you may need to restart the kernel to use updated packages.


[VADER](https://www.nltk.org/_modules/nltk/sentiment/vader.html) (Valence Aware Dictionary for Sentiment Reasoning) is used in text sentiment analysis to determine how negative or positive a text is. VADER is a rule based model, so when importing it we are importing a dictionary of words that have been awarded a score based on the emotion associated with it. In this step we specifically import the french version `vaderSentiment-fr`. 

In [27]:
pip install python-Levenshtein

Note: you may need to restart the kernel to use updated packages.


In [26]:
from vaderSentiment_fr.vaderSentiment import SentimentIntensityAnalyzer   

Alongside measuring the polarity of a text, VADAR also can analyse intensity of a text. In the step above we import this feature from the VADAR package.

In [28]:
SIA = SentimentIntensityAnalyzer()

Next, we are put `SentimentIntensityAnalyser()` into the variable `SIA` so it will be quicker to call the function later.

### Getting the data

In [43]:
dataMT = open('machine-text.txt', encoding="utf-8")
message_textMT = dataMT.read()

In [44]:
dataHT = open('Human-Text.txt', encoding="utf-8")
message_textHT = dataHT.read()

These two step open our text files `machine_text.txt` and `Human_text.txt` and put them into their respective message_text variables for use in various analysis functions below.

*Note: if this step doesn't work, make sure this `.ipynb` is in the same directory as your `.txt` files. If your text files are in the different directory, there are better/other solutions, but this is the easiest.*

### Initial Data Analysis

In [45]:
scoresMT = SIA.polarity_scores(message_textMT)

In [46]:
scoresHT = SIA.polarity_scores(message_textHT)

In the steps above, we have created variables that contain sentiment analysis scores for each text.

In [47]:
scoresMT

{'neg': 0.058, 'neu': 0.853, 'pos': 0.089, 'compound': 0.9996}

In [48]:
scoresHT

{'neg': 0.058, 'neu': 0.833, 'pos': 0.11, 'compound': 0.9999}

When we print the respective scores for each text (above), we will get the texts' polarity and intensity scores. We can see that VADER scores human translations as slightly more positive than the equivalent machine translations.

In [49]:
for key in sorted(scoresMT):
        print('{0}: {1}, '.format(key, scoresMT[key]), end='')

compound: 0.9996, neg: 0.058, neu: 0.853, pos: 0.089, 

In [50]:
for key in sorted(scoresHT):
        print('{0}: {1}, '.format(key, scoresHT[key]), end='')

compound: 0.9999, neg: 0.058, neu: 0.833, pos: 0.11, 

The steps above tidy up the output.

### Sentence level analysis

In [51]:
tokenizer = nltk.data.load('tokenizers/punkt/french.pickle')

As previously explained, a tokenizer breaks down a text into smaller components. In this stage we create a `tokenizer` variable using french package within the `punkt` tokenizer.

In [52]:
sentencesMT = tokenizer.tokenize(message_textMT)

In [53]:
sentencesHT = tokenizer.tokenize(message_textHT)

In the two steps above, we tokenize the two text and creating their `sentences` variables for each. These contain the text broken down sentences (one per line).

In [54]:
for sentence in sentencesMT:
        scores = SIA.polarity_scores(sentence)
        for key in sorted(scores):
                print('{0}: {1}, '.format(key, scores[key]), end='')
        print()

compound: 0.296, neg: 0.0, neu: 0.856, pos: 0.144, 
compound: -0.6688, neg: 0.157, neu: 0.843, pos: 0.0, 
compound: 0.1171, neg: 0.149, neu: 0.571, pos: 0.28, 
compound: 0.0, neg: 0.109, neu: 0.781, pos: 0.109, 
compound: 0.5719, neg: 0.0, neu: 0.793, pos: 0.207, 
compound: 0.0428, neg: 0.096, neu: 0.726, pos: 0.178, 
compound: -0.0516, neg: 0.054, neu: 0.946, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.1027, neg: 0.131, neu: 0.724, pos: 0.145, 
compound: -0.5411, neg: 0.164, neu: 0.786, pos: 0.05, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.3802, neg: 0.0, neu: 0.909, pos: 0.091, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.3818, neg: 0.0, neu: 0.843, pos: 0.157, 
compound: 0.4767, neg: 0.088, neu: 0.737, pos: 0.176, 
compound: -0.2023, neg: 0.101, neu: 0.899, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.7717, neg: 0.0, neu: 0.438, pos: 0.562, 
compound: 0.4019, 

In [55]:
for sentence in sentencesHT:
        scores = SIA.polarity_scores(sentence)
        for key in sorted(scores):
                print('{0}: {1}, '.format(key, scores[key]), end='')
        print()

compound: 0.296, neg: 0.0, neu: 0.862, pos: 0.138, 
compound: -0.6103, neg: 0.162, neu: 0.711, pos: 0.127, 
compound: -0.2831, neg: 0.234, neu: 0.537, pos: 0.229, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.5106, neg: 0.0, neu: 0.875, pos: 0.125, 
compound: 0.0428, neg: 0.089, neu: 0.746, pos: 0.166, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.1027, neg: 0.141, neu: 0.704, pos: 0.156, 
compound: -0.5983, neg: 0.15, neu: 0.85, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.3818, neg: 0.0, neu: 0.852, pos: 0.148, 
compound: 0.2732, neg: 0.144, neu: 0.68, pos: 0.176, 
compound: -0.2023, neg: 0.087, neu: 0.913, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.4926, neg: 0.0, neu: 0.892, pos: 0.108, 
compound: 0.6597, neg: 0.0, neu: 0.565, pos: 0.435, 
compound: 0.5574, neg: 0.0, neu:

Finally, in these two steps, `SentimentIntensityAnalyzer` loops through each text sentence by sentence and calculates their polarity and intensity scores. For each sentence, we then print a negative (`neg`), neutral (`neu`), positive (`pos`), and compound (combined) score.

Note that `neg`, `neu`, and `pos` are scored between 0 and 1, and `compound` is scored between -1 and 1.

### Rights

This notebook was produced in spring/summer 2022 by Isobel Lester as part of a [Digital Humanities Internship](https://www.southampton.ac.uk/study/facilities/digital-humanities-facilities) funded by the [School of Humanities](https://www.southampton.ac.uk/about/faculties-schools-departments/school-of-humanities) at the [University of Southampton](https://www.southampton.ac.uk/).

This notebook builds on Zoë Wilkinson Saldaña, "Sentiment Analysis for Exploratory Data Analysis," *Programming Historian* 7 (2018), [https://doi.org/10.46430/phen0079](https://doi.org/10.46430/phen0079).

This notebook is released under a [CC-BY](https://creativecommons.org/licenses/by/4.0/deed.en) license.