**Marina Albert, Ramón Carreño, Charlotte Puopolo**

# Machine Translation Project
How may a (target) language change over time if we use MT persistently? Compare
(automatically) the MT output and the original translation of a (large) set of texts to check
if there are any words and/or structures that get used less/more frequently.

## Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Remember to set the right paths if you want to rerun inference using corpora files.

In [1]:
!pip install tqdm
!pip install sentencepiece
!pip install transformers
!pip install sacremoses
!pip install scattertext

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sacremoses
Successfully installed sacremoses-0.1.1
Collecting scattertext
  Downloading scattertext-0.1.19-py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Collecting flashtext (from scattertext)
  Downloading flashtext-2.7.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: flashtext
  Building wheel for flashtext (setup.py) ... [?25l[?25hdone
  Created wheel for flashtext: filename=flashtext-2.7-py2.py3-none-any.whl size=9296 sha256=46540f036e3c02f9e6397e176d97dd8297cfae5d8dd7c897832e86572ef3d7e5
  Stored in directory: /root/.cache/pip/wheels/bc/be/39/c37ad168eb2ff644c9685f52554440372129450f0b8ed203dd
Suc

In [None]:
import pandas as pd
from tqdm import tqdm
from transformers import pipeline

## Translations

In [None]:
# Read write operations
def read_corpus(filepath):
    examples = []
    with open(filepath, 'r') as file:
        lines = file.readlines()
        for line in lines:
            examples.append(line.strip())
    return examples


def write_translations(filepath, translations):
    with open(filepath, "w", encoding="utf-8") as file:
        for sentence in translations:
            file.write(sentence + "\n")

In [None]:
path_to_english_file = 'corpora/esxnli-all-en.csv'
path_to_spanish_file = 'corpora/esxnli-all-es.csv'

en_corpus = read_corpus(path_to_english_file)
es_corpus = read_corpus(path_to_spanish_file)

In [None]:
# Don't rerun (30 min or +)
es_en_model = 'Helsinki-NLP/opus-mt-es-en'
translator = pipeline(task="translation", model=es_en_model)
#translation = translator(es_corpus[0:5])

# Iterate through each sentence in es_corpus and translate
tr_en_texts = []
#for sentence in tqdm(es_corpus[0:1000]):
for sentence in tqdm(es_corpus):
    translation = translator(sentence)
    tr_en_texts.append(translation)

100%|██████████| 3320/3320 [58:34<00:00,  1.06s/it]


In [None]:
translations = [text[0]["translation_text"] for text in tr_en_texts]
print(translations[0])
#write_translations("translations_esxnli_1000_es_en.txt", translations)
write_translations("translations_esxnli_all_es_en.txt", translations)

Let's see if we all have to go on strike until we collect whatever we want.


## Creation of a comparison file

In [None]:
#spanish = es_corpus[:1000]
spanish = es_corpus

In [None]:
#english = en_corpus[:1000]
english = en_corpus

sentence2,


In [None]:
# readjusting to remove this one extra line in English
print(english[830])
english.pop(830)

'sentence2,'

In [None]:
translations = read_corpus("translations_esxnli_all_es_en.txt")

In [None]:
comparison_df = pd.DataFrame({'spanish' : spanish,
                                'english_human' : english,
                                'english_mt' : translations },
                                columns=['spanish','english_human', 'english_mt'])


In [None]:
comparison_df

Unnamed: 0,spanish,english_human,english_mt
0,A ver si nos tenemos que poner todos en huelga...,Maybe we will all have to go on strike until w...,Let's see if we all have to go on strike until...
1,"""Profesor de FP y funcionario de carrera de la...","""He is a professor of Vocational Training and ...","""Professor of FP and career officer of the Dep..."
2,"""Dos horas después estaba controlado, pero tod...","""Two hours later it was under control, but the...","""Two hours later it was controlled, but the ca..."
3,Estos que se hacen llamar periodistas se creen...,"""They call themselves journalists and they bel...",Those who call themselves journalists think th...
4,"""Por eso, un buen día, explota, y a lomos de s...","""Therefore, one day he explodes, and on the ba...","""That's why one good day, it explodes, and on ..."
...,...,...,...
3315,Cynthia ha mantenido una relación sentimental ...,Cynthia has maintained a sentimental relations...,Cynthia has had a sentimental relationship wit...
3316,Cynthia no tiene ningún abogado que la represe...,"Cynthia does not have a lawyer to represent her.,",Cynthia doesn't have a lawyer to represent her.
3317,Al-Fayed y Eva Mendes son los propietarios de ...,Al-Fayed and Eva Mendes own the department sto...,Al-Fayed and Eva Mendes are the owners of the ...
3318,Al-Fayed firmó a Eva Mendes imágenes de la cam...,Al-Fayed signed pictures of the campaign for E...,Al-Fayed signed Eva Mendes images of the campa...


In [None]:
comparison_df.to_csv('comparison-all-esxnli.csv', index=True)

## Automatic analysis

### Comparison of unigrams (words)

In [None]:
from collections import Counter
import nltk
from nltk import word_tokenize

nltk.download('punkt')


# Tokenize function
def tokenize(text):
    return word_tokenize(text.lower())

# Function to count word frequencies
def count_word_frequencies(texts):
    all_words = []
    for text in texts:
        tokens = tokenize(text)
        all_words.extend(tokens)
    return Counter(all_words)

# Calculate word frequencies for human translations
human_word_freq = count_word_frequencies(comparison_df['english_human'])

# Calculate word frequencies for machine translations
mt_word_freq = count_word_frequencies(comparison_df['english_mt'])

# Compare word frequencies
common_words = set(human_word_freq.keys()) & set(mt_word_freq.keys())

# Identify significant differences in word frequencies
differences = {}
for word in common_words:
    human_freq = human_word_freq[word]
    mt_freq = mt_word_freq[word]
    if human_freq != mt_freq:
        differences[word] = (human_freq, mt_freq)

# Set the threshold for significant frequency difference
threshold = 20

# Output the significant differences
print("Significant differences in word frequencies by count:")
for word, (human_freq, mt_freq) in differences.items():
    if abs(human_freq - mt_freq) >= threshold:
        print(f"Word: {word}, Human Frequency: {human_freq}, MT Frequency: {mt_freq}")

# Calculate the total number of instances
total_instances = sum([max(human_freq, mt_freq) for _, (human_freq, mt_freq) in differences.items()])

# Define the percentage threshold
percentage_threshold = 0.0005  # adjust as needed

print("\nSignificant differences in word frequencies by percentage:")
for word, (human_freq, mt_freq) in differences.items():
    # Calculate the threshold based on the percentage
    threshold_p = percentage_threshold * total_instances
    if abs(human_freq - mt_freq) >= threshold_p:
        print(f"Word: {word}, Human Frequency: {human_freq}, MT Frequency: {mt_freq}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Significant differences in word frequencies by count:
Word: been, Human Frequency: 81, MT Frequency: 133
Word: this, Human Frequency: 179, MT Frequency: 159
Word: don, Human Frequency: 38, MT Frequency: 2
Word: n't, Human Frequency: 83, MT Frequency: 218
Word: ., Human Frequency: 848, MT Frequency: 2375
Word: 're, Human Frequency: 10, MT Frequency: 42
Word: to, Human Frequency: 1100, MT Frequency: 1155
Word: during, Human Frequency: 42, MT Frequency: 20
Word: they, Human Frequency: 206, MT Frequency: 157
Word: s, Human Frequency: 104, MT Frequency: 2
Word: an, Human Frequency: 139, MT Frequency: 118
Word: will, Human Frequency: 313, MT Frequency: 271
Word: ', Human Frequency: 23, MT Frequency: 45
Word: from, Human Frequency: 165, MT Frequency: 144
Word: the, Human Frequency: 3376, MT Frequency: 3557
Word: is, Human Frequency: 895, MT Frequency: 803
Word: ``, Human Frequency: 1733, MT Frequency: 814
Word: 's, Human Frequency: 154, MT Frequency: 349
Word: has, Human Frequency: 264, MT Fr

### Estimating lexical richness: Perplexity & Type-Token Ratio

In [None]:
import numpy as np
from collections import Counter

def calculate_unigram_probabilities(texts):
    # Count the frequency of each word
    word_freq = count_word_frequencies(texts)

    # Calculate unigram model probabilities
    total_words = sum(word_freq.values())  # total number of words in corpus
    probs = {word: count / total_words for word, count in word_freq.items()}

    return probs

def perplexity(probs):
    log_probabilities = np.log2(probs)
    entropy = -np.mean(log_probabilities)
    perplexity = 2 ** entropy
    return perplexity

def ttr(texts):  # type-token ratio
    # Obtain all corpus words
    words = []
    for text in texts:
        tokens = tokenize(text)
        words.extend(tokens)

    # Count the number of unique words (types)
    unique_words = set(words)
    num_unique_words = len(unique_words)

    # Count the total number of words (tokens)
    num_tokens = len(words)

    # Calculate the TTR
    ttr = num_unique_words / num_tokens

    return ttr

# Calculate probabilities
human_probs = calculate_unigram_probabilities(comparison_df['english_human'])
mt_probs = calculate_unigram_probabilities(comparison_df['english_mt'])

# Calculate perplexities
human_perp = perplexity(list(human_probs.values()))
mt_perp = perplexity(list(mt_probs.values()))

# Calculate TTRs
human_ttr = ttr(comparison_df['english_human'])
mt_ttr = ttr(comparison_df['english_mt'])

# Comparison time!
print(f"Human texts PERPLEXITY: {human_perp} | MT texts PERPLEXITY: {mt_perp}")  # does this make any sense? what is our baseline?
print(f"Human texts TTR: {human_ttr} | MT texts TTR: {mt_ttr}")  # 11.7% vs 11.4% - significant?

Human texts PERPLEXITY: 25220.061396570567 | MT texts PERPLEXITY: 21416.966481893287
Human texts TTR: 0.15371484150141923 | MT texts TTR: 0.13619065443882963


### Comparison of bigrams

In [None]:
from nltk import bigrams
from collections import Counter
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import stopwords
import nltk

nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


# Function to count bigram frequencies, excluding stopwords and punctuation
def count_bigram_frequencies_clean(texts):
    all_bigrams = []
    for text in texts:
        tokens = word_tokenize(text)
        tokens = [token.lower() for token in tokens if token.lower() not in stop_words and not all(char in string.punctuation for char in token)]
     #   tokens = [token.lower() for token in tokens if token.lower() not in stop_words and token.lower() not in string.punctuation]
     #   tokens = [token for token in tokenize(text) if token not in stop_words and token not in punctuation]
        bigrams_list = list(bigrams(tokens))
        all_bigrams.extend(bigrams_list)
    return Counter(all_bigrams)

# Calculate bigram frequencies for human translations (excluding stopwords and punctuation)
human_bigram_freq_clean = count_bigram_frequencies_clean(comparison_df['english_human'])

# Calculate bigram frequencies for machine translations (excluding stopwords and punctuation)
mt_bigram_freq_clean = count_bigram_frequencies_clean(comparison_df['english_mt'])

# Get the top 50 bigrams for human translations
top_50_human_bigrams_clean = human_bigram_freq_clean.most_common(50)

# Get the top 50 bigrams for machine translations
top_50_mt_bigrams_clean = mt_bigram_freq_clean.most_common(50)

# Output the top 50 bigrams for human translations
print("Top 50 meaningful bigrams for human translations:")
for bigram, freq in top_50_human_bigrams_clean:
    print(f"Bigram: {bigram}, Frequency: {freq}")

# Output the top 50 bigrams for machine translations
print("\nTop 50 meaningful bigrams for machine translations:")
for bigram, freq in top_50_mt_bigrams_clean:
    print(f"Bigram: {bigram}, Frequency: {freq}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Top 50 meaningful bigrams for human translations:
Bigram: ('”', '“'), Frequency: 12
Bigram: ('ca', "n't"), Frequency: 10
Bigram: ('real', 'estate'), Frequency: 10
Bigram: ('company', "'s"), Frequency: 8
Bigram: ('michael', 'jackson'), Frequency: 8
Bigram: ('world', 'cup'), Frequency: 8
Bigram: ('city', 'council'), Frequency: 7
Bigram: ('united', 'states'), Frequency: 7
Bigram: ('three', 'months'), Frequency: 7
Bigram: ('long', 'time'), Frequency: 7
Bigram: ('de', 'azúa'), Frequency: 7
Bigram: ('years', 'old'), Frequency: 6
Bigram: ('young', 'people'), Frequency: 6
Bigram: ('wo', "n't"), Frequency: 6
Bigram: ('children', "'s"), Frequency: 6
Bigram: ('last', 'year'), Frequency: 6
Bigram: ('million', 'people'), Frequency: 6
Bigram: ('fair', 'trade'), Frequency: 6
Bigram: ('’', 'know'), Frequency: 6
Bigram: ('louis', 'vuitton'), Frequency: 6
Bigram: ('last', 'year.'), Frequency: 6
Bigram: ('real', 'madrid'), Frequency: 5
Bigram: ('basque', 'country'), Frequency: 5
Bigram: ('would', 'like')

### Comparison of trigrams

In [None]:
from nltk import trigrams

# Function to count trigram frequencies, excluding stopwords and punctuation
def count_trigram_frequencies_clean(texts):
    all_trigrams = []
    for text in texts:
        tokens = word_tokenize(text)
        tokens = [token.lower() for token in tokens if token.lower() not in stop_words and not all(char in string.punctuation for char in token)]
     #   tokens = [token for token in tokenize(text) if token not in stop_words and token not in punctuation]
        trigrams_list = list(trigrams(tokens))
        all_trigrams.extend(trigrams_list)
    return Counter(all_trigrams)

# Calculate trigram frequencies for human translations (excluding stopwords and punctuation)
human_trigram_freq_clean = count_trigram_frequencies_clean(comparison_df['english_human'])

# Calculate trigram frequencies for machine translations (excluding stopwords and punctuation)
mt_trigram_freq_clean = count_trigram_frequencies_clean(comparison_df['english_mt'])

# Get the top 50 trigrams for human translations
top_50_human_trigrams_clean = human_trigram_freq_clean.most_common(50)

# Get the top 50 trigrams for machine translations
top_50_mt_trigrams_clean = mt_trigram_freq_clean.most_common(50)

# Output the top 50 trigrams for human translations
print("Top 50 meaningful trigrams for human translations:")
for trigram, freq in top_50_human_trigrams_clean:
    print(f"Trigram: {trigram}, Frequency: {freq}")

# Output the top 50 trigrams for machine translations
print("\nTop 50 meaningful trigrams for machine translations:")
for trigram, freq in top_50_mt_trigrams_clean:
    print(f"Trigram: {trigram}, Frequency: {freq}")


Top 50 meaningful trigrams for human translations:
Trigram: ('increase', 'agricultural', 'production'), Frequency: 4
Trigram: ('condoleeza', 'rice', 'mahmud'), Frequency: 4
Trigram: ('long', 'time', 'ago'), Frequency: 4
Trigram: ('real', 'estate', 'market'), Frequency: 4
Trigram: ('“', 'gladiator', '”'), Frequency: 4
Trigram: ('eighty', 'desirable', 'men'), Frequency: 4
Trigram: ('john', 'edward', 'thomas'), Frequency: 4
Trigram: ('33', 'chilean', 'miners'), Frequency: 4
Trigram: ('rock', 'band', 'aerosmith'), Frequency: 4
Trigram: ('finnish', 'computer', 'engineer'), Frequency: 3
Trigram: ('consumer', 'protection', 'improvement'), Frequency: 3
Trigram: ('contained', 'small', 'capsules'), Frequency: 3
Trigram: ('agricultural', 'production', 'developing'), Frequency: 3
Trigram: ('sexual', 'identity', 'law'), Frequency: 3
Trigram: ('rice', 'mahmud', 'abas'), Frequency: 3
Trigram: ('homosexual', 'couples', 'rights'), Frequency: 3
Trigram: ('couples', 'rights', 'heterosexual'), Frequency: 

## Creation of comparison file for manual linguistic analysis

In [None]:
import pandas as pd

# Load the original DataFrame
comparison_df = pd.read_csv('comparison-all-esxnli.csv')

# Create a new DataFrame for the two corpora with a category column
corpus_df = pd.concat([comparison_df[['spanish', 'english_human']].rename(columns={'english_human': 'text'}),
                       comparison_df[['spanish', 'english_mt']].rename(columns={'english_mt': 'text'})],
                      ignore_index=True)

# Add a category column indicating human or machine translation
corpus_df['category'] = ['human-translation'] * len(comparison_df) + ['machine-translation'] * len(comparison_df)

# Save the new DataFrame to a CSV file
corpus_df.to_csv('corpus_with_category.csv', index=False)


## Scattertext

In [None]:
import pandas as pd
import scattertext as st

corpus_df = pd.read_csv('corpus_with_category.csv')

# creating a Scattertext Corpus
corpus = st.CorpusFromPandas(corpus_df,
                              category_col='category',
                              text_col='text',
                              nlp=st.whitespace_nlp_with_sentences
                             ).build()

# generate the HTML visualization
html = st.produce_scattertext_explorer(corpus,
                                       category='human-translation',
                                       category_name='Human Translation',
                                       not_category_name='Machine Translation',
                                       width_in_pixels=1000,
                                       metadata=corpus_df['spanish'])

# Save the HTML file
with open('mt-scattertext_visualization-1.html', 'w') as f:
    f.write(html)