# Parallel Corpora and LMs

You receive a small parallel corpus extracted from EuroParl, a very influential corpus in Machine Translation which contains speeches delivered at the European Parliament translated in a variety of languages: you'll work with English, Dutch, and Italian. The corpus comes as a .json file containing a dictionary mapping IDs to sentences in the three target languages, indicated as 'en', 'it', and 'nl'. Sentences under each language come as a list of strings, and sentences have been tokenized already, with tokens separated by white spaces.

I've resolved some incongruencies from the first attempt that were identified in the discussion board, so make sure to re-read the assignment carefully and to respect naming conventions: there are examples to guide you, and if in doubt ask on the discussion board.

I've also decided to eliminate the sub-task on splitting training and test set: nearly everybody completed it correctly, so it wasn't really making a difference and was introducing potential sources of divergence in the results because multiple splitting procedures were possible given the specification. You now get two files, one for training and one for testing, that you should pre-process as indicated. The 1 point awarded for correctly splitting the original corpus is awarded by default to everybody, so the assignment is still out of 40 points and everybody starts with 1 point.

You should carry out the following tasks:

1. read the input .json files for training and test, then:
	- pre-process sentences in all three languages by 
        * lowercasing everything (1pt)
        * replacing each digit with the capital letter D (1pt)
        * removing all characters that aren't letters (mind all sorts of special characters!) or white spaces (2pts)
    
> 4 points available, assigned as indicated above if the step is carried out correctly (everything is lowercased, all numbers replaced, only letters and white spaces). 


2. train a total of four character-level Statistical Language Models using the LM class provided in Notebook04 (make sure the resulting object is of class LM and has attributes _counts_, _vocab_, and _vocab\_size_), with add-k smoothing and k=0.01:
	- a model predicting the current character based on the two previous characters
	- a model predicting the current character based on the four previous characters

  given the following inputs:
    - the English sentences in the training set, after getting rid of all white spaces
    - the word types extracted from the English sentences in the training set



- !!! Replace any character which occurs fewer than 20 times in the English sentences from the training set with the string '?'. 
- !!! Remember that language models can only be compared if they have the same vocabulary: make sure that all models are trained using the vocabulary of the models trained on English sentences, not word types.
- !!! Get inspiration from the Corpus and LM classes introduced in class, but edit them to fit the task.
- !!! remember to set BoS and EoS correctly.
    
At the end of task 2 you should have four LMs all having the same vocabulary:
* a character-level language model trained on full English sentences without white spaces to predict the next character given the two preceding ones
* a character-level language model trained on word types from the English training sentences to predict the next character given the two preceding ones
* a character-level language model trained on full English sentences without white spaces to predict the next character given the four preceding ones
* a character-level language model trained on word types from the English training sentences to predict the next character given the four preceding ones

You should submit a .pkl file for each LM, dumping the LMs to .pkl files and naming files using the template Name(Initial)Surname\_[words|sents]\_[2gr|4gr]\_en.pkl ( the | symbol means OR ). Therefore, John K. Doe should name his model trained on sentences and predicting based on two preceding characters JohnKDoe\_sents\_2gr\_en.pkl. If you don't have a middle name, just use NameSurname. If you have multiple surnames, add them as NameSurname1Surname2, with no intervening spaces. The notebook contains the code backbone to save a .pkl file, you need to edit it to choose the correct object to save and the appropriate file name given your name and surname.

> 4 points available: we will automatically check whether 5 random transition counts and the vocabulary of your models check out with ours. For each model where the check succeeds, you will receive one point.


3. Compute the perplexity of all four Language Models on:
	- the English sentences from the training set
	- the English word types from the training set
	- the English sentences from the test set
	- the Dutch sentences from the test set
	- the Italian sentences from the test set

You should submit a .csv file with the following structure, column names, and values ([2/4] means either 2 for LMs predicting based on two previous characters or 4 for LMs predicting based on four previous characters, the options under test_data indicate the five sets to be used to compute perplexity):

|ngram_size|training_data|test_data|perplexity|
|---|---|---|---|
|[2/4]|[words/sents]|[ITtest/NLtest/ENtest/ENtrain\_sents/ENtrain\_words]|float (rounded at 4 decimal places)|
|---|---|---|---|

The file should be named according to the template Name(Initial)Surname\_perplexities.csv

> 5 points available: you get 1 point if all four LMs yield the correct perplexity scores for a test_dataset.


4. Out of all Italian and Dutch word types in the test sentences, restricting attention to word types consisting of at least 5 characters and with at least 5 occurrences in the Italian/Dutch test sentences, find:
	- the word in each language with the lowest perplexity according to each of the four LMs
	- the word in each language with the highest perplexity according to each of four LMs

You should submit two .csv files (one for the lowest perplexities, one for highest perplexities) with the following structure, column names, and values ([it/nl] indicates the language, with it indicating italian and nl indicating dutch, str indicates that the word should appear as a string, [2/4] means either 2 for LMs predicting based on two previous characters or 4 for LMs predicting based on four previous characters, [words/sents] indicates whether the model identifying that particular word on that language was trained on word types or sentences):

| lang | word | ngram_size | training_data | perplexity |
|---|---|---|---|---|
|[it/nl]|str|[2/4]|[words/sents]|float (rounded at 4 decimal places)|
|---|---|---|---|

The files should be named according to the template Name(Initial)Surname\_perplexities\_[max|min].csv, so Jane Smith should submit a file named JaneSmith\_perplexities\_max.csv containing 8 rows each storing the word with the highest perplexity according to each of the four LMs per language.

> 4 points available: you get 0.25 points for each correct word identified


5. Answer the following questions:
	- a. compare LMs' perplexity on the English training sets, sentences and words, then explain the differences in perplexity considering what changes between the two training set-ups. (5 pts, 150 words)
	- b. which LM trained on sentences generalizes better to unseen sentences in the same language, bigram or tetragram? explain why this is the case. (5 pts, 150 words)
	- c. compare LMs trained on English in their ability to fit Italian and Dutch sentences: which factor between ngram size and training corpus (words or sentences) affects perplexity the most? Explain why we observe this pattern. (4 pts, 100 words)
	- d. what patterns can you identify in the words with the lowest perplexity in Dutch and Italian? (4 pts, 100 words)
	- e. what patterns can you identify in the words with the highest perplexity in Dutch and Italian? (4 pts, 100 words)
    
> 22 points in total, see specifications next to each question


Summing up, you will have to submit 8 files:
- 1 python notebook in .ipynb format - name the file as \Name(Initial)Surname\_CLassignment.ipynb
- 4 .pkl files each containing an LM object with attributes _counts_, _vocab_, and _vocab\_size_ (you can of course add multiple attributes if it helps you, but these three have to be there, with those exact names!)
- 3 .csv files, one storing the perplexity for each model on the five possible test sets (Italian, Dutch, English sentences from the test set; English sentences from training set, and English word types from the training set); one storing the words in Dutch and Italian with the highest perplexity according to each LM; one storing the words in Dutch and Italian with the lowest perplexity according to each LM.

## Task1

1. read the input .json files for training and test, then:
	- pre-process sentences in all three languages by 
        * lowercasing everything (1pt)
        * replacing each digit with the capital letter D (1pt)
        * removing all characters that aren't letters (mind all sorts of special characters!) or white spaces (2pts)
    
> 4 points available, assigned as indicated above if the step is carried out correctly (everything is lowercased, all numbers replaced, only letters and white spaces). 


In [4]:
import json
import pickle as pkl
import re
import numpy as np
from collections import defaultdict, Counter
import pandas as pd
import csv

In [5]:
# Define the file paths for training and testing datasets
train_file_path = '/Users/belizpekkan/Desktop/Programming/Computational Linguistics/Assignment Submission 2/training_parallel_sentences_en-nl-it.json'
test_file_path = '/Users/belizpekkan/Desktop/Programming/Computational Linguistics/Assignment Submission 2/test_parallel_sentences_en-nl-it.json'

def preprocess_text(text):
    text = text.lower()  # Lowercase 
    text = re.sub(r'\d', 'D', text)  # Replace each digit with 'D'    
    text = re.sub(r'[^a-zàèéìòóùëïöüáí\s]', '', text) # Remove non-letter characters except accented ones and white spaces
    return text

def preprocess_sentences(sentences):
    return [preprocess_text(sentence) for sentence in sentences]

def load_and_preprocess_corpus(file_path):
    with open(file_path, 'r') as file:
        corpus = json.load(file)

    for sentence_id in corpus:
        for lang in corpus[sentence_id]:
            corpus[sentence_id][lang] = preprocess_sentences(corpus[sentence_id][lang])
    return corpus

def replace_infrequent_chars(sentences, threshold=20):
    all_chars = [char for sentence in sentences for char in sentence]
    char_freq = Counter(all_chars)
    frequent_chars = {char for char, freq in char_freq.items() if freq >= threshold}
    processed_sentences = [
        ''.join([char if char in frequent_chars else '?' for char in sentence])
        for sentence in sentences
    ]
    return processed_sentences, frequent_chars

# Load and preprocess the training and testing datasets
train_corpus = load_and_preprocess_corpus(train_file_path)
test_corpus = load_and_preprocess_corpus(test_file_path)

# Output the size of the corpora
print(f"Training corpus size: {len(train_corpus)}")
print(f"Testing corpus size: {len(test_corpus)}")

# Example to print a preprocessed sentence from the training corpus
example_sentence_id = next(iter(train_corpus))
print(f"Example preprocessed sentence (train corpus):")
for lang, sentences in train_corpus[example_sentence_id].items():
    print(f"{lang}: {sentences[0]}")

Training corpus size: 6117
Testing corpus size: 1529
Example preprocessed sentence (train corpus):
en: member of the commission   i agree with your way of reasoning and i think it is essential that we see the euro as a key policy instrument for economic policy and sustainable growth in europe and  at the same time  that we look at it not only as a symbol but also as a bond for europeans in building the common european home 
nl: lid van de commissie    en  ik ben het eens met uw redenering  en ik denk dat het essentieel is dat we de euro zien als een onmisbaar beleidsinstrument voor economisch beleid en duurzame groei in europa en dat we hem tegelijkertijd niet alleen zien als een symbool  maar ook als een verbindende factor voor europeanen bij het bouwen van ons gemeenschappelijk europees thuis 
it: signor presidente  concordo con tale ragionamento e penso che sia essenziale vedere l euro come strumento politico essenziale per la politica economica e la crescita sostenibile in europa e

In [6]:
# Preprocess test data
english_test_sentences = [''.join(test_corpus[sentence_id]['en']) for sentence_id in test_corpus]
english_test_sentences_no_spaces = [list(sentence.replace(' ', '')) for sentence in english_test_sentences]

dutch_test_sentences = [''.join(test_corpus[sentence_id]['nl']) for sentence_id in test_corpus]
dutch_test_sentences_no_spaces = [list(sentence.replace(' ', '')) for sentence in dutch_test_sentences]

italian_test_sentences = [''.join(test_corpus[sentence_id]['it']) for sentence_id in test_corpus]
italian_test_sentences_no_spaces = [list(sentence.replace(' ', '')) for sentence in italian_test_sentences]

## Task2

2. train a total of four character-level Statistical Language Models using the LM class provided in Notebook04 (make sure the resulting object is of class LM and has attributes _counts_, _vocab_, and _vocab\_size_), with add-k smoothing and k=0.01:
	- a model predicting the current character based on the two previous characters
	- a model predicting the current character based on the four previous characters

  given the following inputs:
    - the English sentences in the training set, after getting rid of all white spaces
    - the word types extracted from the English sentences in the training set



- !!! Replace any character which occurs fewer than 20 times in the English sentences from the training set with the string '?'. 
- !!! Remember that language models can only be compared if they have the same vocabulary: make sure that all models are trained using the vocabulary of the models trained on English sentences, not word types.
- !!! Get inspiration from the Corpus and LM classes introduced in class, but edit them to fit the task.
- !!! remember to set BoS and EoS correctly.
    
At the end of task 2 you should have four LMs all having the same vocabulary:
* a character-level language model trained on full English sentences without white spaces to predict the next character given the two preceding ones
* a character-level language model trained on word types from the English training sentences to predict the next character given the two preceding ones
* a character-level language model trained on full English sentences without white spaces to predict the next character given the four preceding ones
* a character-level language model trained on word types from the English training sentences to predict the next character given the four preceding ones

You should submit a .pkl file for each LM, dumping the LMs to .pkl files and naming files using the template Name(Initial)Surname\_[words|sents]\_[2gr|4gr]\_en.pkl ( the | symbol means OR ). Therefore, John K. Doe should name his model trained on sentences and predicting based on two preceding characters JohnKDoe\_sents\_2gr\_en.pkl. If you don't have a middle name, just use NameSurname. If you have multiple surnames, add them as NameSurname1Surname2, with no intervening spaces. The notebook contains the code backbone to save a .pkl file, you need to edit it to choose the correct object to save and the appropriate file name given your name and surname.

> 4 points available: we will automatically check whether 5 random transition counts and the vocabulary of your models check out with ours. For each model where the check succeeds, you will receive one point.

In [7]:
class Corpus:
    def __init__(self, sentences, t=20, n=2, bos_eos=True):
        self.sentences = sentences
        self.t = t
        self.ngram_size = n
        self.bos_eos = bos_eos
        
        self.sentences, self.vocab = replace_infrequent_chars(self.sentences, self.t)
        
        if self.bos_eos:
            self.sentences = self.add_bos_eos()

    def add_bos_eos(self):
        """
        Adds the necessary number of BOS symbols and one EOS symbol.
        """
        r = self.ngram_size - 1
        padded_sentences = []
        for sentence in self.sentences:
            padded_sentence = ['#bos#']*r + list(sentence) + ['#eos#']
            padded_sentences.append(padded_sentence)
        return padded_sentences

# LM class for language model
def default_int_dict():
    return defaultdict(int)

class LM:
    def __init__(self, corpus_sentences, n, k=0.01):
        self.ngram_size = n
        self.k = k
        
        all_chars = [char for sentence in corpus_sentences for char in sentence]
        char_freq = Counter(all_chars)
        
        frequent_chars = {char for char, freq in char_freq.items() if freq >= 20}
        
        processed_sentences = [
            [char if char in frequent_chars else '?' for char in sentence]
            for sentence in corpus_sentences
        ]
        
        self.vocab = frequent_chars.union({'?'})
        self.vocab_size = len(self.vocab)
        self.counts = defaultdict(default_int_dict)
        self.update_counts(processed_sentences)

    def update_counts(self, sentences):
        for sentence in sentences:
            sentence = ['#bos#'] * (self.ngram_size - 1) + list(sentence) + ['#eos#']
            for i in range(len(sentence) - self.ngram_size + 1):
                ngram = tuple(sentence[i:i+self.ngram_size])
                prefix = ngram[:-1]
                char = ngram[-1]
                self.counts[prefix][char] += 1

    def calculate_probability(self, prefix, char):
        prefix_counts = sum(self.counts[prefix].values()) + self.k * self.vocab_size
        char_count = self.counts[prefix].get(char, 0) + self.k
        return char_count / prefix_counts

    def perplexity(self, sentences):
        log_prob = 0
        num_chars = 0
        
        for sentence in sentences:
            sentence = ['#bos#'] * (self.ngram_size - 1) + list(sentence) + ['#eos#']
            num_chars += len(sentence) - self.ngram_size + 1
            for i in range(len(sentence) - self.ngram_size + 1):
                ngram = tuple(sentence[i:i+self.ngram_size])
                prefix = ngram[:-1]
                char = ngram[-1]
                prob = self.calculate_probability(prefix, char)
                log_prob += np.log(prob)
        
        return np.exp(-log_prob / num_chars)

    def save(self, file_name):
        with open(file_name, 'wb') as f_out:
            pkl.dump(self, f_out)

In [8]:
# English sentences in the training set, after getting rid of all white spaces
english_training_sentences = [''.join(train_corpus[sentence_id]['en']) for sentence_id in train_corpus]
english_training_sentences_no_spaces = [list(sentence.replace(' ', '')) for sentence in english_training_sentences]
print(english_training_sentences_no_spaces[0])

all_chars = ''.join([''.join(sentence) for sentence in english_training_sentences_no_spaces])
char_freq = Counter(all_chars)
frequent_chars = {char for char, freq in char_freq.items() if freq >= 20}
english_words = set(word for sentence in english_training_sentences for word in sentence.split())

english_word_types_prepared = [
    ''.join([char if char in frequent_chars else '?' for char in word])
    for word in english_words
]

english_word_types_sentences = [list(word) for word in english_word_types_prepared]
sentence_as_lists_of_chars_per_word = [
    [list(word) for word in sentence.split()] for sentence in english_training_sentences
]
print(sentence_as_lists_of_chars_per_word[0])

# Train the LMs
corpus_sent = Corpus(english_training_sentences_no_spaces, t=20, n=2, bos_eos=True).sentences
lm_sent_2gram = LM(corpus_sent, n=2, k=0.01)

corpus_word_types = english_word_types_sentences
lm_word_2gram = LM(corpus_word_types, n=2, k=0.01)

corpus_sent = Corpus(english_training_sentences_no_spaces, t=20, n=4, bos_eos=True).sentences
lm_sent_4gram = LM(corpus_sent, n=4, k=0.01)

corpus_word_types = english_word_types_sentences
lm_word_4gram = LM(corpus_word_types, n=4, k=0.01)

# Save the LMs
lm_sent_2gram.save('BelizPekkan_sents_2gr_en.pkl')
lm_word_2gram.save('BelizPekkan_words_2gr_en.pkl')
lm_sent_4gram.save('BelizPekkan_sents_4gr_en.pkl')
lm_word_4gram.save('BelizPekkan_words_4gr_en.pkl')

# Output the size of the corpora and example preprocessed sentence
print(f"Training corpus size: {len(train_corpus)}")
print(f"Example preprocessed sentence (train corpus): {english_training_sentences_no_spaces[0]}")

['m', 'e', 'm', 'b', 'e', 'r', 'o', 'f', 't', 'h', 'e', 'c', 'o', 'm', 'm', 'i', 's', 's', 'i', 'o', 'n', 'i', 'a', 'g', 'r', 'e', 'e', 'w', 'i', 't', 'h', 'y', 'o', 'u', 'r', 'w', 'a', 'y', 'o', 'f', 'r', 'e', 'a', 's', 'o', 'n', 'i', 'n', 'g', 'a', 'n', 'd', 'i', 't', 'h', 'i', 'n', 'k', 'i', 't', 'i', 's', 'e', 's', 's', 'e', 'n', 't', 'i', 'a', 'l', 't', 'h', 'a', 't', 'w', 'e', 's', 'e', 'e', 't', 'h', 'e', 'e', 'u', 'r', 'o', 'a', 's', 'a', 'k', 'e', 'y', 'p', 'o', 'l', 'i', 'c', 'y', 'i', 'n', 's', 't', 'r', 'u', 'm', 'e', 'n', 't', 'f', 'o', 'r', 'e', 'c', 'o', 'n', 'o', 'm', 'i', 'c', 'p', 'o', 'l', 'i', 'c', 'y', 'a', 'n', 'd', 's', 'u', 's', 't', 'a', 'i', 'n', 'a', 'b', 'l', 'e', 'g', 'r', 'o', 'w', 't', 'h', 'i', 'n', 'e', 'u', 'r', 'o', 'p', 'e', 'a', 'n', 'd', 'a', 't', 't', 'h', 'e', 's', 'a', 'm', 'e', 't', 'i', 'm', 'e', 't', 'h', 'a', 't', 'w', 'e', 'l', 'o', 'o', 'k', 'a', 't', 'i', 't', 'n', 'o', 't', 'o', 'n', 'l', 'y', 'a', 's', 'a', 's', 'y', 'm', 'b', 'o', 'l',

## Task 3

3. Compute the perplexity of all four Language Models on:
	- the English sentences from the training set
	- the English word types from the training set
	- the English sentences from the test set
	- the Dutch sentences from the test set
	- the Italian sentences from the test set

You should submit a .csv file with the following structure, column names, and values ([2/4] means either 2 for LMs predicting based on two previous characters or 4 for LMs predicting based on four previous characters, the options under test_data indicate the five sets to be used to compute perplexity):

|ngram_size|training_data|test_data|perplexity|
|---|---|---|---|
|[2/4]|[words/sents]|[ITtest/NLtest/ENtest/ENtrain\_sents/ENtrain\_words]|float (rounded at 4 decimal places)|
|---|---|---|---|

The file should be named according to the template Name(Initial)Surname\_perplexities.csv

> 5 points available: you get 1 point if all four LMs yield the correct perplexity scores for a test_dataset.




In [9]:
def compute_perplexities(lm, test_sentences, ngram_size, training_data, test_data_label):
    perplexity = lm.perplexity(test_sentences)
    return {
        "ngram_size": ngram_size,
        "training_data": training_data,
        "test_data": test_data_label,
        "perplexity": round(perplexity, 4)
    }

# List to store perplexity results
perplexity_results = []

# Compute perplexities for each combination
perplexity_results.append(compute_perplexities(lm_sent_2gram, english_training_sentences_no_spaces, 2, "sents", "ENtrain_sents"))
perplexity_results.append(compute_perplexities(lm_word_2gram, english_training_sentences_no_spaces, 2, "words", "ENtrain_sents"))
perplexity_results.append(compute_perplexities(lm_sent_4gram, english_training_sentences_no_spaces, 4, "sents", "ENtrain_sents"))
perplexity_results.append(compute_perplexities(lm_word_4gram, english_training_sentences_no_spaces, 4, "words", "ENtrain_sents"))

perplexity_results.append(compute_perplexities(lm_sent_2gram, english_word_types_sentences, 2, "sents", "ENtrain_words"))
perplexity_results.append(compute_perplexities(lm_word_2gram, english_word_types_sentences, 2, "words", "ENtrain_words"))
perplexity_results.append(compute_perplexities(lm_sent_4gram, english_word_types_sentences, 4, "sents", "ENtrain_words"))
perplexity_results.append(compute_perplexities(lm_word_4gram, english_word_types_sentences, 4, "words", "ENtrain_words"))

perplexity_results.append(compute_perplexities(lm_sent_2gram, english_test_sentences_no_spaces, 2, "sents", "ENtest"))
perplexity_results.append(compute_perplexities(lm_word_2gram, english_test_sentences_no_spaces, 2, "words", "ENtest"))
perplexity_results.append(compute_perplexities(lm_sent_4gram, english_test_sentences_no_spaces, 4, "sents", "ENtest"))
perplexity_results.append(compute_perplexities(lm_word_4gram, english_test_sentences_no_spaces, 4, "words", "ENtest"))

perplexity_results.append(compute_perplexities(lm_sent_2gram, dutch_test_sentences_no_spaces, 2, "sents", "NLtest"))
perplexity_results.append(compute_perplexities(lm_word_2gram, dutch_test_sentences_no_spaces, 2, "words", "NLtest"))
perplexity_results.append(compute_perplexities(lm_sent_4gram, dutch_test_sentences_no_spaces, 4, "sents", "NLtest"))
perplexity_results.append(compute_perplexities(lm_word_4gram, dutch_test_sentences_no_spaces, 4, "words", "NLtest"))

perplexity_results.append(compute_perplexities(lm_sent_2gram, italian_test_sentences_no_spaces, 2, "sents", "ITtest"))
perplexity_results.append(compute_perplexities(lm_word_2gram, italian_test_sentences_no_spaces, 2, "words", "ITtest"))
perplexity_results.append(compute_perplexities(lm_sent_4gram, italian_test_sentences_no_spaces, 4, "sents", "ITtest"))
perplexity_results.append(compute_perplexities(lm_word_4gram, italian_test_sentences_no_spaces, 4, "words", "ITtest"))

# Define the output CSV file name
csv_file_name = "BelizPekkan_perplexities.csv"

# Write the results to the CSV file
with open(csv_file_name, mode='w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=["ngram_size", "training_data", "test_data", "perplexity"])
    writer.writeheader()
    for result in perplexity_results:
        writer.writerow(result)

print(f"Perplexity results saved to {csv_file_name}")

Perplexity results saved to BelizPekkan_perplexities.csv


## Task4

4. Out of all Italian and Dutch word types in the test sentences, restricting attention to word types consisting of at least 5 characters and with at least 5 occurrences in the Italian/Dutch test sentences, find:
	- the word in each language with the lowest perplexity according to each of the four LMs
	- the word in each language with the highest perplexity according to each of four LMs

You should submit two .csv files (one for the lowest perplexities, one for highest perplexities) with the following structure, column names, and values ([it/nl] indicates the language, with it indicating italian and nl indicating dutch, str indicates that the word should appear as a string, [2/4] means either 2 for LMs predicting based on two previous characters or 4 for LMs predicting based on four previous characters, [words/sents] indicates whether the model identifying that particular word on that language was trained on word types or sentences):

| lang | word | ngram_size | training_data | perplexity |
|---|---|---|---|---|
|[it/nl]|str|[2/4]|[words/sents]|float (rounded at 4 decimal places)|
|---|---|---|---|

The files should be named according to the template Name(Initial)Surname\_perplexities\_[max|min].csv, so Jane Smith should submit a file named JaneSmith\_perplexities\_max.csv containing 8 rows each storing the word with the highest perplexity according to each of the four LMs per language.

> 4 points available: you get 0.25 points for each correct word identified


In [11]:
def preprocess_words(word, frequent_chars):
    return ''.join([char if char in frequent_chars else '?' for char in word])

def extract_frequent_words(sentences, min_length=5, min_occurrences=5):
    words = [''.join(word) for sentence in sentences for word in sentence.split()]
    filtered_words = [word for word in words if len(word) >= min_length]
    word_counts = Counter(filtered_words)
    return [word for word, count in word_counts.items() if count >= min_occurrences]

# Extract word types from Italian and Dutch test sentences
italian_words = extract_frequent_words(italian_test_sentences, min_length=5, min_occurrences=5)
dutch_words = extract_frequent_words(dutch_test_sentences, min_length=5, min_occurrences=5)

def compute_word_perplexity(model, word):
    word_processed = preprocess_words(word, model.vocab)
    word_ngrams = ['#bos#'] * (model.ngram_size - 1) + list(word_processed) + ['#eos#']
    log_prob = 0.0
    for i in range(len(word_ngrams) - model.ngram_size + 1):
        ngram = tuple(word_ngrams[i:i + model.ngram_size])
        history = ngram[:-1] if model.ngram_size > 1 else ()
        char = ngram[-1]
        prob = model.calculate_probability(history, char)
        log_prob += np.log(prob if prob > 0 else 1e-10)
    perplexity = np.exp(-log_prob / len(word_ngrams))
    return perplexity

def filter_eligible_words(sentences):
    word_counts = Counter(word for sentence in sentences for word in sentence.split())
    return {word for word, count in word_counts.items() if len(word) > 4 and count >= 5}

# Initialize models dictionary
models = {
    'sent_2gram': lm_sent_2gram,
    'word_2gram': lm_word_2gram,
    'sent_4gram': lm_sent_4gram,
    'word_4gram': lm_word_4gram,
}

# Initialize results list
results = []

for lang, eligible_words in [("it", italian_words), ("nl", dutch_words)]:
    for word in eligible_words:
        for model_key, model in models.items():
            parts = model_key.split("_")
            ngram_size = "2" if "2gram" in model_key else "4"
            training_data_type = "words" if "word" in model_key else "sents"
            perplexity = compute_word_perplexity(model, word)
            results.append([lang, word, ngram_size, training_data_type, round(perplexity, 4)])

# Convert results to DataFrame
results_df = pd.DataFrame(results, columns=['lang', 'word', 'ngram_size', 'training_data', 'perplexity'])

# Find min and max perplexities
min_perplexities_df = results_df.loc[results_df.groupby(['lang', 'ngram_size', 'training_data'])['perplexity'].idxmin()]
max_perplexities_df = results_df.loc[results_df.groupby(['lang', 'ngram_size', 'training_data'])['perplexity'].idxmax()]

# Define the output CSV file names
min_csv_file_name = "BelizPekkan_perplexities_min.csv"
max_csv_file_name = "BelizPekkan_perplexities_max.csv"

# Write the lowest perplexities to the CSV file
min_perplexities_df.to_csv(min_csv_file_name, index=False)

# Write the highest perplexities to the CSV file
max_perplexities_df.to_csv(max_csv_file_name, index=False)

print(f"Lowest perplexity results saved to {min_csv_file_name}")
print(f"Highest perplexity results saved to {max_csv_file_name}")

Lowest perplexity results saved to BelizPekkan_perplexities_min.csv
Highest perplexity results saved to BelizPekkan_perplexities_max.csv


## Task 5

Answer questions in the separate markdown blocks below.

5. Answer the following questions:
	- a. compare LMs' perplexity on the English training sets, sentences and words, then explain the differences in perplexity considering what changes between the two training set-ups. (5 pts, 150 words)
	- b. which LM trained on sentences generalizes better to unseen sentences in the same language, bigram or tetragram? explain why this is the case. (5 pts, 150 words)
	- c. compare LMs trained on English in their ability to fit Italian and Dutch sentences: which factor between ngram size and training corpus (words or sentences) affects perplexity the most? Explain why we observe this pattern. (4 pts, 100 words)
	- d. what patterns can you identify in the words with the lowest perplexity in Dutch and Italian? (4 pts, 100 words)
	- e. what patterns can you identify in the words with the highest perplexity in Dutch and Italian? (4 pts, 100 words)
    
> 22 points in total, see specifications next to each question

#### 5a

2-gram models trained on sentences have lower perplexity (11.8622) than those trained on word types (15.7788), due to the broader context sentences provide. 4-gram models trained on sentences have even lower perplexity (5.3673), benefiting from longer sequences. However, 4-gram models trained on word types have higher perplexity (21.1391), as the advantage of longer n-grams is less effective for isolated word types. For English training word types, 2-gram models trained on word types have lower perplexity (12.4359) than those trained on sentences (28.4246), suggesting word-based models are more effective for shorter n-grams. However, 4-gram models trained on word types have lower perplexity (6.1067) than 2-gram models, indicating longer n-grams benefit from additional context. Overall, sentence-trained models perform better with longer n-grams, while word type models perform better with shorter n-grams.

#### 5b

The tetragram LM trained on sentences has a perplexity of 5.3673, which is significantly lower than the perplexity of the bigram LM trained on sentences, which is 11.8622. A lower perplexity indicates that the model is better at predicting the next character in a sequence, suggesting it generalizes better to unseen data. This is the case because tetragram models capture more context and dependencies in the data. While a bigram model only considers the immediate previous character, a tetragram model takes into account the previous three characters, providing a richer context for making predictions. This additional context allows the tetragram model to better understand the structure and patterns within the language, leading to more accurate predictions and lower perplexity.

#### 5c

N-gram size has a more significant effect on perplexity than the training corpus when comparing how well LMs trained on English fit Italian and Dutch sentences. This is evident from the consistently lower perplexity values for 2-gram models compared to 4-gram models, irrespective of whether they were trained on words or sentences. The reason for this pattern is that shorter n-gram models (like 2-gram models) are less sensitive to the specific sequences within a language, making them more robust to cross-linguistic variations. Longer n-grams (like 4-grams), while providing richer context, are more specialized and thus less effective when applied to a different language with potentially different syntactic and morphological patterns.

#### 5d

The words with the lowest perplexities in Dutch and Italian (examples include 'congres', 'rating' and 'international') are influenced by structural simplicity, similarity to English, frequent use in multiple languages, and the context provided by the training data. These factors contribute to the predictability and lower perplexity scores for these words. The effectiveness of 4-gram models and the focus provided by word-level training also play a significant role in achieving lower perplexity.

#### 5e

The words with the highest perplexity in both Italian and Dutch tend to have more complex structures, including characters and accents not found in English ('reëel'). These words often have letter repetitions ('ridurre') and are specific or less frequent in usage ('aangemoedigd'), making them harder to predict for a model trained on English data. The challenges posed by these words are exacerbated in 4-gram models, which, while capturing more context, also demand more specific training data for accurate predictions.