# **Module 5: Natural language processing**
## DAT410

### Group 29 
### David Laessker, 980511-5012, laessker@chalmers.se

### Oskar Palmgren, 010529-4714, oskarpal@chalmers.se



We hereby declare that we have both actively participated in solving every exercise. All solutions are entirely our own work, without having taken part of other solutions.

___


## 1) Reading and reflection

### a)

In general, life would get easier if computers were able to take over some simple intellectual tasks that only humans are able to do, in order to automate things. In addition to translation, things like image and speech recognition has been resarched alot, and has a very large range of possibilities. The task of converting spoken language into text has seen approaches ranging from Hidden Markov models to deep neural networks, including CNNs and RNNs.

### b)

Systems that are rule-based explicitly use linguistic rules and dictionaries, while neural systems learn these linguistic patterns from large datasets. Both approaches aim to accurately translate languages by mapping structures and meanings, but through different means. The reason for the output of the two different models could be the same, but the reason why is competely different. But essentially they both reflect patterns of speech in a society, although one has learned them by observation, the other has had the rules/patterns defined for it.

### c)

In situations where there is not alot of data avalible the neural network or the statistical model may not perform very well. For example for the minority language Meänkieli that is spoken by around 70 000 people in northen parts of Sweden there is not alot of data for a statistical or neural model to train on, and it may be confused because the language is very close to Finnish and often used interchangably with Finnish and Swedish. Same goes for the language spoken in the Faroe Islands, Faroese. It is similar looking to icelandic and also spoken by around 70 000 people. In these case using a rule based model would most likely be the prefered method. Training on a statistical or neural model with small and often confusing data may lead to a broken och poorly working model. Whereas a rule based model will always obey the rules, even though it has a lack of versitiality.

## 2) Implementation

In [1]:
import numpy as np
import random
from collections import Counter, defaultdict
import re
from html import unescape

In [2]:
swe_eng_file_path = 'data/europarl-v7.sv-en.lc.sv'
eng_swe_file_path = 'data\europarl-v7.sv-en.lc.en'

ger_eng_file_path = 'data\europarl-v7.de-en.lc.de'
eng_ger_file_path = 'data\europarl-v7.de-en.lc.en'

fre_eng_file_path = 'data\europarl-v7.fr-en.lc.fr'
eng_fre_file_path = 'data\europarl-v7.fr-en.lc.en'

### (a) Warmup

We implement a function that calculates the word frequency in a file. We go through the file line by line and update a word counter everytime a specific word appears. The result is a dictionary with all the words and their respective frequency.

We noticed that punctuations (periods and commas) were in the top 10 in all languages. French also had HTML entity "&apos" (apostrohe), so we also added an argument to be able to calculate the word frequency without special symbols. However, this is only used in this task because these symbols may be important and handled differently in the translation model, for example punctuations for sentence boundary detection.

In [3]:
def word_frequency(file, remove_symbols=False):
    
    word_counter = Counter()

    with open(file, 'r') as f:
    
        for line in f:

            if remove_symbols:
                line = unescape(line)
                line = re.sub(r'[^\w\s]', '', line)
            
            words = line.split()
            word_counter.update(words)
    
    return word_counter

In [4]:
swe_frequency = word_frequency(swe_eng_file_path)

swe_top_10 = swe_frequency.most_common(10)

swe_top_10


[('.', 9648),
 ('att', 9181),
 (',', 8876),
 ('och', 7038),
 ('i', 5949),
 ('det', 5687),
 ('som', 5028),
 ('för', 4959),
 ('av', 4013),
 ('är', 3840)]

In [5]:
eng_frequency1 = word_frequency(eng_swe_file_path)
eng_frequency2 = word_frequency(eng_ger_file_path)
eng_frequency3 = word_frequency(eng_fre_file_path)

eng_total_frequency = eng_frequency1 + eng_frequency2 + eng_frequency3

eng_top_10 = eng_total_frequency.most_common(10)

eng_top_10


[('the', 58790),
 (',', 42043),
 ('.', 29542),
 ('of', 28406),
 ('to', 26842),
 ('and', 21459),
 ('in', 18485),
 ('is', 13331),
 ('that', 13219),
 ('a', 13090)]

In [6]:
ger_frequency = word_frequency(ger_eng_file_path)

ger_top_10 = ger_frequency.most_common(10)

ger_top_10

[(',', 18549),
 ('die', 10521),
 ('.', 9733),
 ('der', 9374),
 ('und', 7028),
 ('in', 4175),
 ('zu', 3168),
 ('den', 2976),
 ('wir', 2863),
 ('daß', 2738)]

In [7]:
fre_frequency = word_frequency(fre_eng_file_path)

fre_top_10 = fre_frequency.most_common(10)

fre_top_10

[('&apos;', 16729),
 (',', 15402),
 ('de', 14520),
 ('la', 9746),
 ('.', 9734),
 ('et', 6619),
 ('l', 6536),
 ('le', 6174),
 ('les', 5585),
 ('à', 5500)]

In [8]:
eur_parl_frequency = swe_frequency + fre_frequency + ger_frequency + eng_total_frequency

eur_parl_top_10 = eur_parl_frequency.most_common(10)

eur_parl_top_10


[(',', 84870),
 ('the', 58812),
 ('.', 58657),
 ('of', 28414),
 ('to', 26843),
 ('in', 22808),
 ('and', 21463),
 ('&apos;', 20013),
 ('de', 17536),
 ('i', 14894)]

In [9]:
words_amount = sum(eur_parl_frequency.values())

speaker_probability = (eur_parl_frequency['speaker'] / words_amount ) * 100
zebra_probability = (eur_parl_frequency['zebra'] / words_amount ) * 100

# calculate probability for "speaker" and "zebra"
print(f'Probability of speaker: {speaker_probability:.5f} %')
print(f'Probability of zebra: {zebra_probability} %')


Probability of speaker: 0.00193 %
Probability of zebra: 0.0 %


Now we look at the most common words and probability without all the symbols.

In [10]:
#Without symbols

file_paths = ['data/europarl-v7.sv-en.lc.sv',
              'data\europarl-v7.sv-en.lc.en',
              'data\europarl-v7.de-en.lc.de',
              'data\europarl-v7.de-en.lc.en',
              'data\europarl-v7.fr-en.lc.fr',
              'data\europarl-v7.fr-en.lc.en']


#Initialize the word counter 
eur_parl_frequency_without_symbols = word_frequency(file_paths[0], remove_symbols=True)


for path in file_paths[1:]:

    eur_parl_frequency_without_symbols += word_frequency(path, remove_symbols=True)


eur_parl_frequency_without_symbols.most_common(10)




[('the', 58812),
 ('of', 28414),
 ('to', 26843),
 ('in', 22808),
 ('and', 21463),
 ('de', 17536),
 ('i', 14907),
 ('a', 14853),
 ('is', 13340),
 ('that', 13219)]

In [11]:
words_amount_without_symbols = sum(eur_parl_frequency_without_symbols.values())

speaker_probability_without_symbols = (eur_parl_frequency_without_symbols['speaker'] / words_amount_without_symbols ) * 100

print(f'Probability of speaker: {speaker_probability_without_symbols:.5f} %')

Probability of speaker: 0.00217 %


The probability of the word *speaker* is slightly higher, which is reasonable with all the symbols removed. The word zebra had zero occurences in the original word frequency.

### (b) Language modeling

In this part we implement a bigram model 

In [12]:
def read_file(file):
    '''
    Reads the file, each sentence line by line and splits the words from the sentences. 
    Returns a list of lists with words from each sentence
    '''
    
    sentences_list = []

    with open(file, 'r', encoding='utf-8') as f:
    
        for line in f:
            
            words = line.strip().split()
            sentences_list.append(words)
    
    return sentences_list


In [13]:
class BigramModel:
    def __init__(self):
        self.bigram_counts = defaultdict(Counter)
        self.starting_words = []

    def train(self, sentence_list):
        # Preprocess the text into words

        for sentence in sentence_list:
            self.starting_words.append(sentence[0])
        
        # Count bigrams in the text
            for i in range(len(sentence) - 1):
                self.bigram_counts[sentence[i]][sentence[i+1]] += 1
        

    def predict_next_word(self, word):
        if word not in self.bigram_counts:
            return None
        next_words = self.bigram_counts[word]
        total_counts = sum(next_words.values())
        # Create a weighted choice among the next possible words
        weighted_choices = [(w, count / total_counts) for w, count in next_words.items()]
        return random.choices([w for w, _ in weighted_choices], [count for _, count in weighted_choices])[0]

    def generate_text(self, start_word, length=10):

        if start_word.lower() not in self.bigram_counts and not self.starting_words:
            return "Model not trained or start word not in corpus."
        
        if start_word.lower() in self.bigram_counts:
            current_word = start_word.lower()

        else:
            current_word = random.choice(self.starting_words)
        
        
        generated_text = [current_word]
        
        for _ in range(length - 1):
            next_word = self.predict_next_word(current_word)
            if next_word is None:
                break  # End if no next word is found
            generated_text.append(next_word)
            current_word = next_word
        return ' '.join(generated_text)



In [14]:
eng_sentences = read_file(eng_swe_file_path)

model = BigramModel()
model.train(eng_sentences)

start_word = "You"
generated_text = model.generate_text(start_word, 10)
print(generated_text)


you the matter of lead to thank you are the


In [15]:
start_word = ""
generated_text = model.generate_text(start_word, 100)
print(generated_text)

i believe it outlines the agenda ? will do . this . this has been almost four conclusions that parliament , especially regarding the european union , with this , mainly towards cultural and i believe there is a sustainable energy transmission rather they begin implementing it and , growth , ladies and therefore of action against europeanist dictatorship know , we should react against money from 1995 , not adding a functioning of demand-orientated economic conversion of every area . there is so these parameters of food in paris or underwater tunnels in the eu from the conditions and


If the starting word does not appear in the bigrams, the model chooses a random starting word instead.

### c) Translation modeling

We implemented a translation model based on the pseudo code for the IBM model 1. The main component is the `train` method where the EM-algorithm is implemented with count collecting and probability updates. We also created some other methods that loads the sentence pairs, initializes translation probabilities, and finds the top translations for a word. In the initializing process, all the probabilities are uniform rather than random. 


During the translation generation phase, the model uses the learned probabilities P(f∣e) to generate English translations by selecting the English words that are most likely to translate into the observed foreign words. This is achieved through the decode method later that search for the best English sentence given the foreign sentence, using the probabilities learned by the model.

Regarding the conditional probability $P(f∣e)$, the objective of the statistical translation model is to find the most probable translation of a foreign sentence into English. By using this probaility, we are essentially asking, "given the English word $e$, what is the probability of each foreign word $f$ being its correct translation?"

The IBM model used simplifies translation by focusing on word alignments from English to the foreign language, and do not consider word order or syntax. The model thus learns to align and translate based on word occurrence probabilities, which are easier to estimate and optimize.

In [16]:
class TranslationModel:
    
    def __init__(self):
        
        self.translation_probabilities = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
        self.languages = []


    def add_sentence_pairs(self, foreign_file_path, english_file_path, language): #add source_language='en'?
        
        if language in self.languages:
            print(f'Language \"{language}\" is already implemented.')
            return

        self.languages.append(language)
        
        sentence_pairs = []
        
        with open(foreign_file_path, 'r') as foreign_file, \
             open(english_file_path, 'r') as english_file:
            
            for foreign_sentence, english_sentence in zip(foreign_file, english_file):
                
                foreign_sentence = foreign_sentence.split()
                english_sentence = english_sentence.split() + ['NULL'] #idk if this works correctly
                
                sentence_pairs.append((foreign_sentence, english_sentence))

        self.train(sentence_pairs, language)


    def train(self, sentence_pairs, language, iterations=10):
        
        translation_probabilities = self.initialize_translation_probabilities(sentence_pairs)
        
        #Expectation-maximization algoritm
        for iteration in range(iterations):
            
            counts_fe = defaultdict(lambda: defaultdict(float))
            counts_e = defaultdict(float)
            total_f = defaultdict(float)
            
            #collect counts
            for (foreign_sentence, english_sentence) in sentence_pairs:

                for f in foreign_sentence:
                    
                    total_f[f] = sum(translation_probabilities[e][f] for e in english_sentence)

                for e in english_sentence:
                    for f in foreign_sentence:
                        
                        count = translation_probabilities[e][f] / total_f[f]
                        counts_fe[e][f] += count
                        counts_e[e] += count
            
            #update probabilities
            for e in counts_fe:
                for f in counts_fe[e]:
                    
                    translation_probabilities[e][f] = counts_fe[e][f] / counts_e[e]
        
        self.translation_probabilities[language] = translation_probabilities


    def initialize_translation_probabilities(self, sentence_pairs):

        translation_probabilities = defaultdict(lambda: defaultdict(float))
        
        for (foreign_sentence, english_sentence) in sentence_pairs:
            
            uniform_probability = 1 / len(english_sentence)
            
            for f in foreign_sentence:
                for e in english_sentence:
                    
                    translation_probabilities[e][f] = uniform_probability
        
        return translation_probabilities


    def find_top_translations(self, word, language, n=10):
        
        if language not in self.languages:
            
            print(f'Language \"{language}\" not found.')
            return
       
        translations = [(f, prob) for f, prob in self.translation_probabilities[language][word].items()]
        translations.sort(key=lambda x: x[1], reverse=True)

        if len(translations) == 0:
            
            print(f'Translation of \"{word}\" not found.')
            return

        return translations[:n]
    

    def decode(self, foreign_sentence, language):

        english_translation = []

        for f in foreign_sentence.split():

            possible_translations = {e: self.translation_probabilities[language][e].get(f, 0) for e in self.translation_probabilities[language]}
            
            e = max(possible_translations, key=possible_translations.get)
            
            english_translation.append(e)
        
        return ' '.join(english_translation)

    

In [17]:
translation_model = TranslationModel()

In [18]:
translation_model.add_sentence_pairs(swe_eng_file_path, eng_swe_file_path, 'se')

In [19]:
translation_model.find_top_translations('european', 'se')

[('europeiska', 0.8358811726335589),
 ('europeisk', 0.07793251129315405),
 ('den', 0.01234342827816025),
 ('i', 0.01145802549009602),
 ('att', 0.0073175106656019956),
 ('en', 0.0064164076171784445),
 ('till', 0.006403088625106279),
 ('det', 0.005713358673491642),
 (',', 0.005560543111116395),
 ('för', 0.0047190385350298)]

In [20]:
translation_model.add_sentence_pairs(ger_eng_file_path, eng_ger_file_path, 'de')

In [21]:
translation_model.find_top_translations('european', 'de')

[('europäischen', 0.6583104427978661),
 ('europäische', 0.2994256203430025),
 ('der', 0.015614335052903423),
 ('die', 0.006707869728434155),
 (',', 0.0046853348392669105),
 ('den', 0.002090016898065694),
 ('in', 0.002076244515776026),
 ('.', 0.001742672454766282),
 ('union', 0.0017056204984250926),
 ('das', 0.0015701739731939042)]

In [22]:
translation_model.add_sentence_pairs(fre_eng_file_path, eng_fre_file_path, 'fr')

In [23]:
translation_model.find_top_translations('european', 'fr')

[('européenne', 0.46255524097583534),
 ('européen', 0.2740681214755836),
 ('l', 0.07615189999366892),
 ('&apos;', 0.06673961686785357),
 ('de', 0.043416375856401186),
 (',', 0.012650462639585984),
 ('le', 0.011782554822424903),
 ('la', 0.008875011271880097),
 ('au', 0.00688588253565706),
 ('d', 0.005003206485329585)]

### d) Decoding

Decoding is challenging because we have a very large search space. From the outputs above, we see that for each foreign word there are multiple possible translations in English. In the IBM model 1 word alignments are assumed to be indepent of each other so the order of thr words and context is not considered which leads to less natural sentence translations. 


Since our model is trained with the translation probabilities $P(f|e)$ (the probability of a foreign word 
$f$ given an English word $e$), we can not simply find the English word with the highest translation probability $P(e|f)$. A solution would be to also compute the "reverse" translation probabilities $P(e|f) in the model. However, in our decoder, we instead use the trained model in a reverse lookup manner by using the existing translation probabilities but select the English word that maximizes the probability for the foreign word. 


We do this for each word, and all these words are then combined into the output sentence. This means that the the model does not consider word order or grammatical structure by looking at the neighboring words. This simpliciation impacts the translation quality because the a word with the highest probability will not always be the word that fits the context of the sentence, which leads to a translation that is less fluent or grammatically correct.


In [24]:
sentence = 'jag är en politiker'

translation_model.decode(sentence, 'se')

'i is one politicians'

In the example sentence above, the Swedish sentence *Jag är en politiker* should in reality translate to *I am a politician* in English. The decoder output conveys the concept but is obviously not grammatically correct. This is a result of the implementation of the decoder which uses the translated word with highest probability. The word *politiker* could be both singular and plural in Swedish and the translation *politician* has a higher probability as seen below.

In [25]:
print('Politician:', translation_model.find_top_translations('politician', 'se')[0])
print('Politicians:', translation_model.find_top_translations('politicians', 'se')[0])


Politician: ('politiker', 0.16913921184700276)
Politicians: ('politiker', 0.4493980991701494)


## 3) Discussion

### a)

For a translation to be good, it should be as understandable and as accurate as possible. Sometimes direct translations do not exist, and in that case it is more important to convey the same message as the sentence is trying to say, instead of attempting a direct translation. It can also be useful to simplify sentences, to avoid misunderstandings and increase the likelihood that the translation is done correctly.

Evaluating translations can be done manually or automatically. In the case of manual evaluation, simply having a person who is fluent in both languages would be a great asset, as this person would be able to properly understand both sentences and evaluate how well the sentences are translated, and perhaps even come up with better translation suggestions. The advantage of this is that our evaluation will be done very accurately, however, this is very time costly and expensive. It is simply unfeasible to have people evaluate every single sentence translation. In order to automatically translate sentences, we can use a set of test data, where we have a sentence in the source language and a respective sentence in the language the sentence is translated into. We then evaluate the translated sentences on this test data, where we have a “correct” translation. These sentences would preferably not be used in the training data to get a good evaluation. It would however still be interesting to evaluate the translation system on sentences in the training data, as sometimes the translation system could get confused. But this has the limitation of not being able to be as accurate as a human fluent in both languages. Some times sentences can be translated in different ways, and the same message will still get across, and there could be many ways of saying the same thing, and it would still be a correct translation. This is difficult to consider when using automatic evaluation as it is difficult to consider all variants and translations when evaluating.

When talking about translations in 2024 it is also important to consider LLM:s, as these are very good at “understanding” what is being said, and can often translate the message into other languages very well. If your translation system is not already LLM based, using these for translation evaluation could be don. For example; a model could be asked if the message is properly conveyed, or if the sentence is properly translated, and output the information in any code or output format asked. However, these models also vary in quality, speed and cost. They can be a bit random sometimes.

### b)

Although we are not experts in the Estonian language we could argue that it is both a feature and a bug. It depends on what we prioritize in our translation. Perhaps it is difficult to translate into different genders in the Estonian language or gender changes how the noun is written? In that case the model may be correct in its translation more times if it simply uses a gender neutral form. However, this may cause the translation to sometimes not convey the message properly. If a text is talking about a male and a female, and it refers to either of them simply using a he and she, the gender neutral translation would confuse the reader as to which person is mentioned. 

Another interesting point is that there could have been a bias in the training dataset for the translation model. Historically, medicine and technology was predominantly practiced by men even though the fields have become more diverse today. This by itself may lead to a bias in the translation model favoring a specific gender for the translation depending on the context. A simple solution to this issue is to use better and inclusive data, although this may prove to be a challange by itself.

### c)

In the first example, the sentence has a clear context where the ball is being hit with the bat, which gives a bit of context that it is a baseball bat and not the animal. The second sentence also gives clear context saying that this bat eats insects, giving a clear indication that we are talking about the animal. However in the last sentence, the translation gets confused. We think what particularly confuses it is the use of the "forest" in the text. As the word trä in slagträ means tree, and trees are in the forest. So using trä makes sense, as in slagträ, but a slagträ doesn’t live anywhere, that does not make sense either. Instead it comes up with this word combination (combining words is normal in swedish) but makes up a word that doesn’t exist so that it makes sense for both a tree in the forest, and an animal bat living.
