# **Module 5: Natural language processing**
## DAT410

### Group 29 
### David Laessker, 980511-5012, laessker@chalmers.se

### Oskar Palmgren, 010529-4714, oskarpal@chalmers.se



We hereby declare that we have both actively participated in solving every exercise. All solutions are entirely our own work, without having taken part of other solutions.

___


## 1) Reading and reflection

a) Like speech recognition and image recognition?

b) Systems that are rule-based explicitly use linguistic rules and dictionaries, while neural systems learn these linguistic patterns from large datasets. Both approaches aim to accurately translate languages by mapping structures and meanings, but through different means.

c) Maybe smaller datasets? Modern neural systems may not capture the grammatical patterns in the language with scarce data. A rule based system will therefore offer more predictable and interpretable results.

## 2) Implementation

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import random
from collections import Counter, defaultdict
import re
from html import unescape

In [2]:
swe_eng_file_path = 'data/europarl-v7.sv-en.lc.sv'
eng_swe_file_path = 'data\europarl-v7.sv-en.lc.en'

ger_eng_file_path = 'data\europarl-v7.de-en.lc.de'
eng_ger_file_path = 'data\europarl-v7.de-en.lc.en'

fre_eng_file_path = 'data\europarl-v7.fr-en.lc.fr'
eng_fre_file_path = 'data\europarl-v7.fr-en.lc.en'

### (a) Warmup

We implement a function that


We noticed that punctuations (periods and commas) were in the top 10 in all languages. French also had HTML entity "&apos" (apostrohe), so we also added an agrument to be able to calculate the word frequency without special symbols. However, this is only used in this task because for these symbols may be important and handles differently in the translation model, for example punctuations for sentence boundary detection.

In [3]:
def word_frequency(file, remove_symbols=False):
    
    word_counter = Counter()

    with open(file, 'r') as f:
    
        for line in f:

            if remove_symbols:
                line = unescape(line)
                line = re.sub(r'[^\w\s]', '', line)
            
            words = line.split()
            word_counter.update(words)
    
    return word_counter

In [4]:
swe_frequency = word_frequency(swe_eng_file_path)

swe_top_10 = swe_frequency.most_common(10)

swe_top_10


[('.', 9648),
 ('att', 9181),
 (',', 8876),
 ('och', 7038),
 ('i', 5949),
 ('det', 5687),
 ('som', 5028),
 ('för', 4959),
 ('av', 4013),
 ('är', 3840)]

In [5]:
eng_frequency1 = word_frequency(eng_swe_file_path)
eng_frequency2 = word_frequency(eng_ger_file_path)
eng_frequency3 = word_frequency(eng_fre_file_path)

eng_total_frequency = eng_frequency1 + eng_frequency2 + eng_frequency3

eng_top_10 = eng_total_frequency.most_common(10)

eng_top_10


[('the', 58790),
 (',', 42043),
 ('.', 29542),
 ('of', 28406),
 ('to', 26842),
 ('and', 21459),
 ('in', 18485),
 ('is', 13331),
 ('that', 13219),
 ('a', 13090)]

In [6]:
ger_frequency = word_frequency(ger_eng_file_path)

ger_top_10 = ger_frequency.most_common(10)

ger_top_10

[(',', 18549),
 ('die', 10521),
 ('.', 9733),
 ('der', 9374),
 ('und', 7028),
 ('in', 4175),
 ('zu', 3168),
 ('den', 2976),
 ('wir', 2863),
 ('daß', 2738)]

In [7]:
fre_frequency = word_frequency(fre_eng_file_path)

fre_top_10 = fre_frequency.most_common(10)

fre_top_10

[('&apos;', 16729),
 (',', 15402),
 ('de', 14520),
 ('la', 9746),
 ('.', 9734),
 ('et', 6619),
 ('l', 6536),
 ('le', 6174),
 ('les', 5585),
 ('à', 5500)]

In [8]:
eur_parl_frequency = swe_frequency + fre_frequency + ger_frequency + eng_total_frequency

eur_parl_top_10 = eur_parl_frequency.most_common(10)

eur_parl_top_10


[(',', 84870),
 ('the', 58812),
 ('.', 58657),
 ('of', 28414),
 ('to', 26843),
 ('in', 22808),
 ('and', 21463),
 ('&apos;', 20013),
 ('de', 17536),
 ('i', 14894)]

In [9]:
words_amount = sum(eur_parl_frequency.values())

#print(words_amount)
#print(eur_parl_frequency['speaker'])
#print(eur_parl_frequency['zebra'])

speaker_probability = (eur_parl_frequency['speaker'] / words_amount ) * 100
zebra_probability = (eur_parl_frequency['zebra'] / words_amount ) * 100

# calculate probability for "speaker" and "zebra"
print(f'Probability of speaker: {speaker_probability:.5f} %')
print(f'Probability of zebra: {zebra_probability} %')


Probability of speaker: 0.00193 %
Probability of zebra: 0.0 %


Now without symbols..

In [10]:
#Without symbols

file_paths = ['data/europarl-v7.sv-en.lc.sv',
              'data\europarl-v7.sv-en.lc.en',
              'data\europarl-v7.de-en.lc.de',
              'data\europarl-v7.de-en.lc.en',
              'data\europarl-v7.fr-en.lc.fr',
              'data\europarl-v7.fr-en.lc.en']


#Initialize the word counter 
eur_parl_frequency_without_symbols = word_frequency(file_paths[0], remove_symbols=True)


for path in file_paths[1:]:

    eur_parl_frequency_without_symbols += word_frequency(path, remove_symbols=True)


eur_parl_frequency_without_symbols.most_common(10)




[('the', 58812),
 ('of', 28414),
 ('to', 26843),
 ('in', 22808),
 ('and', 21463),
 ('de', 17536),
 ('i', 14907),
 ('a', 14853),
 ('is', 13340),
 ('that', 13219)]

In [11]:
words_amount_without_symbols = sum(eur_parl_frequency_without_symbols.values())

speaker_probability_without_symbols = (eur_parl_frequency_without_symbols['speaker'] / words_amount_without_symbols ) * 100

print(f'Probability of speaker: {speaker_probability_without_symbols:.5f} %')

Probability of speaker: 0.00217 %


In [12]:
#Perhaps an issue: merges words that consists of hyphens

print(eur_parl_frequency['vice-president'])
print(eur_parl_frequency['vicepresident'])

print(eur_parl_frequency_without_symbols['vice-president'])
print(eur_parl_frequency_without_symbols['vicepresident'])

43
0
0
43


### (b) Language modeling

In [13]:
def read_file(file):
    '''
    Reads the file, each sentence line by line and splits the words from the sentences. 
    Returns a list of lists with words from each sentence
    '''
    
    sentences_list = []

    with open(file, 'r', encoding='utf-8') as f:
    
        for line in f:
            
            words = line.strip().split()
            sentences_list.append(words)
    
    return sentences_list


In [14]:
class BigramModel:
    def __init__(self):
        self.bigram_counts = defaultdict(Counter)
        self.starting_words = []

    def train(self, sentence_list):
        # Preprocess the text into words

        for sentence in sentence_list:
            self.starting_words.append(sentence[0])
        
        # Count bigrams in the text
            for i in range(len(sentence) - 1):
                self.bigram_counts[sentence[i]][sentence[i+1]] += 1
        

    def predict_next_word(self, word):
        if word not in self.bigram_counts:
            return None
        next_words = self.bigram_counts[word]
        total_counts = sum(next_words.values())
        # Create a weighted choice among the next possible words
        weighted_choices = [(w, count / total_counts) for w, count in next_words.items()]
        return random.choices([w for w, _ in weighted_choices], [count for _, count in weighted_choices])[0]

    def generate_text(self, start_word, length=10):
        
        #print(self.starting_words)
        #print(self.bigram_counts['jag'])


        if start_word.lower() not in self.bigram_counts and not self.starting_words:
            return "Model not trained or start word not in corpus."
        
        if start_word.lower() in self.bigram_counts:
            current_word = start_word.lower()

        else:
            current_word = random.choice(self.starting_words)
        
        
        generated_text = [current_word]
        
        for _ in range(length - 1):
            next_word = self.predict_next_word(current_word)
            if next_word is None:
                break  # End if no next word is found
            generated_text.append(next_word)
            current_word = next_word
        return ' '.join(generated_text)





In [15]:
swe_sentences = read_file(swe_eng_file_path)

model = BigramModel()
model.train(swe_sentences)

start_word = "jag"
generated_text = model.generate_text(start_word, 10)
print(generated_text)


jag , att gå igenom . ( h-0041 / 1999


**Need to remove punctuations here too?**

In [16]:
'''

def createBigram(data):
   listOfBigrams = []
   bigramCounts = {}
   unigramCounts = {}
   for sentence in data:
      for i in range(len(sentence)-1):
         if i < len(sentence) - 1 and sentence[i+1].islower():

            listOfBigrams.append((sentence[i], sentence[i + 1]))

            if (sentence[i], sentence[i+1]) in bigramCounts:
               bigramCounts[(sentence[i], sentence[i + 1])] += 1
            else:
               bigramCounts[(sentence[i], sentence[i + 1])] = 1

         if sentence[i] in unigramCounts:
            unigramCounts[sentence[i]] += 1
         else:
            unigramCounts[sentence[i]] = 1
   return listOfBigrams, unigramCounts, bigramCounts


def calcBigramProb(listOfBigrams, unigramCounts, bigramCounts):
    listOfProb = {}
    for bigram in listOfBigrams:
        word1 = bigram[0]
        word2 = bigram[1]
        listOfProb[bigram] = (bigramCounts.get(bigram))/(unigramCounts.get(word1))
    return listOfProb
'''

'\n\ndef createBigram(data):\n   listOfBigrams = []\n   bigramCounts = {}\n   unigramCounts = {}\n   for sentence in data:\n      for i in range(len(sentence)-1):\n         if i < len(sentence) - 1 and sentence[i+1].islower():\n\n            listOfBigrams.append((sentence[i], sentence[i + 1]))\n\n            if (sentence[i], sentence[i+1]) in bigramCounts:\n               bigramCounts[(sentence[i], sentence[i + 1])] += 1\n            else:\n               bigramCounts[(sentence[i], sentence[i + 1])] = 1\n\n         if sentence[i] in unigramCounts:\n            unigramCounts[sentence[i]] += 1\n         else:\n            unigramCounts[sentence[i]] = 1\n   return listOfBigrams, unigramCounts, bigramCounts\n\n\ndef calcBigramProb(listOfBigrams, unigramCounts, bigramCounts):\n    listOfProb = {}\n    for bigram in listOfBigrams:\n        word1 = bigram[0]\n        word2 = bigram[1]\n        listOfProb[bigram] = (bigramCounts.get(bigram))/(unigramCounts.get(word1))\n    return listOfProb\n'

### c) Translation modeling

Model description

Regarding the conditional probability $P(f∣e)$, 

In [17]:
class TranslationModel:
    
    def __init__(self):
        
        self.translation_probabilities = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
        self.languages = []


    def add_sentence_pairs(self, foreign_file_path, english_file_path, language): #add source_language='en'?
        
        if language in self.languages:
            print(f'Language \"{language}\" is already implemented.')
            return

        #if language not in self.languages:
        self.languages.append(language)
        
        sentence_pairs = []
        
        with open(foreign_file_path, 'r') as foreign_file, \
             open(english_file_path, 'r') as english_file:
            
            for foreign_sentence, english_sentence in zip(foreign_file, english_file):
                
                foreign_sentence = foreign_sentence.split()
                english_sentence = english_sentence.split() + ['NULL'] #idk if this works correctly
                
                sentence_pairs.append((foreign_sentence, english_sentence))

        self.train(sentence_pairs, language)


    def train(self, sentence_pairs, language, iterations=10):
        
        translation_probabilities = self.initialize_translation_probabilities(sentence_pairs)
        
        #Expectation-maximization algoritm
        for iteration in range(iterations):
            
            counts_fe = defaultdict(lambda: defaultdict(float))
            counts_e = defaultdict(float)
            total_f = defaultdict(float)
            
            #collect counts
            for (foreign_sentence, english_sentence) in sentence_pairs:
            
                #foreign_words = foreign_sentence.split()
                #english_words = english_sentence.split()

                for f in foreign_sentence:
                    
                    total_f[f] = sum(translation_probabilities[e][f] for e in english_sentence)

                for e in english_sentence:
                    for f in foreign_sentence:
                        
                        count = translation_probabilities[e][f] / total_f[f]
                        counts_fe[e][f] += count
                        counts_e[e] += count
            
            #update probabilities
            for e in counts_fe:
                for f in counts_fe[e]:
                    
                    translation_probabilities[e][f] = counts_fe[e][f] / counts_e[e]
        
        self.translation_probabilities[language] = translation_probabilities


    def initialize_translation_probabilities(self, sentence_pairs):

        translation_probabilities = defaultdict(lambda: defaultdict(float))
        
        for (foreign_sentence, english_sentence) in sentence_pairs:
            
            #foreign_words = foreign_sentence.split()
            #english_words = english_sentence.split()
            
            uniform_probability = 1 / len(english_sentence)
            
            for f in foreign_sentence:
                for e in english_sentence:
                    
                    translation_probabilities[e][f] = uniform_probability
        
        return translation_probabilities


    def find_top_translations(self, word, language, n=10):
        
        if language not in self.languages:
            
            print(f'Language \"{language}\" not found.')
            return
       
        translations = [(f, prob) for f, prob in self.translation_probabilities[language][word].items()]
        translations.sort(key=lambda x: x[1], reverse=True)

        if len(translations) == 0:
            
            print(f'Translation of \"{word}\" not found.')
            return

        return translations[:n]
    

    def decode(self):
        pass


In [18]:
translation_model = TranslationModel()

In [19]:
translation_model.add_sentence_pairs(swe_eng_file_path, eng_swe_file_path, 'se')

In [20]:
translation_model.find_top_translations('european', 'se')

[('europeiska', 0.8358811726335589),
 ('europeisk', 0.07793251129315405),
 ('den', 0.01234342827816025),
 ('i', 0.01145802549009602),
 ('att', 0.0073175106656019956),
 ('en', 0.0064164076171784445),
 ('till', 0.006403088625106279),
 ('det', 0.005713358673491642),
 (',', 0.005560543111116395),
 ('för', 0.0047190385350298)]

In [21]:
translation_model.add_sentence_pairs(ger_eng_file_path, eng_ger_file_path, 'de')

In [22]:
translation_model.find_top_translations('european', 'de')

[('europäischen', 0.6583104427978661),
 ('europäische', 0.2994256203430025),
 ('der', 0.015614335052903423),
 ('die', 0.006707869728434155),
 (',', 0.0046853348392669105),
 ('den', 0.002090016898065694),
 ('in', 0.002076244515776026),
 ('.', 0.001742672454766282),
 ('union', 0.0017056204984250926),
 ('das', 0.0015701739731939042)]

In [23]:
translation_model.add_sentence_pairs(fre_eng_file_path, eng_fre_file_path, 'fr')

In [24]:
translation_model.find_top_translations('european', 'fr')

[('européenne', 0.46255524097583534),
 ('européen', 0.2740681214755836),
 ('l', 0.07615189999366892),
 ('&apos;', 0.06673961686785357),
 ('de', 0.043416375856401186),
 (',', 0.012650462639585984),
 ('le', 0.011782554822424903),
 ('la', 0.008875011271880097),
 ('au', 0.00688588253565706),
 ('d', 0.005003206485329585)]

### d) Decoding

Dedoding is challenging because we have a very large search space. From the outputs above, we see that for each foreign word there are multiple possible translations in English. In the IBM model 1 word alignments are assumed to be indepent of each other so the order of thr words and context is not considered which leads to less natural sentence translations

## 3) Discussion