# **Module 5: Natural language processing**
## DAT410

### Group 29 
### David Laessker, 980511-5012, laessker@chalmers.se

### Oskar Palmgren, 010529-4714, oskarpal@chalmers.se



We hereby declare that we have both actively participated in solving every exercise. All solutions are entirely our own work, without having taken part of other solutions.

___


## 1) Reading and reflection

a) Like speech recognition and image recognition?

b) Systems that are rule-based explicitly use linguistic rules and dictionaries, while neural systems learn these linguistic patterns from large datasets. Both approaches aim to accurately translate languages by mapping structures and meanings, but through different means.

c) Maybe smaller datasets? Modern neural systems may not capture the grammatical patterns in the language with scarce data. A rule based system will therefore offer more predictable and interpretable results.

## 2) Implementation

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

In [2]:
swe_eng_file_path = 'data/europarl-v7.sv-en.lc.sv'
eng_swe_file_path = 'data\europarl-v7.sv-en.lc.en'

ger_eng_file_path = 'data\europarl-v7.de-en.lc.de'
eng_ger_file_path = 'data\europarl-v7.de-en.lc.en'

fre_eng_file_path = 'data\europarl-v7.fr-en.lc.en'
eng_fre_file_path = 'data\europarl-v7.fr-en.lc.fr'

In [4]:
def word_frequency(file):
    
    word_counter = Counter()

    with open(file, 'r', encoding='utf-8') as f:
    
        for line in f:
            
            words = line.strip().split()
            word_counter.update(words)
    
    return word_counter

In [5]:
swe_frequency = word_frequency(swe_eng_file_path)

swe_top_10 = swe_frequency.most_common(10)

swe_top_10


[('.', 9648),
 ('att', 9181),
 (',', 8876),
 ('och', 7038),
 ('i', 5949),
 ('det', 5687),
 ('som', 5028),
 ('för', 4959),
 ('av', 4013),
 ('är', 3840)]

In [8]:
eng_frequency = word_frequency(eng_swe_file_path) #maybe we should combine all the english files?

eng_top_10 = eng_frequency.most_common(10)

eng_top_10


[('the', 19322),
 (',', 13514),
 ('.', 9774),
 ('of', 9312),
 ('to', 8801),
 ('and', 6946),
 ('in', 6090),
 ('is', 4400),
 ('that', 4357),
 ('a', 4269)]

We need to remove punctuations, and maybe stop words?

In [18]:
eng_words_amount = sum(eng_frequency.values())

#print(eng_words_amount)
#print(eng_frequency['speaker'] )
#print(eng_frequency['zebra'] )

speaker_probability = eng_frequency['speaker'] / eng_words_amount
zebra_probability = eng_frequency['zebra'] / eng_words_amount

# calculate probability for "speaker" and "zebra" in the english text, but we need to look at all the languages?
print(f'Probability of speaker: {speaker_probability:.6f}')
print(f'Probability of zebra: {zebra_probability}')





281476
Probability of speaker: 0.000036
Probability of zebra: 0.0


In [23]:
words_list = []

with open(swe_eng_file_path, 'r', encoding='utf-8') as f:
    
        for line in f:
            
            words = line.strip().split()
            words_list.append(words)





['som',
 'ni',
 'kunnat',
 'konstatera',
 'ägde',
 '&quot;',
 'den',
 'stora',
 'år',
 '2000-buggen',
 '&quot;',
 'aldrig',
 'rum',
 '.',
 'däremot',
 'har',
 'invånarna',
 'i',
 'ett',
 'antal',
 'av',
 'våra',
 'medlemsländer',
 'drabbats',
 'av',
 'naturkatastrofer',
 'som',
 'verkligen',
 'varit',
 'förskräckliga',
 '.']

In [69]:
import random
from collections import defaultdict, Counter

class BigramModel:
    def __init__(self):
        self.bigram_counts = defaultdict(Counter)
        self.starting_words = []

    def train(self, sentence_list):
        # Preprocess the text into words

        for sentence in sentence_list:
            self.starting_words.append(sentence[0])
        
        # Count bigrams in the text
            for i in range(len(sentence) - 1):
                self.bigram_counts[sentence[i]][sentence[i+1]] += 1
        

    def predict_next_word(self, word):
        if word not in self.bigram_counts:
            return None
        next_words = self.bigram_counts[word]
        total_counts = sum(next_words.values())
        # Create a weighted choice among the next possible words
        weighted_choices = [(w, count / total_counts) for w, count in next_words.items()]
        return random.choices([w for w, _ in weighted_choices], [count for _, count in weighted_choices])[0]

    def generate_text(self, start_word, length=10):
        
        #print(self.starting_words)
        #print(self.bigram_counts['jag'])


        if start_word.lower() not in self.bigram_counts and not self.starting_words:
            return "Model not trained or start word not in corpus."
        
        if start_word.lower() in self.bigram_counts:
            current_word = start_word.lower()

        else:
            current_word = random.choice(self.starting_words)
        
        
        generated_text = [current_word]
        
        for _ in range(length - 1):
            next_word = self.predict_next_word(current_word)
            if next_word is None:
                break  # End if no next word is found
            generated_text.append(next_word)
            current_word = next_word
        return ' '.join(generated_text)







In [77]:
# Example usage

model = BigramModel()
model.train(words_list)

start_word = "jag"
generated_text = model.generate_text(start_word, 100)
print(generated_text)


jag trodde att hjälpa människor som vi beklagar hans betänkande om dessa medel att godta ändringsförslag faktiskt inte att efterfrågan för att den europeiska unionen , när en allvarlig begränsning av roth-behrendt sade , och de människor måste samtidigt de summor av personer , är det irländska myndigheterna , för att man är den också viktigt som vi får inte någon stadga över vad det var lönsamt att kväva teatergrupper , deltar i strid med hur brådskande ärende , och kommissionen gjort sig därför har utsatts för någon betoning på att visa att den risken för kvestorernas möte där nationella


In [78]:


def createBigram(data):
   listOfBigrams = []
   bigramCounts = {}
   unigramCounts = {}
   for i in range(len(data)-1):
      if i < len(data) - 1 and data[i+1].islower():

         listOfBigrams.append((data[i], data[i + 1]))

         if (data[i], data[i+1]) in bigramCounts:
            bigramCounts[(data[i], data[i + 1])] += 1
         else:
            bigramCounts[(data[i], data[i + 1])] = 1

      if data[i] in unigramCounts:
         unigramCounts[data[i]] += 1
      else:
         unigramCounts[data[i]] = 1
   return listOfBigrams, unigramCounts, bigramCounts


def calcBigramProb(listOfBigrams, unigramCounts, bigramCounts):
    listOfProb = {}
    for bigram in listOfBigrams:
        word1 = bigram[0]
        word2 = bigram[1]
        listOfProb[bigram] = (bigramCounts.get(bigram))/(unigramCounts.get(word1))
    return listOfProb


In [80]:
listOfBigrams, unigramCounts, bigramCounts = createBigram(words_list[0])

print("\n All the possible Bigrams are ")
print(listOfBigrams)

print("\n Bigrams along with their frequency ")
print(bigramCounts)

print("\n Unigrams along with their frequency ")
print(unigramCounts)

bigramProb = calcBigramProb(listOfBigrams, unigramCounts, bigramCounts)

print("\n Bigrams along with their probability ")
print(bigramProb)
inputList="This is my cat"
splt=inputList.split()
outputProb1 = 1
bilist=[]
bigrm=[]

for i in range(len(splt) - 1):
    if i < len(splt) - 1:

        bilist.append((splt[i], splt[i + 1]))

print("\n The bigrams in given sentence are ")
print(bilist)
for i in range(len(bilist)):
    if bilist[i] in bigramProb:

        outputProb1 *= bigramProb[bilist[i]]
    else:

        outputProb1 *= 0
print('\n' + 'Probablility of sentence \"This is my cat\" = ' + str(outputProb1))

AttributeError: 'list' object has no attribute 'islower'