# MMI_2024_NLP - Week 1

#Lab 1: Part 1

# Introduction

Before we start, please change the name of the notebook to the following format : **Firstname_LASTNAME_Lab1_A_naive_bayes.ipynb**


In some cells and files you will see code blocks that look like this:

```python
##############################################################################
#                    TODO: Write the equation for a line                     #
##############################################################################
pass
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################
```

You should replace the `pass` statement with your own code and leave the blocks intact, like this:

```python
##############################################################################
#                    TODO: Write the equation for a line                     #
##############################################################################
y = m * x + b
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################
```

# (A) Naive Bayes model

In this lab, we will implement a language identifier (LID).

Our first model will be based on Naive Bayes.

In [30]:
import io, sys, math, re
from collections import defaultdict
from typing import List, Tuple, Dict


import numpy as np

The next function is used to load the data. Each line of the data consist of a label (corresponding to a language), followed by some text, written in that language. Here is an example of data:

```__label__de Zur Namensdeutung gibt es mehrere Varianten.```


In [16]:
def load_data(filename:str)->List[Tuple]:
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    for line in fin:
        tokens = line.split()
        data.append((tokens[0], tokens[1:]))
    return data

You can now try loading the first dataset `train1.txt` and look what examples look like.

In [18]:
data = load_data("train1.txt")
data[1]

('__label__de', ['Tom', 'ist', 'an', 'Kunst', 'völlig', 'uninteressiert.'])

Next, we will start implementing the Naive Bayes method. This technique is based on word counts, and we thus need to start by implementing a function to count the words and labels of our training set.

`n_examples` is the total number of examples

`n_words_per_label` is the total number of words for a given label

`label_counts` is the number of times a given label appears in the training data

`word_counts` is the number of times a word appears with a given label

In [40]:
def count_words(data:str)->Dict:
    n_examples = 0 # label_total
    n_words_per_label = defaultdict(lambda: 0) #word_total
    label_counts = defaultdict(lambda: 0) 
    word_counts = defaultdict(lambda: defaultdict(lambda: 0.0))

    for example in data:
        label, sentence = example
        ##########################################################################
        #                      TODO: Implement this function                     #
        ##########################################################################
        # Replace "pass" statement with your code
        # Counts labels:
        label_counts[label] +=1
        # Counts words:
        for word in sentence:
            word_counts[label][word] +=1.0
            n_words_per_label[label] +=1.0
    n_examples = len(label_counts.keys())
        ##########################################################################
        #                            END OF YOUR CODE                            #
        ##########################################################################
    return {'label_counts': label_counts,
            'word_counts': word_counts,
            'n_examples': n_examples,
            'n_words_per_label': n_words_per_label}

In [46]:
n_examples = count_words(data=data)['n_examples'] #label_total
word_counts = count_words(data=data)['word_counts']
label_counts = count_words(data=data)['label_counts']
n_words_per_label = count_words(data=data)['n_words_per_label']

word_counts['__label__en']['is']

261.0

Next, using the word and label counts from the previous function, we can implement the prediction function.

Here, `mu` is a regularization parameter (Laplace smoothing), and `sentence` is the list of words corresponding to the test example.

In [49]:
import operator
def predict(sentence:List, mu:float, label_counts:Dict, word_counts:Dict, n_examples:int, n_words_per_label:Dict)->str:
    best_label = None
    best_score = float('-inf')

    label_outputs = defaultdict(lambda: 0)
    for label in word_counts.keys():
        score = 0.0
        prior = label_counts[label] / sum(label_counts.values())
        #P(Class | Word) = P(Class) * P(word | Class)
        ##########################################################################
        #                      TODO: Implement this function                     #
        ##########################################################################
        # Replace "pass" statement with your code
        sentence_count = len(word_counts[label].values())
        prod = 0
        for i in sentence:
            prod += np.log((word_counts[label][i]+ mu)/(sentence_count+ (mu*n_words_per_label[label])))
        
        label_prob = np.log(label_counts[label]/sum(list(label_counts.values())))

        label_outputs[label]=prod + label_prob
    
    # best_score = max(list(label_outputs.values()))
    best_score = max(list(label_outputs.values()))
    sorted_x = sorted(label_outputs.items(), key=operator.itemgetter(1))
    best_label = sorted_x[-1][0]
        ##########################################################################
        #                            END OF YOUR CODE                            #
        ##########################################################################

    return best_label

In [48]:
for example in range(len(data)):
        label, sentence = data[example]
        print(' '.join(sentence),predict(sentence, 3, label_counts, word_counts, n_examples, n_words_per_label))
        if example == 10:
            break

Ich würde alles tun, um dich zu beschützen. __label__de
Tom ist an Kunst völlig uninteressiert. __label__de
Végeztem Tomival. __label__hu
„Wird das in der Werkstatt gemacht?“ – „Nein, das muss an Ort und Stelle erledigt werden.“ __label__de
У меня есть яблоко. __label__ru
Non possiamo lasciarle lì. __label__it
Том считает, что школа — это пустая трата времени. __label__ru
My fathers don't speak Dutch. __label__hu
El niño no sabe cómo comportarse. __label__es
Она думала, что он переночует у неё. __label__ru
Helikopter neden kentin üstünde uçuyor? __label__hu


The next function will be used to evaluate the Naive Bayes model on a validation set. It computes the accuracy for a particular regularization parameter `mu`.

In [53]:
def compute_accuracy(valid_data:str, mu:float, counts:Dict)->float:
    accuracy = 0.0
    for label, sentence in valid_data:
      ##########################################################################
      #                      TODO: Implement this function                     #
      ##########################################################################
      # Replace "pass" statement with your code
      accuracy =0.0
      for label, sentence in valid_data:
        predicted_label = predict(sentence, mu, label_counts, word_counts, n_examples, n_words_per_label)
        if predicted_label==label:
            accuracy += 1.0
      ##########################################################################
      #                            END OF YOUR CODE                            #
      ##########################################################################

    return  accuracy/len(valid_data) # Replace "..." statement with your code

In [54]:
print("")
print("** Naive Bayes **")
print("")

mu = 1.0
train_data = load_data("train1.txt")
valid_data = load_data("valid1.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")


** Naive Bayes **

Validation accuracy: 0.949



# Now, it is your turn, try to do it with train2.txt and valid2.txt.


In [None]:
#Write your code here.

### OOP for Naive'Bayes.