# AMMI_2024_NLP - Week 1

#Lab 1: Part 1

# (A) Naive Bayes model

In this lab, we will implement a language identifier (LID).

Our first model will be based on Naive Bayes.

In [None]:
import io, sys, math, re
from collections import defaultdict
from typing import List, Tuple, Dict
import numpy as np


The next function is used to load the data. Each line of the data consist of a label (corresponding to a language), followed by some text, written in that language. Here is an example of data:

```__label__de Zur Namensdeutung gibt es mehrere Varianten.```


In [None]:
def load_data(filename:str)->List[Tuple]:
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    for line in fin:
        tokens = line.split()
        data.append((tokens[0], tokens[1:]))
    return data



You can now try loading the first dataset `train1.txt` and look what examples look like.

In [None]:
data = load_data("train1.txt")
print(data[0])



Next, we will start implementing the Naive Bayes method. This technique is based on word counts, and we thus need to start by implementing a function to count the words and labels of our training set.

`n_examples` is the total number of examples

`n_words_per_label` is the total number of words for a given label

`label_counts` is the number of times a given label appears in the training data

`word_counts` is the number of times a word appears with a given label

In [None]:
def count_words(data:str)->Dict:
    n_examples = 0
    n_words_per_label = defaultdict(lambda: 0)
    label_counts = defaultdict(lambda: 0)
    word_counts = defaultdict(lambda: defaultdict(lambda: 0.0))

    for example in data:
        label, sentence = example
        n_examples += 1
        
        n_words_per_label[label] += len(sentence)
            
        if label not in label_counts.keys():
            label_counts[label] = 1 
        else:
            label_counts[label] += 1
            
        if label not in word_counts.keys():
            for word in sentence:
                if word not in word_counts.keys():
                    word_counts[label][word] = 1 
                else:
                    word_counts[label][word] += 1
        else:
            for word in sentence:
                word_counts[label][word] += 1
                
    return {'label_counts': label_counts,
            'word_counts': word_counts,
            'n_examples': n_examples,
            'n_words_per_label': n_words_per_label}



Next, using the word and label counts from the previous function, we can implement the prediction function.

Here, `mu` is a regularization parameter (Laplace smoothing), and `sentence` is the list of words corresponding to the test example.

In [None]:
def predict(sentence:List, mu:float, label_counts:Dict, word_counts:Dict, n_examples:int, n_words_per_label:Dict)->str:
    best_label = None
    best_score = float('-inf')

    for label in word_counts.keys():
        score = 0.0
        prior = label_counts[label] / sum(label_counts.values())
        #P(Class | Word) = P(Class) * P(word | Class)
        
        for word in sentence:
            word_count = word_counts.get(label,0).get(word,0)
            
            # Calculate the likelihood P(word | label) using Laplace smoothing
            likelihood = (word_count + mu) / (n_words_per_label[label] + mu * len(word_counts.keys()))
            
            score += math.log(likelihood)
        score += math.log(prior)
        
        if score > best_score:
            best_score = score
            best_label = label
        
    return best_label

The next function will be used to evaluate the Naive Bayes model on a validation set. It computes the accuracy for a particular regularization parameter `mu`.

In [None]:
def compute_accuracy(valid_data:str, mu:float, counts:Dict)->float:
    accuracy = 0.0
    for label, sentence in valid_data:
        predicted_label = predict(sentence, mu, counts['label_counts'], counts['word_counts'], 
                                  counts['n_examples'], counts['n_words_per_label'])
        if predicted_label == label:
            accuracy += 1

    accuracy = accuracy / len(valid_data)
     
    return accuracy 



In [None]:
print("")
print("** Naive Bayes **")
print("")

mu = 0.001
train_data = load_data("train1.txt")
valid_data = load_data("valid1.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")



# Now, it is your turn, try to do it with train2.txt and valid2.txt.


In [None]:
#Write your code here.

print("")
print("** Naive Bayes **")
print("")

mu = 0.001
train_data = load_data("train2.txt")
valid_data = load_data("valid2.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")