<a href="https://colab.research.google.com/github/nguepigit2020/Lab1_NLP_Week1/blob/main/Idriss_Nguepi_Nguefack_of_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this lab is to implement a language identifier (LID).

Our first model will be based on Naive Bayes.

In [5]:
import io, sys, math, re
from collections import defaultdict

The next function is used to load the data. Each line of the data consist of a label (corresponding to a language), followed by some text, written in that language. Here is an example of data:

```__label__de Zur Namensdeutung gibt es mehrere Varianten.```


In [6]:
def load_data(filename):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    for line in fin:
        tokens = line.split()
        data.append((tokens[0], tokens[1:]))
    return data

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


You can now try loading the first dataset `train1.txt` and look what examples look like.

In [8]:
data = load_data("/content/drive/MyDrive/Lab_Week1_NLP/train1.txt")
print(data[0])

('__label__de', ['Ich', 'würde', 'alles', 'tun,', 'um', 'dich', 'zu', 'beschützen.'])


Next, we will start implementing the Naive Bayes method. This technique is based on word counts, and we thus need to start by implementing a function to count the words and labels of our training set.

`n_examples` is the total number of examples

`n_words_per_label` is the total number of words for a given label

`label_counts` is the number of times a given label appears in the training data

`word_counts` is the number of times a word appears with a given label

In [None]:
def count_words(data):
    n_examples = 0
    n_words_per_label = defaultdict(lambda: 0)
    label_counts = defaultdict(lambda: 0)
    word_counts = defaultdict(lambda: defaultdict(lambda: 0.0))

    for example in data:
        n_examples += 1
        label, sentence = example
        label_counts[label] += 1
        n_words_per_label[label] += len(sentence)
        for i in range(len(sentence)):
          
             word_counts[label][sentence[i]]+=1      
    return {'label_counts': label_counts, 
            'word_counts': word_counts, 
            'n_examples': n_examples, 
            'n_words_per_label': n_words_per_label}
count_words(data)           

Next, using the word and label counts from the previous function, we can implement the prediction function.

Here, `mu` is a regularization parameter (Laplace smoothing), and `sentence` is the list of words corresponding to the test example.

In [26]:
def predict(sentence, mu, label_counts, word_counts, n_examples, n_words_per_label):
    best_label = None
    best_score = float('-inf')

    for label in word_counts.keys():
        score = 0.0
        ## FILE CODE
        n = len(word_counts[label])
        for word in sentence:
          c_k = word_counts[label][word] + mu
          t_ck = n_words_per_label[label] +mu * n
          score += math.log(c_k/t_ck)
        score += label_counts[label]/n_examples
        print(score,best_score)
        if score > best_score:
            best_score = score
            best_label = label

    return best_label

The next function will be used to evaluate the Naive Bayes model on a validation set. It computes the accuracy for a particular regularization parameter `mu`.

In [27]:
def compute_accuracy(valid_data, mu, counts):
    accuracy = 0.0
    for label, sentence in valid_data:
        ## FILL CODE
        pred = predict(sentence,mu,**counts)
        if pred == label:
            accuracy+=1
  
    return accuracy/len(valid_data)

In [29]:
print("")
print("** Naive Bayes **")
print("")

mu = 1.0
train_data = load_data("/content/drive/MyDrive/Lab_Week1_NLP/train1.txt")
valid_data = load_data("/content/drive/MyDrive/Lab_Week1_NLP/valid1.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
-28.30689487779111 -25.020197025393657
-28.22268572918494 -25.020197025393657
-29.80727403665237 -25.020197025393657
-26.90519046198592 -25.020197025393657
-27.907436328568366 -25.020197025393657
-28.32831981416637 -25.020197025393657
-26.92620650650202 -25.020197025393657
-27.20899475618194 -25.020197025393657
-46.66379699628818 -inf
-36.953900800281076 -46.66379699628818
-47.263656573927754 -36.953900800281076
-41.69771600891274 -36.953900800281076
-44.03788236140143 -36.953900800281076
-39.75686996768265 -36.953900800281076
-46.592803561666955 -36.953900800281076
-47.282632405666156 -36.953900800281076
-40.428148348653416 -36.953900800281076
-41.42249226637896 -36.953900800281076
-69.87035241111965 -inf
-62.61327776528455 -69.87035241111965
-75.70055831146499 -62.61327776528455
-70.56545045488994 -62.61327776528455
-73.41952546211527 -62.61327776528455
-63.72647675048927 -62.61327776528455
-58.61909864378015 -62.613277