Implement from scratch the Naive Bayes algorithm. Then improve the model obtained using standard Naive Bayes algorithm; The improvement can include: adding new features such as n-grams (phrases of n words, for some n that is tunable hyperparameter). 
(in Python using Jupyter notebook).

**Report**

**1. explain and motivate the chosen representation & data preprocessing**

The dataset contains project descriptions labeled with four categories: Web Development (W), Game Development (G), Security (S), and AI/ML (A).

Each description was lowercased and split into individual words. Stop words (like “the”, “and”, “is”, "we") were removed to reduce noise. The initial model used only unigrams (single words) to represent the text.

However, Naive Bayes assumes that words appear independently, which is often unrealistic in natural language. To improve the model, extended the model to use bigrams—pairs of adjacent words (like “web development” or “decision making”). This helps capture more contextual meaning.

Applied Laplace smoothing to avoid assigning zero probabilities to words or phrases not seen in the training data.

Also noted that the training data is imbalanced. Over half the labels are ‘W’ (52%), while only 3% are ‘S’. This makes the model biased toward predicting more common labels.


**2. explain the idea behind the model improvements and their implementation (including the implementation of the standard Naive Bayes)**

The standard model is a basic Naive Bayes classifier. It calculates:
 - Prior probabilities: how often each label appears (e.g., P(W), P(S))
 - Conditional probabilities: how often each word appears given a label (e.g., P(word | W))

To improve the model:
 - Laplace smoothing: This helps when the model sees new words or phrases it didn’t learn during training. Instead of giving them zero probability, Laplace smoothing gives them a small value.
 - N-grams (bigrams): Instead of using only single words, added bigrams to help the model learn word pairs. This helps it understand common phrases used in project descriptions.
 - Also used log probabilities when multiplying values, to avoid very small numbers (underflow problems) during calculation.


**3. explain the evaluation procedure (e.g., cross-validation or training/validation split)**

Splitted the dataset into two parts: one for training and one for testing. Trained the model using the training set and then checked how well it worked on the testing set.

Accuracy was used as the main score to compare different versions of the model: Standard Naive Bayes, Naive Bayes with Laplace smoothing, Naive Bayes with n-gram features (bigrams) and Naive Bayes with n-gram with Laplace smoothing


**4. include and explain the training/validation results for the standard and improved Naive Bayes model. You can summarize results using tables (or plots), but all results have to be explained descriptively as well.**

The accuracy of each model:
Standard Naive Bayes:	0.91
Standard Naive Bayes with Laplace:	0.90
Naive Bayes with bigrams:	0.885
Naive Bayes with bigrams with Laplace: 0.92

The best result came from using bigrams combined with Laplace smoothing, giving an accuracy of 0.92. 
This suggests that capturing word pairs and handling unseen phrases helps the model make better predictions.


**5. be written in plain English and should not be longer than two A4 pages (export the notebook as pdf to see if the report section fits in two pages).**

In [215]:
import pandas as pd
import numpy as np
import nltk 
import sklearn 
import math

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) #common words as 'a', 'we', 'the'
#stop_words

training_data = pd.read_csv("train.csv")
#training_data

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\higher\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [216]:
#Artificial Intelligence and Machine Learning (A)
#Privacy and Security (S)
#Game Development (G)
#Web Development (W)

#separate rows for each class, only includes description
wd = []
gd = []
ps = []
aiml = []
for i in range(len(training_data)):
    if (training_data["Class"][i] == "W"):
        wd.append(training_data["Description"][i])
    if (training_data["Class"][i] == "G"):
        gd.append(training_data["Description"][i])
    if (training_data["Class"][i] == "S"):
        ps.append(training_data["Description"][i])
    if (training_data["Class"][i] == "A"):
        aiml.append(training_data["Description"][i])
#wd

pwd = len(wd) / len(training_data) #prior probabilities = number of label W in the dataset / all rows in the dataset
pgd = len(gd) / len(training_data)
pps = len(ps) / len(training_data)
paiml = len(aiml) / len(training_data)
#pwd

def preprocess(text):
    text = text.lower() #all into lowercase
    words = text.split() #split word for word
    filtered_words = [word for word in words if word not in stop_words] #remove stop words
    return filtered_words
#preprocess(wd[0])


word_by_label = {'W': {}, 'G': {}, 'S': {}, 'A': {}} #dictionary to store word counts for each label
for label in word_by_label: #for each label

    description_list = []
    if label == 'W':
        description_list = wd
    if label == 'G':
        description_list = gd
    if label == 'S':
        description_list = ps
    if label == 'A':
        description_list = aiml

    for row in description_list: #for each row in description_list
        words = preprocess(row) #preprocess each row of text
        for word in words: #for each words,  
            if word in word_by_label[label]: #if already exist in the dictionary, increment count
                word_by_label[label][word] += 1
            else: #if new unique word, add to dictionary with count 1
                word_by_label[label][word] = 1
#word_by_label['W'] #number of each word given label = 'W'
#word_by_label['A']['developed']
#len(word_by_label['W']) #number of total words given W



all_words = {} #dictionary to store word counts for each label
all_description_list = training_data['Description'] #sample size description list

for row in all_description_list: #for all rows in description column
    words = preprocess(row) #preprocess each row of text
    for word in words: #for each words,  
        if word in all_words: #if (key) already exist in the dictionary, increment count
            all_words[word] += 1
        else: #if new unique word, add (key) to dictionary with count 1
            all_words[word] = 1
#all_words #number of all words of the dataset
#all_words['developed'] #number of all 'developed' total (sample size)

tot_unique_words = len(all_words) #total unique words of all total (sample size)
#tot_unique_words


In [217]:
# training data conditional probabilities
#we use the training data numbers to calculate the labels for the testing data

#calculate for each label 
#for each row of test data
#and choose the largest probability

testing_data = pd.read_csv("test.csv")
#testing_data


labels = [] #predicted labels 
labels_laplace = []
for row in testing_data['Description']: #for each Description row in testing data

    probabilities_by_label = {'W': {}, 'G': {}, 'S': {}, 'A': {}} #dictionary to store conditional probabilites for each label (standard Naive Bayes algorithm)
    probabilities_by_label_laplace = {'W': {}, 'G': {}, 'S': {}, 'A': {}} #laplace smoothing
    for label in word_by_label: #for each label

        posterior_probabilites = 0
        posterior_probabilites_laplace = 0
        if label == 'W':
            posterior_probabilites = pwd
            posterior_probabilites_laplace = pwd
        if label == 'G':
            posterior_probabilites = pgd
            posterior_probabilites_laplace = pgd
        if label == 'S':
            posterior_probabilites = pps
            posterior_probabilites_laplace = pps
        if label == 'A':
            posterior_probabilites = paiml
            posterior_probabilites_laplace = paiml

        words = preprocess(row)
        for word in words:
            if word in word_by_label[label]: #if the word exists in the training data
                p_word_given_label = word_by_label[label][word] / len(word_by_label[label]) #number of all 'word' given label / number of total words of the given
                p_word_given_label_laplace = ((word_by_label[label][word]) + 1) / (len(word_by_label[label]) + tot_unique_words) #laplace smoothing

            posterior_probabilites = posterior_probabilites * p_word_given_label 
            posterior_probabilites_laplace = posterior_probabilites_laplace * p_word_given_label_laplace
        probabilities_by_label[label] = posterior_probabilites
        probabilities_by_label_laplace[label] = posterior_probabilites_laplace

        #so there will be 8 probabilities for each row of test data (4 for without smoothing, and 4 for with laplace smoothing)
    max_label = max(probabilities_by_label, key=probabilities_by_label.get) #get the largest probability out of the labels
    #max_value = probabilities_by_label[max_word]
    labels.append(max_label)

    max_label_laplace = max(probabilities_by_label_laplace, key=probabilities_by_label_laplace.get)
    labels_laplace.append(max_label_laplace)
#labels #0.91
#labels_laplace 0.90636
labels == labels_laplace #not the same

False

In [218]:
# Create DataFrame
df = pd.DataFrame({
    'Id': range(1, len(labels) + 1),
    'content': labels
})
# Save to CSV
df.to_csv('output.csv', index=False)



# Create DataFrame
df = pd.DataFrame({
    'Id': range(1, len(labels_laplace) + 1),
    'content': labels_laplace
})
# Save to CSV
df.to_csv('output_laplace.csv', index=False)

In [219]:
from nltk.util import ngrams

def preprocess_with_ngrams(text, n):
    text = text.lower()
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]

    ngram_tokens = ['_'.join(gram) for gram in ngrams(filtered_words, n)]  #join words e.g. ['web_development', 'development_using']
    return filtered_words + ngram_tokens  # include both unigrams and bigrams
#preprocess_with_ngrams(testing_data['Description'][0], n=2)


# Count all unique unigrams + n-grams from training data
all_words_ngrams = {}
for row in all_description_list:
    words = preprocess_with_ngrams(row, n=4)
    for word in words: 
        if word in all_words_ngrams: 
            all_words_ngrams[word] += 1
        else:
            all_words_ngrams[word] = 1

tot_unique_words_ngram = len(all_words_ngrams)
total_vocab_size_ngram = tot_unique_words + tot_unique_words_ngram  # <-- Laplace fix
#total_vocab_size_ngram

In [220]:
labels_ngrams = []
labels_laplace_ngrams = []

for row in testing_data['Description']:
    p_by_label = {'W': 0, 'G': 0, 'S': 0, 'A': 0}
    p_by_label_laplace = {'W': 0, 'G': 0, 'S': 0, 'A': 0}

    words = preprocess_with_ngrams(row, n=2)

    for label in word_by_label:
        if label == 'W':
            pp = math.log(pwd)
        if label == 'G':
            pp = math.log(pgd)
        if label == 'S':
            pp = math.log(pps)
        if label == 'A':
            pp = math.log(paiml)

        pp_laplace = pp

        for word in words:
            if word in word_by_label[label]:
                p_word_given_label = word_by_label[label][word] / len(word_by_label[label])
                pp += math.log(p_word_given_label)

            count = word_by_label[label].get(word, 0)
            prob = (count + 1) / (len(word_by_label[label]) + total_vocab_size_ngram)
            pp_laplace += math.log(prob)

        p_by_label[label] = pp
        p_by_label_laplace[label] = pp_laplace

    max_label_ngrams = max(p_by_label, key=p_by_label.get)
    labels_ngrams.append(max_label_ngrams)

    max_label_laplace = max(p_by_label_laplace, key=p_by_label_laplace.get)
    labels_laplace_ngrams.append(max_label_laplace)
#labels_ngrams
#labels_laplace_ngrams

In [221]:
# Create DataFrame
df = pd.DataFrame({
    'Id': range(1, len(labels_ngrams) + 1),
    'content': labels_ngrams
})
# Save to CSV
df.to_csv('output_ngrams.csv', index=False)


# Create DataFrame
df = pd.DataFrame({
    'Id': range(1, len(labels_laplace_ngrams) + 1),
    'content': labels_laplace_ngrams
})
# Save to CSV
df.to_csv('output_laplace_ngrams.csv', index=False)