# HW2: Classification

---




## Name: Mohammad Mahdi Heydari Nasab
## Student ID: 99105389

In [1]:
# Install the dependencies you use in the notebook here:
import nltk
!pip install gensim
from nltk.corpus import stopwords
from nltk.stem import wordnet
import numpy as np
import pandas as pd
from evaluation import *
from tqdm import *



# 0. GloVe Embeddings

In this section we will prepare GloVe embeddings for our classification task.
First we will download the embeddings file, then we will initialize a word2vec based on GloVe embeddings.

Finally we will write a funcation to produce the text's GloVe embeddings.

GloVe embeddings are available in diferent vector sizes, for this assignment we will use the 50d version.

You can read more about GloVe embeddings in [this Medium post](https://towardsdatascience.com/light-on-math-ml-intuitive-guide-to-understanding-glove-embeddings-b13b4f19c010).

## 0.0. Preprocessing
use the preprocessing functions from your last homework to define the following function.

In [2]:
def preprocess_text(text):
    """
    Gets a text and returns the preprocessed version of it.

    Parameters:
        text (str): input text

    Returns:
        text (str): preprocessed text
    """
    punctuations = "``''?:!.,;،؛؟#"
    text = text.lower()
    text = nltk.tokenize.word_tokenize(text)
    text = [word for word in text if word not in punctuations]
    english_stopwords = set(stopwords.words('english'))
    text = [word for word in text if word not in english_stopwords]
    english_lemmatizer = wordnet.WordNetLemmatizer()
    text = [english_lemmatizer.lemmatize(word) for word in text]
    return text

In [3]:
def test_preprocess_text():
    text = "Dude! imagine being a #bug\n and accidentally getting STUCK! in a #car and   driving far af away from everything you know!!!"
    try:
        print(preprocess_text(text))
    except NotImplementedError:
        print("Run this cell after implementation of the above cell!")

test_preprocess_text()

['dude', 'imagine', 'bug', 'accidentally', 'getting', 'stuck', 'car', 'driving', 'far', 'af', 'away', 'everything', 'know']


## 0.1. Download Data
Download the GloVe data using the cell below. 

In [4]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
--2023-02-04 14:40:42--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu... 171.64.67.140
Connecting to nlp.stanford.edu|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-02-04 14:40:43--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu|171.64.67.140|:443... connected.
ERROR: cannot verify nlp.stanford.edu's certificate, issued by `/C=US/ST=MI/L=Ann Arbor/O=Internet2/OU=InCommon/CN=InCommon RSA Server CA':
  Self-signed certificate encountered.
To connect to nlp.stanford.edu insecurely, use `--no-check-certificate'.
Unable to establish SSL connection.
'unzip' is not recognized as an internal or external command,
operable program or batch file.


## 0.2. Make glove model

In [5]:
from gensim.scripts.glove2word2vec import glove2word2vec

glove_input_file = 'glove.6B.50d.txt'
word2vec_output_file = 'glove.6B.50d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

  glove2word2vec(glove_input_file, word2vec_output_file)


(400000, 50)

In [6]:
from gensim.models import KeyedVectors # load the Stanford GloVe model

filename = 'glove.6B.50d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)

## 0.3. Word embedding


In [7]:
def get_token_embedding(token):
    """
    Gets a token and returns its GloVe embedding.

    Parameters:
        token (str): input token

    Returns:
        embedding_vector (np.array): embedding vector in numpy format.
    """
    if model.__contains__(token):
        return model.get_vector(token, norm = True)
    else:
        return np.zeros((50,))

In [8]:
def test_get_token_embedding():
    token = "frog"
    try:
        assert get_token_embedding(token).shape == (50,), "Wrong embedding initialization"
    except NotImplementedError:
        print("Run this cell after implementation of the above cell!")
    print('Embedding of', token, ':\n ', get_token_embedding(token))

test_get_token_embedding()

Embedding of frog :
  [ 0.12718841 -0.04325256 -0.14992847  0.1860879   0.06768462  0.15954083
  0.03779937 -0.06894321  0.16497736 -0.06598011  0.00232193  0.09462761
  0.33323455  0.00281182 -0.01951356  0.04010192  0.05230232  0.23498537
 -0.22706708 -0.08941196 -0.23602724 -0.18850714  0.11704467 -0.01218248
  0.20852165 -0.08130198 -0.08681977  0.15361671 -0.11215618 -0.20002617
  0.14154759 -0.12305214  0.02793902  0.11309178 -0.07629679  0.00312105
 -0.05201059 -0.16896775  0.01644189 -0.20327474 -0.13834901 -0.03856619
 -0.1816495   0.06414223  0.26753366 -0.03101465  0.12956388 -0.31443903
  0.03038536 -0.06601761]


## 0.4. Vectorize input

In [9]:
def vectorize_text(tokens):
    """
    Gets a list of tokens and returns the average GloVe
    embedding of all the tokens.

    Parameters:
        tokens (List(str)): list of input tokens

    Returns:
        embedding_vector (np.array): embedding vector in numpy format.
    """
    average_vector = np.zeros((50,))
    for i in range(len(tokens)):
        average_vector += get_token_embedding(tokens[i])
    average_vector = average_vector / len(tokens)
    return average_vector

In [10]:
def test_vectorize_text():
    text = "The Princess and the Frog"
    tokens = preprocess_text(text)
    try:
        assert vectorize_text(tokens).shape == (50,), "Wrong embeddings"
    except NotImplementedError:
        print("Run this cell after implementation of the above cell!")
    print('Embedding of', tokens, ':\n ', vectorize_text(tokens))
test_vectorize_text()

Embedding of ['princess', 'frog'] :
  [ 0.19482759  0.11889462 -0.17737214  0.15396611  0.08941954  0.17433513
  0.00553212  0.06158997  0.07124736 -0.09829862  0.05330768  0.11427242
  0.18295492 -0.03992577  0.05355072  0.06239018 -0.09932238  0.11463517
 -0.12799442 -0.01127088 -0.05557535 -0.01588819  0.03568342  0.0324892
  0.15007959 -0.14938772 -0.20444039  0.06785171 -0.0632048  -0.1077086
  0.15069643 -0.00078949  0.03694061  0.15214353  0.01963824 -0.01524018
 -0.02847135 -0.09666358 -0.03960238 -0.16786506 -0.02946135 -0.01346679
 -0.04725478 -0.07911661  0.14332134 -0.05626741  0.0072868  -0.33822565
  0.07844153 -0.0640892 ]


# 1. Dataset
There is a dataset `classification_dataset.zip` file adjacent to this notebook. The data contains more news documents, each one consisting of a title, body, and a category (1: World, 2: Sports, 3: Business, 4: Science/Technology). In the following cells load the dataset in Pandas dataframe format.

## 1.0. Load in dataframe format

In [11]:
def load_dataset(dataset_path):
    """
    Loads the dataset file and returns it in pandas format.

    Parameters:
        dataset_path (str): path to dataset file

    Retuns:
        dataset (pd.DataFrame): pandas dataframe object.
    """
    return pd.read_json(dataset_path)

In [12]:
train_dataset = load_dataset("classification_dataset/train.json")
val_dataset = load_dataset("classification_dataset/validation.json")

train_dataset.head()

Unnamed: 0,body,category,title
0,"Every day, cubicle-dwellers get up from their ...",4,MobileAccess Networks Strengthens Signals for ...
1,New 1.8-inch hard drives may boost battery lif...,4,Hitachi Drives Consumer Storage
2,A hearing into allegations of racism against t...,1,Cricket: Zim race probe halted
3,The prospect that a tropical storm and a hurri...,4,Simultaneous Tropical Storms are Very Rare
4,Second seed Jiri Novak and number three Guille...,2,NOVAK AND CANAS SET UP SHOWDOWN


## 1.1. Vectorize dataset
In this part, for getting the GloVe embedding, simply concatenate the text in the title and body of each entry and consider them as one text input.

In [13]:
def vectorize_dataset(dataset):
    """
    Vectorize all text inputs of the dataset. Each vector should be
    the average GloVe embedding of the text presented in its 
    respective row.
    The output will be in form of pandas with one additional column
    named `embedding`.

    Parameters:
        dataset (pd.DataFrame): path to dataset file

    Retuns:
        new_dataset (pd.DataFrame): new dataset with the added embeddings column
    """
    vectors = list()
    for i in range(len(dataset)):
        vectors.append(vectorize_text(preprocess_text(dataset['title'][i] + dataset['body'][i])))
    
    dataset['Embedding'] = vectors
    return dataset

In [14]:
train_dataset_vect = vectorize_dataset(train_dataset)
val_dataset_vect = vectorize_dataset(val_dataset)

train_dataset_vect.head()

Unnamed: 0,body,category,title,Embedding
0,"Every day, cubicle-dwellers get up from their ...",4,MobileAccess Networks Strengthens Signals for ...,"[0.04450523899868131, 0.03620941671761102, 0.0..."
1,New 1.8-inch hard drives may boost battery lif...,4,Hitachi Drives Consumer Storage,"[0.01512077826863298, -0.02261329721659422, 0...."
2,A hearing into allegations of racism against t...,1,Cricket: Zim race probe halted,"[-0.030633848692689623, -0.05615816019209368, ..."
3,The prospect that a tropical storm and a hurri...,4,Simultaneous Tropical Storms are Very Rare,"[0.03878044509328902, 0.020403445764289548, 0...."
4,Second seed Jiri Novak and number three Guille...,2,NOVAK AND CANAS SET UP SHOWDOWN,"[-0.040298306429758665, 0.015299352444708347, ..."


# 2. KNN

In the following cells, you are asked to implement the KNN algorithm. 

You will write a function that gets the training samples and a specific query and a parameter `k` defining the count of nearest neighbours.

The function should predict the query's label based on the top `k` nearest embedding vectors labels.

## 2.0. Training

In [15]:
def get_distance(v1, v2):
    """
    Returns the euclidian distance between two vectors.

    Parameters:
        v1 (np.array) : vector 1
        v2 (np.array) : vector 2

    Returns:
        dist (float) : euclidian distance between the vectors
    """
    norm = np.linalg.norm(v1)
    if norm != 1:
        v1 = v1 / norm
    norm = np.linalg.norm(v2)
    if norm != 1:
        v2 = v2 / norm
    return np.linalg.norm(v1 - v2)

In [16]:
def KNN(query, k, train_dataset):
    """
    Predict the query label based on its `k` nearest neighbours in the training set.

    Parameters:
        query (np.array) : input query embedding to predict.
        k (int) : defines the nearest neighbours to considered
        train_dataset (pd.DataFrame) : training dataset samples

    Returns:
        label (int) : Predicted label
    """
    query_vector = vectorize_text(query)
    distances = dict()
    for i in range(len(train_dataset)):
        distances[i] = get_distance(query_vector, train_dataset['Embedding'][i])
    top_k = {key: value for key, value in sorted(distances.items(), key=lambda item: item[1])}
    count = 0
    classes = list()
    for key in top_k:
        classes.append(train_dataset['category'][key])
        count += 1
        if count == k:
            break
    return max(set(classes), key = classes.count)



In [17]:
def test_KNN():
    text = "The Princess and the Frog"
    tokens = preprocess_text(text)
    query = vectorize_text(tokens)
    try:
        print(KNN(query, 10, train_dataset_vect))
    except NotImplementedError:
        print("Run this cell after implementation of the above cell!")

test_KNN()

  v1 = v1 / norm


4


## 2.1. Prediction and Evaluation
We will implement a function that gets the validation set and calculates classification metrics on its predictions based on a specific `k`.

After that you should report the metrics for `k` values of `{1, 5, 10, 20}` on the validation set.

In [18]:
def evaluate_KNN(k, val_dataset, train_dataset):
    """
    Evaluate KNN with the given `k` on the validation set.
    Then return the classification metrics in a dictionary.
    Keys will be the metric name in `str` format and the respective
    values are the metric's values.
    For example {'f1': 0.65}

    Parameters:
        k (int) : defines the nearest neighbours to considered
        val_dataset (pd.DataFrame) : validation dataset samples
        train_dataset (pd.DataFrame) : training dataset samples

    Returns:
        metrics (dict) : dictionary containing prediction based on metrics.
    """
    classes = list()
    for i in tqdm(range(len(val_dataset))):
        query = preprocess_text(val_dataset['title'][i] + val_dataset['body'][i])
        classes.append(KNN(query, k, train_dataset))
    y_true = list()
    y_predicted = list()
    for i in range(1, 5):
        y_true_i = list()
        y_predicted_i = list()
        for j in range(len(classes)):
            if val_dataset['category'][j] == i:
                y_true_i.append(True)
            else:
                y_true_i.append(False)
            if classes[j] == i:
                y_predicted_i.append(True)
            else:
                y_predicted_i.append(False)
        y_true.append(y_true_i)
        y_predicted.append(y_predicted_i)
    metrics = dict()
    metrics['accuracy'] = get_accuracy(classes, list(val_dataset['category']))
    metrics['micro-precision'] = get_precision(y_true, y_predicted, 'micro')
    metrics['macro-precision'] = get_precision(y_true, y_predicted, 'macro')
    metrics['micro-recall'] = get_recall(y_true, y_predicted, 'micro')
    metrics['macro-recall'] = get_recall(y_true, y_predicted, 'macro')
    metrics['micro-f1'] = get_f1_score(y_true, y_predicted, 'micro')
    metrics['macro-f1'] = get_f1_score(y_true, y_predicted, 'macro')
    return metrics


In [19]:
metrics_for_each_k = dict()
for k in [1, 5, 10, 20]:
    try:
        metrics_for_each_k[k] = evaluate_KNN(k, val_dataset_vect, train_dataset_vect)
    except NotImplementedError:
        print("Run this cell after implementation of the above cell!")

100%|██████████████████████████████████████████████████████████████████████████████| 3000/3000 [27:03<00:00,  1.85it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 3000/3000 [26:17<00:00,  1.90it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 3000/3000 [26:11<00:00,  1.91it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 3000/3000 [25:48<00:00,  1.94it/s]


## 2.2. Sklearn KNN
In the following cell use the Sklearn KNN on your training set and then compare its results with your own results.

In [20]:
# TODO: use sklearn KNN
from sklearn.neighbors import KNeighborsClassifier
metrics_for_each_k_2 = dict()
for k in [1, 5, 10, 20]:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(list(train_dataset_vect['Embedding']), list(train_dataset_vect['category']))
    prediction = knn.predict(list(val_dataset_vect['Embedding']))
    y_true = list()
    y_predicted = list()
    for i in range(1, 5):
        y_true_i = list()
        y_predicted_i = list()
        for j in range(len(prediction)):
            if val_dataset_vect['category'][j] == i:
                y_true_i.append(True)
            else:
                y_true_i.append(False)
            if prediction[j] == i:
                y_predicted_i.append(True)
            else:
                y_predicted_i.append(False)
        y_true.append(y_true_i)
        y_predicted.append(y_predicted_i)
    metrics = dict()
    metrics['accuracy'] = get_accuracy(prediction, list(val_dataset_vect['category']))
    metrics['micro-precision'] = get_precision(y_true, y_predicted, 'micro')
    metrics['macro-precision'] = get_precision(y_true, y_predicted, 'macro')
    metrics['micro-recall'] = get_recall(y_true, y_predicted, 'micro')
    metrics['macro-recall'] = get_recall(y_true, y_predicted, 'macro')
    metrics['micro-f1'] = get_f1_score(y_true, y_predicted, 'micro')
    metrics['macro-f1'] = get_f1_score(y_true, y_predicted, 'macro')
    metrics_for_each_k_2[k] = metrics

print('My code results:\n')
for k in [1, 5, 10, 20]:
    print('k =', k, ': ')
    print('Accuracy: ', metrics_for_each_k[k]['accuracy'])
    print('Micro-precision: ', metrics_for_each_k[k]['micro-precision'])
    print('Macro-precision: ', metrics_for_each_k[k]['macro-precision'])
    print('Micro-recall: ', metrics_for_each_k[k]['micro-recall'])
    print('Macro-recall: ', metrics_for_each_k[k]['macro-recall'])
    print('Micro-f1: ', metrics_for_each_k[k]['micro-f1'])
    print('Macro-f1: ', metrics_for_each_k[k]['macro-f1'])
    print('\n')
print('\n\n\nsklearn knn results:\n')
for k in [1, 5, 10, 20]:
    print('k =', k, ': ')
    print('Accuracy: ', metrics_for_each_k_2[k]['accuracy'])
    print('Micro-precision: ', metrics_for_each_k_2[k]['micro-precision'])
    print('Macro-precision: ', metrics_for_each_k_2[k]['macro-precision'])
    print('Micro-recall: ', metrics_for_each_k_2[k]['micro-recall'])
    print('Macro-recall: ', metrics_for_each_k_2[k]['macro-recall'])
    print('Micro-f1: ', metrics_for_each_k_2[k]['micro-f1'])
    print('Macro-f1: ', metrics_for_each_k_2[k]['macro-f1'])
    print('\n')

My code results:

k = 1 : 
Accuracy:  0.8443333333333334
Micro-precision:  0.8443333333333334
Macro-precision:  0.8447431874464275
Micro-recall:  0.8443333333333334
Macro-recall:  0.8443333333333334
Micro-f1:  0.8443333333333334
Macro-f1:  0.8445382106643672


k = 5 : 
Accuracy:  0.8763333333333333
Micro-precision:  0.8763333333333333
Macro-precision:  0.8756541043207571
Micro-recall:  0.8763333333333333
Macro-recall:  0.8763333333333333
Micro-f1:  0.8763333333333333
Macro-f1:  0.8759935871617028


k = 10 : 
Accuracy:  0.8813333333333333
Micro-precision:  0.8813333333333333
Macro-precision:  0.8810118432236447
Micro-recall:  0.8813333333333333
Macro-recall:  0.8813333333333334
Micro-f1:  0.8813333333333333
Macro-f1:  0.8811725589550932


k = 20 : 
Accuracy:  0.8733333333333333
Micro-precision:  0.8733333333333333
Macro-precision:  0.8731380598532117
Micro-recall:  0.8733333333333333
Macro-recall:  0.8733333333333334
Micro-f1:  0.8733333333333333
Macro-f1:  0.8732356856764799





sklea

## 2.3. Speed up

As you probably have noticed, sklearn performs the KNN much faster than our implemented KNN.
One of the main reasons is that it uses LSH. Read about this algorithm and write down a paragraph about it, explain why it will boost the speed of the KNN and how it should be implemented.

Write your answer here (fa or en)

این الگوریتم، داده ها را به گونه ای هش میکند که داده هایی که نزدیک به هم هستند با احتمال زیادی به یک جایگاه یکسان هش شوند. در نتیجه از این طریق میتوان داک های مشابه را در یک بخش هش کرد و دسترسی به آنها سریع تر میشود و سرعت اجرای الگوریتم نزدیک ترین همسایه ها افزایش می یابد

# 3. SVM Model
Now train an SVM model using sklearn library. For more information read [SVC documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

## 3.1. Choose hyperparameters
For choosing the best hyperparameters, you can use [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

In [21]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
"""
For choosing proper hyperparameters for SVC model, we should split train dataset into two parts:
1- train set 2- validation set. Then train model on train set and measure performance on validation set. 
Finally we should choose hyperparameters that has the best performance on validation set.
We can do these process using GridSearchCV.
"""
#Specify different values for different hyperparameters for GridSearchCV to choose among them
parameters = {
    'kernel':('linear', 'poly', 'rbf'),
    'C':[1,10,100]
}

svc = SVC()
clf = GridSearchCV(svc, parameters)
#Get the best set of hyperparameters
#TODO
print()
X_train, X_test, y_train, y_test = train_test_split(list(train_dataset_vect['Embedding']), list(train_dataset_vect['category']), test_size=0.5)
clf.fit(X_train, y_train)
print(clf.best_params_)


{'C': 10, 'kernel': 'rbf'}


## 3.2. Train SVM model

In [22]:
#Train SVC model with the chosen hyperparameters
#TODO
svc = SVC(kernel=clf.best_params_['kernel'], C=clf.best_params_['C'])
svc.fit(list(train_dataset_vect['Embedding']), list(train_dataset_vect['category']))

SVC(C=10)

## 3.3. Measure performance of the SVM model
In this section, first get predictions on the test set. Then, obtain the evaluation metrics using both what you implemented from scrach in the Model evaluation part and built-in functions.

In [23]:
#Get prediction on test set
#TODO
prediction = svc.predict(list(val_dataset_vect['Embedding']))

In [24]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

"""
Measure model's accuracy, f1_score, precision, and recall both using sklearn library and functions you have
implemented in the Model evaluation part. Then compare the SVM model's performance with other models you have
implemented on given dataset. 
"""
#TODO
y_true = list()
y_predicted = list()
for i in range(1, 5):
    y_true_i = list()
    y_predicted_i = list()
    for j in range(len(prediction)):
        if val_dataset_vect['category'][j] == i:
            y_true_i.append(True)
        else:
            y_true_i.append(False)
        if prediction[j] == i:
            y_predicted_i.append(True)
        else:
            y_predicted_i.append(False)
    y_true.append(y_true_i)
    y_predicted.append(y_predicted_i)

print('With sklearn metric functions: ')
print('Accuracy: ', accuracy_score(list(val_dataset_vect['category']), prediction))
print('Micro-Precision: ', precision_score(list(val_dataset_vect['category']), prediction, average='micro'))
print('Macro-Precision: ', precision_score(list(val_dataset_vect['category']), prediction, average='macro'))
print('Micro-Recall: ', recall_score(list(val_dataset_vect['category']), prediction, average='micro'))
print('Macro-Recall: ', recall_score(list(val_dataset_vect['category']), prediction, average='macro'))
print('Micro-F1: ', f1_score(list(val_dataset_vect['category']), prediction, average='micro'))
print('Macro-F1: ', f1_score(list(val_dataset_vect['category']), prediction, average='macro'))
      
print('With our implemented metric functions: ')
print('Accuracy: ', get_accuracy(prediction, list(val_dataset_vect['category'])))
print('Micro-Precision: ', get_precision(y_true, y_predicted, 'micro'))
print('Macro-Precision: ', get_precision(y_true, y_predicted, 'macro'))
print('Micro-Recall: ', get_recall(y_true, y_predicted, 'micro'))
print('Macro-Recall: ', get_recall(y_true, y_predicted, 'macro'))
print('Micro-F1: ', get_f1_score(y_true, y_predicted, 'micro'))
print('Macro-F1: ', get_f1_score(y_true, y_predicted, 'macro'))

With sklearn metric functions: 
Accuracy:  0.885
Micro-Precision:  0.885
Macro-Precision:  0.8852500363605169
Micro-Recall:  0.885
Macro-Recall:  0.885
Micro-F1:  0.885
Macro-F1:  0.8847599763073059
With our implemented metric functions: 
Accuracy:  0.885
Micro-Precision:  0.885
Macro-Precision:  0.8852500363605169
Micro-Recall:  0.885
Macro-Recall:  0.885
Micro-F1:  0.885
Macro-F1:  0.8851250005222495


نتیجه به دست آمده به صورت زیر است

accuracy(svm) > accuracy(KNN, k=20|10|5|1)

های 10 و 20 بسیار به اس وی ام نزدیک هستند k البته مقادیر به ازای 

## 4. Naive Bayes
In this section, you are asked to implement the Naive Bayes classifier on the given data from scratch.
In your implementation, also use Laplacian smoothing.In this approach, you shall calculate the probabilty of each term belonging to each category as follows:
$$P[t, c] = {\alpha + T_{t, c} \over \alpha|V| + \sum_{t \in V} T_{t, c}},$$
where $t$ represents each term, $V$ the whole vocabulary, $c$ each categroy, $\alpha$ the smoothing factor, and $T_{t, c}$ the frequency of the term $t$ in the class $c$.

Also, to avoid calculation errors in multiplication of probabilities over terms, you may use the log of probabilites for the final score of a document. 

## 4.1 Prediction
Here, implement yout Naive Bayes classifier. Note that this part does not use glove embeddings, but uses the original documents in the datasets.

In [25]:
def naive_bayes(train_dataset, title_weight, body_weight, smoothing_factor, val_dataset):
    """
    This function performs Naive Bayes classification. First, build term-per-category counts as described in the
    folrmulation above for the training data. Notice the body and title weights in counting terms in the training
    data, that is the number of occurrence for terms in the title is multiplied by title_weight and likewise for
    body_weight). Then using term-per-category counts obtained, classify the documents in the validation dataset.

    Parameters:
        train_dataset (pd.DataFrame) : training dataset samples
        title_weight (int): weight for each term occurrence in the title of eash sample
        body_weight (int): weight for each term occurrence in the body of eash sample
        smoothing_factor (float): Laplacian smoothing factor for Naive Bayes
        val_dataset (pd.DataFrame) : validation dataset samples
    Returns:
        category (list) : list of predicted categories for validation data
    """
    # TODO
    term_frequency = dict()
    number_of_all_terms_in_a_class = np.zeros(5)
    for i in range(len(train_dataset)):
        if title_weight != 0:
            title_tokens = preprocess_text(train_dataset['title'][i])
            number_of_all_terms_in_a_class[train_dataset['category'][i]] += title_weight * len(title_tokens)
            for token in title_tokens:
                if token in term_frequency:
                    term_frequency[token][train_dataset['category'][i]] += title_weight
                else:
                    term_frequency[token] = np.zeros(5)
                    term_frequency[token][train_dataset['category'][i]] = title_weight
        if body_weight != 0:
            body_tokens = preprocess_text(train_dataset['body'][i])
            number_of_all_terms_in_a_class[train_dataset['category'][i]] += body_weight * len(body_tokens)
            for token in body_tokens:
                if token in term_frequency:
                    term_frequency[token][train_dataset['category'][i]] += body_weight
                else:
                    term_frequency[token] = np.zeros(5)
                    term_frequency[token][train_dataset['category'][i]] = body_weight
    predictions = list()
    smoothing_factor_vector = np.array([0, smoothing_factor, smoothing_factor, smoothing_factor, smoothing_factor])
    for i in range(len(val_dataset)):
        probability = np.zeros(5)
        if title_weight != 0:
            title_tokens = preprocess_text(val_dataset['title'][i])
            for token in title_tokens:
                term_frequency_for_token = np.zeros(5)
                if token in term_frequency:
                    term_frequency_for_token = term_frequency[token]
                probability += title_weight * np.log(np.divide((smoothing_factor_vector + term_frequency_for_token), (len(term_frequency) * smoothing_factor_vector + number_of_all_terms_in_a_class)))
        if body_weight != 0:
            body_tokens = preprocess_text(val_dataset['body'][i])
            for token in body_tokens:
                term_frequency_for_token = np.zeros(5)
                if token in term_frequency:
                    term_frequency_for_token = term_frequency[token]
                probability += body_weight * np.log(np.divide((smoothing_factor_vector + term_frequency_for_token), (len(term_frequency) * smoothing_factor_vector + number_of_all_terms_in_a_class)))
        probability[0] = -np.Inf
        predictions.append(np.argmax(probability))
    return predictions

## 4.2 Evaluation
In this part, we will consider different weighting schemes for evaluation on our validation dataset as follows:

In [26]:
val_predictions = []
# Considering title words only
val_predictions.append(naive_bayes(train_dataset, 1, 0, 1.5, val_dataset))

# Considering body words only
val_predictions.append(naive_bayes(train_dataset, 0, 1, 1.5, val_dataset))

# Considering title and body words equally
val_predictions.append(naive_bayes(train_dataset, 1, 1, 1.5, val_dataset))

# Considering title words twice more important than body words
val_predictions.append(naive_bayes(train_dataset, 2, 1, 1.5, val_dataset))

  probability += title_weight * np.log(np.divide((smoothing_factor_vector + term_frequency_for_token), (len(term_frequency) * smoothing_factor_vector + number_of_all_terms_in_a_class)))
  probability += body_weight * np.log(np.divide((smoothing_factor_vector + term_frequency_for_token), (len(term_frequency) * smoothing_factor_vector + number_of_all_terms_in_a_class)))


In [27]:
def evaluate_naive_bayes(val_dataset, val_prediction):
    """
    Calculates the metrics implemented in Model Evaluation for the Naive Bayes classifier.
    Parameters:
        val_dataset (pd.DataFrame) : validation dataset samples
        val_prediction (list) :list of predicted categories for validation data

    Returns:
        metrics (dict) : dictionary of the metrics on the given predictions
    """
    y_true = list()
    y_predicted = list()
    for i in range(1, 5):
        y_true_i = list()
        y_predicted_i = list()
        for j in range(len(prediction)):
            if val_dataset['category'][j] == i:
                y_true_i.append(True)
            else:
                y_true_i.append(False)
            if val_prediction[j] == i:
                y_predicted_i.append(True)
            else:
                y_predicted_i.append(False)
        y_true.append(y_true_i)
        y_predicted.append(y_predicted_i)
    metrics = dict()
    metrics['accuracy'] = get_accuracy(val_prediction, list(val_dataset_vect['category']))
    metrics['micro-precision'] = get_precision(y_true, y_predicted, 'micro')
    metrics['macro-precision'] = get_precision(y_true, y_predicted, 'macro')
    metrics['micro-recall'] = get_recall(y_true, y_predicted, 'micro')
    metrics['macro-recall'] = get_recall(y_true, y_predicted, 'macro')
    metrics['micro-f1'] = get_f1_score(y_true, y_predicted, 'micro')
    metrics['macro-f1'] = get_f1_score(y_true, y_predicted, 'macro')
    return metrics
    #TODO

In [28]:
metrics = []
for val_prediction in val_predictions:
    metrics.append(evaluate_naive_bayes(val_dataset, val_prediction))
print(metrics)

[{'accuracy': 0.824, 'micro-precision': 0.824, 'macro-precision': 0.8237543419953393, 'micro-recall': 0.824, 'macro-recall': 0.8240000000000001, 'micro-f1': 0.824, 'macro-f1': 0.8238771526855179}, {'accuracy': 0.8836666666666667, 'micro-precision': 0.8836666666666667, 'macro-precision': 0.8827922383215231, 'micro-recall': 0.8836666666666667, 'macro-recall': 0.8836666666666667, 'micro-f1': 0.8836666666666668, 'macro-f1': 0.8832292360653603}, {'accuracy': 0.8993333333333333, 'micro-precision': 0.8993333333333333, 'macro-precision': 0.8988576942887748, 'micro-recall': 0.8993333333333333, 'macro-recall': 0.8993333333333333, 'micro-f1': 0.8993333333333333, 'macro-f1': 0.8990954509054736}, {'accuracy': 0.899, 'micro-precision': 0.899, 'macro-precision': 0.8984458801745248, 'micro-recall': 0.899, 'macro-recall': 0.899, 'micro-f1': 0.899, 'macro-f1': 0.8987228546747378}]


## 4.3 Analysis
Answer the following questions:
1. Compare the weighting scenrios used above given the resulting metrics. (fa or en)

مشاهده میکنیم که به ازای حالت اول که فقط عنوان ها را لحاظ میکنیم به مقدار های کمتری به نسبت حالات دیگر دست پیدا میکنیم. اما در باقی حالات متریک های به دست آمده تقریبا مساوی یکدیگر است اما در حالتی که فقط بدنه را لحاظ میکنیم کمی کمتر از دو حالتی است که هم عنوان و هم بدنه را لحاظ میکنیم. همچنین در بین دو حالتی که هم بدنه و هم عنوان را لحاظ میکنیم مقدار دقت در حالتی که عنوان ضریب 2 دارد بیشتر است اما مقدار پرسیژن و ریکال در حالت با ضرایب مساوی کمی بیشتر است

2. Explain the reason for using the smoothing factor as descrbied above. (fa or en)

اگر این ضریب را اضافه نکنیم کوئری ما ترمی داشته باشد که در یک کلاس ظاهر نشده باشد، احتمال حضور در آن کلاس صفر خواهد شد و به باقی ترم ها وابسته نخواهد بود و اگر ترمی وجود داشته باشد که در هیچ کلاسی نبوده باشد احتمال حضور در همه ی کلاس ها صفر خواهد شد که قابل قبول نیست