# TP1 - Discriminative and Generative Models

### ING5, BDA02
- ABUL KALAM Simon : [simon.abulkalam@edu.ece.fr](mailto:simon.abulkalam@edu.ece.fr)
- BARITEAU Yanis : [yanis.bariteau@edu.ece.fr](mailto:yanis.bariteau@edu.ece.fr)
- PUY Guillaume : [guillaume.puy@edu.ece.fr](mailto:guillaume.puy@edu.ece.fr)

# Libraries Importation

In [None]:
import numpy as np
import pandas as pd
from collections import Counter

# Parameters for Google Colab (don't lauch this part if you don't use Google Colab...)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
cd drive/MyDrive/Machine\ Learning\ II/TP1/Homework

/content/drive/.shortcut-targets-by-id/1XgIATrl_rKxs0DZGjxyc-T_vuoVJelnF/Machine Learning II/TP1/Homework


In [None]:
cd drive/MyDrive/'Colab Notebooks'/Machine\ Learning\ II/TP1/Homework

/content/drive/MyDrive/Colab Notebooks/Machine Learning II/TP1/Homework


# Importation of the dataset

In [None]:
df = pd.read_csv('messages.txt', sep="\t", names=['Label', 'Email'])

## 1. Divide the data in two groups: training and test examples.

In [None]:
X_train = df[:round(len(df)*0.9)].reset_index(drop=True)
X_test = df[round(len(df)*0.9):].reset_index(drop=True)

## 2. Parse both the training and test examples to generate both the spam and ham datasets.

In [None]:
print( X_train['Label'].value_counts(normalize=True) )
print("\n")
print( X_test['Label'].value_counts(normalize=True) )

ham     0.865051
spam    0.134949
Name: Label, dtype: float64


ham     0.87
spam    0.13
Name: Label, dtype: float64


In [None]:
spam_train = X_train[X_train['Label'] == 'spam'].reset_index(drop=True)
ham_train = X_train[X_train['Label'] == 'ham'].reset_index(drop=True)

In [None]:
spam_test = X_test[X_test['Label'] == 'spam'].reset_index(drop=True)
ham_test = X_test[X_test['Label'] == 'ham'].reset_index(drop=True)

## 3. Generate a dictionary from the training data.

In [None]:
X_train['Email'] = X_train['Email'].str.replace('\W', ' ') # Removes Punctuation
X_train['Email'] = X_train['Email'].str.lower() # To Lower Case

X_test['Email'] = X_test['Email'].str.replace('\W', ' ') # Removes Punctuation
X_test['Email'] = X_test['Email'].str.lower() # To Lower Case

In [None]:
def make_Dictionary(df, most_common_words=3000):
    """
    Réalisation d'un dictionnaire récapitulant les mots les plus utilisés et le nombre de fois où ces mots sont utilisés.

    Paramètres
    ----------
    df : pandas.core.frame.DataFrame
        DataFrame dont nous souhaitons obtenir le dictionnaire.
    most_common_words : int
        Nombre de mots dont nous souhaitons afficher (par défaut, égal à 3000).

    Returns
    -------
    dictionary : list
        Liste contenant des tuples du type : (Mot, Nb d'Utilisation).
    """
    all_words = []
    for oneEmail in df['Email']: # We go through all emails
        for word in oneEmail.split(): # We go through the words in an email
            all_words.append(word) # We add it to our list
    dictionary = Counter(all_words) # Using the `Counter()` function, we create a dictionary with the registered words

    for item in list(dictionary): # For each word in the dictionary...
        if item.isalpha() == False: # If the word does not consist of alphabetical letters...
            del dictionary[item] # We delete it
        elif len(item) == 1: # If the word is only one element...
            del dictionary[item] # We delete it
    dictionary = dictionary.most_common(most_common_words) # We keep only the `most_common_words` most used words in our dictionary.
    return dictionary

In [None]:
dictionary = make_Dictionary(X_train)

## 4. Extract features from both the training data and test data.

In [None]:
def extract_features(list_emails, dictionary):
    """
    Comptage du nombre de présence des mots utilisés dans chaque email.
    Elle retourne ensuite un dataset qui regroupe toutes ces informations.

    Paramètres
    ----------
    list_emails: list
        Liste des emails.
    dictionary : list
        Dictionnaire de mots créé grâce à la fonction make_Dictionary().

    Returns
    -------
    features_matrix : pandas.core.frame.DataFrame
        DataFrame avec l'extraction des features.
    """
    features_matrix = pd.DataFrame(columns=[d[0] for i, d in enumerate(dictionary)], index=range(len(list_emails))).fillna(0) # Creation of a DataFrame filled with 0
    
    docID = 0
    for one_email in list_emails: # We go through all the emails
        list_words_in_the_email = one_email.split()
        for word in list_words_in_the_email: # We go through the words in an email
            for i, d in enumerate(dictionary): # For each word in the dictionary...
                if d[0] == word: # Equal to 1 if the dictionary word is present in the mail, 0 otherwise.
                    features_matrix.loc[docID, d[0]] = 1
                    
        docID += 1
    return features_matrix


In [None]:
# For spam & hams, we perform feature extraction to obtain the word dataset
spam_train_clean = extract_features(spam_train['Email'], dictionary)
ham_train_clean = extract_features(ham_train['Email'], dictionary)

In [None]:
spam_train_clean

Unnamed: 0,to,you,the,and,in,is,me,it,my,for,your,of,call,that,have,on,are,now,so,can,not,but,or,we,get,do,will,ur,be,at,if,just,with,no,this,how,gt,lt,up,what,...,cutting,drpd,deeraj,deepak,received,explicit,secs,reckon,transport,rule,roommates,bitch,items,caring,clearly,rec,gain,dearly,blessings,appreciated,chip,falls,panic,beloved,hor,realise,silver,mnths,shocking,fair,stopsms,oil,stuck,virgin,txtstop,feelin,puttin,kaiez,managed,option
0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
602,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
603,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
604,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
605,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## 5. Implement the Naïve Bayes from scratch, and fit it to the training data.

In [None]:
def indicator_function_of_word_in_specific_type_email(spam_or_ham_extracted_dataframe):
    """
    Cette fonction calcule la fonction indicatrice dans tous les spams/hams.
    Elle retourne ensuite un dataset récapitulant ces données.

    Paramètres
    ----------
    spam_or_ham_dataframe : pandas.core.frame.DataFrame
        DataFrame de nos emails donc l'extraction de features a été réalisée.

    Returns
    -------
    indicator_df : pandas.core.frame.DataFrame
        DataFrame avec le nombre d'utilisation d'un mot dans les emails.
    """
    columns_name = spam_or_ham_extracted_dataframe.columns # Retrieving dictionary words

    liste = list( spam_or_ham_extracted_dataframe.sum() ) # For each word, we count the number of emails where this word is present

    dico = {columns_name[i]:liste[i] for i in range(len(liste))}
    indicator_df = pd.DataFrame(data=dico, columns=columns_name, index=[0]) # Creation of a DataFrame to put in the calculations made previously
    
    return(indicator_df)

In [None]:
def calcul_phi_n_sachant_type_Email(df_indicator, df_spam_or_ham, laplace_smoothing=1) :
    """
    Calcul des paramètres pour Phi_n sachant SPAM ou HAM.

    Paramètres
    ----------
    df_indicator : pandas.core.frame.DataFrame
        DataFrame avec les occurences calculées grâce à la fonction indicator_function_of_word_in_specific_type_email().
    df_spam_or_ham : pandas.core.frame.DataFrame
        DataFrame listant les spams ou les hams.
    laplace_smoothing : int
        Laplace Smoothing (par défaut, égal à 1).
    
    Returns
    -------
    df_parameters : pandas.core.frame.DataFrame
        DataFrame regroupant les calculs de paramètres pour chaque mot.
    """
    columns_names = df_indicator.columns # Retrieving dictionary words

    liste = list(df_indicator[col][0] for col in df_indicator.columns) # Converting our DataFrame to a list (more convenient)

    dico = {columns_names[i]:((liste[i]+laplace_smoothing)/(len(df_spam_or_ham)+2*laplace_smoothing)) for i in range(len(columns_names))} # Calculation of the two parameters using the course formula
    df_parameters = pd.DataFrame( data=dico, columns=columns_names, index=range(1) ) # Creation of a DataFrame to put in the calculations made previously

    return(df_parameters)

In [None]:
def naiveBayes_fit(X_train, spam_train, spam_train_clean, ham_train_clean, laplace_smoothing=0.1):
    """
    Prédiction des emails à l'aide de l'entraînement du modèle avec les 3 paramètres calculés.

    Paramètres
    ----------
    X_train : pandas.core.frame.DataFrame
        DataFrame avec les emails du training set.
    spam_train : pandas.core.frame.DataFrame
        DataFrame contenant tous les spams du training set.
    spam_train_clean : pandas.core.frame.DataFrame
        DataFrame composé des features extraites des spams.
    ham_train_clean : pandas.core.frame.DataFrame
        DataFrame composé des features extraites des hams.
    laplace_smoothing : float
        Valeur du Laplace Smoothing (par défaut, égal à 0.1)
    
    Returns
    -------
    phi_y : float
        Paramètre représentant P(Y = 1).
    phi_n_given_spam : pandas.core.frame.DataFrame
        DataFrame composé du paramètre représentant P(Xn = 1 | Y = 1).
    phi_n_given_ham : pandas.core.frame.DataFrame
        DataFrame composé du paramètre représentant P(Xn = 1 | Y = 0).
    """
    
    # Calculation of word occurrences
    number_occurrences_of_word_in_all_spams = indicator_function_of_word_in_specific_type_email(spam_train_clean)
    number_occurrences_of_word_in_all_hams = indicator_function_of_word_in_specific_type_email(ham_train_clean)

    # Calculation of parameters
    phi_y = ( len(spam_train) + laplace_smoothing ) / ( len(X_train) + 2*laplace_smoothing )
    phi_n_given_spam = calcul_phi_n_sachant_type_Email(number_occurrences_of_word_in_all_spams, spam_train, laplace_smoothing)
    phi_n_given_ham = calcul_phi_n_sachant_type_Email(number_occurrences_of_word_in_all_hams, ham_train, laplace_smoothing)

    return( phi_y, phi_n_given_spam, phi_n_given_ham )

In [None]:
phi_y, phi_n_given_spam, phi_n_given_ham = naiveBayes_fit(X_train, spam_train, spam_train_clean, ham_train_clean, 0.1)

## 6. Make predictions for the test data.

In [None]:
def naiveBayes_predict(X_test, phi_y, phi_n_given_spam, phi_n_given_ham):
    """
    Prédiction des emails à l'aide de l'entraînement du modèle avec les 3 paramètres calculés.

    Paramètres
    ----------
    X_test : pandas.core.frame.DataFrame
        DataFrame avec les emails du testing set.
    phi_y : float
        Paramètre représentant P(Y = 1).
    phi_n_given_spam : pandas.core.frame.DataFrame
        DataFrame composé du paramètre représentant P(Xn = 1 | Y = 1).
    phi_n_given_ham : pandas.core.frame.DataFrame
        DataFrame composé du paramètre représentant P(Xn = 1 | Y = 0).
    
    Returns
    -------
    y_pred : pandas.core.frame.DataFrame
        DataFrame regroupant les prédictions pour chaque email du testing set.
    """
    liste_proba_spam, liste_proba_ham, prediction = [], [], []
    columns_name = phi_n_given_spam.columns.tolist()

    for oneEmail in X_test['Email']: # We go through all the emails
        listWords = oneEmail.split() # Separate words in an email
        produit_spam, produit_ham = 1, 1

        for oneWord in listWords: # We go through the words in the email
            if oneWord in columns_name: # If the word is present in the dictionary
                produit_spam *= phi_n_given_spam[oneWord][0] # Product of probabilities for spam
                produit_ham *= phi_n_given_ham[oneWord][0] # Product of probabilities for ham
                
        liste_proba_spam.append(produit_spam*phi_y / (produit_spam*phi_y + produit_ham*(1-phi_y))) # Calculate the probability P(Y = 1 | X), then add this probability to the list
        liste_proba_ham.append(produit_ham*(1-phi_y) / (produit_spam*phi_y + produit_ham*(1-phi_y))) # Calculate the probability P(Y = 0 | X), then add this probability to the list

    # Term-to-term probability comparison (argmax P(Y | X))
    for i in range( len(liste_proba_ham) ):
        if liste_proba_spam[i] > liste_proba_ham[i] : # If P(Y = 1 | X) > P(Y = 0 | X)
            prediction.append( "spam" )
        else: # If P(Y = 1 | X) < P(Y = 0 | X)
            prediction.append( "ham" )
    
    return( pd.DataFrame(data={'Prediction':prediction}) )

In [None]:
y_pred = naiveBayes_predict(X_test, phi_y, phi_n_given_spam, phi_n_given_ham)

In [None]:
y_pred

Unnamed: 0,Prediction
0,ham
1,ham
2,ham
3,ham
4,ham
...,...
495,ham
496,ham
497,ham
498,ham


## 7. Measure the spam-filtering performance for each approach through the confusion matrix, precision, and recall.

In [None]:
def measures(X_test, y_pred):
    """
    Affichage des performances du programme.

    Paramètres
    ----------
    X_test : pandas.core.frame.DataFrame
        DataFrame avec les emails du testing set.
    phi_y : pandas.core.frame.DataFrame
        DataFrame avec les prédictions des emails.
    
    Returns
    -------
    Nothing
    """

    # Accuracy Calculation
    accuracy = 0
    y_test = X_test["Label"].values.tolist()
    y_pred = y_pred["Prediction"].values.tolist()

    for i in range( len(y_pred) ):
        if y_pred[i] == y_test[i]:
            accuracy += 1

    # Confusion Matrix Calculation
    tp, fp, tn, fn = 0,0,0,0

    for i in range( len(X_test) ):
        if y_pred[i] == "spam":
            if y_pred[i] == y_test[i]:
                tp += 1
            else:
                fp += 1
        else:
            if y_pred[i] == y_test[i]:
                tn += 1
            else:
                fn += 1

    # Displaying Information
    print("Confusion Matrix")
    print(f"[{tp}, {fn}]\n[{fp}, {tn}]")
    print("----------------")
    print( f"Accuracy = { round(accuracy*100/len(y_pred), 3) } %" )
    print( f"Precision = { round(tp*100/(tp+fp), 3) } %" )
    print( f"Recall = { round(tp*100/(tp+fn), 3) } %" )

In [None]:
measures(X_test, y_pred)

Confusion Matrix
[59, 6]
[16, 419]
----------------
Accuracy = 95.6 %
Precision = 78.667 %
Recall = 90.769 %


In [None]:
# Searching for the Laplace Smoothing value to obtain the best results.
for ls in range(5, 104, 5):
  print(f"Laplace Smoothing = {ls/100}")
  phi_y, phi_n_given_spam, phi_n_given_ham = naiveBayes_fit(X_train, spam_train, spam_train_clean, ham_train_clean, ls/100)
  y_pred = naiveBayes_predict(X_test, phi_y, phi_n_given_spam, phi_n_given_ham)
  measures(X_test, y_pred)
  print("\n--------------------------------------------")

Laplace Smoothing = 0.05
Confusion Matrix
[57, 8]
[13, 422]
----------------
Accuracy = 95.8 %
Precision = 81.429 %
Recall = 87.692 %

--------------------------------------------
Laplace Smoothing = 0.1
Confusion Matrix
[59, 6]
[16, 419]
----------------
Accuracy = 95.6 %
Precision = 78.667 %
Recall = 90.769 %

--------------------------------------------
Laplace Smoothing = 0.15
Confusion Matrix
[61, 4]
[20, 415]
----------------
Accuracy = 95.2 %
Precision = 75.309 %
Recall = 93.846 %

--------------------------------------------
Laplace Smoothing = 0.2
Confusion Matrix
[61, 4]
[22, 413]
----------------
Accuracy = 94.8 %
Precision = 73.494 %
Recall = 93.846 %

--------------------------------------------
Laplace Smoothing = 0.25
Confusion Matrix
[61, 4]
[25, 410]
----------------
Accuracy = 94.2 %
Precision = 70.93 %
Recall = 93.846 %

--------------------------------------------
Laplace Smoothing = 0.3
Confusion Matrix
[61, 4]
[25, 410]
----------------
Accuracy = 94.2 %
Precision

## 8. Discuss your results.

* Laplace Smoothing is a smoothing technique that solves the null probability problem in Naïve Bayes, i.e. when we calculate the probability `P(X = 1 | Y = 1)` or `P(X = 1 | Y = 0)`.
  * By varying this Laplace Smoothing between 0.05 and 1, we obtain the best results when it is around 0.1 and 0.2.
  * When the Laplace Smoothing is equal to 0.1, we obtain the following values for matrix confusion: 59 True Positive, 419 True Negative and 22 errors.

* We set `Laplace Smoothing = 0.1`. 
  * The accuracy measures the percentage of emails marked as spam that were correctly classified. In our case, the accuracy is equal to 78.7%. This means that when our algorithm identifies an email as spam, it has 78.7% of being correct.
  * Recall measures the percentage of actual spam that has been correctly classified. In our case, we have a recall equal to 90.8%. This means that our algorithm correctly identifies 90.8% of all spam.
* Possible errors in precisions may come from : 
  * Misspelled words.
  * The assumption that the words are independent and identically distributed.