# Machine Learning 2024-2025 - UMONS

# Gaussian Discriminant Analysis and Naive Bayes

In [None]:
import pandas as pd
import numpy as np

## 1. Gaussian Discriminant Analysis (GDA)

Nous considérons le jeu de données *IRIS*, qui est un dataset très classique (https://fr.wikipedia.org/wiki/Iris_de_Fisher). Celui-ci est inclus dans `scikit-learn`. Il contient 150 données sur 3 types d'iris (*Iris setosa*, *Iris virginica* et *Iris versicolor*). Les données ont 4 features : la longueur et la largeur des sépales, et la longueur et la largeur des pétales.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

dataset = load_iris()
x_train, x_test, y_train, y_test = train_test_split(dataset.data, dataset.target, train_size=0.75)

On affiche les données selon les deux premières features.

In [None]:
import matplotlib.pyplot as plt

_, ax = plt.subplots()
scatter = ax.scatter(dataset.data[:, 0], dataset.data[:, 1], c=dataset.target)
ax.set(xlabel=dataset.feature_names[0], ylabel=dataset.feature_names[1])
_ = ax.legend(
    scatter.legend_elements()[0], dataset.target_names, loc="lower right", title="Classes"
)

On souhaite prédire le type d'iris en utilisant une analyse discriminante gaussienne. Pour cela, calculez les trois paramètres ($\pi$, $\mu$ et $\Sigma$) décrivant la gaussienne de chaque classe d'iris. Vous pouvez utiliser `np.mean()` et `np.cov()`.

In [None]:
def fit(x_train, y_train):
    n = y_train.shape[0] # Number of training examples.

    x_train = x_train.reshape(n, -1)
    p = x_train.shape[1] # Number of input features. In our case, 4.
    class_label = len(np.unique(y_train.reshape(-1))) # Number of classes. In our case, 3.
    
    mu = np.zeros((class_label, p))
    sigma = np.zeros((class_label, p, p))
    pi = np.zeros(class_label)

    for label in range(class_label):
        indices = (y_train == label)
        
        pi[label] = float(np.sum(indices)) / n
        mu[label] = np.mean(x_train[indices, :], axis=0)
        sigma[label] = np.cov(x_train[indices, :], rowvar=0)
    
    return pi, mu, sigma

À partir des paramètres calculés avec la fonction `fit`, implémentez une fonction `predict` qui calcule la classe la plus probable pour chacune des données de test.

In [None]:
from scipy.stats import multivariate_normal

def predict(x_tests, pi, mu, sigma):
    # flatten the test data
    x_tests = x_tests.reshape(x_tests.shape[0], -1)
    class_label = mu.shape[0] # Number of classes. In our case, k = 3.
    scores = np.zeros((x_tests.shape[0], class_label)) 
    for label in range(class_label):
        # normal_distribution_prob.logpdf will give us the log value of the distribution
        normal_distribution_prob = multivariate_normal(mean=mu[label], cov=sigma[label])
        # x_test can have multiple test data, we calculate the probability of each of the test data
        for i, x_test in enumerate(x_tests):
            scores[i, label] = np.log(pi[label]) + normal_distribution_prob.logpdf(x_test)
    predictions = np.argmax(scores, axis=1)
    return predictions

Testez votre fonction `predict` sur les données de test. Pour évaluer la qualité de vos prédictions, calculez le score $F_1$ de vos prédictions. Vous pouvez utiliser la méthode `f1_score` de `scikit-learn` (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html).

In [None]:
from sklearn.metrics import f1_score

pi, mu, sigma = fit(x_train, y_train)
y_predict = predict(x_test, pi, mu, sigma)
score = f1_score(y_test, y_predict, average="weighted")
print("f1 score of our model: ", score)

Comparez vos prédictions avec la méthode `LinearDiscriminantAnalysis` de `scikit-learn` (https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html).
Il est possible que vos prédictions soient légèrement différentes.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis()
lda.fit(x_train, y_train)
y_predict_sk = lda.predict(x_test)
print("f1 score of scikit-learn model is: ", f1_score(y_test, y_predict_sk, average="weighted"))

## 2. Spam filters avec naive Bayes

In [None]:
import nltk 
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report 
from sklearn.feature_extraction.text import CountVectorizer

En utilisant `pandas`, on charge les données d'un dataset contenant des emails classifiés en spam ou non-spam (la valeur $1$ indique les spams, et la valeur $0$ les non-spams).

In [None]:
df = pd.read_csv('data/spam_email_raw_text_for_NLP.csv')
df.head()

La colonne "FILE_NAME" n'est pas utile ; on la supprime.

In [None]:
df.drop('FILE_NAME',axis=1,inplace=True)
df.head()

On compte le nombre de spams et de non-spams.

In [None]:
df.CATEGORY.value_counts()

Les "stopwords" sont les petits mots qui apparaissent dans la plupart des textes et sont peu pertinents pour la classification en spam.

In [None]:
nltk.download('stopwords')
stopword = nltk.corpus.stopwords.words('english')
nltk.download('wordnet')
lemmatizer=WordNetLemmatizer()

On crée le corpus. On le simplifie afin de retirer les caractères non-alphanumériques, les majuscules, et les caractères blancs. Le "lemmatizer" sert à identifier les mots similaires, par exemple "chiens" et "chien". Ce calcul peut prendre un peu de temps.

In [None]:
from tqdm.notebook import tqdm

corpus=[]
for i in tqdm(range(len(df))):
    # removing all non-alphanumeric characters
    message = re.sub('[^a-zA-Z0-9]',' ',df['MESSAGE'][i]) 
    # converting the message to lowercase
    message = message.lower() 
    # spliting the sentence into words for lemmatization                 
    message = message.split()      
    # removing stopwords and lemmatizing            
    message = [lemmatizer.lemmatize(word) for word in message
             if word not in set(stopwords.words('english'))] 
    # Converting the words back into sentences
    message = ' '.join(message)    
    # Adding the preprocessed message to the corpus list            
    corpus.append(message)                 

print(corpus[:5])

Pour chaque email, on crée un vecteur booléen qui donne la présence ou l'absence de mot (ou séquences de mots) dans le message. On utilise la méthode `CountVectorizer` (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Pour limiter la taille des vecteurs, on se limite aux $2500$ mots les plus présents.

In [None]:
cv = CountVectorizer(max_features = 2500, binary = True)
X = cv.fit_transform(corpus).toarray()
y = df['CATEGORY']

On sépare à nouveau les données en gardant $80 \%$ de données d'entraînement et $20 \%$ de données de test.

Grâce à la variable `cv`, on peut facilement transformer un email en vecteur booléen, et on peut voir à quels mots correspondent les $2500$ features.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=1, stratify=y)
print(cv.get_feature_names_out()[1100:1120]) # On affiche les mots correspondant à 20 features.
message = ["You won 10000 dollars, please provide your account details so that we can transfer the money"]
print(cv.transform(message))
print(cv.get_feature_names_out()[1783], cv.get_feature_names_out()[2294])

### 2.1 Naive Bayes à la main

Implémentez la méthode *naive Bayes* vue au cours pour prédire si un message donné est un spam ou non. Testez vos méthodes sur les données de test. Analysez vos résultats avec la méthode `classification_report`.

In [None]:
def fit(x_train, y_train):
    n = y_train.shape[0] # Number of training examples.
    p = x_train.shape[1] # Number of input features. In our case, 2500.
    x_train = x_train.reshape(n, -1)

    # Laplace Smoothing: to ensure that all words are present in the vocabulary
    # for each class, we add two dummy examples with each class label.
    # Otherwise, if a word is not present in the training set, it will be assigned a probability of 0 and the likelihood will be 0.
    x_train = np.append(x_train, [np.ones(p),np.ones(p)], axis=0)
    y_train = np.append(y_train, [0,1])
    n += 2
    
    pi = np.zeros(2)
    phi_y = np.zeros((2, p))

    for label in range(2):
        indices = (y_train == label)     
        pi[label] = float(np.sum(indices)) / n
        phi_y[label] = np.mean(x_train[indices, :], axis=0)

    return pi, phi_y

In [None]:
def predict(x_tests, pi, phi_y):
    p = x_tests.shape[1] # Number of input features. In our case, 2500.
    number_tests = x_tests.shape[0]
    # flatten the test data
    x_tests = x_tests.reshape(number_tests, -1)
    scores = np.zeros((number_tests, 2)) 
    for i in range(number_tests):
        for label in range(2):
            scores[i, label] = np.log(pi[label])            
            for j in range(p):
                if x_tests[i,j]:
                    scores[i, label] += np.log(phi_y[label][j])
                else:
                    scores[i, label] += np.log(1 - phi_y[label][j])
    predictions = np.argmax(scores, axis=1)
    return predictions    

In [None]:
pi, phi_y = fit(x_train, y_train)
train_pred = predict(x_train, pi, phi_y)
test_pred = predict(x_test, pi, phi_y)
print(classification_report(train_pred, y_train))
print(classification_report(test_pred, y_test))

In [None]:
print('Predicting...')
message = ["You won 10000 dollars, please provide your account details so that we can transfer the money"]
message_vector = cv.transform(message)
category = predict(message_vector, pi, phi_y)
print("The message is", "spam" if category == 1 else "not spam")

print('Predicting...')
message = ["hey Laura, the meeting is postponed to Monday"]
message_vector = cv.transform(message)
category = predict(message_vector, pi, phi_y)
print("The message is", "spam" if category == 1 else "not spam")

Les résultats ne sont pas très bons. En fait, le [*Laplace smoothing*](https://en.wikipedia.org/wiki/Additive_smoothing) peut être amélioré en pratique : 

In [None]:
def fit_smooth(x_train, y_train, alpha = 1):
    n = y_train.shape[0] # Number of training examples.
    p = x_train.shape[1] # Number of input features. In our case, 2500.
    x_train = x_train.reshape(n, -1)

    pi = np.zeros(2)
    phi_y = np.zeros((2, p))

    for label in range(2):
        indices = (y_train == label)
        sum_label = np.sum(indices)
        pi[label] = float(sum_label) / n
        for j in range(p):
            phi_y[label][j] = (alpha + np.sum(x_train[indices,j])) / (sum_label + n * alpha)

    return pi, phi_y

In [None]:
phi, phi_y = fit_smooth(x_train, y_train)
train_pred = predict(x_train, phi, phi_y)
test_pred = predict(x_test, phi, phi_y)
print(classification_report(train_pred, y_train))
print(classification_report(test_pred, y_test))

# Words that contribute the most to the spam score
indices = np.argsort(phi_y[1])[::-1]
print("Top 10 words that contribute the highest score to spams:")
for i in range(10):
    print(cv.get_feature_names_out()[indices[i]], phi_y[1][indices[i]])

In [None]:
print('Predicting...')
message = ["You won 10000 dollars, please provide your account details so that we can transfer the money"]
message_vector = cv.transform(message)
category = predict(message_vector, phi, phi_y)
print("The message is", "spam" if category == 1 else "not spam")

print('Predicting...')
message = ["hey Laura, the meeting is postponed to Monday"]
message_vector = cv.transform(message)
category = predict(message_vector, phi, phi_y)
print("The message is", "spam" if category == 1 else "not spam")

### 2.2 Naive Bayes avec Scikit-learn

Utilisez la méthode `MultinomialNB` de `scikit-learn` (https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) pour effectuer la même tâche qu'en Section 2.1.

Appliquez deux fois vos méthodes avec différentes valeurs pour le paramètre `alpha` correspondant au *Laplace smoothing*, et comparez les résultats.

In [None]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB(alpha = 0.1)

In [None]:
model.fit(x_train, y_train)

In [None]:
train_pred = model.predict(x_train)
test_pred = model.predict(x_test)

In [None]:
print(classification_report(train_pred, y_train))
print(classification_report(test_pred, y_test))

In [None]:
print('Predicting...')
message = ["You won 10000 dollars, please provide your account details so that we can transfer the money"]
message_vector = cv.transform(message)
category = model.predict(message_vector)
print("The message is", "spam" if category == 1 else "not spam")

print('Predicting...')
message = ["hey Laura, the meeting is postponed to Monday"]
message_vector = cv.transform(message)
category = model.predict(message_vector)
print("The message is", "spam" if category == 1 else "not spam")