# **Practical assignment for Topic 7**

**Answer the following questions by modifying and running the code given in the lecture.**

### Question 1
According to the model `nb`, which we trained using scikit-learn library in the lecture on the email spam dataset, what is the probability of an email being a spam if the email contains just the word "click" 10 times without any other words present? And what if the email contains the word "click" 20 times? Explain the results.

In [10]:
import numpy as np
from sklearn.naive_bayes import BernoulliNB

# Vocabulary order (according to previous class): hi, dear, buy, sell, question, project, pill, click
data = np.array([
    [1, 0, 0, 0, 1, 0, 0, 0, 0],
    [0, 1, 1, 0, 0, 0, 1, 1, 1],
    [0, 0, 1, 1, 0, 0, 0, 1, 1],
    [1, 0, 0, 1, 1, 0, 0, 0, 1],
    [1, 0, 0, 0, 0, 1, 0, 0, 0],
    [0, 1, 0, 0, 1, 1, 0, 0, 0],
    [0, 1, 0, 0, 0, 1, 0, 1, 0],
    [1, 1, 1, 1, 1, 0, 1, 1, 1],
    [1, 0, 0, 1, 1, 0, 0, 0, 1],
    [0, 1, 0, 0, 0, 1, 0, 0, 0],
    [0, 1, 0, 1, 0, 0, 1, 1, 1],
    [1, 0, 1, 0, 1, 1, 0, 0, 0],
    [1, 0, 1, 0, 0, 1, 0, 1, 1]])
X = data[:, 0:8]
y = data[:, 8]

nb = BernoulliNB()
nb.fit(X, y)

click_only = np.array([[0, 0, 0, 0, 0, 0, 0, 1]])
spam_probability = nb.predict_proba(click_only)[0, 1]
spam_probability


np.float64(0.603558277011681)

Naive-Bayes gives a .6 spam probability when "click" is the only word present.
As Bernoulli Naive Bayes only records whether a word appears at least once, repeating "click" 10 or 20 times triggers the same binary feature vector, so both cases lead to the same probability (60%) and the message is labeled as spam. It would be the same thing if we had the word 100 or 1000 times, while we still don't have any other word.


### Question 2
Now try the same with our own implementation given in the last cell of the notebook. You will get results that are different from those you got in question 1, which means that something is incorrect. What is the modification necessary in our code to get correct results?

In [11]:
# Bernoulli Naive Bayes algo from the lecture notebook
py = [np.mean(y == 0), np.mean(y == 1)]  # class priors
alpha = 1

X_manual = X.copy()
X_manual[X_manual != 0] = 1  # ensure binary features
m = X_manual.shape[1]

for click_count in [1, 10, 20]:
    xq = np.zeros((1, m))
    xq[0, 7] = click_count

    log_probs = []
    for c in range(0, 2):
        pxy = np.empty(m)
        Ny = np.sum(y == c)
        for j in range(0, m):
            Nj = np.sum(X_manual[y == c, j] == xq[0, j])
            pxy[j] = (Nj + alpha) / (Ny + alpha * 2)
        log_p = np.log(py[c]) + np.sum(np.log(pxy))
        log_probs.append(log_p)

    probs = np.exp(log_probs)
    probs /= np.sum(probs)
    print(f"Manual model with click count {click_count}: {probs}")


Manual model with click count 1: [0.39644172 0.60355828]
Manual model with click count 10: [0.66335889 0.33664111]
Manual model with click count 20: [0.66335889 0.33664111]


In [12]:
# Manual model after binarising the query vector like in scikit-learn
for click_count in [1, 10, 20]:
    xq = np.zeros((1, m))
    xq[0, 7] = click_count
    xq_binary = xq.copy()
    xq_binary[xq_binary != 0] = 1

    log_probs = []
    for c in range(0, 2):
        pxy = np.empty(m)
        Ny = np.sum(y == c)
        for j in range(0, m):
            Nj = np.sum(X_manual[y == c, j] == xq_binary[0, j])
            pxy[j] = (Nj + alpha) / (Ny + alpha * 2)
        log_p = np.log(py[c]) + np.sum(np.log(pxy))
        log_probs.append(log_p)

    probs = np.exp(log_probs)
    probs /= np.sum(probs)
    print(f"Manual model after binarising for click count {click_count}: {probs}")


Manual model after binarising for click count 1: [0.39644172 0.60355828]
Manual model after binarising for click count 10: [0.39644172 0.60355828]
Manual model after binarising for click count 20: [0.39644172 0.60355828]


Manual implemenation gives a different probability from scikit-learn when we repeat 'click' several times because the query vector stays numerical instead of being binary.
As BernoulliNB transforms entries into binary, only presence / absence matters, meaning all cases still produce the same results between them.

The fact that maunal model before binarising changes between 1 on one side and 10 & 20 on the other side is due to the fact that without binarisation, X_manual[y==c, j]==xq[ 0,j] treats values 10 or 20 as different of 1.

To correct, we should binarise vector xq before computing probabilities and then use xq_binary in the loop.
This modification would align manual results on those of scikit-learn and correct the unexpected change of output.

### Question 3
With the email spam dataset, compute macro-averaged F1-score of the model `nb` (the one which we trained using scikit-learn library) evaluated using Leave-One-Out Cross-Validation (do not implement LOOCV yourself, just use scikit-learn functionality). What F1-score did you get? Now do the same but with Multinomial Naive Bayes. What F1-score did you get now? Which of the two algorithms would you recommend for this data? Give at least two distinct reasons.

In [13]:
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

loo = LeaveOneOut()
bernoulli_f1 = cross_val_score(BernoulliNB(), X, y, cv=loo, scoring='f1_macro').mean()
multinomial_f1 = cross_val_score(MultinomialNB(), X, y, cv=loo, scoring='f1_macro').mean()

print(f"BernoulliNB LOOCV macro F1: {bernoulli_f1:.6f}")
print(f"MultinomialNB LOOCV macro F1: {multinomial_f1:.6f}")


BernoulliNB LOOCV macro F1: 0.923077
MultinomialNB LOOCV macro F1: 0.846154


BernoulliNB performs better on this set. The dataset only encodes the presence / absence of words : BernoulliNB corresponds exactly to this binary way of functioning. MultinomialNB works better with counting.
BernoulliNB is better on small datasets (here, 13 emails only), because each absent word is informative; MutinomialNB does not have this advantage as it focuses en frequences.
With this dataset, we should use BernoulliNB, because the higher score shows us the proif that the model is more relevant with this type of binary variables. MultinomialNB could be very interesting, but we would have to change methodology and not consider features as binary, and we would need to have much more emails.

### Question 4
Now go to the top of the notebook file from the lecture where we worked with the Iris dataset. Try replacing the model `nb` we used with Multinomial Naive Bayes (with appropriate import) and try to train the model. Do not modify any other code. You will get an error message. Explain the reasons behind the message specifically in the context of Naive Bayes.

In [14]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import MultinomialNB

iris = load_iris()
X_iris = iris.data[:, (2, 3)]
y_iris = iris.target_names[iris.target] == 'versicolor'

scaler = StandardScaler().fit(X_iris)
X_scaled = scaler.transform(X_iris)
print('Min values after scaling:', X_scaled.min(axis=0))

try:
    nb_multi = MultinomialNB()
    nb_multi.fit(X_scaled, y_iris)
except ValueError as err:
    print('MultinomialNB fitting error:', err)


Min values after scaling: [-1.56757623 -1.44707648]
MultinomialNB fitting error: Negative values in data passed to MultinomialNB (input X).


MultinomialNB is used when admitting that we don't have negative values. The standardization on Iris dataset produces negative value, so the algorithms rejects these.