# 13 Naive Bayes classifier, Exercises

Part of ["Introduction to Data Science" course](https://github.com/kupav/data-sc-intro) by Pavel Kuptsov, [kupav@mail.ru](mailto:kupav@mail.ru)

## Lesson 1

In [None]:
d1 = dict()
d1['one'] = 0
try:
    d1['one'] += 1  # it's ok, since we already initialized this key
    d1['two'] += 1  # it will be an error since key 'two' is undefined
    print("Finished successfully")  # this line will not be executed
except KeyError as e:
    print("KeyError", e)

In [None]:
from collections import defaultdict
d2 = defaultdict(int)
d2['one'] = 0
try:
    d2['one'] += 1  # it's ok, since we already initialized this key
    d2['two'] += 1  # it is also ok since default value of int (zero) will be assumed 
    print("Finished successfully") 
except KeyError as e:
    print("KeyError", e)  # this line will not be executed

Let us remember the Python sets.

A set does not allow duplicates. Thus when a list is converted to a set, all duplicates are removed.

Then the set can be converted back to a list if required.

In [None]:
s1 = ['one', 'one', 'one', 'two', 'two', 'three']
print(s1)
s2 = set(s1)
print(s2)
s3 = list(s2)
print(s3)

Sets are convenient when we need to store only unique values.

In [None]:
vocab = set()
print(vocab)
vocab.add("one")
print(vocab)
vocab.add("one")
print(vocab)
vocab.update(["one", "two", "three"])
print(vocab)

We can check if a value belong to a set using `in` operator:

In [None]:
print('one' in vocab)
print('one' not in vocab)
print('once' not in vocab)

Let us remember regular expressions. 

Pattern "\[a-z'\]" matches any letter or apostrophe. And if we add plus like this: "\[a-z'\]+" it means that the pattern matches 
one or more occurrences of the symbols. This pattern is the simplest way to match words in a text.

Usually we ignore the difference between small and capital letters. Thus it is convenient to convert the text to the lower 
case using `.lower()` method of strings.

In [None]:
import re

txt = """I've a cat named Vesters,
And he eats all day.
He always lays around,
And never wants to play.
"""

rge = re.compile(r"[a-z']+")
print(rge.findall(txt.lower()))

Two final technical remarks: We will compute probabilities of a word $W_i$ to appear in spam and in ham messages. 

Assume that this word appears only in spam messages. Then the probability to see it in ham messages will be zero.

Since we compute the product of probabilities $P(W_1|H)P(W_2|H)P(W_3|H)\ldots$ vanishing of one of the elements, say $P(W_2|H)=0$, zeroing the whole product. 

In this case all messages with $W_i$ will always be classified as spam since the opposite probability will always be zero.

To avoid it we add a pseudocount $k$ when compute the probabilities:

$$
P(W_i | S) = \frac{k + \text{number of spam messages with $W_i$}}{2k + \text{total number of spam messages}}
$$

Usually $k=1$.

To perform a classification of a message that contains a set of tokes (words)
$(W_1, W_2, \ldots W_n)$ we need to compute likelihoods that it is a spam and a ham:

$$
P(S | W_1,W_2,\ldots W_n) \propto P(W_1 | S) P(W_2 | S) \ldots P(W_n | S) P(S)
$$

$$
P(H | W_1,W_2,\ldots W_n) \propto P(W_1 | H) P(W_2 | H) \ldots P(W_n | H) P(H)
$$

Then if $P(S | W_1,W_2,\ldots W_n) > P(H | W_1,W_2,\ldots W_n)$ we classify it as a spam and this is a ham in the other case.

The expressions contains a lot of multiplications:

$$
P(W_1 | S) P(W_2 | S) \ldots P(W_n | S) P(S), \; P(W_1 | H) P(W_2 | H) \ldots P(W_n | H) P(H)
$$

Such computations are numerically unstable. 

We can easily obtain an underflow when values approach zero. Numerical errors can enormously grow.

We can avoid it by using logarithms. Let us remember:

$$
\log(a \times b) = \log a + \log b
$$

It means that if we sum logarithms of the factors we will have a logarithm of their product.

Given the logarithms of the probabilities we do not need to compute back the probabilities themselves. 

Since we want to compare the probabilities we will collect sums of logarithms and compare them.

### Exercises

1\. Describe in writing what an assumption is made when a naive Bayes classifier is created. Why the classifier is naive?

2\. Describe in writing what means maximum likelihood?

3\. Make a copy of a naive Bayes classifier that we used above to create a spam filter and try to improve its performance.
Split the data set into training, validation and test data. Select the best model using the validation dataset and then compute your final score on the testing data. To improve the model for example the whole message content can be taken into account instead of the subject only. Also lengths of tokens that are taken into account can be varied. May be it would be interesting to split the messages into digramms: couples of words going one after another. And so on.

4\. Try to improve the Gaussian naive Bayes classifier. Split the data set into training, validation and test data. Select the best model using the validation dataset and then compute your final score on the testing data. 

5\. Previously we discussed that in the most cases data must be standardized before creation of a machine learning model. Why it does not influences the performance of a Gaussian naive Bayes classifier?