# Tutorial 08 – Baseline Model

In this tutorial we will take a look at using baselines to assess whether the more complex model is worth the complexity and train a naïve bayes based spam filter.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

sns.set()  # make plots nicer

## Baselines

Baseline is a general term for any threshold that we aim to beat with our machine learning model. It is often 1) result obtained using simple label/target distribution statistic or 2) previously obtained results. Case 1) is common when you are tackling a new problem for the first time and you have no idea how any machine learning model will perform. Case 2) is common for improving on previous results, e.g., results published in literature. Note that in case 2) the baseline can be result of arbitrary complex model even state-of-the-art neural network. The baseline is not always easy to beat.

Let's focus now on case 1) and use titanic dataset for demonstration.

In [2]:
titanic = sns.load_dataset("titanic")
# drop redundant columns
titanic = titanic.drop(columns=["embarked", "who", "class", "alive"])
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,adult_male,deck,embark_town,alone
0,0,3,male,22.0,1,0,7.25,True,,Southampton,False
1,1,1,female,38.0,1,0,71.2833,False,C,Cherbourg,False
2,1,3,female,26.0,0,0,7.925,False,,Southampton,True
3,1,1,female,35.0,1,0,53.1,False,C,Southampton,False
4,0,3,male,35.0,0,0,8.05,True,,Southampton,True


Add a new category for missing categorical values so we can later fit more complex models.

In [3]:
titanic["deck"] = titanic.deck.cat.add_categories("missing")
titanic["deck"] = titanic.deck.fillna("missing")
titanic["embark_town"] = titanic.embark_town.fillna("missing")
titanic["embark_town"] = titanic.embark_town.astype("category")

Split the dataset into training and test subsets.

In [4]:
from sklearn.model_selection import train_test_split

titanic_X, titanic_y = titanic.drop(columns="survived"), titanic.survived

titanic_train_X, titanic_test_X, titanic_train_y, titanic_test_y = train_test_split(
    titanic_X, titanic_y, test_size=0.2, random_state=42
)

The most simple baseline is to simply toss a coin and predict labels completely at random. This might be sensible if the classes are balanced (classes have same roughly the same number of examples). In case of imbalanced classes it is better to adjust the probabilities so that generated class labels are proportional to the class labels in training set. If the classes are highly imbalanced this can be simplified to predicting the most frequent class label.

`scikit-learn` has [Dummy estimators](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators) that does exactly this.

<div class="alert alert-block alert-warning"><b>Exercise 1</b></div>

Compute baseline **accuracy on test set** of simple (dummy) classifiers that predict labels **completely randomly**, **proportional to label distribution in training data**, and by simply **predicting most frequent** label.

You should get the following accuracies (the exact number will differ due to randomness).
* completely random: $0.50 ±0.08$
* proportional: $0.52 ±0.07$
* most frequent: $0.58$

In [14]:
# TODO: your code goes here...
from sklearn.dummy import DummyClassifier

completely_random = DummyClassifier(strategy='uniform', random_state=42)
completely_random.fit(titanic_train_X, titanic_train_y)
print(round(completely_random.score(titanic_test_X, titanic_test_y), 2))

proportional = DummyClassifier(strategy='stratified', random_state=42)
proportional.fit(titanic_train_X, titanic_train_y)
print(round(proportional.score(titanic_test_X, titanic_test_y), 2))

most_frequent = DummyClassifier(strategy='most_frequent', random_state=42)
most_frequent.fit(titanic_train_X, titanic_train_y)
print(round(most_frequent.score(titanic_test_X, titanic_test_y), 2))

0.45
0.51
0.59


Now let's fit a decision tree and see if it beats the baselines.

In [15]:
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.impute import KNNImputer

dt_pipeline = make_pipeline(
    make_column_transformer(
        (OrdinalEncoder(), ["sex"]),
        (OneHotEncoder(), ["deck", "embark_town"]),
        remainder="passthrough",
    ),
    KNNImputer(),
    DecisionTreeClassifier(max_depth=5),
)
dt_pipeline.fit(titanic_train_X, titanic_train_y)
round(dt_pipeline.score(titanic_test_X, titanic_test_y), 2)

0.79

The decision tree has achieved an accuracy of 0.79 which is 20 % more than the best baseline. This is a very good result and indication that the decision tree is worth using.

## Bayesian statistics
To get to the naïve bayes we first need to be familiar with bayesian statistics. In the bayesian statistics the probability express the degree of belief some event will happen. This allows us to update our probability distribution estimates based on new evidence (this process is called [Bayesian inference](https://en.wikipedia.org/wiki/Bayesian_inference)). The fundamental theorem in Bayesian statistic is Bayes' theorem.

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

We can use Bayes' theorem to update our belief of event $A$ based on the observation of event $B$. We start with prior probability ($P(A)$) and after observing event $B$ we update our beliefs and end up with posterior probability ($P(A|B)$) of event $A$.

**The following exercise is not a medical advice!**

Let's train this type of inference on simple example. Say a you start coughing and have fever, you test for coronavirus and the test is positive. How likely are you to actually have the coronavirus?

<div class="alert alert-block alert-warning"><b>Exercise 2</b></div>

Calculate the posterior probability of $P(C=True|T=+)$ if the prevalence of coronavirus $P(C=True)$ is 0.001 (0.1 % of population) and test accuracy $P(T=+|C=True) = P(T=-|C=False)$ is 0.99 (99 %).

You can use the fact that probability of a person testing positive is $P(T=+) = P(T=+|C=T) P(C=T) + P(T=+|C=F) P(C=F)$.

In [16]:
covid19_prevalence = 0.001  # P(C=True)
test_accuracy = 0.99  # P(T=+|C=True) = P(T=-|C=False)
# TODO: your code goes here...
P_plus_true = test_accuracy
P_true = covid19_prevalence
P_false = 1 - P_true
P_plus = P_plus_true * P_true + (1 - P_plus_true) * P_false
P_true_plus = P_plus_true * P_true / P_plus
print(round(P_true_plus, 2))

0.09


Yes, it is really less than 10 %. You would need multiple tests (more evidence) for more definitive results. The main reason is the very small percentage of infected people. In other words, the prior $P(T=+)$ is tiny and you need multiple evidences to overcome it.

## Naïve Bayes
Naïve bayes methods use exactly this bayesian inference to classify the example based on features (evindence). Using bayesian inference the probability of class $y$ is given by

$$ P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots x_n \mid y)} {P(x_1, \dots, x_n)} $$
                                 
where $x_1, \dots x_n$ are feature values. The naïve bayes introduces "naïve" assumption that all features are conditionally independent resulting in the following probability.

$$ P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)} {P(x_1, \dots, x_n)} $$
                                 
Since the denominator is independent of which class probability we are estimating we can simplify the calculations and predict the class with maximal nominator. This is the final formula used in Naïve Bayes classifiers.

$$ \hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y) $$



There are different variants of Naïve Bayes classifiers based on how they model $P(x_i \mid y)$. You can read more on each of them in [scikit-lean guide](https://scikit-learn.org/stable/modules/naive_bayes.html). Now let's use the Naïve Bayes classifier for spam detection in SMS.

In [17]:
spam_train = pd.read_csv(
    "https://www.fi.muni.cz/~xcechak1/IB031/datasets/spam_train.csv"
)
spam_test = pd.read_csv("https://www.fi.muni.cz/~xcechak1/IB031/datasets/spam_test.csv")

In [18]:
spam_train_X, spam_train_y = spam_train.text, spam_train.type
spam_test_X, spam_test_y = spam_test.text, spam_test.type

In [19]:
spam_train_X.head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: text, dtype: object

We cannot use the classifier on raw strings we need to extract some features from the text. The most common and simple method is bag-of-words where we transform each string into a vector. Each element of this vector holds information how many times a given word has occurred in the string. The example belows shows should make it clear.

In [20]:
sample_sentences = [
    "the black cat",
    "the cat and dog",
    "black dog",
    "black dog black dog",
]

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

bag_of_words = CountVectorizer().fit(sample_sentences)
print(bag_of_words.transform(sample_sentences).toarray())
print(bag_of_words.vocabulary_)

[[0 1 1 0 1]
 [1 0 1 1 1]
 [0 1 0 1 0]
 [0 2 0 2 0]]
{'the': 4, 'black': 1, 'cat': 2, 'and': 0, 'dog': 3}


Here the first column correspond to occurrences of word "and", the second "black", and so on.

<div class="alert alert-block alert-warning"><b>Exercise 3</b></div>

Use `CountVectorizer` and [Multinomial Naïve Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) to train a spam classifier on train set and then evaluate the classifier using `score` method on test set.

You should get accuracy of $0.986$.

In [33]:
# TODO: your code goes here...
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

spam_filter_pipeline = make_pipeline(
    CountVectorizer(),
    MultinomialNB()
)

spam_filter_pipeline.fit(spam_train_X, spam_train_y)
round(spam_filter_pipeline.score(spam_test_X, spam_test_y), 3)

0.986

This looks very good, but the dataset has unbalanced classes. There are much less spam messages than normal ones.

<div class="alert alert-block alert-warning"><b>Exercise 4</b></div>

Train a baseline model that will constantly predict the majority class label `ham` (i.e., not spam) and evaluate it using `score` method.

In [34]:
# TODO: your code goes here...
most_frequent = DummyClassifier(strategy='most_frequent', random_state=42)
most_frequent.fit(spam_train_X, spam_train_y)
print(round(most_frequent.score(spam_test_X, spam_test_y), 2))

0.87


As you can see, even this simple classifier was able to obtain accuracy of almost 87 %. This makes the naïve bayes classifier results a bit less impresive but they are still good. 

Of course, accuracy is not the best metric for unbalanced data. F1 measure is more suited for this job. not really, just labeling everything as "ham" gives accuracy of 0.87. Accuracy is not good metric for imbalanced data, F1 measure is better. Use F1 score to evaluate spam filter and dummy model.

In [35]:
from sklearn.metrics import f1_score

f1_score(spam_filter_pipeline.predict(spam_test_X), spam_test_y, average=None)

array([0.99174236, 0.94413408])

The classifier is doing really great job in identifying `ham` messages, but we are mostly concerned with false positives (`ham` messages labeled as `spam`) that the filter might block and the user might not receive.

In [36]:
(spam_train_X, spam_validation_X, spam_train_y, spam_validation_y,) = train_test_split(
    spam_train_X, spam_train_y, test_size=0.2, random_state=42
)

<div class="alert alert-block alert-danger"><b>Exercise 5</b></div>

Experiment with various settings of `CountVectorizer` or try [TF-IDF Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and various naïve bayes implementations. See which one will minimize false positives on validation dataset. Use confusion matrix for this evaluation. After deciding on pipeline, confirm that there is very low false positives also on test set.

In [46]:
# TODO: your code goes here...
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix

pipeline = make_pipeline(
    #CountVectorizer(),
    TfidfVectorizer(),
    MultinomialNB()
)

pipeline.fit(spam_train_X, spam_train_y)
pred = pipeline.predict(spam_validation_X)
tn, fp, fn, tp = confusion_matrix(y_true=spam_validation_y, y_pred=pred).ravel()
print(fp) #false positive is 0
pred = pipeline.predict(spam_test_X)
tn, fp, fn, tp = confusion_matrix(y_true=spam_test_y, y_pred=pred).ravel()
print(fp)

0
0
