In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
%matplotlib inline
plt.rcParams["figure.figsize"] = [12,8]

## Training (Naive) Bayes

As we have shown in the previous lectures training a Bayes classifier amounts to finding the conditional probability distributions of the features $\mathbf{x}$ given the class label $c$. 

$$P(\mathbf{X}=\mathbf{x}|C=c)$$

In case of the Naive bayes classifier the $n_f$ features are conditionally independent 

$$P(\mathbf{X}=\mathbf{x}|C=c)=\prod_{i=0}^{n_f-1} P(X_i=x_i|C=c)$$

so we can estimate   probality distribution  for each feature separately. 

In case of categorical features each $X_i$ has a finite $m_i$  number of possible values (categories) that, without any loss of generality we can  assume, take values $0,\ldots,m_i-1$. In the same way we  will assume that the class labels take $n_c$ integer values $c=0,\ldots,n_c-1$. 

So for each feature we have  to estimate  $n_c\times m_i$ probabilities 

$$p^{(c)}_{ij} = P(X_i = j|C = c)$$

Of course they are not all idependent. Normalisation requires

$$\sum_{j=0}^{m_i-1} p^{(c)}_{ij}=1$$

Let $\mathbf{x}$ denote the $n_s \times n_f$ matrix of training data where $n_s$ is the number of samples. So $x_{hi}$ denotes the value of the ith   feature in  sample $h$. Let  $y_h$ denote the corresponding label

$$x_{hi} =0,\ldots,m_i-1,\quad  y_h=0,\ldots,k-1$$

Let's introduce some more notation. 

$\mathbf{x}^{c}$ will denote the sets of all data points with label $c$. 

$$\mathbf{x}^c\equiv\{\mathbf{x}_h: y_h=c\}$$ 

The number of elements in $\mathbf{x}^{(c)}$ will be denoted by $n^c$.

Let $n^{(c)}_{ij}$ denote the number of times that feature $i$ had value $j$ in samples with  class label $c$:

$$n^{(c)}_{ij} = 
\sum_{h=0}^{n_s-1} \delta_{x_{hi},j},\qquad \delta_{a,b}=
\begin{cases}
1 & a=b\\
0 & a\neq b
\end{cases}
$$

We will use a smoothed estimator 

$$p^{(c)}_{ij} = \frac{n^{(c)}_{ij}+\alpha}{n^c+m_i\alpha} $$

Where $\alpha\ge 0$ is a smoothing parameter. The use of  non-zero smoothing parameter ensures a non-vanishing probability even when $n^{(c)}_{ij}=0$. 

## Example: Car evaluation data set

As an example we will use the [car evaluation dataset](http://archive.ics.uci.edu/ml/datasets/Car+Evaluation) from [UCI Machine Learning repository](http://archive.ics.uci.edu/ml/). It contains 1728 samples with six atttributes (features) each. The class label is the evaluation of the car: unacc, acc, good, vgood.

All six parameters are categorical and the data contains exactly one sample for each possible combination of attributes values (in this respect this is quite peculiar dataset).

As before we will use pandas to read and proccess the data.

In [None]:
cars_data = pd.read_csv("../../Data/Cars/car_data.csv", names=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'])

In [None]:
cars_data.head()

In [None]:
cars_data.info()

Method `groupby`  divides the data frame into goups based on the value of the given colum(s)

In [None]:
cars_by_class = cars_data.groupby('class')

The size of each group can be calculated using method `size`

In [None]:
cars_by_class.size()

As we can see the classes are not very well ballanced with relatively small number of cars in two best  classes.  So I have decided to join those two classes together introducing a new classification

In [None]:
def bargain(c):
    if c in ['good', 'vgood']:
        return 'good'
    elif c=='acc':
        return 'fair'
    else:
        return 'bad'

In [None]:
cars_data['bargain'] = cars_data['class'].apply(bargain)

We start by dividing the data set into training and testing. 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
seed = 678565

In [None]:
cars_train, cars_test = train_test_split(cars_data, train_size=0.75, random_state=seed)

In [None]:
cars_train['bargain'].value_counts()

In [None]:
cars_test['bargain'].value_counts()

Function `train_test_split` has an option to _stratify_ data based on the values of one colum

In [None]:
cars_train, cars_test = train_test_split(cars_data, train_size=0.75, stratify=cars_data['bargain'],
                                         random_state = seed)

In this way the split was done separately for each  class label. That way we obtain as a result slightly more balanced sets

In [None]:
cars_train['bargain'].value_counts()

In [None]:
cars_test['bargain'].value_counts()

We will use the Naive Bayes.
There are many ways that we can calculate the estimators. We can start by grouping the dataframe according to  feature values

In [None]:
cars_training_grouped = cars_train.groupby(['bargain', 
                                            'buying',
                                            'maint',
                                            'doors', 
                                            'persons', 
                                            'lug_boot',
                                            'safety'])

and count the size of each group

In [None]:
group_counts=cars_training_grouped.size()

The `sum` method can make a partial sums (see Titanic problem) which we can use to extract $n^{c}_{ij}$ values

In [None]:
# The level argument list the levels not summed over i.e. left in the result.
group_counts.sum(level = ['bargain', 'buying'] )

and finally calculate the probabilities

In [None]:
(group_counts.sum(level = ['bargain', 'buying'] ) +1 )/(group_counts.sum(level='bargain')+4)

However if you look closely you will notice one problem: the 'good' class does not contain any values for 'high' and 'vhigh'  attribute values! That means of course that they are zero, but it complicates the calculations. Instead of fixing this by hand we will use the tools from the `scikit-learn` library. 

This library has a class implementing just what we need

In [None]:
from sklearn.naive_bayes import CategoricalNB

In [None]:
cnb = CategoricalNB()

However this classifier requires the  class labels and attributes to be  integer numbers counted from zero. Fortunatelly scikit-learn also posses a class for converting from labels to ordinals.

In [None]:
from sklearn.preprocessing import  OrdinalEncoder

In [None]:
features_encoder = OrdinalEncoder()
encoded_features = features_encoder.fit_transform(cars_train.loc[:,'buying':'safety'])

In [None]:
class_encoder = OrdinalEncoder()
encoded_class = np.ravel(class_encoder.fit_transform(cars_train.loc[:,'bargain':]) )

In [None]:
np.bincount(encoded_features[encoded_class==0][:,0].astype('int64'))

In [None]:
cnb.fit(encoded_features, np.ravel(encoded_class))

We can view the learned probabilities of this classifier  using its `feature_log_prob` attribute:

In [None]:
np.exp( cnb.feature_log_prob_[0] )

Comparing with our calculations we see a almost perfect fit. However we get all four probabilities in the last line like required. The two values are not zero because of smoothing. 

After training the classifier we can use it to  make predictions on the test set

In [None]:
encoded_test_features = features_encoder.transform(cars_test.loc[:,'buying':'safety'])
predicted_test_class = cnb.predict(encoded_test_features)

In [None]:
encoded_test_class =  np.ravel(class_encoder.transform(cars_test.loc[:,'bargain':]))

and test them

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(encoded_test_class, predicted_test_class, normalize=True)

Actually the classifier has a method for predicting and measuring accuracy

In [None]:
cnb.score(encoded_test_features, encoded_test_class)

which unsuprisingly gives same results

As a last check I will look at the class distribution in the predicted and real labels. As it may happen that with unbalanced  classes  one class can be e.g. totaly misclassified without affecting accuracy. 

In [None]:
np.bincount(predicted_test_class.astype('int64'))

In [None]:
np.bincount(encoded_test_class.astype('int64'))

## Multiclass metrics

Our approach to measuring the performance of the classifier was a little haphazard.  The common and more systematic way of doing this is to treat a $k$ class classifcation problem as $k$ binary classification problems: class $C_i$ against the rest. We combine the final score out of binary metrics for each binary classification. 

In [None]:
pred = predicted_test_class.astype('int64')
true = encoded_test_class.astype('int64')

In [None]:
def stat(y_true, y_pred, c):
    lbl_true = np.where(y_true==c,1,0)
    lbl_pred = np.where(y_pred==c,1,0)
    TP = np.sum(lbl_true * lbl_pred)
    FP = np.sum( (1-lbl_true)*lbl_pred)
    TN = np.sum((1-lbl_true) * (1-lbl_pred))
    FN = np.sum(lbl_true * (1-lbl_pred))            
    return TP, FP,FN, TN

In [None]:
for i in range(3):
    print( stat(true,pred, i))

Once we have the statistics for each binary classifier we can combine the together. We will consider _micro_ and _macro_ averaging. 

### Micro averaging

In micro averaging we first calculate the summary values of TP, FP, TN and TN and use them to calculate the total score. 
We will start with _recall_ which is just another name for true positives rate.

$$Recall_\mu = \frac{\sum_i TP_i}{\sum_i(TP_i+FN_i)}$$

In [None]:
num = 0
den = 0
for i in range(3):
    tp, fp, fn, tn =  stat(true,pred, i)
    num += tp
    den += tp+fn
print(num, den, num/den    )

$$Precision_\mu = \frac{\sum_i TP_i}{\sum_i(TP_i+FP_i)}$$

In [None]:
num = 0
den = 0
for i in range(3):
    tp, fp, fn, tn =  stat(true,pred, i)
    num += tp
    den += tp+fp
print(num, den, num/den    )

and $F_1$ is then  harmonic mean of the two

$$F_\mu = 2\cdot\frac{Precision_\mu\cdot Recall_\mu}{Precision_\mu + Recall_\mu}$$

__Problem__  Show that $Recall_\mu$ = $Precision_\mu = Accuracy$

There is no surprise that scikit-learn library has functions to calculate those metrics

In [None]:
from sklearn.metrics import *

In [None]:
print(
    recall_score(encoded_test_class, predicted_test_class, average='micro'),
    precision_score(encoded_test_class, predicted_test_class, average='micro'),
    f1_score(encoded_test_class, predicted_test_class, average='micro'))

### Macro averaging

With macro averaging we  calculate  score for each binary classifier separately and average them. So for recall

$$Recall_M = \frac{1}{k}\sum_{i=0}^{k-1}\frac{TP_i}{TP_i+FN_i}$$

In [None]:
tot = 0
for i in range(3):
    tp, fp, fn, tn =  stat(true,pred, i)
    tot +=tp/(tp+fn)
rec = tot/3    
print(rec)

In [None]:
recall_score(encoded_test_class, predicted_test_class, average='macro')

and for precision

$$Precision_M = \frac{1}{k}\sum_{i=0}^{k-1}\frac{TP_i}{TP_i+FP_i}$$

In [None]:
tot = 0
for i in range(3):
    tp, fp, fn, tn =  stat(true,pred, i)
    tot +=tp/(tp+fp)
prec =  tot/3   
print(prec)

In [None]:
precision_score(encoded_test_class, predicted_test_class, average='macro')

and $F_M$ score is

$$F_M = \frac{1}{k}\sum_{i=0}^{k-1}\frac{2\cdot TP_i}{TP_i+FP_i +TP_i +FN_i}$$

In [None]:
tot = 0
for i in range(3):
    tp, fp, fn, tn =  stat(true,pred, i)
    tot +=2 * tp/(tp+fp+tp+fn)
f =  tot/3   
print(f)

In [None]:
f1_score(encoded_test_class, predicted_test_class, average='macro')

### Weighted averaging 

And finally the weighted averaging is like macro averaging but we  weight the average by the support of each class _i.e._ the number of labels of each class. E.g. for precision

In [None]:
tot = 0
den = 0
for i in range(3):
    tp, fp, fn, tn =  stat(true,pred, i)
    tot +=tp/(tp+fp) *(tp+fn)
    den += (tp+fn)
prec = tot/den
print(prec)

In [None]:
precision_score(encoded_test_class, predicted_test_class, average='weighted')

__Problem__ Show that weighted averaging for recall gives same result as micro averaging.