### Bayesian Classifier

- aval.luació del model
- train/test split
- types : Gaussian, Multinomial, Bernouilli

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.max.columns', None)
import seaborn as sns
sns.set_theme()

In [2]:
from scipy.stats import gaussian_kde

#### Bayes Theoreme

$
\text{if}\;X,\,Y\;\text{dependent variables, then}
\; P\left(Y|X\right) = \frac{P\left(Y\right)\,P\left(X|Y\right)}{P\left(X\right)}
$

#### Dataset Iris

In [3]:
iris = sns.load_dataset("iris")
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


#### Inference

$
\text{3. for this particular case, }\;\forall{c\in Y},\;P\left(c\right) = 0.33\; \text{thus, }\;P\left(Y=c|X\right) \propto P\left(X|Y=c\right)
$

$
\text{1.}\;P\left(X\right)\;\text{can be obviated, then}\;
P\left(Y|X\right) \propto P\left(Y\right)\,P\left(X\,|Y\right)
$

$
\text{2. max. probability class is }\;\arg\max_{c\in Y}{P\left(Y=c\,|X\right) \propto P\left(Y=c\right)\;P\left(X\,|Y=c\right)}
$

$
\text{3. for this particular case, }\;\forall{c\in Y},\;P\left(c\right) = 0.33\; \text{thus, }\;P\left(Y=c\,|X\right) \propto P\left(X\,|Y=c\right)
$

#### Conditional distribution: $P\left(X|Y\right)$

In [4]:
setosa = iris.loc[iris.species == 'setosa'][['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

In [5]:
pdf_setosa = gaussian_kde(setosa.transpose(), bw_method = 'scott')

In [6]:
virginica = iris.loc[iris.species == 'virginica'][['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

In [7]:
pdf_virginica = gaussian_kde(virginica.transpose(), bw_method = 'scott')

In [8]:
versicolor = iris.loc[iris.species == 'versicolor'][['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

In [9]:
pdf_versicolor = gaussian_kde(versicolor.transpose(), bw_method = 'scott')

#### Avaluació del model

In [10]:
classNames = ['setosa', 'virginica', 'versicolor']
def evaluate_lkh(row):
    target, values = row.species, tuple(row[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])
    maxLikelihood = np.argmax([pdf_setosa.evaluate(values), pdf_virginica.evaluate(values), pdf_versicolor.evaluate(values)])
    return classNames[maxLikelihood]

In [11]:
iris['predicted'] = iris.apply(lambda row: evaluate_lkh(row), axis = 1)
iris.groupby('species').predicted.value_counts()

species     predicted 
setosa      setosa        50
versicolor  versicolor    50
virginica   virginica     50
Name: predicted, dtype: int64

- El model prediu perfectament el 100% dels exemples.
- Obvi! Hem fet una estimació del model amb el 100% dels exemples.
- Què passa si dividim el dataset en un set d'entrenament i un set de test ???

### Estimació del model amb train/test

In [12]:
from sklearn.model_selection import train_test_split

- variable independent

In [13]:
X = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

- variable dependent

In [14]:
y = iris[['species']]

- split: train 80%, test 20%

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 27)

- set de train

In [16]:
train = pd.concat((X_train, y_train), axis = 1)
train.groupby('species').species.value_counts()

species     species   
setosa      setosa        43
versicolor  versicolor    39
virginica   virginica     38
Name: species, dtype: int64

#### Reestimem la prior $P\left(Y\right)$ (ja no és uniforme !!)
- tot i així encara la podriem considerar uniforme (... si estiguéssim convençuts de que és així)

In [17]:
prior = y_train.value_counts(normalize = True)
prior

species   
setosa        0.358333
versicolor    0.325000
virginica     0.316667
dtype: float64

#### Reestimem les funcions de versemblança (només amb el set de train !!)

In [18]:
setosa = X_train.loc[y_train.species == 'setosa']
lkh_setosa = gaussian_kde(setosa.transpose(), bw_method = 'scott')

In [19]:
virginica = X_train.loc[y_train.species == 'virginica']
lkh_virginica = gaussian_kde(virginica.transpose(), bw_method = 'scott')

In [20]:
versicolor = X_train.loc[y_train.species == 'versicolor']
lkh_versicolor = gaussian_kde(versicolor.transpose(), bw_method = 'scott')

#### Redefinim la funció d'avaluació per "posteriors"

In [21]:
classNames = ['setosa', 'virginica', 'versicolor']
def evaluate_post(row):
    target, values = row.species, tuple(row[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])
    post_setosa = prior['setosa'] *lkh_setosa.evaluate(values)
    post_virginica = prior['virginica'] *lkh_virginica.evaluate(values)
    post_versicolor = prior['versicolor'] *lkh_versicolor.evaluate(values)
    maxPosterior = np.argmax([post_setosa, post_virginica, post_versicolor])
    return classNames[maxPosterior]

#### Reavaluem el model (amb el set de test !!)

In [22]:
test = pd.concat((X_test, y_test), axis = 1)
test.species.value_counts()

virginica     12
versicolor    11
setosa         7
Name: species, dtype: int64

In [23]:
test['predicted'] = test.apply(lambda row: evaluate_post(row), axis = 1)
test.groupby('species').predicted.value_counts()

species     predicted 
setosa      setosa         7
versicolor  versicolor    11
virginica   virginica     11
            versicolor     1
Name: predicted, dtype: int64