## Iris flowers recognition using naive bayesian inference

As mentioned above, we will use naive bayesian inference to determine the variety of presented flower, given its petal and sepal dimentions.

Although this method might not be the best choice for this problem (data is highly correlated* and thus bayesian inference provides quite poor aproximation of posterior probabilities), it would be a great chance to dive into this topic.

\* vide `Statistical analysis of data.ipynb` for in-depth analysis

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
from matplotlib import pyplot as plt

In [2]:
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
columns = features + ['classname']
dataset = pd.read_csv("iris-data/iris.data", names=columns)
classes = dataset["classname"].unique()

#### Preparing learning set and testing set

In [3]:
# Choose training set
train_set = pd.concat([dataset.loc[0:39], dataset.loc[50:89], dataset.loc[100:139]])
train_set.reset_index(drop=True, inplace=True)

# Choose testing set
test_set = pd.concat([dataset.loc[40:49], dataset.loc[90:99], dataset.loc[140:149]])
test_set.reset_index(drop=True, inplace=True)

# Shuffle testing set
test_set = test_set.reindex(np.random.permutation(test_set.index))
test_set.reset_index(drop=True, inplace=True)

#### Extracting KDE's

In [4]:
kdes = {}

for c in classes:
    filtr = (dataset['classname'] == c)
    subset = dataset.loc[filtr]
    kdes[c] = {}
    
    for feature in features:
        kdes[c][feature] = stats.gaussian_kde(np.asarray(subset[feature]))

#### Naive bayesian infering function

In [5]:
def recognize(series, verbose=False):
    hypos = {'Iris-setosa': 0., 'Iris-virginica': 0., 'Iris-versicolor': 0.}
    
    for feature in features:
        for c in classes:
            hypos[c] += np.log(kdes[c][feature].pdf(series[feature]) + 0.003)
    
    mle = max(hypos, key=hypos.get)
    
    if verbose:
        norm_const = 0
        logmle = float(hypos[mle])

        for species in hypos:
            hypos[species] -= logmle
            hypos[species] = np.e ** hypos[species]
            norm_const += hypos[species]

        for species in hypos:
            hypos[species] /= norm_const
            print("Posterior probability of '{}': \t{}".format(species, *hypos[species]))
    
    return mle

## Recognition

#### Testing set (verbose)

In [6]:
for i, test_record in test_set.iterrows():
    true = test_record['classname']
    predicted = recognize(test_record, verbose=True)
    print("[{}] True:      {}\n     Predicted: {}\n".format(
        "OK" if true == predicted else "NO", true, predicted))

Posterior probability of 'Iris-setosa': 	3.177495167077363e-08
Posterior probability of 'Iris-virginica': 	0.9989810271809818
Posterior probability of 'Iris-versicolor': 	0.001018941044066571
[OK] True:      Iris-virginica
     Predicted: Iris-virginica

Posterior probability of 'Iris-setosa': 	5.453431122904182e-07
Posterior probability of 'Iris-virginica': 	0.00034006716521284167
Posterior probability of 'Iris-versicolor': 	0.9996593874916748
[OK] True:      Iris-versicolor
     Predicted: Iris-versicolor

Posterior probability of 'Iris-setosa': 	4.414917618741479e-08
Posterior probability of 'Iris-virginica': 	0.9996910977357467
Posterior probability of 'Iris-versicolor': 	0.0003088581150769452
[OK] True:      Iris-virginica
     Predicted: Iris-virginica

Posterior probability of 'Iris-setosa': 	7.440282765213478e-08
Posterior probability of 'Iris-virginica': 	0.9999812560224951
Posterior probability of 'Iris-versicolor': 	1.8669574677310635e-05
[OK] True:      Iris-virginica
     

#### Testing set (accuracy)

In [15]:
incorrect = 0
size = test_set.shape[0]

for i, test_record in test_set.iterrows():
    true = test_record['classname']
    predicted = recognize(test_record, verbose=False)
    if true != predicted: incorrect += 1

print("Bad answers: {} / {}".format(incorrect, test_set.shape[0]))
print("Accuracy:    {:.2f}%".format(100 * (size - incorrect) / size))

Bad answers: 0 / 30
Accuracy:    100.00%


#### All data (accuracy)

In [16]:
incorrect = 0
size = dataset.shape[0]

for i, test_record in dataset.iterrows():
    true = test_record['classname']
    predicted = recognize(test_record, verbose=False)
    if true != predicted: incorrect += 1

print("Bad answers: {} / {}".format(incorrect, size))
print("Accuracy:    {:.2f}%".format(100 * (size - incorrect) / size))

Bad answers: 5 / 150
Accuracy:    96.67%
