# Example application

Here we will go through the steps of a simple machine learning application based on the Iris data set. This involves:
* A simple analysis and viasualization of the data
* Building the model
* Making predictions
* Initial thoughts on how to measure the quality of the developed model/application

## Load the data and print a bit of information

In [None]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

The `iris_dataset` is a so-called `Bounce` object behaving similar to a dictionary with keys and values:

In [None]:
print(iris_dataset.keys())

The value of `DESCR` is a description of the data:

In [None]:
print(iris_dataset['DESCR'][:1200] + "...\n")

The target labels, i.e., the iris spicies (given as a list of strings):

In [None]:
print(iris_dataset['target_names'])

... and the features (given as a list of strings):

In [None]:
print(iris_dataset['feature_names'])

### The data itself

The data itself is contained under the keys `data`and `target`, both of which are numpy arrays

In [None]:
print("Type of data: {}".format(type(iris_dataset['data'])))

The rows of `data` corresponds to flowers while the columns correspond to the four features:

In [None]:
print("Number of flowers and number of features: {}".format(iris_dataset['data'].shape))

First ten measurements:

In [None]:
print(iris_dataset['data'][:10])

We can also calculate some summary statistics for the features (forgetting that this information was already provided in the header file above):

In [None]:
import numpy as np

print("Mean values: {}".format(np.mean(iris_dataset['data'], axis=0)))
print("Standard deviations: {}".format(np.std(iris_dataset['data'], axis=0)))

The `target` (numpy) array contains the species of the individual flowers:

In [None]:
print("Shape of the target array: {}".format(iris_dataset['target'].shape))

The spicies are encoded as 0, 1, and 2, and the flowers are sorted according their species category:

In [None]:
print("Target array: {}".format(iris_dataset['target']))

#### Visualization

For creating the scatter plots we first concvert the data into a pandas dataframe, and then use pandas for the plotting

In [None]:
import pandas as pd

# Create dataframe
iris_df = pd.DataFrame(iris_dataset['data'], columns=iris_dataset['feature_names'])
iris_df[0:10]

In [None]:
# Create scatter plot and color by class label.
%matplotlib inline
import matplotlib.pyplot as plt
pd.plotting.scatter_matrix(iris_df, c=iris_dataset['target'], figsize=(15, 15), marker='o',
                           hist_kwds={'bins': 20}, s=60, alpha=.8)

## Building a model

For illustrating the process we use the naïve Bayes model (more on this later). This model 
* defines a probability distribution over the features and the target variables
* makes certain assumptions about how the features and the class/target interact
* assumes that conditional on the class, the features follow a Gaussian distribution 

$$ f(x_i|y) = \frac{1}{\sqrt{2\cdot \pi\cdot \sigma_y^2}}\exp \bigg (\frac{-(x_i-\mu_y)^2}{2\cdot \sigma_y^2}\bigg )$$

<img src="normal.png">

In scikit-learn all ML algorithms are implemented in their own class (for naïve Bayes it is `GaussianNB` under `sklearn.naive_bayes`) that should be instantiated.

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

Use the `fit` method for learning the model; this function takes as arguments the training data and the corresponding labels 

In [None]:
gnb.fit(iris_dataset['data'], iris_dataset['target'])

## Making predictions

We use the learned model to make predictions about new data instances for which we do no kow the labels

In [None]:
# New data organized in a two-dimensional array 
x_new = np.array([[5, 2.9, 1, 0.2]])

In [None]:
gnb.predict_proba(x_new)

In [None]:
predict = gnb.predict(x_new)

In [None]:
print("Prediction: {}".format(predict))
print("Target name: {}".format(iris_dataset['target_names'][predict]))

## Model criticism

One would typically make predictions on a collection of (known) instances. The performance of the algorithm on these instance would (usually) trigger a feedback loop, where you would
* go back and reanalyze/process your data
* revise your model