[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lfmartins/introduction-to-computational-mathematics/blob/main/19-introduction-to-machine-learning.ipynb)
# Introduction

In this notebook, we explore Machine Learning (ML) algorithms for data classification. In this kind of problems, the data is organized as follows:

- An array `X` with a data point per row. We let $n_X$ be the number of data points, corresponding to the number of rows in `X`
- For each datapoint, the values of observed _features_. The number of features is denoted bu $n_f$, and corresponds to the number of columns in `X`.
- An array `y` of _labels_ or _targets_. Array `y` had $n_X$ rows. Entry `y[i]` represents the class to which data point $X[i]$ belongs.

The goal of classification is to define a _model_ that can accurately predict to which class a data point belongs, based on the values of the observed features.

We start by importing the tools we will use.

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import datasets
import pandas as pd

# The Iris Dataset

For this example, we will use the Iris Dataset, which is a small dataset that is appropriate for experimentation with classification algorithms. For a detailed description of the datset, visit the following [Wikipedia article](https://en.wikipedia.org/wiki/Iris_flower_data_set)

The Iris Dataset is provided as a "toy dataset" in the `sklearn` library:

In [None]:
from sklearn import datasets
iris = datasets.load_iris(as_frame=True)

Let's take a peek at the data:


In [None]:
iris.frame

Each row corresponds to an observed iris flower. The features observed are the first four columns of the data frame. The last column specifies the class to which each data point belongs.

The data is also available as `numpy` arrays, which is helpful for ML algorithms.
The `iris.data` array contains the features observed for each data point:

In [None]:
iris.data[:5]

The `iris.target` array contains the labels associated to each data point:

In [None]:
iris.target[:5]

The members `iris.feature_names` and `iris.target_names` contains the names of features and classes:

In [None]:
print(f'Feature names: {iris.feature_names}')
print(f'Class names: {iris.target_names}')

A detailed description of the data set is also available:

In [None]:
print(iris.DESCR)

To visualize the dataset, let's plot a scatter matrix:

In [None]:
iris_data = pd.DataFrame(iris.data, columns=iris.feature_names)
pd.plotting.scatter_matrix(iris_data, c=iris.target)
None

Looking at the plots, we can see tht the three classes seem to be well separated by the given features. We will now construct a neural network model to perform the classification.

# Neural Network Model

An _Artificial Neural Network_ (ANN) is a model originally designed to model the human brain. Although it is a crude model, it is remarkably useful for many ML tasks. When used for classification tasks, the network consists of:

- An _input layer_, containing one node corresponding to each feature.
- One or more _hidden layers_, that performs the computations associated to the classification task.
- An _output layer_, with one node for each of the possible classes.

The parameters of an ANN are the strengths of the links between nodes. A detailed description of ANNs is beyond what can be presented here, but some information is available at [this link](https://www.ibm.com/cloud/learn/neural-networks)

To build a model for the iris classification task, we start by splitting the data set into a _training set_ and a _test set_. The training set is used to fit the ANN parameters, and the test set is used to evaluate how good the model is to make predictions. This is illustrated in the next cell:

In [None]:
train_data, test_data, train_labels, test_labels = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42)

This splits the data in two subsets, with the test set containing 20% of the data: 

In [None]:
len(train_data), len(test_data)

To define the network, we use the `MLPRegressor` class from `sklearn`. (MLP stands for _multi-layer perceptron_, which is an alternate name for an ANN).

In [None]:
model = MLPClassifier(hidden_layer_sizes=[10, 5], max_iter=2000, random_state=42)
model.fit(train_data, train_labels)

_Note_: you may receive a message regarding convergence of the algorithm. The results may still be usable, but increasing the value of `max_iter` can improve convergence.

We are now ready to make predictions for the data. We will do that both for the training set and the test set

In [None]:
predictions_test = model.predict(test_data)
for p, t in zip(predictions_test, test_labels):
    print(p, t)

We can see that our predictions are highly accurate. In the next code cell, we evaluate quantitatively the accuracy of the predictions.

In [None]:
predictions_train = model.predict(train_data)
train_score = accuracy_score(predictions_train, train_labels)
print(f'Training set accuracy {train_score}')
predictions_test = model.predict(test_data)
test_score = accuracy_score(predictions_test, test_labels)
print(f'Test set accuracy: {test_score}')

We can see that the accuracies are pretty high, both for the training set and the test set, but of comparable values. As expected the accuracy on the testing set is not as high as the accuracy on the training set. 

Another way to display the accuracy of the model, is to compute the _confusion matrix_:

In [None]:
confusion_matrix(predictions_train, train_labels)

The diagonal entries of the matrix are the number of correct classifications in each class, and the other entries represent incorrect classifications. For the testing set we have:

In [None]:
confusion_matrix(predictions_test, test_labels)

The fact that the classification is so remarkably good is a consequence of a simple data set with classes that are clearly delineated. We should not expect such consistence for a more complex data set.

# Exercises

Experiment with the configuration of the ANN in the call to the constructor `MLPClassifier` and compare the results you get for different configurations. The documentation for the constructor can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html)

Here are some of the things that can be tried:

- Experiment with the relative sizes of the training set and test set.
- Change the number of hidden layers.
- Change the number of neurons in each layer.
- Try different activation functions. The default activation function is `relu`, which is zero for negative values and linear for positive values. Two other alternatives are `logistic` and `tanh`.
- Try different optimization algorithms. `sgd` is a stochastic gradient descent algorithm, and `lbgfs` is a quasi-Newton method. The default is `adam`, which is a gradient-based optimizer.
- Experiment with the `alpha` parameter. `alpha` is a regularization parameter. Regularization is a procedure used to prevent overfitting, and is described in [this link](https://www.simplilearn.com/tutorials/machine-learning-tutorial/regularization-in-machine-learning)
- Experiment with the `learning_rate`, which determines the length of each step taken in the optimization algorithm.
