In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Model Building and Feature Selection


## Read Data and Pre-process

In [None]:
# Read data
df = pd.read_csv('https://datahub.io/machine-learning/iris/r/iris.csv')

# Create new column to make numeric values of the three classes of irises
reclass = {'Iris-setosa': 1, 'Iris-versicolor': 3, 'Iris-virginica': 5}
df['reclass'] = df['class'].map(reclass)

In [None]:
df

## Feature Selection

Inspect data to find features that are:
* irrelevant and noisy
* weakly relevant and redundant
* weakly relevant and non-redundant
* strongly relevant

Look at heatmap of correlation coefficients

In [None]:
sns.heatmap(df.corr(), annot=True, cmap=plt.cm.coolwarm_r, vmin=-1, vmax=1);

Nothing too weakly relevant, and the lowest correlation is anti-correlated with the other variables, so it likely has discriminatory value to keep.

## Split Data

1. Split data into the feature and target classes.

2. Split the data into a training and testing data sets.

Scikit-Learn: https://scikit-learn.org/stable/index.html

In [None]:
# First split data into feature (X) and target (y) matrix
X_iris = df['sepallength'].values[:, np.newaxis]
y_iris = df['class']

In [None]:
# Split data into training and testing sets
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,
                                                random_state=1)

In [None]:
print(f'The shape of the feature training matrix is {Xtrain.shape}.')
print(f'The shape of the target training matrix is {ytrain.shape}.')

In [None]:
print(f'The shape of the feature testing matrix is {Xtest.shape}.')
print(f'The shape of the target testing matrix is {ytest.shape}.')

## Model

### Basics of the Scikit-Learn API

Most commonly, the steps in using the Scikit-Learn estimator API are as follows
(we will step through a handful of detailed examples in the sections that follow).

1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired values.
3. Arrange data into a features matrix and target vector following the discussion above.
4. Fit the model to your data by calling the ``fit()`` method of the model instance.
5. Apply the Model to new data:
   - For supervised learning, often we predict labels for unknown data using the ``predict()`` method.
   - For unsupervised learning, we often transform or infer properties of the data using the ``transform()`` or ``predict()`` method.

We will now step through several simple examples of applying supervised and unsupervised learning methods.

*From the Python Data Science Handbook section 5.2*

In [None]:
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB( )                       # 2. instantiate model
model.fit(Xtrain, ytrain)                  # 3. fit model to data
y_model = model.predict(Xtest)             # 4. predict on new data

## Model Evaluation

Let's start with just checking each true value with its modelled value and mark when they are mis-matched.

In [None]:
# Use loop to check each value, since we only have 38 values reserved for
# testing, this isn't too onerous.
for i in range(len(y_model)):
    print(f'True value: {ytest.iloc[i]:15s} Modelled Value: {y_model[i]}')
    if ytest.iloc[i] != y_model[i]:
        print('Missed prediction')

### Confusion Matrix
A confusion matrix is a simplified way of checking how the forecasts compared to the known values. Since we have three possible categories, we should get a 3X3 matrix and desire that all of the values fall on the diagonal, which would imply a perfect prediction.

In [None]:
from sklearn.metrics import confusion_matrix

# Confusion matrix function makes easy work of obtaining matrix
mat = confusion_matrix(ytest, y_model)

# Use seaborn to make a heatmap of the confusion matrix
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=model.classes_, yticklabels=model.classes_, cmap=plt.cm.BuPu)
plt.xlabel('true label')
plt.ylabel('predicted label');

## Build a better model

Now let's keep more than one predictor to see if we can improve our overall model.

Start by resubsetting your data, and only dropping the uneeded columns ('class', 'reclass') and then re-splitting into train and test datasets.

In [None]:
# Get feature matrix (four predictors)
X_iris = df.drop(columns=['class', 'reclass'])

# Get target matrix
y_iris = df['class']

# Split dataset into train, test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,
                                                random_state=1)

In [None]:
# Set model type just to be sure
model = GaussianNB()                       # 2. instantiate model

# Fit with new trainging data
model.fit(Xtrain, ytrain)                  # 3. fit model to data

# Predict the test data to evaluate model
y_model = model.predict(Xtest)             # 4. predict on new data

### Other Evaluation Methods

Accuracy Score computes a subset accuracy score for a multi-label classification model.

Classification Report gives a more detailed look at the specific scores for each classification in the prediction scheme.

Other measures: https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
from sklearn.metrics import accuracy_score, classification_report
accuracy_score(ytest, y_model)

In [None]:
print(classification_report(ytest, y_model))

In [None]:
# Show the confusion matrix for our better model
mat = confusion_matrix(ytest, y_model)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=model.classes_, yticklabels=model.classes_, cmap=plt.cm.BuPu)
plt.xlabel('true label')
plt.ylabel('predicted label');

## Further Reading

For a deeper dive into the basics of Scikit-Learn, Hyper-parameters, Model Evaluation, and Feature Engineering, see Chapter 5.2, 5.3, and 5.4 from the Python Data Science Handbook, especially the Google Colab notebooks.