# Innovate Data Academy
[Krisolis](http://www.krisolis.ie)

## Simple Predictive Modelling in Python

This notebook demonstrates a simple predictive model using **decision trees** in **scikit-learn**.

### Package Imports

To build predictive models in Python we use a set of libraries that are imported here. In particular **pandas** and **sklearn** are particularly important.

In [None]:
# General data handling
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 1000) 
pd.set_option('display.max_colwidth', 200)
import numpy as np

# Drawing plots
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Machine learning with scikit-learn
import sklearn
import sklearn.impute
import sklearn.model_selection
import sklearn.metrics
import sklearn.tree

# System packages for saving Python ojects
import os
import pickle

### Load Data

To support data exploration and manipulation it is easiest to load datasets as **Pandas DataFrames**. In this example we will load **breast-cancer-wisconsin.csv**. This dataset was collected by analysing cells taken from patients suspected of having breast cancer. The task for the predictive model is to determine whether a sample belong to healthy ('*benign*') cell or a cancerous ('*malignant*') cell. The descriptive features in this dataset are: 

- **ID**: Unique patient ID
- **Clump Thickness**: A measure of clumping present in the cell (measured from 1 - 10)
- **Uniformity of Cell Size**: A measure of uniformity of the sizes of cells in the sample (measured from 1 - 10)
- **Uniformity of Cell Shape**: A measure of uniformity of the shapes of cells in the sample (measured from 1 - 10)
- **Marginal Adhesion**: The degree to which cells stick together (measured from 1 - 10)
- **Single Epithelial Cell Size**: The size of a single measured cell  (measured from 1 - 10)
- **Bare Nuclei**: The degree to which nuclei devoid of cytoplasm are present int he sample (measured from 1 - 10)
- **Bland Chromatin**: The degree of texture present in chromatin (measured from 1 - 10)
- **Normal Nucleoli**: Assessment of size of cells  (measured from 1 - 10)
- **Mitoses**: The degree to which mitoses are present (measured from 1 - 10)
- **Class**: One of 'benign' or 'malignant'. The target feature. 

Load the data from a csv file to a Pandas DataFrame and examine it. 

In [None]:
development_dataset_filename = '../data/breast-cancer-wisconsin.csv'
query_dataset_filename = '../data/breast-cancer-wisconsin-query.csv'
target_feature_name = 'Class'
index_col = 0

In [None]:
dataset = pd.read_csv(development_dataset_filename, index_col = index_col)
print(dataset.shape)
print(dataset.columns.to_list())
display(dataset.head())
display(dataset.tail())

Examine the distribution of the target feature.

In [None]:
dataset[target_feature_name].value_counts()

### Prepare Data for Modelling

There are two basic jobs that we need to do to prepare datasets for modelling using machine learning:
- Separate descriptive features into two sets, one for descriptive features and one for the target feature.
- Divide the data into two samples, one for training the model and one for evaluating it.

#### Separate Features

**sklearn** expects datasets to be contained in two sets: one for descriptive features (typically called `X`) and one for the target feature (typically called `y`). Here we extract the descriptive features for `X` and the target feature for `y`.

In [None]:
X = dataset.loc[:, dataset.columns != target_feature_name]
y = dataset[target_feature_name]

Examine the `X` and `y` DataFrames.

In [None]:
print(X.shape)
print(X.columns.to_list())
display(X.head(10))

In [None]:
print(y.shape)
display(y.head(10))

#### Partition Data

When building machine learning models it is very common to divide a dataset into **training**, **validation** and **test** partitions. These partitions are used for different roles in modelling. The **train_test_split** function from **sklearn** provides easy functionality to do this. The main parameters of the **train_test_split** function are:

- **X**: The dataset descriptive feature values.
- **y**: The dataset target feature values.
- **test_size**=None: The size of the test partition (either between 0.0 and 1.0 for a percentage o the data, or an actual number of instances). 
- **train_size**=None: The size of the training partition (either between 0.0 and 1.0 for a percentage o the data, or an actual number of instances). 
- **shuffle**=True: Should the data be shuffled before being partitioned?
- **stratify**=None: Allows us to set a target feature so that its distributions remain the same in the created partitions. 

We create a **training set** and a **validation set**.

In [None]:
X_train, X_valid, y_train, y_valid \
    = sklearn.model_selection.train_test_split(X, 
                                               y, 
                                               random_state = 0, 
                                               train_size = 0.7,
                                               stratify = y)

Examine the partitions created. 

In [None]:
print(X_train.shape)
print(X_train.columns.to_list())
display(X_train.head(10))

In [None]:
print(X_valid.shape)
print(X_valid.columns.to_list())
display(X_valid.head(10))

In [None]:
print(y_train.shape)
display(y_train.head())

In [None]:
print(y_valid.shape)
display(y_valid.head())

### Modeling: A Very Simple Decision Tree

To train a classification model **sklearn** uses a two-step process:

1. Create the model object with hyperparameters
2. Fit the model to the data (using the **fit** function). 

In this first example we will build a decision tree using the **DecisionTreeClassifier** object from **sklearn**. The key parameters passed when creating a DecisionTreeClassifier are:

- **criterion** = "gini": the criterion used for selecting partitions during training. One of either "entropy" or "gini".
- **splitter** = "best": The approach used to split numeric data at each node in the tree. One of either "random" or "best".
- **max_depth** = None: The maximum depth that the tree is allowed to grow to. 
- **min_samples_split** = 2: The minimum number of samples required to split an internal node. 
- **min_samples_leaf** = 1: The minimum number of samples required to be at a leaf node.
- **class_weight** = None: A set of weights for classes - often used to handle imbalanced target values. 
- **ccp_alpha** = 0.0: Complexity parameter used for Minimal Cost-Complexity Pruning. 

We create a simple decision tree using the '*entropy*' splitting criterion. 

In [None]:
my_tree = sklearn.tree.DecisionTreeClassifier(criterion="entropy")

We train the model using the **fit** method passing this the training data sample created earlier. 

In [None]:
my_tree.fit(X_train, y_train)

Print the decision tree in a nice text format using the sklearn **export_text** function. The key parameters passed to **export_text** are:

- **decisiontree**: The tree to be plotted.
- **feature_names** = None: A list of the names of the features used in the tree.

In [None]:
print(sklearn.tree.export_text(my_tree, 
                               feature_names=X_train.columns.to_list()))

One of the advantages of using decision trees is that we can draw nice pictures of them. The **plot_tree** method from the **DecisionTreeCalssfiier** object in **sklearn** does a nice job of this. The key parameters passed to **plot_tree** are:

- **decisiontree**: The tree to be plotted.
- **feature_names** = None: A list of the names of the features used in the tree.
- **class_names** = None: Names of each of the target classes in ascending numerical order.
- **filled** = False: Whether or not to colour the tree based on the target distribution at each node. 

In [None]:
fig = plt.figure(figsize=(10,10))
_ = sklearn.tree.plot_tree(my_tree, 
                           feature_names = X_train.columns,
                           class_names = my_tree.classes_,
                           filled = True)

### Evaluating Model Performance

To assess the performance of a trained model we typically use it to make predictions on a dataset for which we know the correct target values. For any **sklearn** model we use the **predict** method to make predictions. The key parameter for **predict** is **X**, which is the data for which predictions will be made. Predict returns a list of predictions - one for each row in **X**. 

In [None]:
y_pred = my_tree.predict(X_train)

This gives us back a list of predictions, one for each row in the **X** DataFrame. (Note these predictions are returned as a Numpy array, not a Pandas DataFrame.)

In [None]:
y_pred

Once we have a set of predictions we can calculate performance measures by comparing these to the ground truth. **sklearn** provides functions for a wide range of performance measures.  They almost all follow the same pattern - we provide them a list of ground truth target values and a list of predictions. Here we calculate simple accuracy using **accuracy_score**. The key parameters passed to **accuracy_score** are:

- **y_true**: The ground truth for predictions.
- **y_pred**: The predictions made by the model.

In [None]:
accuracy = sklearn.metrics.accuracy_score(y_train, y_pred) 
print("Accuracy: {}".format(accuracy))

This looks very good, but maybe too good to be true! Rather than assessing the performance of the tree on the training data sample we should do it on another sample that was not used during model training. Here we assess the performance of the tree on the **validation dataset**.

In [None]:
y_pred = my_tree.predict(X_valid)
accuracy = sklearn.metrics.accuracy_score(y_valid, y_pred) 
print("Accuracy: {}".format(accuracy))

There is a reasonably large difference between the accuracy score achieved with the training data sample and the accuracy achieved using the validation data sample. This is evidence of **overfitting**.

### Building Alternative Trees

We can change the hyperparmaeters used by a machine learning model to, sometimes quite dramatically, change its behaviour. Here we create another decision tree this time limiting its depth to 2 using the **max_depth** parameter. **Max_depth** is an example of a hyperparameter used in the decision tree algorithm, it only allows the algorithm om build trees with 2 levels. 

In [None]:
my_tree = sklearn.tree.DecisionTreeClassifier(criterion = "entropy", 
                                              max_depth = 2)
my_tree = my_tree.fit(X_train,y_train)

Print the decision tree. 

In [None]:
print(sklearn.tree.export_text(my_tree, 
                               feature_names=list(X_train.columns)))

Visualise the decision tree so we can see what it is doing!

In [None]:
fig = plt.figure(figsize=(10,10))
_ = sklearn.tree.plot_tree(my_tree, 
                           feature_names = X_train.columns,
                           class_names = my_tree.classes_,
                           filled = True)

Assess the performance of the decision tree on the **training set**

In [None]:
y_pred = my_tree.predict(X_train)

accuracy = sklearn.metrics.accuracy_score(y_train, y_pred) 
print("Accuracy: {}".format(accuracy))

Assess the performance of the decision tree on the **validation set**

In [None]:
y_pred = my_tree.predict(X_valid)

accuracy = sklearn.metrics.accuracy_score(y_valid, y_pred) 
print("Accuracy: {}".format(accuracy))

A useful rule of thumb is to only allow trees to grow as far as having 5% - 10% of data at any leaf node. We can set this using the **min_samples_split** hyperparameter which can be given a percentage of the training data.

In [None]:
my_tree = sklearn.tree.DecisionTreeClassifier(criterion = "entropy", 
                                              min_samples_split = 0.1)
my_tree = my_tree.fit(X_train,y_train)

Visualise the decision tree so we can see what it is doing!

In [None]:
fig = plt.figure(figsize=(10,10))
_ = sklearn.tree.plot_tree(my_tree, 
                           feature_names = X_train.columns,
#                           class_names = my_tree.classes_,
                           filled = True)

Assess the performance of the decision tree on the **training set**

In [None]:
y_pred = my_tree.predict(X_train)

accuracy = sklearn.metrics.accuracy_score(y_train, y_pred) 
print("Accuracy: {}".format(accuracy))

Assess the performance of the decision tree on the **validation set**

In [None]:
y_pred = my_tree.predict(X_valid)

accuracy = sklearn.metrics.accuracy_score(y_valid, y_pred) 
print("Accuracy: {}".format(accuracy))

### Deployment

The simplest way to deploy a machine learning model is to save the trained model out to a file using the Python **pickle** package. From the **pickle** package the **dump** method writes the data of any Python object to a file. The key parameters of the method are:

- **obj**: The object to write to the file.
- **file**: A file object to write to

We open the file first using the **open** method to which we pass a file name and the '*wb*' mode to write with binary data. It's important to remember to close the file using the **close** method after writing the object to the file. 

In [None]:
fileObject = open("my_model.bin",'wb') 
pickle.dump(my_tree, fileObject)   
fileObject.close()

Now, we can reload the saved object using the **load** method from the **pickle** package. It takes a file object as its only parameter and returns the loaded object. 

In [None]:
fileObject = open('my_model.bin','rb')  
loaded_model = pickle.load(fileObject)  
fileObject.close()

To use the model we can now load a dataset containing only the descriptive features so that we can make predictions for it. 

In [None]:
X_query = pd.read_csv(query_dataset_filename, index_col = 0)
print(X_query.shape)
print(X_query.columns.to_list())
display(X_query.head())
display(X_query.tail())

As a simple way to use the model outputs we can make predictions for the data loaded and write the results out to a file. 

In [None]:
y_pred = loaded_model.predict(X_query)
predictions = pd.DataFrame({'customer' : X_query.index, 'prediction' : y_pred})
print(predictions.head())