This notebook will give a brief overview of various ML topics I am experimenting with and learning about. Hope you enjoy it.

I will be playing with Haberman's famous dataset concerning breast cancer surgey's eprformed at the University of chicago from 1958-1970. Data was retrievd from the UCI Machine Learning Archive

In [1]:
#start with importing your libraries - yes, it's a lot, but sklearn is an absolute joy to use!
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

In [2]:
habermans_dataset = pd.read_csv("haberman.csv")

In [3]:
habermans_dataset.head(1)

Unnamed: 0,30,64,1,1.1
0,30,62,3,1


It would be better if we had descriptive column names. This can be achieved easily!

In [4]:
habermans_dataset.columns = ["patient_age", "operation_year", "nodes_detected", "survival_status"]

In [5]:
habermans_dataset.head(1)

Unnamed: 0,patient_age,operation_year,nodes_detected,survival_status
0,30,62,3,1


In [6]:
habermans_dataset.shape

(305, 4)

Gorgeous!

Let's see if we have any missing data!

In [7]:
for i, val in enumerate(habermans_dataset.columns):
    missing_data = sum(pd.isnull(habermans_dataset[val]))
    print(habermans_dataset.columns[i], missing_data)

patient_age 0
operation_year 0
nodes_detected 0
survival_status 0


Great, no missing data! Let's proceed!

Looks like we have a supervised learning problem here as we have a category - survival status. Let's check what type of classification problem this is i.e. binary or multi-classification

In [8]:
habermans_dataset["survival_status"].unique()

array([1, 2])

Sweet, looks like a binary variable. 1 corresponds to a patient dying within 5 years of breast surgery and 2 cprresponds to surviving past 5 years from date of surgery

Let's get to splitting our data into training and test sets! we are going to use the sklearn library - let the fun begin!

In [9]:
array = habermans_dataset.values
X = array[:,:3] #features!
y = array[:,3] #outcome variable.
X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.2) #we will be using 80* of the data to train our models!

We will now implement a for loop that will loop through various classification algorithm's and perform a battery of tests (cross-validation) and report the accuracy! This saves us a lot of time and gives us a glimpse into perhaps which algorithm would be best suited for this dataset. Please note: this is a very clean dataset, usually we will be dealing with extremely large and disorganised datasets. It is important to take the time to understand the charactersitics of your dataset before proceeding to this step. You must ensure that you fully understand the assumptions implicit in each algorithm and it's implications for your data.

let's first create a list of the algo's we wish to test out. This will allow for quick looping!

In [10]:
algos = []
algos.append(('LR', LogisticRegression()))
algos.append(('LDA', LinearDiscriminantAnalysis()))
algos.append(('KNN', KNeighborsClassifier()))
algos.append(('CART', DecisionTreeClassifier()))
algos.append(('NB', GaussianNB()))
algos.append(('SVM', SVC()))
algos.append(('Neural Net', MLPClassifier()))
algos.append(('RFC', RandomForestClassifier()))

In [11]:
# evaluate each model in turn
results = []
names = []
for name, algorithm in algos:
    cross_validation = model_selection.KFold(n_splits=20)
    eval_metrics = model_selection.cross_val_score(algorithm, X_train, y_train, cv=cross_validation, scoring='accuracy')
    results.append(eval_metrics)
    names.append(name)
    msg = "%s: %f (%f)" % (name, eval_metrics.mean(), eval_metrics.std())
    print(msg)

LR: 0.748718 (0.140509)
LDA: 0.744551 (0.146461)
KNN: 0.748397 (0.117009)
CART: 0.663462 (0.141251)
NB: 0.757051 (0.132708)
SVM: 0.740705 (0.116703)
Neural Net: 0.745192 (0.136719)
RFC: 0.716346 (0.105699)


Looks like a few of our models had similar mean accuracy, however, our Naive Bayes had the lowest standard deviation. Let's proceed with that!

Let's make a prediction on our test set!

In [12]:
nb =  GaussianNB()
nb.fit(X_train, y_train)
predictions = nb.predict(X_test)
print(accuracy_score(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions, target_names=[">=5 years","<5 years"]))

0.704918032787
[[40  2]
 [16  3]]
             precision    recall  f1-score   support

  >=5 years       0.71      0.95      0.82        42
   <5 years       0.60      0.16      0.25        19

avg / total       0.68      0.70      0.64        61



Remember the following definitions:

- Precision: true positives/(true positives + false positives). We can interpret this as Positive Predictive Value


- Recall: true positives/(true positives + false negatives). We can interpret this as model sensitivity.



- F1 Score: $2  \times (\frac{precision*recall}{precision + recall}) $. We can interpret this as accuracy.