# Performance Metrics

Olatomiwa Bifarin. <br>
PhD Candidate Biochemistry and Molecular Biology <br>
@ The University of Georgia

_This is a draft copy, a work in progress_

In [36]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import style
#For Seaborn plots
#import seaborn as sns; sns.set(style='white')

# More sharp and legible graphics
%config InlineBackend.figure_format = 'retina'

import time

# Set seaborn figure labels to 'talk', to be more visible. 
#sns.set_context('talk', font_scale=0.8)

## Notebook Outline
1. [Problem Definition](#1)
2. [A Classification Problem](#2)
3. [Accuracy](#3)

## 1. Problem Definition
<a id="1"></a>

## 2. A Classification Problem
<a id="2"></a>

In [3]:
# Lets import the  breast cancer wisconsin dataset[link]
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data["data"]
y = data["target"]
feature_names = data["feature_names"]

df = pd.DataFrame(data=X, columns=feature_names)
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [29]:
df.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

In [11]:
X.shape

(569, 30)

In [12]:
y.shape

(569,)

In [28]:
print ("Number of benign cancer:",list(y).count(0))
print ("Number of malignant cancer:",list(y).count(1))

Number of benign cancer: 212
Number of malignant cancer: 357


In [7]:
from sklearn.model_selection import train_test_split
# split train test 8:2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', y_test.shape)

Training Features Shape: (455, 30)
Training Labels Shape: (455,)
Testing Features Shape: (114, 30)
Testing Labels Shape: (114,)


## Accuracy
<a id="3"></a>

In [30]:
from sklearn import svm
linsvm = svm.SVC(kernel = 'linear', probability=True, random_state=42)

In [39]:
#start = time.time()
from sklearn.model_selection import cross_val_score
scores = cross_val_score(linsvm, X_train, y_train, 
                                 cv=5, scoring="accuracy")
accuracy_mean = np.mean(scores)

print("ML model (SVM) Accuracy: {0:.2%}".format(accuracy_mean))
print("Cross validation score: {0:.2%} (+/- {1:.2%})".format(np.mean(scores), np.std(scores)*2))
#print("Execution time: {0:.5} seconds \n".format(end-start))

ML model (SVM) Accuracy: 95.37%
Cross validation score: 95.37% (+/- 4.96%)


Let's compare to base model.

In [40]:
from sklearn.base import BaseEstimator
class CancerClassifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

In [42]:
cancer_clf = CancerClassifier()
cross_val_score(cancer_clf, X_train, y_train, cv=5, scoring="accuracy")

array([0.32967033, 0.30769231, 0.3956044 , 0.46153846, 0.36263736])

In [None]:
never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

## References and Resources
- Introduction to Statistical Learning, Chapter on Metrics
- Cross-validation: evaluating estimator performance cross validation, __[scikit-learn documentation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-and-model-selection)__
- Wikipedia, Resampling (statistics)