## Binary Classification Example: Breast Cancer Wisconsin Data

This dataset is concerned with predicting whether a cell tissue is cancerous or not using the cell's measurement values. It contains 569 observations and 30 input features. The target feature, "diagnosis", has two classes: 212 "malignant" and 357 "benign", denoted by "M" and "B" respectively.

The dataset has no missing values and all features are numeric other than the target feature (which is binary).

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import io
import requests

In [2]:
cancer_df = pd.read_csv('data/breast_cancer_wisconsin.csv')
cancer_df.head(5)

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,M
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,M
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,M
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,M
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,M


## Partitioning Dataset into the Set of DEscriptive Features and Target Feature

In [3]:
Data = cancer_df.drop(columns='diagnosis')
target = cancer_df['diagnosis']

## Encoding Target feature

Keep that in mind that `scikit-learn` always requires all data to be numeric, so the target need to be encoded as 0 and 1

In [4]:
from sklearn import preprocessing

target = preprocessing.LabelEncoder().fit_transform(target)

Note that the `LabelEncoder` labels in an alphabetical order. That is, "B" is labeled as 0 whereas "M" as 1

In [5]:
np.unique(target, return_counts= True)

(array([0, 1]), array([357, 212], dtype=int64))

## Scaling Descriptive Features

Its always a good idea to scale your descriptive features before fitting any models. here, we use the "min-max scaling" so that each descriptive feature is scaled to be between 0 and 1. In the rest we work with scaled data.

In [6]:
Data = preprocessing.MinMaxScaler().fit_transform(Data)

In [7]:
Data[0,:]

array([0.52103744, 0.0226581 , 0.54598853, 0.36373277, 0.59375282,
       0.7920373 , 0.70313964, 0.73111332, 0.68636364, 0.60551811,
       0.35614702, 0.12046941, 0.3690336 , 0.27381126, 0.15929565,
       0.35139844, 0.13568182, 0.30062512, 0.31164518, 0.18304244,
       0.62077552, 0.14152452, 0.66831017, 0.45069799, 0.60113584,
       0.61929156, 0.56861022, 0.91202749, 0.59846245, 0.41886396])

## Spliting Data into Training and Test Sets

We split the descriptive features and the target feature into a `Training set` and  a `Test Set` by a ratio of 70:30. That is, we use 70% of the data to build our classifiers and evaluate their performance on the remaining 30% of the data. This is to ensure that we measure model performance on unseen data in order to avoid overfitting. We also set a random state value so that we can replicate our results later on.

In [8]:
from sklearn.model_selection import train_test_split

D_train, D_test, t_train, t_test = train_test_split(Data,
                                                   target,
                                                   test_size=0.3,
                                                   random_state=999)

## Fitting a Nearest Neighbor Classifier

Let's fit a nearest neighbor classifier with 5 neighbors using the Euclidean distance. We fit the model on the train data and evaluate its performance on the test data.

Below, the `score` method returns the accuracy of the classifier on the test data. Accuracy is defined as the ratio of correctly predicted observations to the total number of observations.

In [10]:
from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier(n_neighbors=5, p=2) #p=1 is Manhattan Distance whether p=2 is Euclidean
knn_classifier.fit(D_train,t_train)
knn_classifier.score(D_test,t_test)

0.9707602339181286

If you would like to see which parameters are available for a classifier, just type the name of the classifier followed by a question mark, e.g. "KNeighborsClassifier?"

In [12]:
# KNeighborsClassifier?

## Fitting a Decision Tree Classifier

Let's fit a decision tree classifier with the entropy split criterion and a maximum depth of 4 on the train data, and then evaluate its performance on the test data.

In [14]:
from sklearn.tree import DecisionTreeClassifier
DecisionTreeClassifier?

In [15]:
dt_classifier = DecisionTreeClassifier(criterion='entropy', max_depth=4)
dt_classifier.fit(D_train,t_train)
dt_classifier.score(D_test,t_test)

0.9239766081871345

## Fitting a Random Forest Classifier

An ensemble method is a collection of many sub-classifiers. The final outcome is determined by a majority voting of the sub-classifiers. Random forest classifier is a popular ensemble method based on the idea of "bagging" where the sub-classifiers are decision trees. Let's fit a random forest classifier with 100 decision trees.

In [16]:
from sklearn.ensemble import RandomForestClassifier
RandomForestClassifier?

rf_classifier = RandomForestClassifier(n_estimators=100)
rf_classifier.fit(D_train,t_train)
rf_classifier.score(D_test,t_test)

0.9415204678362573

## Fitting a Gaussian Naive Bayes Classifier

Another model we would like to fit to the breast cancer dataset is the Gaussian Naive Bayes classifier with a variance smoothing value of  $10^−3$

In [17]:
from sklearn.naive_bayes import GaussianNB

nb_classifier = GaussianNB(var_smoothing=10**(-3))
nb_classifier.fit(D_train, t_train)
nb_classifier.score(D_test, t_test)

0.935672514619883

## Fitting a Support Vector Machine

One last model we fit is the SVM with all the default values.

In [18]:
from sklearn.svm import SVC

svm_classifier = SVC()
svm_classifier.fit(D_train, t_train)
svm_classifier.score(D_test, t_test)

0.9883040935672515

## Making Predictions with a Fitted Model

Once a model is built, a prediction can be made using the predict method of the fitted classifier.

For example, suppose we would like to use the fitted nearest neighbor classifier as our model, and we would like to find out the model's prediction for the first three rows in the input data. Of course, we already know the labels of these rows (which are all malignant), so this is just to illustrate how you would make a prediction for a new observation.

In [21]:
new_obs = Data[0:3]
print(new_obs)
knn_classifier.predict(new_obs)

[[0.52103744 0.0226581  0.54598853 0.36373277 0.59375282 0.7920373
  0.70313964 0.73111332 0.68636364 0.60551811 0.35614702 0.12046941
  0.3690336  0.27381126 0.15929565 0.35139844 0.13568182 0.30062512
  0.31164518 0.18304244 0.62077552 0.14152452 0.66831017 0.45069799
  0.60113584 0.61929156 0.56861022 0.91202749 0.59846245 0.41886396]
 [0.64314449 0.27257355 0.61578329 0.50159067 0.28987993 0.18176799
  0.20360825 0.34875746 0.37979798 0.14132266 0.15643672 0.08258929
  0.12444047 0.12565979 0.11938675 0.08132304 0.0469697  0.25383595
  0.08453875 0.0911101  0.60690146 0.30357143 0.53981772 0.43521431
  0.34755332 0.15456336 0.19297125 0.63917526 0.23358959 0.22287813]
 [0.60149557 0.3902604  0.59574321 0.44941676 0.51430893 0.4310165
  0.46251172 0.63568588 0.50959596 0.21124684 0.22962158 0.09430251
  0.18037035 0.16292179 0.15083115 0.2839547  0.09676768 0.38984656
  0.20569032 0.12700551 0.55638563 0.36007463 0.50844166 0.37450845
  0.48358978 0.38537513 0.35974441 0.83505155 0.

array([1, 1, 1])

In [20]:
target[0:3]

array([1, 1, 1])

The model's prediction for these three rows is that they are all "1", that is, they are all "malignant". Thus, in this particular case, we observe that the model correctly predicts the first three rows in the input data.

## Summary

This tutorial illustrates that Python and Scikit-Learn together provide a unified interface to model fitting and evaluation and they greatly simplify the machine learning workflow.

Of course, there is a whole lot more to supervised machine learning than what is shown in here, such as

1. Other classification algorithms
2. Solving prediction problems where the target feature is numeric (a.k.a. regression problems)
3. Using other model performance metrics (e.g., precision, recall, mean squared error for regression, etc.)
4. More sophisticated model performance assessment methods (such as cross-validation)
5. How model parameters can be optimized (also known as hyperparameter tuning)