## Classification problem

This is an example of a classification problem.
The results are used for the notebooks on measuring the performance of classification models (for the book on DL).

**Import libraries** and **set the seed**

In [516]:
import numpy as np
import pandas as pd

nint = 113
np.random.seed(nint)

#### Load the data

Breast cancer data:

- binary classification
- 30 features, numeric

In [517]:
from sklearn.datasets import load_breast_cancer

(X, y) = load_breast_cancer(return_X_y=True)

In [518]:
print("N. of examples", len(y))

N. of examples 569


In [538]:
unique, counts = np.unique(y, return_counts=True)
print(np.asarray((unique, counts)).T)

[[  0 212]
 [  1 357]]


#### Subset columns

Take a random subset of features:

In [519]:
import random

n_vars = 10 ## n. of features to select (randomly)
lst = list(range(0,X.shape[1]))

In [520]:
random.seed(nint)
selected_cols = random.sample(lst, n_vars)

In [521]:
X = X[:,selected_cols]

#### Data normalization

Center and scale the matrix of features `X`

In [522]:
random.seed(nint)
print(random.random())

0.030555320187374058


In [523]:
avg = np.mean(X, axis=0)
std = np.std(X, axis=0)

In [524]:
X_norm = (X-avg)/std

In [525]:
print("Mean of 1st feature:",X_norm[:,0].mean())
print("Standard deviation of 1st feature:", X_norm[:,0].std())

Mean of 1st feature: -1.3736327053358703e-16
Standard deviation of 1st feature: 1.0


#### Training / test data split

Randomly split the data in training (80%) and test set (20%)

In [526]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.2, random_state=nint)

In [527]:
print("N. of training examples:", len(X_train))

N. of training examples: 455


In [528]:
print("N. of test examples:", len(X_test))

N. of test examples: 114


In [529]:
y_train[0:10]

array([0, 1, 1, 0, 1, 1, 1, 1, 1, 0])

### Logistic regression model

Fir the logistic regression model:

In [530]:
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=nint)

In [531]:
# fit the model with data
logreg.fit(X_train, y_train)

#### Get predictions on the test set

In [532]:
y_pred = logreg.predict(X_test)

#### Measure model performance

In [533]:
# import the metrics class
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

array([[44,  6],
       [ 2, 62]])

In [539]:
y_test

array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1])

In [540]:
y_pred

array([1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1])