## Classification problem

This is an example of a classification problem.
The results are used for the notebooks on measuring the performance of classification models (for the book on DL).

**Import libraries** and **set the seed**

In [976]:
import numpy as np
import pandas as pd

nint = 119
np.random.seed(nint)

#### Load the data

Breast cancer data:

- binary classification
- 30 features, numeric

In [977]:
from sklearn.datasets import load_breast_cancer

(X, y) = load_breast_cancer(return_X_y=True)

In [978]:
print("N. of examples", len(y))

N. of examples 569


In [979]:
unique, counts = np.unique(y, return_counts=True)
print(np.asarray((unique, counts)).T)

[[  0 212]
 [  1 357]]


#### Subset columns

Take a random subset of features:

In [980]:
import random

n_vars = 5 ## n. of features to select (randomly)
lst = list(range(0,X.shape[1]))

In [981]:
random.seed(nint)
selected_cols = random.sample(lst, n_vars)

In [982]:
X = X[:,selected_cols]

#### Data normalization

Center and scale the matrix of features `X`

In [983]:
random.seed(nint)
print(random.random())

0.9263128585070141


In [984]:
avg = np.mean(X, axis=0)
std = np.std(X, axis=0)

In [985]:
X_norm = (X-avg)/std

In [986]:
print("Mean of 1st feature:",X_norm[:,0].mean())
print("Standard deviation of 1st feature:", X_norm[:,0].std())

Mean of 1st feature: -5.744282222313639e-16
Standard deviation of 1st feature: 1.0


#### Training / test data split

Randomly split the data in training (80%) and test set (20%)

In [987]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.2, random_state=nint)

In [988]:
print("N. of training examples:", len(X_train))

N. of training examples: 455


In [989]:
print("N. of test examples:", len(X_test))

N. of test examples: 114


In [990]:
y_train[0:10]

array([1, 1, 1, 1, 1, 1, 0, 1, 0, 1])

### Logistic regression model

Fir the logistic regression model:

In [991]:
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=nint)

In [992]:
# fit the model with data
logreg.fit(X_train, y_train)

#### Get predictions on the test set

In [993]:
y_pred = logreg.predict(X_test)

#### Measure model performance

In [994]:
# import the metrics class
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

array([[34,  7],
       [ 4, 69]])

In [995]:
lr_probs = logreg.predict_proba(X_test)

In [996]:
#this cell mounts the user's google drive in the specified folder,
#but only once (doing more than once would generate an error)
import os

gdrive_folder = '/content/gdrive'
project_folder = '/content/gdrive/MyDrive/projects/book_DL' ## !! IMPORTANT: change this depending on data iteration !!

if not os.path.isdir(gdrive_folder):
  from google.colab import drive
  drive.mount(gdrive_folder)

In [997]:
y_all = np.vstack((y_test, y_pred)).T

In [998]:
df = pd.DataFrame(np.hstack((y_all, lr_probs)), columns=['y_test', 'y_pred', 'prob_0', 'prob1'])
df.head()

Unnamed: 0,y_test,y_pred,prob_0,prob1
0,1.0,1.0,0.013791,0.986209
1,1.0,1.0,0.022589,0.977411
2,1.0,1.0,0.340036,0.659964
3,0.0,0.0,0.999988,1.2e-05
4,1.0,1.0,0.029998,0.970002


In [999]:
df.reset_index(drop=True, inplace=True)

In [1000]:
def writeout_results(res, filename):

    basedir = os.path.dirname(filename)
    if os.path.isdir(basedir):
          res.to_csv(filename, mode='w', header=True, index=False)
          return "Creating file '{}' and writing results to it".format(os.path.basename(filename))
    else:
          os.makedirs(os.path.dirname(filename))
          res.to_csv(filename, mode='w', header=True, index=False)
          return "Creating folder '{}' and writing results to file {}".format(basedir, os.path.basename(filename))

In [1001]:
## the model object is used to extract predictions
## if not reinstantiated, I believe that the model object keeps being over
## epochs in the above for loop; therefore the model object after the loop
## containes the fully trained model to be used for predictions

print(" - saving predictions")
fname = os.path.join(project_folder, "predictions.csv")
print("writing results to: ", fname)
writeout_results(df, fname)

 - saving predictions
writing results to:  /content/gdrive/MyDrive/projects/book_DL/predictions.csv


"Creating file 'predictions.csv' and writing results to it"