
# Exercise 5 Breast Cancer prediction

The goal of this exercise is to use Logistic Regression
to predict breast cancer. It is always important to understand the data before training any Machine Learning algorithm. The data is described in **breast-cancer-wisconsin.names**. I suggest to add manually the column names in the DataFrame.

Preliminary:

- If needed, replace missing values with the median of the column.

- Handle the column `Sample code number`. This column won't be used to train the model as it doesn't contain information on breast cancer. There are two solutions: drop it or set it as index.

1. Print the proportion of class `Benign`.  What would be the accuracy if the model always predicts `Benign`?
Later this week we will learn about other metrics as AUC that will help us to tackle high imbalanced data sets.

2. Using train_test_split, split the data set in a train set and test set (20%). Both sets should should have approximately the same proportion of class `Benign`. Use `random_state = 43`.

3. Fit the logistic regression on the train set. Predict on the train set and test set. Compute the score on the train set and test set. 92-97% accuracy is expected on the test set.

4. Compute the confusion matrix on both tests. Analyse the number of false negative and false positive.

- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

- https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/


In [30]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

file = np.genfromtxt("breast-cancer-wisconsin.data", delimiter=",")  

# Attribute Information:
# 
# 1. Sample code number: id number
# 2. Clump Thickness: 1 - 10
# 3. Uniformity of Cell Size: 1 - 10
# 4. Uniformity of Cell Shape: 1 - 10
# 5. Marginal Adhesion: 1 - 10
# 6. Single Epithelial Cell Size: 1 - 10
# 7. Bare Nuclei: 1 - 10
# 8. Bland Chromatin: 1 - 10
# 9. Normal Nucleoli: 1 - 10
# 10. Mitoses: 1 - 10
# 11. Class: (2 for benign, 4 for malignant)

# we create a data frame with all the columns
df = DataFrame(file, columns=['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class'])

# remove the comlumns we do not need OR puting it as ID
df = df.set_index('Sample code number')

# replace the missinf values with the median of the column
print(df.isnull().sum())
df['Bare Nuclei'] = df['Bare Nuclei'].fillna(df['Bare Nuclei'].mean())
print('\nall good (removed):', df.isnull().sum().sum())

# 1. portion on class Benign
print('portion of class benign:', (df['Class'] == 2).sum()/len(df['Class']))
print('this means that if we predict Benign your accuracy will be 66%')


Clump Thickness                 0
Uniformity of Cell Size         0
Uniformity of Cell Shape        0
Marginal Adhesion               0
Single Epithelial Cell Size     0
Bare Nuclei                    16
Bland Chromatin                 0
Normal Nucleoli                 0
Mitoses                         0
Class                           0
dtype: int64

all good (removed): 0
portion of class benign: 0.6552217453505007
this means that if we predict Benign your accuracy will be 66%


In [117]:
# 2.
X, y = df.drop(['Class'], axis=1), df['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=43, stratify=y)

# testing
print('portion of class benign test:', (y_test == 2).sum()/len(y_test))
print('portion of class benign train:', (y_train == 2).sum()/len(y_train))


portion of class benign test: 0.6571428571428571
portion of class benign train: 0.6547406082289803


In [122]:
# 3.
# X_train = X_train.to_numpy().reshape(-1, 1)
clf = LogisticRegression().fit(X_train, y_train)
# Train
predict = clf.predict(X_train)
print('predict:', predict[:10])

prob_predict = clf.predict_proba(X_train)
print('probability predict:', prob_predict[:10])

print('score:', clf.score(X_train, y_train))

# Test
predict_test = clf.predict(X_test)
print('predict:', predict_test[:10])

prob_predict = clf.predict_proba(X_test)
print('probability predict:', prob_predict[:10])

print('score:', clf.score(X_test, y_test))


predict: [4. 2. 4. 2. 2. 2. 2. 4. 2. 2.]
probability predict: [[4.65649710e-03 9.95343503e-01]
 [9.90960042e-01 9.03995849e-03]
 [7.05444060e-05 9.99929456e-01]
 [9.94699773e-01 5.30022688e-03]
 [9.79006762e-01 2.09932383e-02]
 [9.94168773e-01 5.83122708e-03]
 [9.63204039e-01 3.67959608e-02]
 [4.96441067e-03 9.95035589e-01]
 [9.92180521e-01 7.81947864e-03]
 [9.89521488e-01 1.04785115e-02]]
score: 0.9695885509838998
predict: [2. 2. 2. 4. 2. 4. 2. 2. 2. 4.]
probability predict: [[9.82821870e-01 1.71781299e-02]
 [7.83881983e-01 2.16118017e-01]
 [9.93012989e-01 6.98701126e-03]
 [2.17405344e-01 7.82594656e-01]
 [9.98443347e-01 1.55665329e-03]
 [1.59078715e-03 9.98409213e-01]
 [6.60767146e-01 3.39232854e-01]
 [9.87891047e-01 1.21089534e-02]
 [9.95641306e-01 4.35869376e-03]
 [2.89811233e-04 9.99710189e-01]]
score: 0.9642857142857143


In [124]:
# 4. Compute the confusion matrix on both tests. Analyse the number of false negative and false positive.
from sklearn.metrics import confusion_matrix

# calculates the loss
# train predict: [4. 2. 4. 2. 2. 2. 2. 4. 2. 2.]
# test predict: [2. 2. 2. 4. 2. 4. 2. 2. 2. 4.]

confusion_test = confusion_matrix(y_test, predict_test)
confusion_train = confusion_matrix(y_train, predict)
print(confusion_train)
print(confusion)

[[357   9]
 [  8 185]]
[[90  2]
 [ 3 45]]
