# Case Study for Breast Cancer Detection Using Logistic Regression

We will do a case study and learn how to detect breast cancer by applying a logistic regression model on a real-world dataset and predict whether a tumor is benign (not breast cancer) or malignant (breast cancer) based on off its characteristics. The dataset is taken from the UC Irvine Machine Learning Repository [https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29]

On the website, we can learn about different charactersitic of the data.

In [24]:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
img_1 = mpimg.imread('image_1.png')
img_2 = mpimg.imread('image_2.png')
plt.imshow(img_1)
plt.imshow(img_2)

TypeError: imread() got an unexpected keyword argument 'width'

In [23]:
Image(url= "image_2.png", width=300, height=400)

This is a Multivariate dataset, which means that it has more than one independent variable. We will have to use these independent variables to predict the dependent variable, that whether the tumor is benign or malignant.

Attribute Characteristics are Integers, which means independent variables are Integers numbers.

Associated Tasks: Classification. That is because we will be predicting the result as a true or false result.

Number of Instances: 699. These are the number of observations that the dataset has. We will be using this correlations between these attributes to tell whether or not the tumor corresponding to these attributes is benign or malignant.

There will be some missing values in the dataset.

The date of this dataset is: 1992-07-15. This is a really old dataset. 

## Importing the Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [15]:
dataset = pd.read_csv('breast_cancer.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

The last column in the dataset represents if the cancer is Benign or Malignant, so we are taking it and storing it in a new variable "y".

In [16]:
dataset.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


## Splitting the dataset into the Training Set and Test Set

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

## Training the Logistic Regression Model on the Training Set

In [18]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
regressor = classifier.fit(X_train, y_train)

## Predicting the Test Set Results

In [19]:
y_pred = classifier.predict(X_test)

#print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

## Making the Confusion Matrix¶

In [20]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[84  3]
 [ 3 47]]


## Finding the Accuracy of the Test set

In [21]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(accuracy * 100, '%')

95.62043795620438 %


## Computing the accuracy with k-Fold Cross Validation

In [25]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train , cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 96.70 %
Standard Deviation: 1.97 %


Thus, we get an accuracy of 96.70% from the k-Fold Validation with a standard deviation of 1.97%. This is an excellent result. 

If we want to predict a result of a specific person, we will need to know Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, and Mitoses of that person. We might not know what these terms mean, but we know that these can be used to predict if a person has a breast cancer. Below is the code which we want use for our prediction for a specific person.

In [28]:
# classifier.predict([[Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Mitoses ]])

This finishes the case on the Breast Cancer.