# Predict breast cancer


Predict the class of breast cancer (malignant or ‘bad’ versus benign or ‘good’) from the features of images taken from breast samples. Ten biological attributes of the cancer cell nuclei have been calculated from the images, as described below:
Attribute 	Domain
1. Sample code number 	id number
2. Clump Thickness 	1 - 10
3. Uniformity of Cell Size 	1 - 10
4. Uniformity of Cell Shape 	1 - 10
5. Marginal Adhesion 	1 - 10
6. Single Epithelial Cell Size 	1 - 10
7. Bare Nuclei 	1 - 10
8. Bland Chromatin 	1 - 10
9. Normal Nucleoli 	1 - 10
10. Mitoses 	1 - 10
11. Class 	(2 for benign, 4 for malignant)

Questions:

    1. What are the factors that predict malignant cancer? (i.e. which variables significantly predict malignancy, p < 0.05)
    2. Create a classification report and confusion matrix of predicted and observed values. What is the accuracy, precision, recall and F1-score of the model on the (a) training and (b) test data?
    3. Plot a Receiver Operating Characteristic (ROC) curve on the test data.
    4. What is overdispersion?


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
import os
import seaborn as sns
from sklearn.model_selection import train_test_split

In [2]:
cancer_prediction = pd.read_csv("cancer.data")

In [3]:
cancer_prediction.head(10)

Unnamed: 0,1000025,5,1,1.1,1.2,2,1.3,3,1.4,1.5,2.1
0,1002945,5,4,4,5,7,10,3,2,1,2
1,1015425,3,1,1,1,2,2,3,1,1,2
2,1016277,6,8,8,1,3,4,3,7,1,2
3,1017023,4,1,1,3,2,1,3,1,1,2
4,1017122,8,10,10,8,7,10,9,7,1,4
5,1018099,1,1,1,1,2,10,3,1,1,2
6,1018561,2,1,2,1,2,1,3,1,1,2
7,1033078,2,1,1,1,2,1,1,1,5,2
8,1033078,4,2,1,1,2,1,2,1,1,2
9,1035283,1,1,1,1,1,1,3,1,1,2


In [4]:
cancer_prediction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698 entries, 0 to 697
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   1000025  698 non-null    int64 
 1   5        698 non-null    int64 
 2   1        698 non-null    int64 
 3   1.1      698 non-null    int64 
 4   1.2      698 non-null    int64 
 5   2        698 non-null    int64 
 6   1.3      698 non-null    object
 7   3        698 non-null    int64 
 8   1.4      698 non-null    int64 
 9   1.5      698 non-null    int64 
 10  2.1      698 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.1+ KB


We can see that there are 698 rows and the sixth column contain object datatypes. 
we will change that later, for now we will rename the columns.

In [5]:
cancer_prediction.rename (inplace=True, columns={
    '1000025':'Sample code number',
    '5':'Clump Thickness',
    '1':'Uniformity of Cell Size',
    '1.1':'Uniformity of Cell Shape',
    '1.2':'Marginal Adhesion',
    '2':'Single Epithelial Cell Size',
    '1.3':'Bare Nuclei',
    '3':'Bland Chromatin',
    '1.4':'Normal Nucleoli',
    '1.5':'Mitoses',
    '2.1':'Class'
})   

Now that the columns have been renamed, we will look at the object values found in column 6(Bare Nuclei), and eliminate any row that has a missing value. In this case the missing value is represented by '?'.

In [35]:
cancer_prediction = cancer_prediction [~cancer_prediction["Bare Nuclei"].isin(['?'])]

In [36]:
cancer_prediction = cancer_prediction.astype(object).astype(float)

We have now changed the datatype for column 6(Bare Nuclei) to a float, this is done becauese we look at '?' as a string while the dataset contains integers.

In [8]:
cancer_prediction.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1002945.0,5.0,4.0,4.0,5.0,7.0,10.0,3.0,2.0,1.0,2.0
1,1015425.0,3.0,1.0,1.0,1.0,2.0,2.0,3.0,1.0,1.0,2.0
2,1016277.0,6.0,8.0,8.0,1.0,3.0,4.0,3.0,7.0,1.0,2.0
3,1017023.0,4.0,1.0,1.0,3.0,2.0,1.0,3.0,1.0,1.0,2.0
4,1017122.0,8.0,10.0,10.0,8.0,7.0,10.0,9.0,7.0,1.0,4.0


In [19]:
cancer_prediction['Class'].replace([2,4],[0, 1],inplace=True)

In [21]:
cancer_prediction.describe()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,682.0,682.0,682.0,682.0,682.0,682.0,682.0,682.0,682.0,682.0,682.0
mean,1076833.0,4.441349,3.153959,3.218475,2.832845,3.23607,3.548387,3.445748,2.872434,1.604106,0.35044
std,621092.6,2.822751,3.066285,2.989568,2.865805,2.224214,3.645226,2.451435,3.054065,1.733792,0.477458
min,63375.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,877454.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,0.0
50%,1171820.0,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,0.0
75%,1238741.0,6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0,1.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0


In [27]:
cancer_prediction = cancer_prediction[~cancer_prediction["Bare Nuclei"].isin(['?'])]

In [28]:
cancer_prediction = cancer_prediction.astype(float)

In [33]:
cancer_prediction.drop(['Sample code number'], axis=1)

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,5.0,4.0,4.0,5.0,7.0,10.0,3.0,2.0,1.0,0.0
1,3.0,1.0,1.0,1.0,2.0,2.0,3.0,1.0,1.0,0.0
2,6.0,8.0,8.0,1.0,3.0,4.0,3.0,7.0,1.0,0.0
3,4.0,1.0,1.0,3.0,2.0,1.0,3.0,1.0,1.0,0.0
4,8.0,10.0,10.0,8.0,7.0,10.0,9.0,7.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...
693,3.0,1.0,1.0,1.0,3.0,2.0,1.0,1.0,1.0,0.0
694,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,0.0
695,5.0,10.0,10.0,3.0,7.0,3.0,8.0,10.0,2.0,1.0
696,4.0,8.0,6.0,4.0,3.0,4.0,10.0,6.0,1.0,1.0


We have dropped the Sample code column as it is insignificant to our prediction.

In [37]:
cancer_prediction.groupby(by='Class').size()

Class
0.0    443
1.0    239
dtype: int64

We now divided the data according to the Class column, and we can see that a larger number of observation are bening, however we will be focusing on the data containing malignant cases that amount to 239 which is 35% of the dataset.

We can also see that the rows are not the same and this is because we ahve eliminated all rows containing '?'. THis means that a total of 16 rows have been removed.

In [38]:
cancer_malignant = cancer_prediction[cancer_prediction.Class == 1]
cancer_malignant.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 239 entries, 4 to 697
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Sample code number           239 non-null    float64
 1   Clump Thickness              239 non-null    float64
 2   Uniformity of Cell Size      239 non-null    float64
 3   Uniformity of Cell Shape     239 non-null    float64
 4   Marginal Adhesion            239 non-null    float64
 5   Single Epithelial Cell Size  239 non-null    float64
 6   Bare Nuclei                  239 non-null    float64
 7   Bland Chromatin              239 non-null    float64
 8   Normal Nucleoli              239 non-null    float64
 9   Mitoses                      239 non-null    float64
 10  Class                        239 non-null    float64
dtypes: float64(11)
memory usage: 22.4 KB


In [14]:
cancer_malignant.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class


I used subsetting to create another table that includes stats of malignant cases only. From the info we see that there are 241 which is the number of malignant cases in the class column as shown when splitting the Class column 

In [15]:
cancer_malignant.describe()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,,,,,,,,,,,
std,,,,,,,,,,,
min,,,,,,,,,,,
25%,,,,,,,,,,,
50%,,,,,,,,,,,
75%,,,,,,,,,,,
max,,,,,,,,,,,


With the above information, we can see that the lowest is 2.58(Mitoses). This biological attribute seems to have the lowest mean, this could mean that if the Mitoses mean is lower, the prabability of the cancer being malignant is higher.

In [16]:
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

X = cancer_malignant['Clump Thickness']
y = cancer_malignant.drop(['Sample code number', 'Clump Thickness'], axis=1)

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

ValueError: zero-size array to reduction operation maximum which has no identity

In [None]:
x = cancer_malignant['Clump Thickness']
y = cancer_malignant.drop(['Sample code number', 'Clump Thickness'], axis=1)


In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

X = cancer_malignant['Clump Thickness']
y = cancer_malignant.drop(['Sample code number', 'Clump Thickness'], axis=1)

X = X.reshape(-1, 1)
poly = PolynomialFeatures(degree=2)
poly_data = poly.fit_transform(X)
model = LinearRegression()
model.fit(poly_data,y)
coef = model.coef_
intercept = model.intercept_

In [None]:
plt.scatter(X,y,color='red')
plt.plot(X,model.predict(poly.fit_transform(X)),color='blue')
plt.legend(['Prediction','Original'])
plt.show()

In [None]:
np.mean(cancer_malignant)

In [None]:
cancer_malignant['Clump Thickness']

In [None]:
sns.pairplot(cancer_malignant, hue='Class', size=2.5)

### Overdispersion

This is when we have a higher variatiion from our observation, which is higher than we had inticipated.This can happen when significant predictors in a dataset are missing.