<a href="https://colab.research.google.com/github/iu-data-science-python-i590/final-project-team-python-trio/blob/master/Phase3/final_project_phase3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Analysis of K-means Clustering on Wisconsin Breast Cancer Data

Phase 3.0 | 2018 December 2

Bill Screen, Ha-Lan Nguyen, Tarun Rawat | Indiana University | M.S. Data Science

#### PROBLEM STATEMENT: 
Breast cancer is a rising issue among women. A cancer’s stage is a crucial factor in deciding what treatment options to recommend, and in determining the patient’s prognosis. Today, in the United States, approximately one in eight women over their lifetime has a risk of developing breast cancer. An analysis of the most recent data has shown that the survival rate is 88% after 5 years of diagnosis and 80% after 10 years of diagnosis. With early detection and treatment, it is possible that this type of cancer will go into remission. In such a case, the worse fear of a cancer patient is the recurrence of the cancer.

#### OBJECTIVE: 
This report will demonstrate how implementing a "k-means" algorithm can be used to classify benign and malign cells in two different groups.

#### Import Libraries

In [0]:
# Import Libraries

%matplotlib inline
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
from statistics import stdev
from statistics import mean
from scipy.stats import pearsonr
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

#### Load Data

In [0]:
# Load data file into pandas Dataframe
df = pd.read_csv('breast-cancer-wisconsin.csv')

# Inspect data
print(df.sample(n=10))

         Scn  A2  A3  A4  A5  A6  A7  A8  A9  A10  CLASS
78   1133136   3   1   1   1   2   3   3   1    1      2
459  1267898   5   1   3   1   2   1   1   1    1      2
99   1166630   7   5   6  10   5  10   7   9    4      4
120  1174057   1   1   2   2   2   1   3   1    1      2
498  1204558   4   1   1   1   2   1   2   1    1      2
24   1059552   1   1   1   1   2   1   3   1    1      2
509  1297522   2   1   1   1   2   1   1   1    1      2
219  1223967   6   1   3   1   2   1   3   1    1      2
103  1168359   8   2   3   1   6   3   7   1    1      4
409  1237674   3   1   2   1   2   1   2   1    1      2


#### Impute missing values

In [0]:
# Set NA character
na_value_char = '?' 

# Check the number of NaN values in the DataFrame before replacement of ? with NaN
print('\n Column A7 contains {0} NaN rows before replacement'.format(df['A7'].isnull().sum()))

# Replace ? by NaN in column A7
df['A7'].replace(na_value_char, value=np.NaN, inplace=True)

# Convert column A7 back to numeric
df['A7'] = pd.to_numeric(df['A7'])

# Check the number of NaN values in the DataFrame after replacement of ? with NaN
print('\n Column A7 contains {0} NaN rows after replacement'.format(df['A7'].isnull().sum()))

# Replace NaN values with the mean of column A7 to the entire DataFrame
df.fillna((df.mean(skipna=True)), inplace=True)

# Convert all columns of DataFrame
df = df.apply(pd.to_numeric)

# Check the number of NaN values in the DataFrame after replacement of NaN with Mean
print('\n Column A7 contains {0} NaN rows after replacement of NaN with Mean'.format(df['A7'].isnull().sum()))


 Column A7 contains 0 NaN rows before replacement

 Column A7 contains 16 NaN rows after replacement

 Column A7 contains 0 NaN rows after replacement of NaN with Mean


#### Use KMeans algorithm

In [0]:
# Use only columns A2-CLASS
df = df.loc[:, 'A2':'CLASS']

# Use KMeans algorithm
# Set iteration to 500, initial centroids to 20, and n_clusters to 2
kmeans = KMeans(n_init=20, max_iter=500, n_clusters=2)

# Fit model to your data
kmeans.fit(df)

# Predict labels
# Save predicted clusters into a variable labels
labels = kmeans.predict(df)

# Create a new column in your dataframe and add Kmeans labels.
df['kmeans_labels'] = labels


# Change labels values so that 0 becomes 2 and 1 becomes 4.
df['kmeans_labels'] = df['kmeans_labels'].replace(0, 2)
df['kmeans_labels'] = df['kmeans_labels'].replace(1, 4)

# Print the first 15 records from your dataframe
print(df.head(15))

# Count how many 2 and 4 values in KMeans label columns.
print('\n Distinct counts for CLASS')
print(df['CLASS'].value_counts())
print('\n Distinct counts for kmeans_labels')
print(df['kmeans_labels'].value_counts())

# Count how many labels (=2) that are in CLASS have value of 4
print('\n Count how many labels (=2) that are in CLASS have value of 4')
labels_class = df.groupby('kmeans_labels')['CLASS'].value_counts()
print(labels_class)


    A2  A3  A4  A5  A6    A7  A8  A9  A10  CLASS  kmeans_labels
0    5   1   1   1   2   1.0   3   1    1      2              2
1    5   4   4   5   7  10.0   3   2    1      2              4
2    3   1   1   1   2   2.0   3   1    1      2              2
3    6   8   8   1   3   4.0   3   7    1      2              4
4    4   1   1   3   2   1.0   3   1    1      2              2
5    8  10  10   8   7  10.0   9   7    1      4              4
6    1   1   1   1   2  10.0   3   1    1      2              2
7    2   1   2   1   2   1.0   3   1    1      2              2
8    2   1   1   1   2   1.0   1   1    5      2              2
9    4   2   1   1   2   1.0   2   1    1      2              2
10   1   1   1   1   1   1.0   3   1    1      2              2
11   2   1   1   1   2   1.0   2   1    1      2              2
12   5   3   3   3   2   3.0   4   4    1      4              2
13   1   1   1   1   2   3.0   3   1    1      2              2
14   8   7   5  10   7   9.0   5   5    

#### Error Rate Function

In [0]:
def ErrorRate(column_labels, column_class):
    # create a temp dataframe to be able to perform different calculations
    temp_df = pd.DataFrame()
    temp_df['column_labels'] = column_labels
    temp_df['column_class'] = column_class
    
    #print(temp_df)
    
    #Calculate different counts for the formulae
    count_label_2 = len(temp_df[temp_df['column_labels']==2])
    count_label_4 = len(temp_df[temp_df['column_labels']==4])
    count_data_points = len(temp_df)
    
    count_label_4_class_2 = len(temp_df[ (temp_df['column_labels']==4) & (temp_df['column_class']==2) ])
    count_label_2_class_4 = len(temp_df[ (temp_df['column_labels']==2) & (temp_df['column_class']==4) ])
    count_label_not_equal_class = len(temp_df[ (temp_df['column_labels']!=temp_df['column_class']) ])
    
    # Use calculated values in formulae to calculate error
    error_rate_benign_cells = round(count_label_4_class_2/count_label_2,2)
    error_rate_malign_cells = round(count_label_2_class_4/count_label_4,2)
    total_error_rate = round(count_label_not_equal_class/count_data_points,2)
    
    return error_rate_benign_cells, error_rate_malign_cells, total_error_rate
    
# Call the ErrorRate function and store values in variables    
error_rate_benign_cells, error_rate_malign_cells, total_error_rate = ErrorRate(df['kmeans_labels'],df['CLASS'])

# Print the error rates
print('Error Rates ')

print('Error Rate for Benign : '+str(error_rate_benign_cells))
print('Error Rate for Malign : '+str(error_rate_malign_cells))
print('Total Error Rate : '+str(total_error_rate))

Error Rates 
Error Rate for Benign : 0.02
Error Rate for Malign : 0.06
Total Error Rate : 0.04


#### Report

Using K-Means algorithm, we got the result that 462 samples are classified as benign and 237 as malignant while the actual result is 458 are benign and  241 are malignant. There are 15 actual malignant samples that K-Means algorithm classified as benign and 11 actual benign samples are classified as malignant. 

The total error rate we got when using K-Means algorithm is 0.04. Error rate for benign samples is 0.02 and that of malignant samples is 0.06. These error rates can be considered good error rates and the model we have built is good enough to use to classify benign or malignant samples. However, the model might be better to predict benign samples as its error rate for benign is quite low (0.02) rather than to predict malignant samples as the error rate is higher (0.06). 

To see how good the model works, this is the training step, we will also need a testing step with more sample data with its actual result and see how our model classifies the samples. If it gives a result within the error rate, then it is good and we can use it.