<a href="https://colab.research.google.com/github/iu-data-science-python-i590/final-project-team-python-trio/blob/master/Phase2/final_project_phase2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Analysis of K-means Clustering on Wisconsin Breast Cancer Data

Phase 2.0 | 2018 November 13

Bill Screen, Ha-Lan Nguyen, Tarun Rawat | Indiana University | M.S. Data Science

#### PROBLEM STATEMENT: 
Breast cancer is a rising issue among women. A cancer’s stage is a crucial factor in deciding what treatment options to recommend, and in determining the patient’s prognosis. Today, in the United States, approximately one in eight women over their lifetime has a risk of developing breast cancer. An analysis of the most recent data has shown that the survival rate is 88% after 5 years of diagnosis and 80% after 10 years of diagnosis. With early detection and treatment, it is possible that this type of cancer will go into remission. In such a case, the worse fear of a cancer patient is the recurrence of the cancer.

#### OBJECTIVE: 
This report will demonstrate how implementing a "k-means" algorithm can be used to classify benign and malign cells in two different groups.

#### Import Libraries

In [0]:
# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statistics
from scipy.stats import pearsonr
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

#### Load Data

In [0]:
# Load data file into pandas Dataframe
df = pd.read_csv('breast-cancer-wisconsin.csv')

# Inspect data
print(df.sample(n=10))

         Scn  A2  A3  A4  A5  A6  A7  A8  A9  A10  CLASS
324   740492   1   1   1   1   2   1   3   1    1      2
604   188336   5   3   2   8   5  10   8   1    2      4
133  1180523   3   1   1   1   2   1   2   2    1      2
233  1232225  10   4   5   5   5  10   4   1    1      4
396  1176187   3   1   1   1   2   1   3   1    1      2
63   1116132   6   3   4   1   5   2   3   9    1      4
465  1296572  10   9   8   7   6   4   7  10    3      4
54   1110524  10   5   5   6   8   8   7   1    1      4
42   1100524   6  10  10   2   8  10   7   3    3      4
491  1119189   5   8   9   4   3  10   7   1    1      4


#### Impute missing values

In [0]:
# Set NA character
na_value_char = '?' 

# Check the number of NaN values in the DataFrame before replacement of ? with NaN
print('\n Column A7 contains {0} NaN rows before replacement'.format(df['A7'].isnull().sum()))

# Replace ? by NaN in column A7
df['A7'].replace(na_value_char, value=np.NaN, inplace=True)

# Convert column A7 back to numeric
df['A7'] = pd.to_numeric(df['A7'])

# Check the number of NaN values in the DataFrame after replacement of ? with NaN
print('\n Column A7 contains {0} NaN rows after replacement'.format(df['A7'].isnull().sum()))

# Replace NaN values with the mean of column A7 to the entire DataFrame
df.fillna((df.mean(skipna=True)), inplace=True)

# Convert all columns of DataFrame
df = df.apply(pd.to_numeric)

# Check the number of NaN values in the DataFrame after replacement of NaN with Mean
print('\n Column A7 contains {0} NaN rows after replacement of NaN with Mean'.format(df['A7'].isnull().sum()))


 Column A7 contains 0 NaN rows before replacement

 Column A7 contains 16 NaN rows after replacement

 Column A7 contains 0 NaN rows after replacement of NaN with Mean


#### Use KMeans algorithm

In [0]:
# Use only columns A2-A10
data = df.loc[:, 'A2':'A10']

# Use KMeans algorithm
kmeans = KMeans(n_clusters=4)

# Fit model to your data
kmeans.fit(data)
           
# Calculate centroids
centroids = kmeans.cluster_centers_

# Print centroids
print('\n Calculated centroids:\n {0}'.format(pd.DataFrame(centroids)))

# Validate expected centroids array is 4 x 9 matrix
cluster_shape_assert = (centroids.shape[0] == kmeans.n_clusters and centroids.shape[1] == len(data.columns))
print('\nThe centroids array is a 4 x 9 matrix: {0}'.format(cluster_shape_assert))



 Calculated centroids:
           0         1         2         3         4         5         6  \
0  2.944934  1.244493  1.365639  1.292952  2.039648  1.351401  2.061674   
1  7.234043  4.851064  5.042553  4.861702  4.117021  9.382979  5.265957   
2  7.464789  7.028169  6.676056  4.197183  5.535211  3.453220  5.380282   
3  6.762500  8.387500  8.425000  7.750000  6.775000  9.212500  7.375000   

          7         8  
0  1.200441  1.077093  
1  3.787234  1.648936  
2  6.816901  2.309859  
3  7.737500  3.787500  

The centroids array is a 4 x 9 matrix: True


#### Find the optimal number of clusters

In [0]:
# Find the optimal number of clusters

#### Revise data variation

In [0]:
# Revise data variation

#### Implement normalization

In [0]:
# Implement normalization