<a href="https://colab.research.google.com/github/iu-data-science-python-i590/final-project-team-python-trio/blob/master/Phase1/final_project_phase1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Analysis of K-means Clustering on Wisconsin Breast Cancer Data

Phase 1.0 | 2018 October 24

Bill Screen, Ha-Lan Nguyen, Tarun Rawat | Indiana University | M.S. Data Science

#### PROBLEM STATEMENT: 
Breast cancer is a rising issue among women. A cancer’s stage is a crucial factor in deciding what treatment options to recommend, and in determining the patient’s prognosis. Today, in the United States, approximately one in eight women over their lifetime has a risk of developing breast cancer. An analysis of the most recent data has shown that the survival rate is 88% after 5 years of diagnosis and 80% after 10 years of diagnosis. With early detection and treatment, it is possible that this type of cancer will go into remission. In such a case, the worse fear of a cancer patient is the recurrence of the cancer.

#### OBJECTIVE: 
This report will demonstrate how implementing a "k-means" algorithm can be used to classify benign and malign cells in two different groups.

#### PHASE 1 TASKS

- Download the data and load it in Python
- Add Headers (Scn A2 A3 A4 A5 A6 A7 A8 A9 A10 CLASS)
- Impute missing values 
- Plot basic graphs
- Compute data statistics
---

#### Import Libraries

In [0]:
# Import Libraries
%matplotlib inline
import pandas as pd
import pylab as plt
import numpy as np
import statistics

#### Load Data

In [0]:
# Set header column names
column_names = ['Scn','A2','A3','A4','A5','A6','A7','A8','A9','A10','CLASS']

# Set NA character
na_value_char = '?' 

# Load data file into pandas Dataframe, and replace '?' chars with NaN
df = pd.read_csv('breast-cancer-wisconsin.data', names=column_names, na_values=na_value_char)

# Check the number of NaN values in the DataFrame
print(df.isnull().sum())

# Inspect data
print(df.head(n=20))

#### Impute missing values

In [0]:
# Validate column A7 is float (before replacing with mean)
print('\n Column A7 is datatype {0}'.format(df['A7'].dtypes))

# Replace NaN values with the mean value in column A7
df['A7'].fillna((df['A7'].mean()), inplace=True)

# Validate column A7 is float (after replacing with mean)
print('\n Column A7 is datatype {0}'.format(df['A7'].dtypes))

#### Explore Dataset

In [0]:
# Provide the summary statistics
print('\nProvide the summary statistics:')
print(df.describe())

# Find number of columns and number of rows
print('\nThe Dataframe has {0} rows, and {1} columns'.format(df.shape[0], df.shape[1]))

# Report how many unique values in each column
for column in column_names:
  print('\nColumn {0} has {1} unique values.'.format(column, len(df[column].unique())))

#### Plot basic graphs

In [0]:
# Draw a bar plot for CLASS


# Plot histograms for attributes A2 to A10 (nine histograms)
for column in column_names[column_names.index('A2') : column_names.index('A10') + 1]:
  df[[column]].plot(kind='hist', bins=30, color="b", alpha=0.5, figsize=(8,4))

####Compute data statistics

In [0]:
# Find the mean, median, standard deviation and variance of each of the attributes A2 to A10.

# Set column headers
column_headers = ['','Mean','Median','Std. Deviation']

# Create array of row data
row_data = []

# Iterate over dataset and append column statitics to array
for column in column_names[column_names.index('A2') : column_names.index('A10') + 1]:
  row_data.append([column, df[column].mean(), df[column].median(), df[column].std(ddof=0)])

# Convert array to DataFrame
stats_df = pd.DataFrame(row_data, columns=column_headers)

# Display results
print(stats_df)