<a href="https://colab.research.google.com/github/iu-data-science-python-i590/final-project-team-python-trio/blob/master/Phase1/final_project_phase1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Analysis of K-means Clustering on Wisconsin Breast Cancer Data

Phase 1.0 | 2018 October 24

Bill Screen, Ha-Lan Nguyen, Tarun Rawat | Indiana University | M.S. Data Science

#### PROBLEM STATEMENT: 
Breast cancer is a rising issue among women. A cancer’s stage is a crucial factor in deciding what treatment options to recommend, and in determining the patient’s prognosis. Today, in the United States, approximately one in eight women over their lifetime has a risk of developing breast cancer. An analysis of the most recent data has shown that the survival rate is 88% after 5 years of diagnosis and 80% after 10 years of diagnosis. With early detection and treatment, it is possible that this type of cancer will go into remission. In such a case, the worse fear of a cancer patient is the recurrence of the cancer.

#### OBJECTIVE: 
This report will demonstrate how implementing a "k-means" algorithm can be used to classify benign and malign cells in two different groups.

#### PHASE 1 TASKS

- Download the data and load it in Python
- Add Headers (Scn A2 A3 A4 A5 A6 A7 A8 A9 A10 CLASS)
- Impute missing values 
- Plot basic graphs
- Compute data statistics
---

#### Import Libraries

In [0]:
# Import Libraries
# %matplotlib inline
import pandas as pd
# import pylab as plt
import matplotlib.pyplot as plt
import numpy as np
import statistics

#### Load Data

In [0]:
# Set header column names
column_names = ['Scn','A2','A3','A4','A5','A6','A7','A8','A9','A10','CLASS']

# Load data file into pandas Dataframe
df = pd.read_csv('breast-cancer-wisconsin.data', names=column_names)

# Inspect data
print(df.head(n=120))

#### Impute missing values

In [0]:
# Set NA character
na_value_char = '?' 

# Replace ? by NaN in column A7
df['A7'].replace(na_value_char, value=np.NaN, inplace=True)

# Convert column A7 back to numeric
df['A7'] = pd.to_numeric(df['A7'])

# Check the number of NaN values in the DataFrame
print('\n Column A7 contains {0} NaN rows'.format(df['A7'].isnull().sum()))

# Replace NaN values with the mean of column A7 to the entire DataFrame
df.fillna((df.mean(skipna=True)), inplace=True)

# Convert all columns of DataFrame
df = df.apply(pd.to_numeric)

#### Explore Dataset

In [0]:
# Provide the summary statistics using the describe() function
print('\nDataFrame Summary Statistics:')
print(df.describe())

# Find number of columns and number of rows
print('\nThe Dataframe has {0} rows, and {1} columns'.format(df.shape[0], df.shape[1]))

# Report how many unique values in each column
for column in column_names:
  print('\nColumn {0} has {1} unique values.'.format(column, len(df[column].unique())))

####Compute data statistics

In [0]:
# Find the mean, median, standard deviation and variance of each of the attributes A2 to A10.

# Set column headers
column_headers = ['','Mean','Median','Std. Deviation']

# Create array of row data
row_data = []

# Iterate over dataset and append column statitics to array
for column in column_names[column_names.index('A2') : column_names.index('A10') + 1]:
  row_data.append([column, df[column].mean(), df[column].median(), df[column].std(ddof=0)])

# Convert array to DataFrame
stats_df = pd.DataFrame(row_data, columns=column_headers)

# Display results
print(stats_df)

#### Plot basic graphs

In [0]:
# Plot histograms for attributes A2 to A10 (nine histograms)
index_A2 = list(df.columns).index('A2')
index_A10 = list(df.columns).index('A10') + 1

# Turn grid off in histograms 
hist = df.iloc[:, index_A2:index_A10].hist(bins=20, color="g", alpha=0.5, grid=False)

# Adjust layout to fit better
plt.tight_layout(rect=(0, 0, 1.2, 1.2))
plt.show()

# Draw a bar plot for CLASS to see counts of benign and malignant values 
df.groupby('CLASS')['CLASS'].count().plot.bar()
plt.title('Benign and Malignant Class Counts')
plt.xlabel('Class (2 for benign, 4 for malignant)')
plt.ylabel('Totals')
plt.show()

# Draw a scatterplot of any two columns

# 1. Bare and Normal Nuclei
df.plot.scatter(x='A7', y='A9', c='Green')
plt.title('Bare and Normal Nuclei')
plt.xlabel('Bare Nuclei (A7)')
plt.ylabel('Normal Nuclei (A9)')
plt.show()

# 2. Clump Thickness and Marginal Adhesion
df.plot.scatter(x='A2', y='A5', c='green')
plt.title('Clump Thickness and Marginal Adhesion')
plt.xlabel('Clump Thickness (A2)')
plt.ylabel('Marginal Adhesion (A5)')
plt.show()

#### SUMMARY

Summarize by reporting which values might need standardization in the future (too much variation) and any other observations that you may discover as a Data Scientist.
