Kmeans Clustering
=================

Importing required python modules
---------------------------------

In [None]:
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import normalize,scale
from sklearn.cross_validation import cross_val_score
from sklearn import metrics

The following libraries have been used :
- ** Pandas **: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- ** Numpy **: NumPy is the fundamental package for scientific computing with Python. 
- ** Matplotlib **: matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments .
- ** Sklearn **: It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.


Retrieving the dataset
---------------------------------

In [None]:
data = pd.read_csv('heart.csv', header=None)
df = pd.DataFrame(data)

x = df.iloc[:, 0:5]
x = x.drop(x.columns[1:3], axis=1)
x = pd.DataFrame(scale(x))

y = df.iloc[:, 13]
y = y-1

1. Dataset is imported.
2. The imported dataset is converted into a pandas DataFrame.
3. Attributes(x) and labels(y) are extracted.

Plotting the Dataset
---------------------------------

In [None]:
fig = plt.figure()

ax1 = fig.add_subplot(1,2,1)
ax1.scatter(x[1],x[2], c=y)
ax1.set_title("Original Data")

Matplotlib is used to plot the loaded pandas DataFrame.

Learning from the data:
---------------------------------

In [None]:
clusters = 2

model = KMeans(init='k-means++', n_clusters=clusters, n_init=10,random_state=100)

scores = cross_val_score(model, x, y, scoring='accuracy', cv=10)
print ("10-Fold Accuracy : ", scores.mean()*100)

model.fit(x)

predicts = model.predict(x)
print ("Accuracy(Total) = ", count(predicts == np.array(y))/(len(y)*1.0) *100)
centroids = model.cluster_centers_

Here **model** is an instance of KMeans method from sklearn.cluster.
The number of clusters to form are taken as 2. 
The initial cluster centers for k-mean clustering are selected in a smart way to speed up convergence.
10 Fold Cross Validation is used to verify the results.

In [None]:
ax1.scatter(centroids[:, 1], centroids[:, 2],
            marker='x', s=169, linewidths=3,
            color='b', zorder=10)

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("KMeans Clustering")
ax2.scatter(x[1],x[2], c=predicts)
ax2.scatter(centroids[:, 1], centroids[:, 2],
            marker='x', s=169, linewidths=3,
            color='b', zorder=10)

The learned cluster centroids are then used for prediction and to plot the clustered dataset.

In [None]:
cm = metrics.confusion_matrix(y, predicts)
print (cm/len(y))
print (metrics.classification_report(y, predicts))

plt.show()

Compute confusion matrix to evaluate the accuracy of a classification and build a text report showing the main classification metrics.