Kmeans Clustering with Naive Bayes Classifier
=============================================

Importing required python modules
---------------------------------

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import scale
from sklearn.cross_validation import cross_val_score
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import KFold

The following libraries have been used :
- ** Pandas **: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- ** Numpy **: NumPy is the fundamental package for scientific computing with Python. 
- ** Matplotlib **: matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments .
- ** Sklearn **: It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.


Retrieving the dataset
---------------------------------

In [None]:
data = pd.read_csv('heart.csv', header=None)
df = pd.DataFrame(data)

1. Dataset is imported.
2. The imported dataset is converted into a pandas DataFrame.

Plotting the Dataset
---------------------------------

In [None]:
fig = plt.figure()

ax1 = fig.add_subplot(1,2,1)
ax1.scatter(x[1],x[2], c=y)
ax1.set_title("Original Data")
FP = 0
FN = 0
TN = 0
TP = 0

Matplotlib is used to plot the loaded pandas DataFrame.

Learning from the data:
---------------------------------

In [None]:
def nbkmh(train_index, test_index):

This function is used to run the hybrid classifier.

In [None]:
	x_kmeans = df.iloc[:, 0:5]
	x_kmeans = x_kmeans.drop(x_kmeans.columns[1:3], axis=1)
	x_kmeans = pd.DataFrame(scale(x_kmeans))

	x_naive = df.iloc[:, 0:13]
	y = df.iloc[:, 13]
	y = y-1

	y_train = pd.Series(y.iloc[train_index])
	y_test = pd.Series(y.iloc[test_index])

	x_train_kmeans = x_kmeans.iloc[train_index, :]
	x_test_kmeans = x_kmeans.iloc[test_index, :]

	x_train_naive = x_naive.iloc[train_index, :]
	x_test_naive = x_naive.iloc[test_index, :]

Labels and attributes are extracted from the dataset for Kmeans and Naive Bayes respectively.
Kmeans Clustering uses only the continous attributes.

In [None]:
	clusters = 5
	model_kmeans = KMeans(init='k-means++', n_clusters=clusters, n_init=10,random_state=10000)
	model_kmeans.fit(x_train_kmeans)
	kmean_predictions = model_kmeans.predict(x_train_kmeans)

Kmeans clustering is run on the dataset to cluster the data into 5 clusters. The initial cluster centers for k-mean clustering are selected in a smart way to speed up convergence.

In [None]:
	x = [pd.DataFrame() for ii in range(0,clusters)]
	y = [pd.Series() for ii in range(0,clusters)]
	iterators = zip(kmean_predictions,range(len(x_train_kmeans)))
	for kmean_prediction,i in iterators:
		row_x =  x_train_naive.iloc[i, :]
		row_y = pd.Series(y_train.iloc[i])
		index = int(kmean_prediction)
		x[index] = x[index].append(row_x, ignore_index=True)
		y[index] = y[index].append(row_y)

Attributes(x) and labels(y) are then grouped according to the cluster defined by the Kmeans Clustering.

In [None]:
	clstr_n = [MultinomialNB(alpha=2,fit_prior=True) for ii in range(0,clusters)]
	for i in range(0,clusters):
		clstr_n[i].fit(x[i], y[i])

Naive Bayes Classifier is then run on each cluster individually. Additive (Laplace/Lidstone) smoothing parameter is set as 2. Class prior probabilities are learned.

In [None]:
	predicts = []
	c=0
	for i in range(len(x_test_kmeans)):
		prediction = model_kmeans.predict(
                    x_test_kmeans.iloc[i, :].reshape(1,-1))
		prediction = int(prediction)
		pred_naive = clstr_n[prediction].predict(
                    x_test_naive.iloc[i, :].reshape(1,-1))
		predicts.append(pred_naive)
		if pred_naive == y_test.iloc[i]:
			c+=1
	print ((c*100.0)/len(x_test_kmeans))

Accuracies are predicted on the test set using the hybrid classifier.

In [None]:
	predicts = np.array(predicts)
	cm = metrics.confusion_matrix(y_test, predicts)/len(y_test)
	# print (cm)
	global FP
	global FN
	global TN
	global TP
	FP += cm[0][0]
	FN += cm[1][0]
	TN += cm[0][1]
	TP += cm[1][1]
	return ((c*100.0)/len(x_test_kmeans))

Compute confusion matrix to evaluate the accuracy of a classification and build a text report showing the main classification metrics.

In [None]:
def main():
	scores = []
	kf = KFold(n=df.shape[0], n_folds=10)
	for (train_index,test_index),i in zip(kf,range(0,10)):
		print("Iteration " + str(i+1) + " : ")
		scores.append(nbkmh(train_index, test_index))
	print("\n 10 Fold Accuracy",np.array(scores).mean())
	print("FP", FP*10)
	print("FN", FN*10)
	print("TN", TN*10)
	print("TP", TP*10)

if __name__ == '__main__':
	main()

This is the function that is used to call the function nbkmh() and run 10 Fold Cross Validation.