 ![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png)
 
 # Welcome to Apache Spark with Python

> Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
http://spark.apache.org/

In this notebook, we'll train on K-means clustering. 
kmeans, where k = number of clusters

> Test Code Assignment

In [2]:
from sklearn import datasets, cluster
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [3]:
np.random.seed(2)

In [7]:
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target

In [8]:
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X_iris)
labels = k_means.labels_
correct_labels = sum(y_iris == labels)
print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y_iris.size))

Result: 134 out of 150 samples were correctly labeled.


> K-Means Assignment

In [17]:
from numpy import array
from math import sqrt

from pyspark.mllib.clustering import KMeans, KMeansModel


In [18]:
# Load and parse the data
data = sc.textFile("kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
    center = clusters.centers[clusters.predict(point)]
    return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

Within Set Sum of Squared Error = 0.6928203230275529
