# Exercise Clustering

(1) Read the file "clustering_exercise.txt" (tvs format), visualize the original data and groups;
(2) Perform the clustering on the data;
(3) Compare visually the resulting clusters and the original data

## Loading all libraries

In [1]:
# import spark
from pyspark import SparkContext
# initialize a new Spark Context to use for the execution of the script
sc = SparkContext(appName="MY-APP-NAME", master="local[*]")
# prevent useless logging messages
sc.setLogLevel("ERROR")

In [2]:
%matplotlib inline

import numpy as np
from pyspark.mllib.clustering import KMeans
import matplotlib.pyplot as plt

## Read the input file

In [4]:
raw_data_rdd = sc.textFile("clustering_exercise.txt")

Check the content of `raw_data_rdd`

In [5]:
raw_data_rdd.take(10)

['0\t-3.124335669555818\t-0.8238869204280516',
 '0\t-1.631655282541037\t-2.813150868834941',
 '0\t-1.1312513766433991\t-1.2842727920630375',
 '0\t-3.4837770197998412\t-0.5950548777092937',
 '0\t-1.026738204511106\t-2.2128310684806842',
 '0\t-1.966876746731852\t-1.1103973218532115',
 '0\t-1.547588455852239\t-1.1604632124538172',
 '0\t-1.0317494649457666\t-1.2686345312148963',
 '0\t-1.236075185439029\t-2.527746243219358',
 '0\t-1.978076992374207\t-1.3106375096991711']

Transform the rdd of lines into an rdd of (float) triples

In [7]:
cleaned_data_rdd = raw_data_rdd.map(lambda row: [float(x) for x in row.split("\t")])
cleaned_data_rdd.take(10)

[[0.0, -3.124335669555818, -0.8238869204280516],
 [0.0, -1.631655282541037, -2.813150868834941],
 [0.0, -1.1312513766433991, -1.2842727920630375],
 [0.0, -3.4837770197998412, -0.5950548777092937],
 [0.0, -1.026738204511106, -2.2128310684806842],
 [0.0, -1.966876746731852, -1.1103973218532115],
 [0.0, -1.547588455852239, -1.1604632124538172],
 [0.0, -1.0317494649457666, -1.2686345312148963],
 [0.0, -1.236075185439029, -2.527746243219358],
 [0.0, -1.978076992374207, -1.3106375096991711]]

Fetch the number of known clusters

In [8]:
number_clusters = 1 + cleaned_data_rdd.map(lambda triple: int(triple[0])).reduce(lambda l, r: max(l, r))
print("There are {} clusters".format(number_clusters))

There are 5 clusters


## Training & Predict

In [7]:
training_rdd = cleaned_data_rdd.map(lambda row: row[1:])

In [8]:
clusters = KMeans.train(training_rdd, number_clusters, maxIterations=20, initializationMode="random")

In [9]:
print("SSE is {:.2f}".format(clusters.computeCost(training_rdd)))

SSE is 1540.72
