# Week 10 - Clustering with Spark

Used this as a guide on how to setup PySpark and IPython notebook integration: https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python

## Intro and Basic Setup

In [1]:
# Create PySpark context
from pyspark import  SparkContext
sc = SparkContext( 'local', 'pyspark')

In [2]:
# Simple example of distributed computation of the number of primes till 1000000
def isprime(n):
    """
    check if integer n is a prime
    """
    # make sure n is a positive integer
    n = abs(int(n))
    # 0 and 1 are not primes
    if n < 2:
        return False
    # 2 is the only even prime number
    if n == 2:
        return True
    # all other even numbers are not primes
    if not n & 1:
        return False
    # range starts with 3 and only needs to go up the square root of n
    # for all odd numbers
    for x in range(3, int(n**0.5)+1, 2):
        if n % x == 0:
            return False
    return True

# Create an RDD of numbers from 0 to 1,000,000
nums = sc.parallelize(xrange(1000000))

# Compute the number of primes in the RDD
print nums.filter(isprime).count()

78498


In [3]:
# Load in shakespear text file
text = sc.textFile("shakespeare.txt")
print text


MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:-2


In [4]:
# Define function to take in string an return list of tokens. Could use nltk here
def tokenize(text):
    return text.split()

# Use function above and map it to every entry in the RDD
words = text.flatMap(tokenize)
print words

PythonRDD[4] at RDD at PythonRDD.scala:43


In [5]:
# Map each word to a tuple of (word, 1) to indiate that this word appeared once
wc = words.map(lambda x: (x,1))
print wc.toDebugString()

(1) PythonRDD[5] at RDD at PythonRDD.scala:43 []
 |  MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:-2 []
 |  shakespeare.txt HadoopRDD[2] at textFile at NativeMethodAccessorImpl.java:-2 []


In [6]:
# Reduce the generated keys, using add to add the resulting values together for every key
from operator import add
counts = wc.reduceByKey(add)
print counts.toDebugString()

(1) PythonRDD[10] at RDD at PythonRDD.scala:43 []
 |  MapPartitionsRDD[9] at mapPartitions at PythonRDD.scala:374 []
 |  ShuffledRDD[8] at partitionBy at NativeMethodAccessorImpl.java:-2 []
 +-(1) PairwiseRDD[7] at reduceByKey at <ipython-input-6-f50e20538e62>:3 []
    |  PythonRDD[6] at reduceByKey at <ipython-input-6-f50e20538e62>:3 []
    |  MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:-2 []
    |  shakespeare.txt HadoopRDD[2] at textFile at NativeMethodAccessorImpl.java:-2 []


In [8]:
# Save counts to folder wc, and look into output folder
counts.saveAsTextFile("wc")
!ls -al wc

total 6076
drwxrwxr-x 2 james james    4096 Nov  1 13:31 .
drwxrwxr-x 5 james james    4096 Nov  1 13:31 ..
-rw-r--r-- 1 james james 6159858 Nov  1 13:31 part-00000
-rw-rw-r-- 1 james james   48132 Nov  1 13:31 .part-00000.crc
-rw-r--r-- 1 james james       0 Nov  1 13:31 _SUCCESS
-rw-rw-r-- 1 james james       8 Nov  1 13:31 ._SUCCESS.crc


In [9]:
# Show output from word count and stop the spark context
!head wc/part-00000
sc.stop()

(u'kingrichardiii@18311', 1)
(u'troilusandcressida@83747', 1)
(u'considered.', 2)
(u'kinghenryviii@7731', 1)
(u'hamlet@141843', 1)
(u'othello@36737', 1)
(u'kinghenryviii@7732', 1)
(u'othello@36738', 1)
(u'romeoandjuliet@1862', 1)
(u'coriolanus@166868', 1)


## Clustering with K-Means Example

Follow the example from http://spark.apache.org/docs/latest/mllib-clustering.html

In [10]:
from pyspark import  SparkContext
sc = SparkContext('local', 'pyspark')

from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt

# Load and parse the data
data = sc.textFile("%s/data/mllib/kmeans_data.txt" % os.environ['SPARK_HOME'])
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
        runs=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
    center = clusters.centers[clusters.predict(point)]
    return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

# Save and load model
clusters.save(sc, "myModelPath")
sameModel = KMeansModel.load(sc, "myModelPath")

Within Set Sum of Squared Error = 0.692820323028


In [18]:
# Show clusters
print("Cluster centers: %s" % clusters.centers)

# Confirm we can predict new arrays
print("(1, 1, 1) is clustered to: %s" % clusters.centers[clusters.predict(array([1, 1, 1]))])
print("(8, 8, 8) is clustered to: %s" % clusters.centers[clusters.predict(array([8, 8, 8]))])

Cluster centers: [array([ 9.1,  9.1,  9.1]), array([ 0.1,  0.1,  0.1])]
(1, 1, 1) is clustered to: [ 0.1  0.1  0.1]
(8, 8, 8) is clustered to: [ 9.1  9.1  9.1]


## K-Means for Mini Project

Lets do the above but for the data from previous weeks