# Week 10 - Clustering with Spark

Used this as a guide on how to setup PySpark and IPython notebook integration: https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python

## Intro and Basic Setup

In [42]:
# Create PySpark context
from pyspark import  SparkContext
sc = SparkContext( 'local', 'pyspark')

In [29]:
# Simple example of distributed computation of the number of primes till 1000000
def isprime(n):
    """
    check if integer n is a prime
    """
    # make sure n is a positive integer
    n = abs(int(n))
    # 0 and 1 are not primes
    if n < 2:
        return False
    # 2 is the only even prime number
    if n == 2:
        return True
    # all other even numbers are not primes
    if not n & 1:
        return False
    # range starts with 3 and only needs to go up the square root of n
    # for all odd numbers
    for x in range(3, int(n**0.5)+1, 2):
        if n % x == 0:
            return False
    return True

# Create an RDD of numbers from 0 to 1,000,000
nums = sc.parallelize(xrange(1000000))

# Compute the number of primes in the RDD
print nums.filter(isprime).count()

78498


In [30]:
# Load in shakespear text file
text = sc.textFile("shakespeare.txt")
print text


MapPartitionsRDD[54] at textFile at NativeMethodAccessorImpl.java:-2


In [31]:
# Define function to take in string an return list of tokens. Could use nltk here
def tokenize(text):
    return text.split()

# Use function above and map it to every entry in the RDD
words = text.flatMap(tokenize)
print words

PythonRDD[55] at RDD at PythonRDD.scala:43


In [32]:
# Map each word to a tuple of (word, 1) to indiate that this word appeared once
wc = words.map(lambda x: (x,1))
print wc.toDebugString()

(1) PythonRDD[56] at RDD at PythonRDD.scala:43 []
 |  MapPartitionsRDD[54] at textFile at NativeMethodAccessorImpl.java:-2 []
 |  shakespeare.txt HadoopRDD[53] at textFile at NativeMethodAccessorImpl.java:-2 []


In [38]:
# Reduce the generated keys, using add to add the resulting values together for every key
from operator import add
counts = wc.reduceByKey(add)
print counts.toDebugString()

(1) PythonRDD[73] at RDD at PythonRDD.scala:43 []
 |  MapPartitionsRDD[72] at mapPartitions at PythonRDD.scala:346 []
 |  ShuffledRDD[71] at partitionBy at NativeMethodAccessorImpl.java:-2 []
 +-(1) PairwiseRDD[70] at reduceByKey at <ipython-input-38-21cc0e4dacc5>:1 []
    |  PythonRDD[69] at reduceByKey at <ipython-input-38-21cc0e4dacc5>:1 []
    |  MapPartitionsRDD[54] at textFile at NativeMethodAccessorImpl.java:-2 []
    |  shakespeare.txt HadoopRDD[53] at textFile at NativeMethodAccessorImpl.java:-2 []


In [34]:
# Save counts to folder wc, and look into output folder
counts.saveAsTextFile("wc")
!ls -al wc

total 6076
drwxrwxr-x 2 james james    4096 Nov  1 12:55 .
drwxrwxr-x 4 james james    4096 Nov  1 12:55 ..
-rw-r--r-- 1 james james 6159858 Nov  1 12:55 part-00000
-rw-rw-r-- 1 james james   48132 Nov  1 12:55 .part-00000.crc
-rw-r--r-- 1 james james       0 Nov  1 12:55 _SUCCESS
-rw-rw-r-- 1 james james       8 Nov  1 12:55 ._SUCCESS.crc


In [41]:
# Show output from word count and stop the spark context
!head wc/part-00000
sc.stop()

(u'kingrichardiii@18311', 1)
(u'troilusandcressida@83747', 1)
(u'considered.', 2)
(u'kinghenryviii@7731', 1)
(u'hamlet@141843', 1)
(u'othello@36737', 1)
(u'kinghenryviii@7732', 1)
(u'othello@36738', 1)
(u'romeoandjuliet@1862', 1)
(u'coriolanus@166868', 1)


## Clustering with K-Means

In [None]:
from pyspark import  SparkContext
sc = SparkContext( 'local', 'pyspark')