# Preparing the environment

## Graphics and plotting

In [3]:
# This line configures matplotlib to show figures embedded in the notebook, 
# instead of opening a new window for each figure. 
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

# general graphics settings
matplotlib.style.use('ggplot')
matplotlib.rcParams['figure.figsize'] = (16, 9)

## Spark

This IPython notebook comes with [Spark][1] preinstalled and already initialized.  A global [SparkContext][2] is available in variable `sc`:

[1]: http://spark.apache.org/
[2]: http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark

In [1]:
# SparkContext
sc

<pyspark.context.SparkContext at 0x7f47a414b750>

----

# *Word count* using Spark

We are first going to write a simple "word count" script in plain Python, then rework it to use Spark.

a) Read the whole corpus of Shakespeare Theatre Plays into a single long Python string:

In [42]:
def load_text_file(filename):
    return open(filename, 'r').read()

text = load_text_file('/srv/nfs/datasets/shakespeare.txt')

# print sample text
print text[:100]

﻿The Project Gutenberg EBook of The Complete Works of William Shakespeare, by
William Shakespeare



In [44]:
# it is a long string indeed
print len(text)

5465102


b) Now try to normalize the text:

  * first convert it to lowercase:

In [45]:
text2 = text.lower()

# sample first two lines
print text2[:100]

﻿the project gutenberg ebook of the complete works of william shakespeare, by
william shakespeare



  * then remove punctuation (except for `-`, to keep character sequences like "Stratford-upon-Avon" as a single word):

In [47]:
# see: https://docs.python.org/2/library/re.html
import re
punctuation = re.compile(r'[^\w-]', re.M)
text3 = punctuation.sub(' ', text2)

# sample first two lines
print text3[:100]

   the project gutenberg ebook of the complete works of william shakespeare  by william shakespeare 


c) Now we can split into words at whitespace boundaries

In [48]:
words = text3.split()

print words[:20]

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'complete', 'works', 'of', 'william', 'shakespeare', 'by', 'william', 'shakespeare', 'this', 'ebook', 'is', 'for', 'the', 'use']


d) Finally, count words. We want to build a `count` map from words to integers; we do so by first creating an empty map (defaulting to zero for unknown values) and then incrementing the value associated to a word, each time we see a word. For a complete overview of idioms for counting things in Python, please see <http://treyhunner.com/2015/11/counting-things-in-python/>

In [40]:
from collections import defaultdict

counts = defaultdict(int)
for word in words:
    counts[word] += 1
        
print counts.items()[:20]

[('fawn', 14), ('considered-', 1), ('nunnery', 6), ('gag', 1), ('woods', 15), ('cxsar', 1), ('spiders', 4), ('hanging', 38), ('offendeth', 1), ('beadsmen', 1), ('scold', 8), ('mustachio', 2), ('mutinies', 5), ('rests-that', 1), ('out-night', 1), ('benvolio', 17), ('slothful', 1), ('appropriation', 1), ('strictest', 1), ('bringing', 17)]


Just for sampling the output, print out the counts of the most frequent words:

In [41]:
print counts['the'], counts['a']

27820 14665


----

Now we build the same with Spark.  You might want to read the [Spark API Overview](http://spark.apache.org/docs/latest/quick-start.html) along.

a) Spark already provides a method to read a text file; differently from Python's `read()` function, it will be chunked into lines.

In [22]:
text = sc.textFile('hdfs:/data/shakespeare.txt')

# sample first few items
print text.take(5)

[u'The Project Gutenberg EBook of The Complete Works of William Shakespeare, by', u'William Shakespeare', u'', u'This eBook is for the use of anyone anywhere at no cost and with', u'almost no restrictions whatsoever.  You may copy it, give it away or']


b.1) Use Spark's [flatMap](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.flatMap) RDD method to turn text lines into words. We'll fold case and remove punctuation later. 

In [24]:
def mapper(line):
    return line.split()

words = text.flatMap(mapper)

# sample first few items
print words.take(10)

[u'The', u'Project', u'Gutenberg', u'EBook', u'of', u'The', u'Complete', u'Works', u'of', u'William']
CPU times: user 3.16 ms, sys: 161 µs, total: 3.32 ms
Wall time: 83.7 ms


b.2) Now convert words to lowercase; we could also apply other transformations like removing punctuation here. Notice the CPU timings; where does the actual work happen and why?

In [26]:
%%time

def lowercase(s):
    return s.lower()

words2 = words.map(lowercase)

CPU times: user 19 µs, sys: 3 µs, total: 22 µs
Wall time: 26 µs


In [28]:
%%time

print words2.take(10)

[u'the', u'project', u'gutenberg', u'ebook', u'of', u'the', u'complete', u'works', u'of', u'william']
CPU times: user 3.23 ms, sys: 0 ns, total: 3.23 ms
Wall time: 80.1 ms


c) For each word, we output a pair *(word, count)*.  For simplicity, the count is always 1. If were processing a larger fileset, it could help to output a real "local" count here, to limit the amount of data transferred over the network.

In [30]:
def to_key_value(word):
    return (word, 1)

words3 = words2.map(to_key_value)

# sample first few items
words3.take(10)

[(u'the', 1),
 (u'project', 1),
 (u'gutenberg', 1),
 (u'ebook', 1),
 (u'of', 1),
 (u'the', 1),
 (u'complete', 1),
 (u'works', 1),
 (u'of', 1),
 (u'william', 1)]

d) Perform a "reduce" step: aggregate key/value pairs by key (i.e., word) and reduce the set of values with operation `add`.  Note that Spark assumes the operation is commutative and associative (there is no guarantee the values in the set to be reduced will come in a particular order), but it has no way of checking and/or guaranteeing it.  If you pass in a function that's not associative or commutative, you'll get weird results or errors, period.

In [32]:
def add(a,b):
    return a+b

counts = words3.reduceByKey(add)

# sample first few items
counts.take(10)

[(u'fawn', 11),
 (u'considered,', 2),
 (u'considered.', 3),
 (u'mustachio', 1),
 (u'fleeces', 1),
 (u'woods', 8),
 (u'sending.', 3),
 (u'hanging', 27),
 (u'offendeth', 1),
 (u'dance;', 4)]