# Preparing the environment

**Note: this notebook requires the _ESC403 2016 v7_ (or later) VM.**
    
It will *not* run with earlier versions of the VM.

## Graphics and plotting

In [1]:
# This line configures matplotlib to show figures embedded in the notebook, 
# instead of opening a new window for each figure. 
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

# general graphics settings
matplotlib.style.use('ggplot')
matplotlib.rcParams['figure.figsize'] = (16, 9)

## Spark

This IPython notebook comes with [Spark][1] preinstalled and already initialized.  A global [SparkContext][2] is available in variable `sc`:

[1]: http://spark.apache.org/
[2]: http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark

In [2]:
# SparkContext
sc

NameError: name 'sc' is not defined

Note however that Spark has been configured to use the *local* executor, so the level of parallelism is effectively limited by the amount of CPUs available to the system.

----

# Word count with Spark

We are now going to define and use a single function for computing word counts given a text file. Refer to the notebook "Word Count with Spark" for more details on the "word count" and how to do it in PySpark.

In [None]:
# see: https://docs.python.org/2/library/re.html
import re
punctuation = re.compile(r'[^\w]', re.M)

from operator import add

def wordcount(filename):
    # make a Spark RDD from a text file
    lines1 = sc.textFile('hdfs://' + filename)
    # normalize (lowercase, remove punctuation, etc.)
    lines2 = lines1.map(lambda line: punctuation.sub(' ', line).lower())
    # break each line into words (creates a new RDD)
    words1 = lines2.flatMap(lambda line: line.lower().split())
    # final map/reduce step
    words2 = words1.map(lambda word: (word, 1))
    counts = words2.reduceByKey(add)
    return counts

We can now get the word counts for the [complete works of William Shakespeare][1] (downloaded from [Project Gutenberg][2]) in a single line:

[1]: http://www.gutenberg.org/ebooks/100
[2]: http://www.gutenberg.org/

In [None]:
wc1 = wordcount('/data/shakespeare.txt')

wc1.take(5)

How can we get *sorted* output? Most-frequently used words first?  Spark's [takeOrdered()][1] method provides a solution:

[1]: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.takeOrdered

In [None]:
wc1.takeOrdered(5)

It turns out that `takeOrdered()` uses the normal Python ordering for values in the RDD: in this cases, 2-tuples are sorted lexicographically i.e. according to the string value of the "word" part first.  To sort on the *second* item in the *(word, count)* pair, we need to pass a custom `key` function.  Also, the sort order is always ascending -- so we need to flip numbers around 0 to get a descending order.

In [None]:
# to sort on the *second* item in a tuple, we need to pass a custom `key` function
wc1.takeOrdered(5, lambda (k,v): -v)

Now let's compute the word frequency in the Dickens' corpus instead:

In [None]:
wc2 = wordcount('/data/dickens.txt')

# show the number of words contained
print wc2.count()

In [None]:
wc2.takeOrdered(5, lambda (k,v): -v)

---

The list of most used words are quite similar.  How can we find word usage dissimilarity?

**First approach:** We have two tables with a common column -- use a "join" operation.

In [None]:
wc = wc1.join(wc2)

wc.take(5)

The first attempt yields unexpected results though:

In [None]:
only_in_shakespeare = wc.filter(lambda (k, (v1, v2)): v2 == 0)

only_in_shakespeare.count()

However, there *are* words which are used in Shakespeare's plays and not in Dickens' novels:

In [None]:
print "Occurences of 'thou' in Dickens:", wc2.filter(lambda (k,v): k=='thou').take(1)
print "Occurences of 'thou' in Shakespeare:", wc1.filter(lambda (k,v): k=='thou').take(1)

Maybe we need an "outer join" instead?  (The dataset is too large for running a "full outer join" on the computer we're running this playbook on -- reduce it to the top 1000 words first to be tractable.)

In [None]:
# reduce dataset first, to complete in a reasonable time
def top(rdd, num=1000):
    return sc.parallelize(rdd.top(num, lambda (k,v): v))

top_wc1 = top(wc1)
top_wc2 = top(wc2)

In [None]:
wc = top_wc1.fullOuterJoin(top_wc2)

wc.take(5)

In [None]:
only_in_shakespeare = wc.filter(lambda (k, (v1, v2)): v2 == None)

only_in_shakespeare.take(5)

As it turns out, PySpark already has a method [subtractByKey][1] to take the difference:

[1]: https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.subtractByKey

In [None]:
only_in_dickens = top_wc2.subtractByKey(top_wc1)

only_in_dickens.take(5)

**Second approach:** (suggested during class) 

- Normalize the counts, dividing by the total number of words in the corpus.  In other words, we seek percentages expressing "usage popularity" of a word.
- Compare these "usage popularity" across the two sets; e.g., select only words that differ significantly in usage.  Here "significantly" could mean "difference of usage popularity is over a certain threshold."

In [None]:
def normalize_by_total_count(wc):
    tot_words = wc.count()
    wfreq = wc.map(lambda (k,v): (k, 100.0*v/tot_words))
    return wfreq

wf1 = normalize_by_total_count(wc1)
wf2 = normalize_by_total_count(wc2)

wf = wf1.join(wf2)

wf.filter(lambda (k, (v1,v2)): v1-v2 > 0.1).take(5)

### Test Zipf's law

Zipf law asserts that a word's frequency is inversely proportional to its rank.

In [None]:
import numpy as np

x_max = 1000

# power law y=1/x
y = 1.0 / np.arange(1, x_max)

print y[:5]

In [None]:
top_words_with_count1 = wc1.top(x_max, lambda (k,v): v)

top_words_with_count1[:5]

In [None]:
top_freq1 = float(top_words_with_count1[0][1])

top_freq1

In [None]:
freq1 = np.array([(occurrences / top_freq1) for (word, occurrences) in top_words_with_count1])

print freq1[:5]

Repeat count with Dickens' text corpus:

In [None]:
top_words_with_count2 = wc2.top(x_max, lambda(k,v): v)
top_freq2 = float(top_words_with_count2[0][1])
freq2 = np.array([(occurrences / top_freq2) for (word, occurrences) in top_words_with_count2])

print freq2[:5]

In [None]:
x_max_display = 100

xs = np.arange(1, x_max_display+1)

plt.figure()
plt.plot(xs, y[:x_max_display], 'r', label='ideal (Zipf)')
plt.plot(xs, freq1[:x_max_display], 'b', label='Shakespeare')
plt.plot(xs, freq2[:x_max_display], 'g', label='Dickens')
plt.legend()
plt.xlabel('rank')
plt.ylabel('Relative frequency')
plt.title('Relative word frequency in different text corpora')
plt.show()