# Word Count Lab: Building a word count application
This lab will build on the techniques covered in the Spark tutorial to develop a simple word count application. The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. In this lab, we will write code that calculates the most common words in the Complete Works of William Shakespeare retrieved from Project Gutenberg.

## Part 1: Creating a base RDD and pair RDDs

### (1a) Create a base RDD

In [1]:
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)
# Print out the type of wordsRDD
print(type(wordsRDD))

<class 'pyspark.rdd.RDD'>


### (1b) Pluralize and test

Let's use a `map()` transformation to add the letter 's' to each string in the base RDD we just created. We'll define a Python function that returns the word with an 's' at the end of the word. Please replace `<FILL IN>` with your solution. If you have trouble, the next cell has the solution. After you have defined makePlural you can run the third cell which contains a test. If you implementation is correct it will print 1 test passed.

This is the general form that exercises will take, except that no example solution will be provided. Exercises will include an explanation of what is expected, followed by code cells where one cell will have one or more `<FILL IN>` sections. The cell that needs to be modified will have `# TODO: Replace <FILL IN> with appropriate code` on its first line. Once the `<FILL IN>` sections are updated and the code is run, the test cell can then be run to verify the correctness of your solution. The last code cell before the next markdown section will contain the tests.

In [2]:
def makePlural(word):
    """Adds an 's' to `word`.

    Note:
        This is a simple function that only adds an 's'.  No attempt is made to follow proper
        pluralization rules.

    Args:
        word (str): A string.

    Returns:
        str: A string with 's' added to it.
    """
    return word + 's'

print(makePlural('cat'))

cats


### (1c) Apply makePlural to the base RDD

In [3]:
pluralRDD = wordsRDD.map(makePlural)
print(pluralRDD.collect())

['cats', 'elephants', 'rats', 'rats', 'cats']


### (1d) Pass a lambda function to map

In [4]:
pluralLambdaRDD = wordsRDD.map(lambda x: x + 's')
print(pluralLambdaRDD.collect())

['cats', 'elephants', 'rats', 'rats', 'cats']


### (1e) Length of each word

In [5]:
pluralLengths = (pluralRDD
                 .map(lambda x: len(x))
                 .collect())
print(pluralLengths)

[4, 9, 4, 4, 4]


### (1f) Pair RDDs

In [6]:
wordPairs = wordsRDD.map(lambda x: (x, 1))
print(wordPairs.collect())

[('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]


## Part 2: Counting with pair RDDs

### (2a) `groupByKey()` approach

In [7]:
wordsGrouped = wordPairs.groupByKey()
for key, value in wordsGrouped.collect():
    print('{0}: {1}'.format(key, list(value)))

rat: [1, 1]
cat: [1, 1]
elephant: [1]


### (2b) Use `groupByKey()` to obtain the counts

In [8]:
wordCountsGrouped = wordsGrouped.map(lambda x: (x[0],sum(x[1])))
print(wordCountsGrouped.collect())

[('rat', 2), ('cat', 2), ('elephant', 1)]


### (2c) Counting using `reduceByKey`

In [9]:
wordCounts = wordPairs.reduceByKey(lambda x, y: x+y)
print(wordCounts.collect())

[('rat', 2), ('cat', 2), ('elephant', 1)]


### (2d) All together

In [10]:
wordCountsCollected = (wordsRDD
                       .map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y)
                       .collect())
print(wordCountsCollected

[('rat', 2), ('elephant', 1), ('cat', 2)]


### Part 3: Finding unique words and a mean value

### (3a) Unique words

In [11]:
uniqueWords = wordCounts.count()
print(uniqueWords)

3


### (3b) Mean using `reduce`

In [12]:
from operator import add
totalCount = (wordCounts
              .map(lambda x: x[1])
              .reduce(lambda x, y: x+y))
average = totalCount / float(wordCounts.count())
print(totalCount)
print(round(average, 2))

5
1.67


## Part 4: Apply word count to a file

## (4a) `wordCount` function

In [13]:
def wordCount(wordListRDD):
    """Creates a pair RDD with word counts from an RDD of words.

    Args:
        wordListRDD (RDD of str): An RDD consisting of words.

    Returns:
        RDD of (str, int): An RDD consisting of (word, count) tuples.
    """
    return wordListRDD.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y)
                   
print(wordCount(wordsRDD).collect())

[('rat', 2), ('elephant', 1), ('cat', 2)]


## (4b) Capitalization and punctuation

In [14]:
import re
def removePunctuation(text):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        text (str): A string.

    Returns:
        str: The cleaned up string.
    """
    return re.sub(r'[^A-Za-z0-9 ]', '', text).lower().strip()
print(removePunctuation('Hi, you!'))
print(removePunctuation(' No under_score!'))
print(removePunctuation(' *      Remove punctuation then spaces  * '))

hi you
no underscore
remove punctuation then spaces


## (4c) Load a text file

In [16]:
import os.path
fileName = "hdfs:/" + os.path.join('user', 'root', 'shakespeare.txt')

shakespeareRDD = sc.textFile(fileName, 8).map(removePunctuation)
print('\n'.join(shakespeareRDD
                .zipWithIndex()  # to (line, lineNum)
                .map(lambda (l, num): '{0}: {1}'.format(num, l))  # to 'lineNum: line'
                .take(15)))

0: 1609
1: 
2: the sonnets
3: 
4: by william shakespeare
5: 
6: 
7: 
8: 1
9: from fairest creatures we desire increase
10: that thereby beautys rose might never die
11: but as the riper should by time decease
12: his tender heir might bear his memory
13: but thou contracted to thine own bright eyes
14: feedst thy lights flame with selfsubstantial fuel


## (4d) Words from lines

In [18]:
shakespeareWordsRDD = shakespeareRDD.flatMap(lambda x: x.split(' '))
shakespeareWordCount = shakespeareWordsRDD.count()
print(shakespeareWordsRDD.top(5))
print(shakespeareWordCount)

[u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds']
927631


## (4e) Remove empty elements

In [19]:
shakeWordsRDD = shakespeareWordsRDD.filter(lambda x: x != '')
shakeWordCount = shakeWordsRDD.count()
print(shakeWordCount)

882996


## (4f) Count the words

In [20]:
top15WordsAndCounts = wordCount(shakeWordsRDD).takeOrdered(15, key=lambda x: -x[1])
print('\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts)))

the: 27361
and: 26028
i: 20681
to: 19150
of: 17463
a: 14593
you: 13615
my: 12481
in: 10956
that: 10890
is: 9134
not: 8497
with: 7771
me: 7769
it: 7678
