#WORD COUNT APPLICATION

- Here we will calculate the most_common words in "harry potter and sorcerer's stone" by J.K. Rowling using spark.

###Let's first start with some basic 

■ Let's get some words and create base RDD

In [3]:
words=['cat','mouse','dog','dog','cat']
wordsRDD=sc.parallelize(words)

#print the type of wordsRDD
print type(wordsRDD)

<class 'pyspark.rdd.RDD'>


■ Define a function that add 's' to every word

In [4]:
def plural(word):
    """ADD 's' to word
        Return: string"""
    return word+'s'

print plural('cat')

cats


■ Now, apply plural function to the wordsRDD

In [8]:
pluralRDD=wordsRDD.map(lambda x:plural(x))

print pluralRDD.collect()

['cats', 'mouses', 'dogs', 'dogs', 'cats']


- Looks great!!!

■ Calculate the length of each word

In [11]:
pluralLength=pluralRDD.map(lambda x:len(x))

print pluralLength.collect()

[4, 6, 4, 4, 4]


■ Create a pair RDD

- create pair ( word,1 ) for each word in wordsRdd

In [12]:
wordPairs=wordsRDD.map(lambda x:(x,1))

print wordPairs.collect()

[('cat', 1), ('mouse', 1), ('dog', 1), ('dog', 1), ('cat', 1)]


■ Counting with pairRdd

- let's count number of times a particular word appear in RDD

In [17]:
#use groupByKey:it takes no parameter
wordsGrouped = wordPairs.groupByKey()
for key, value in wordsGrouped.collect():
    print '{0}: {1}'.format(key, list(value))

mouse: [1]
dog: [1, 1]
cat: [1, 1]


- Get the sum on every word

In [18]:
wordCountsGrouped = wordsGrouped.map(lambda x:(x[0],sum(x[1])))
print wordCountsGrouped.collect()

[('mouse', 1), ('dog', 2), ('cat', 2)]


- Lets try another method

In [19]:
from operator import add

#use reduceByKey
wordCounts = wordPairs.reduceByKey(add)

print wordCounts.collect()

[('mouse', 1), ('dog', 2), ('cat', 2)]


- Expert version use map ,reduceByKey and collect in one statement 

In [20]:
wordCountsCollected = wordsRDD.map(lambda s:(s,1)).reduceByKey(add).collect()
print wordCountsCollected


[('mouse', 1), ('dog', 2), ('cat', 2)]


■ Find unique words and mean value

- Get numbers of unique words in wordsRDD

In [22]:
uniqueWords = len(wordCountsCollected)
print uniqueWords

3


- Get mean using reduce
- Find the mean number of words per unique word in 'wordCounts'

In [23]:
totalCount = wordCounts.map(lambda s:s[1]).reduce(add)
average = totalCount / float(uniqueWords)

print totalCount

print round(average, 2)

5
1.67


###Let'sapply word count on file

**NOTE: Do not apply collect method on every rdd as due to large size dataset it might slow down processing.**

1. define a function for word counting

In [24]:
from operator import add
def wordCount(wordListRDD):
    """Creates a pair RDD with word counts from an RDD of words."""
    return (wordListRDD.map(lambda s:(s,1)).reduceByKey(add))
                       
                       
print wordCount(wordsRDD).collect()


[('mouse', 1), ('dog', 2), ('cat', 2)]


2. Define the function `removePunctuation` that converts all text to lower case, removes leading and trailing spaces, and removes any punctuation.
- Use the Python [re](https://docs.python.org/2/library/re.html) module to remove any text that is not a letter, number, or space.

In [25]:
import re
def removePunctuation(text):
    """Removes punctuation, changes to lowercase, and strips leading and trailing spaces."""
    
    text=text.strip().lower()              
    return re.sub(r'[^A-Za-z0-9\s]+', '', text)

print removePunctuation('Hi, you!')
print removePunctuation(" The Elephant's 4 cats. ")

hi you
the elephants 4 cats


**GET THE DATA FILE**

In [26]:
potterRDD = (sc.textFile('../HIMANSHU/Harry_Potter_and_the_Sorcerer_s_Stone.txt', 8).map(removePunctuation))
print '\n'.join(potterRDD.zipWithIndex().map(lambda (l, num): '{0}: {1}'.format(num, l)).take(15))

0: 
1: 
2: 1harry potter and the sorcerers stonechapter onethe boy who livedmr and mrs dursley of number four privet drive were proud to saythat they were perfectly normal thank you very much they were the lastpeople youd expect to be involved in anything strange or mysteriousbecause they just didnt hold with such nonsensemr dursley was the director of a firm called grunnings which madedrills he was a big beefy man with hardly any neck although he didhave a very large mustache mrs dursley was thin and blonde and hadnearly twice the usual amount of neck which came in very useful as shespent so much of her time craning over garden fences spying on theneighbors the dursleys had a small son called dudley and in theiropinion there was no finer boy anywherethe dursleys had everything they wanted but they also had a secret andtheir greatest fear was that somebody would discover it they didntthink they could bear it if anyone found out about the potters mrspotter was mrs dursleys sister but th

**WORDS FROM LINE**

Apply a transformation that will split each element of the RDD by its spaces.

In [41]:
potterWordsRDD = potterRDD.flatMap(lambda x: x.split(" "))
potterWordCount = potterWordsRDD.count()

print potterWordsRDD.top(5)

print potterWordCount

[u'zoorestaurant', u'zooming', u'zooming', u'zoomed', u'zoom']
72046


- REMOVE EMPTY ELEMENTS

- Remove all entries where the word is `''`

In [42]:
potWordsRDD = potterWordsRDD.filter(lambda x:x!='')
        
potWordCount = potWordsRDD.count()
print potWordCount

70370


**COUNT THE WORDS**

- We now have an RDD that is only words.  
- Next, let's apply the `wordCount()` function to produce a list of word counts. 
- We can view the top 15 words by using the `takeOrdered()` action; 
- Since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.

In [43]:
top20WordsAndCounts = wordCount(potWordsRDD).takeOrdered(20,lambda s:-1*s[1])

print '\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top20WordsAndCounts))

the: 3036
and: 1662
to: 1651
a: 1473
he: 1259
of: 1117
was: 1044
in: 866
harry: 813
his: 808
it: 806
said: 739
you: 692
had: 620
at: 558
on: 556
that: 509
i: 499
they: 461
as: 454


**You'll notice that many of the words are common English words. These are called stopwords.**

- GET THE STOPWORDS FILE 

In [44]:
stopwords = set(sc.textFile('../HIMANSHU/stopwords.txt').collect())

print 'These are the stopwords: %s' % stopwords

These are the stopwords: set([u'all', u'just', u'being', u'over', u'both', u'through', u'yourselves', u'its', u'before', u'with', u'had', u'should', u'to', u'only', u'under', u'ours', u'has', u'do', u'them', u'his', u'very', u'they', u'not', u'during', u'now', u'him', u'nor', u'did', u'these', u't', u'each', u'where', u'because', u'doing', u'theirs', u'some', u'are', u'our', u'ourselves', u'out', u'what', u'for', u'below', u'does', u'above', u'between', u'she', u'be', u'we', u'after', u'here', u'hers', u'by', u'on', u'about', u'of', u'against', u's', u'or', u'own', u'into', u'yourself', u'down', u'your', u'from', u'her', u'whom', u'there', u'been', u'few', u'too', u'themselves', u'was', u'until', u'more', u'himself', u'that', u'but', u'off', u'herself', u'than', u'those', u'he', u'me', u'myself', u'this', u'up', u'will', u'while', u'can', u'were', u'my', u'and', u'then', u'is', u'in', u'am', u'it', u'an', u'as', u'itself', u'at', u'have', u'further', u'their', u'if', u'again', u'no', u

** An implementation of input string tokenization that excludes stopwords**

- Define a function that will remove stopwords from rdd

In [45]:
def tokenize(data):
    #data: rdd
    #return rdd of ramoved stopwords
    return data.filter(lambda x:x not in stopwords)    

- LET'S REMOVE STOPWORDS FROM **[`potWordsRDD`]**

In [46]:
emmaWordsRDD=tokenize(potterWordsRDD)

emmaWordCount = emmaWordsRDD.count()

print emmaWordsRDD.top(5)

print emmaWordCount

[u'zoorestaurant', u'zooming', u'zooming', u'zoomed', u'zoom']
42500


In [47]:
emmaWordsRDD = emmaWordsRDD.filter(lambda x:x!='')
        
emmaWordCount = emmaWordsRDD.count()
print emmaWordCount

40824


**NOW COUNT THE WORDS**

In [48]:
top15WordsAndCounts = wordCount(emmaWordsRDD).takeOrdered(15,lambda s:-1*s[1])

print '\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))

harry: 813
said: 739
ron: 311
hagrid: 244
back: 220
one: 210
hermione: 175
got: 170
could: 167
didnt: 165
get: 165
like: 163
know: 161
looked: 147
see: 146


#THAT'S IT 

Top 5 words and counts from the novel.

harry: 813

said: 739

ron: 311

hagrid: 244

back: 220
