# PySpark word-frequency counter
by Héctor Ramírez

What are the most frequent words in the complete works of William Shakespeare?

[The complete works were downloaded from http://www.gutenberg.org/ebooks/100 ]

### =============================================================

In [1]:
# We import a list of stop words
import pandas as pd
stop_words = pd.read_csv('stop_words.csv', header=None, index_col=False).iloc[:, 0].values.tolist()

In [2]:
# To find out where the pyspark
import findspark
findspark.init()

# Creating Spark Context
from pyspark import SparkContext
sc = SparkContext("local", "first app")

# Read the Complete works from Shakespeare
baseRDD = sc.textFile('Complete_Shakespeare.txt')
print('The first few works in the contents list are:\n')
print('=====================================')
for line in baseRDD.take(20):
    print(line)
print('=====================================\n')

# Split the lines of baseRDD into words
splitRDD = baseRDD.flatMap(lambda x: x.split())

# Count the total number of words
print("Total number of words in the books:", splitRDD.count())

The first few works in the contents list are:

Contents



               THE SONNETS

               ALL’S WELL THAT ENDS WELL

               THE TRAGEDY OF ANTONY AND CLEOPATRA

               AS YOU LIKE IT

               THE COMEDY OF ERRORS

               THE TRAGEDY OF CORIOLANUS

               CYMBELINE

               THE TRAGEDY OF HAMLET, PRINCE OF DENMARK


Total number of words in the books: 957798


In [3]:
# Convert the words in lower case and remove stop words from stop_words
splitRDD_no_stop = splitRDD.filter(lambda x: x.lower() not in stop_words)

# Create a tuple of the word and 1 
splitRDD_no_stop_words = splitRDD_no_stop.map(lambda w: (w, 1))

# Count of the number of occurences of each word
resultRDD = splitRDD_no_stop_words.reduceByKey(lambda x, y: x + y)

# Display the first 10 words and their frequencies
print('The first 10 words and their frequencies:\n')
for word in resultRDD.take(10):
print(word)

# Swap the keys and values 
resultRDD_swap = resultRDD.map(lambda x: (x[1], x[0]))

# Sort the keys in descending order
resultRDD_swap_sort = resultRDD_swap.sortByKey(ascending=False)

print('\n')

# Show the top 10 most frequent words and their frequencies
print('The most frequent words:\n')
for word in resultRDD_swap_sort.take(10):
print("{} has {} counts". format(word[1], word[0]))

The first 10 words and their frequencies:

('Contents', 20)
('SONNETS', 2)
('ALL’S', 2)
('WELL', 4)
('ENDS', 2)
('TRAGEDY', 16)
('ANTONY', 23)
('CLEOPATRA', 7)
('LIKE', 2)
('COMEDY', 2)


The most frequent words:

thou has 4514 counts
thy has 3918 counts
shall has 3246 counts
good has 2169 counts
would has 2132 counts
Enter has 2005 counts
thee has 1888 counts
hath has 1720 counts
like has 1642 counts
you, has 1568 counts


In [4]:
# Stopping Spark Context
sc.stop()