# WordCount Example

In this example we'll be using the Back to the Future transcript which is formatted as `Character: Line`. For example:

`Doc: Marty, is that you?`

In the first part we'll count the number of words in the transcript (we'll filter out the character names) and sort them by most frequently used to least frequently used.

In the second part we'll filter out common words, known as stop words, by importing a Python package using pip. 

Finally, we'll find the most common words used by each character.

## Part 1: Simple Word Count

In [1]:
import scala.collection.mutable.ArrayBuffer
import scala.util.matching.Regex

In [2]:
// Load the transcript using SparkContext.textFile
// This will return an RDD of strings - one for each line in the transcript 
val lines = sc.textFile("file:///usr/data/backtothefuture_transcript.txt")

In [None]:
// This function will be called for each line in the transcript
// We will strip out the character names (i.e. Marty:)
// We'll also strip out special characters in each string
// Finally, we'll return an array of words
def parseLine(line:String) : Array[String] = {
    var l = line.replaceAll("^[^:]+:", "");
    l = l.replaceAll("[^a-zA-Z ']", "");
    return l.split("\\s+").filterNot(word => word == "")
}

In [None]:
// flatMap can map each input to 0 or more outputs
// In this case each line of text will be mapped to 0 or more words
val words = lines.flatMap(line => parseLine(line))

In [None]:
// Map each RDD to (key, 1) where key is the word
var wordCounts = words.map(x => (x, 1))

In [None]:
// reduceByKey takes 2 RDDs with the same key, combines them into a single RDD,
// and sets the value to the output of the lambda function
// In this case that value is x + y giving us the total count for each word (the key)
wordCounts = wordCounts.reduceByKey((x, y) => x + y)

In [None]:
// Here we reverse the RDDs, so instead of (word, count)
// They will be stored as (count, word)
// This will allow us to sort by the key (count)
val wordCountsReversed = wordCounts.map(x => (x._2, x._1))

In [None]:
// Sort by key (which is now count) descending
val wordCountsSorted = wordCountsReversed.sortByKey(false)

In [None]:
// Find the top 10 words
wordCountsSorted.take(10)