# Simple WordCount

## Outline

Simple Word Count - MapReduce and working with tuples in Spark


### Transformations and Actions

**Actions** return values

- `collect`
- `count`

**Transformations** return pointers to new RDDs

- `map`
- `reduceByKey`
- `glom`

Everything starts with a `SparkContext`

In [1]:
import pyspark
sc = pyspark.SparkContext('local[*]')


In [2]:
wordsList = ['dog', 'cat', 'cat', 'bird', 'dog']
wordsRDD = sc.parallelize(wordsList, 4)
# Print out the type of wordsRDD
print(type(wordsRDD))

<class 'pyspark.rdd.RDD'>



- `glom`: Returns an RDD list from each partition of an RDD.
- `collect`: Returns a list from all elements of an RDD to the DRIVER.

In [3]:
for x in wordsRDD.glom().collect():
    print(x) 

['dog']
['cat']
['cat']
['bird', 'dog']


Use map to create a tuple

In [4]:
#two transformations - A DAG with two tasks
tupleRDD = wordsRDD.map(lambda x: (x,1))
                 
#one action
print(tupleRDD.collect())


[('dog', 1), ('cat', 1), ('cat', 1), ('bird', 1), ('dog', 1)]


Reduce to get the word count

In [5]:
reducedRDD = tupleRDD.reduceByKey(lambda x, y: x+y)
print (type(reducedRDD))
print (reducedRDD.collect())


<class 'pyspark.rdd.PipelinedRDD'>
[('cat', 2), ('bird', 1), ('dog', 2)]


### Put it all together

In [6]:
wordsList = ['dog', 'cat', 'cat', 'bird', 'dog']
wordsRDD = sc.parallelize(wordsList, 4)

wordCountsCollected = (wordsRDD
                       .map(lambda x: (x, 1))
                       .reduceByKey(lambda x, y: x + y)
                       .collect())
print(wordCountsCollected)


[('cat', 2), ('bird', 1), ('dog', 2)]
