# Wordcount example (save to CSV)

This notebook shows the classic wordcount example in which we want to calculate how many time the same word appears within a text.
The example also show how to use basic PySpark tools like DataFrame and SQL.

In [None]:
# To find out where the pyspark
import findspark
findspark.init()

In [None]:
# Creating Spark Context
from pyspark import SparkContext
sc = SparkContext("local", "Wordcount")

With the step below we are going to read an input (local) file that will be our data source. textFile() and wholeTextFiles() methods to read into RDD that are the low level data access of Spark (there exist other method to read directly in Dataframe).

Each line of the text file is a *row*. We can apply a series of chained operation:
1. flatMap produces a new dataset <word> from the splitting
2. map produces a new dataset in the form <word, couunt>
3. reduceByKey coordinates the aggregation by summing rows with the same key

In [None]:
# Calculating words count
text_file = sc.textFile("doveconviene_info.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

We now print some basec statistics about words and their occurrences

In [None]:
# Printing each word with its respective count
output = counts.collect()
for (word, occurs) in output:
    print("%s: %i" % (word, occurs))

In [None]:
def toCSVLine(data):
  return ','.join(str(d) for d in data)

In [None]:
lines = counts.map(toCSVLine)

In [None]:
lines.saveAsTextFile('output_csv')