# INTRODUCTION TO SPARK

The core data structure in **Spark** is a resilient *distributed data set (RDD)*. As the name suggests, an RDD is Spark's representation of a data set that's distributed across the RAM, or memory, of a cluster of many machines. An RDD object is essentially a collection of elements we can use to hold lists of tuples, dictionaries, lists, etc. Similar to a pandas DataFrame, we can load a data set into an RDD, and then run any of the methods accesible to that object.

# Pyspark

Toolkit that allows us to interface with RDDs in Python.

In [None]:
raw_data = sc.textFile("daily_show.tsv")
raw_data.take(5)

### Data preparation using pipelines

In [None]:
#using map function 
daily_show = raw_data.map(lambda line: line.split('\t'))
daily_show.take(5)

In [None]:
#adding the visitors of daily show in years
total = daily_show.map(lambda x: (x[0], 1)).reduceByKey(lambda x,y: x+y)
total.take(total.count())#forces the lazy code in pyspark to run immediately

In [None]:
#removing lines that starts with word 'Year' instead of number
def filter_year(line):
    if line[0] == 'YEAR':
        return False
    else:
    # Write your logic here
        return True

filtered_daily_show = daily_show.filter(lambda line: filter_year(line))

In [None]:
#additional preprocessing
filtered_daily_show.filter(lambda line: line[1] != '') \
                   .map(lambda line: (line[1].lower(), 1)) \
                   .reduceByKey(lambda x,y: x+y) \
                   .take(5)