# Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.

There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. 

# Setting up SparkContext
SparkContext (aka Spark context) is the heart of a Spark application.

You could also assume that a SparkContext instance is a Spark application.

Spark context sets up internal services and establishes a connection to a Spark execution environment.

Once a SparkContext is created you can use it to create RDDs, accumulators and broadcast variables, access Spark services and run jobs (until SparkContext is stopped).

A Spark context is essentially a client of Spark’s execution environment

In [None]:
#import os, sys
#os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2.1"
#sys.path.insert(0, os.path.join(os.environ["SPARK_HOME"], 'python'))
#sys.path.insert(0, os.path.join(os.environ["SPARK_HOME"], 'python/lib/py4j-0.10.4-src.zip'))

In [None]:
import pyspark

In [None]:
#sparkConf = pyspark.SparkConf() \
#    .set("spark.executor.memory", "2560m")\
#    .set("spark.driver.memory", "2560m")\
#    .set("spark.yarn.executor.memoryOverhead", 3584)\
#    .set("spark.yarn.driver.memoryOverhead", 3584)\
#    .set("spark.python.worker.memory", "1536m")\
#    .set("spark.executor.instances", 11)\
#    .set("spark.default.parallelism", 300)

Other configuration properties can be found [here](https://spark.apache.org/docs/latest/configuration.html)

In [None]:
#sc = pyspark.SparkContext(
    #master='yarn-client',
    #appName='seminar3-rdd',
    #conf=sparkConf
#)
sc = pyspark.SparkContext()
sc

Web UI (aka Application UI or webUI or Spark UI) is the web interface of a running Spark application to monitor and inspect Spark job executions in a web browser.

In [None]:
#port = sc.uiWebUrl.split(':')[-1]
#print 'http://cluster1:{}'.format(port)

# Getting the Data Files

In this notebook, we will use the reduced dataset (10 percent) provided for the KDD Cup 1999, containing nearly half million network interactions. The file is provided as a Gzip file that we will download locally.

The KDD Cup 1999 competition dataset is described in detail 
[here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99).

In [None]:
! wget "http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz" -O "./kddcup.data_10_percent.gz"

Put data into hdfs

In [None]:
#! hdfs dfs -put /data/kddcup.data_10_percent.gz ./

## Creating a RDD from a File
The most common way of creating an RDD is to load it from a file. Notice that Spark's textFile can handle compressed files directly.

In [None]:
import os
data_path = 'kddcup.data_10_percent.gz'
raw_data = sc.textFile('file://' + os.path.abspath(os.curdir) + '/' + data_path)

Now we have our data file loaded into the raw_data RDD.

Without getting into Spark transformations and actions, the most basic thing we can do to check that we got our RDD contents right is to count() the number of lines loaded from the file into the RDD and check a few of them

In [None]:
raw_data.count()

In [None]:
raw_data.take(5)

Another way of creating an RDD is to parallelize an already existing list.

In [None]:
rdd_list = sc.parallelize([x + 5 for x in range(100)])
print(rdd_list.count())
rdd_list.take(5)

# RDD Basic Operations
This section will introduce three basic but essential Spark operations. Two of them are the transformations map and filter. The other is the action collect. At the same time we will introduce the concept of persistence in Spark

### The filter Transformation
This transformation can be applied to RDDs in order to keep just elements that satisfy a certain condition. More concretely, a functions is evaluated on every element in the original RDD. The new resulting RDD will contain just those elements that make the function return True.

For example, imagine we want to count how many normal. interactions we have in our dataset. We can filter our raw_data RDD as follows.

In [None]:
normal_raw_data = raw_data.filter(lambda x: 'normal.' in x)

In [None]:
%%time
normal_raw_data.count()

### The map Transformation
By using the map transformation in Spark, we can apply a function to every element in our RDD. Python's lambdas are specially expressive for this particular.

In this case we want to read our data file as a CSV formatted one. We can do this by applying a lambda function to each element in the RDD as follows.

In [None]:
csv_data = raw_data.map(lambda x: x.split(","))
csv_data.take(1)[0]

### FlatMap transformation
By using flatMap you can map each row to multiple new rows.


In [None]:
numbers = sc.parallelize([1, 2, 3])
copies = texts.flatMap(lambda x: [x for _ in range(x)])
copies.collect()

#### Using map to create PairRDD
If you have a tuple of length 2 as your RDD data type, you can use \*ByKey operations on your RDD, with first value of tuple being the key and second being the value. Let's create such RDD.

Of course we can use predefined functions with map and not just lambda. Imagine we want to have each element in the RDD as a key-value pair where the key is the tag (e.g. normal) and the value is the whole list of elements that represents the row in the CSV formatted file. We could proceed as follows.

In [None]:
def parse_interaction(line):
    elems = line.split(",")
    tag = elems[41]
    return (tag, elems)

key_csv_data = raw_data.map(parse_interaction)

You can change key with standard map function. Let's say we want to aggregate data by tag and protocol.

In [None]:
def protocol_key(x):
    tag = x[0]
    proto = x[1][1]
    return '{}_{}'.format(tag, proto), 1

type_protocol = key_csv_data.map(protocol_key)
protocols_by_type = dict(type_protocol.reduceByKey(lambda x, y: x + y).collect())
protocols_by_type

Antother way to acheive this is to use groupBy functions. In this case we get iterable with values corresponding to each key as second tuple value.

In [None]:
grouped_by = key_csv_data.groupByKey()
grouped_by.take(2)

And then we can map values to desired statistic. Write a function that will get us same results as above.

In [None]:
def protocol_counter(values):
    # Task 1
    pass

assert protocol_counter([(0, 'udp'), (0, 'udp'), (0, 'tcp')])['udp'] == 2

In [None]:
protocols_by_type2 = dict(grouped_by.mapValues(protocol_counter).collect())
assert protocols_by_type2['normal.']['udp'] == protocols_by_type['normal._udp']

# Task 2 Word Count in Spark

In [None]:
texts = ['Apache Spark has as its architectural foundation the resilient distributed dataset RDD a read only multiset of data items distributed over a cluster of machines that is maintained in a fault tolerant way 2 The Dataframe API was released as an abstraction on top of the RDD followed by the Dataset API In Spark 1 x the RDD was the primary application programming interface API but as of Spark 2 x use of the Dataset API is encouraged 3 even though the RDD API is not deprecated 4 5 The RDD technology still underlies the Dataset API 6 7 ',
 'Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce cluster computing paradigm which forces a particular linear dataflow structure on distributed programs MapReduce programs read input data from disk map a function across the data reduce the results of the map and store reduction results on disk Spark s RDDs function as a working set for distributed programs that offers a deliberately restricted form of distributed shared memory 8 ',
 'Spark facilitates the implementation of both iterative algorithms which visit their data set multiple times in a loop and interactive exploratory data analysis i e the repeated database style querying of data The latency of such applications may be reduced by several orders of magnitude compared to Apache Hadoop MapReduce implementation 2 9 Among the class of iterative algorithms are the training algorithms for machine learning systems which formed the initial impetus for developing Apache Spark 10 ',
 'Apache Spark requires a cluster manager and a distributed storage system For cluster management Spark supports standalone native Spark cluster Hadoop YARN or Apache Mesos 11 For distributed storage Spark can interface with a wide variety including Alluxio Hadoop Distributed File System HDFS 12 MapR File System MapR FS 13 Cassandra 14 OpenStack Swift Amazon S3 Kudu or a custom solution can be implemented Spark also supports a pseudo distributed local mode usually used only for development or testing purposes where distributed storage is not required and the local file system can be used instead in such a scenario Spark is run on a single machine with one executor per CPU core ',
 'Spark Core',
 'Spark Core is the foundation of the overall project It provides distributed task dispatching scheduling and basic I O functionalities exposed through an application programming interface for Java Python Scala and R centered on the RDD abstraction the Java API is available for other JVM languages but is also usable for some other non JVM languages such as Julia 15 that can connect to the JVM This interface mirrors a functional higher order model of programming a driver program invokes parallel operations such as map filter or reduce on an RDD by passing a function to Spark which then schedules the function s execution in parallel on the cluster 2 These operations and additional ones such as joins take RDDs as input and produce new RDDs RDDs are immutable and their operations are lazy fault tolerance is achieved by keeping track of the lineage of each RDD the sequence of operations that produced it so that it can be reconstructed in the case of data loss RDDs can contain any type of Python Java or Scala objects ',
 'Besides the RDD oriented functional style of programming Spark provides two restricted forms of shared variables broadcast variables reference read only data that needs to be available on all nodes while accumulators can be used to program reductions in an imperative style 2 ',
 'A typical example of RDD centric functional programming is the following Scala program that computes the frequencies of all words occurring in a set of text files and prints the most common ones Each map flatMap a variant of map and reduceByKey takes an anonymous function that performs a simple operation on a single data item or a pair of items and applies its argument to transform an RDD into a new RDD '
]

text_rdd = sc.parallelize(texts)

In [None]:
word_count = ...

In [None]:
assert dict(word_count.collect())['spark'] == 17