# RDD creation

**NOTE: This notebook is worth 10% of the grade of project 2.**

#### [Introduction to Spark with Python, by Jose A. Dianes](https://github.com/jadianes/spark-py-notebooks)

In this notebook we will introduce two different ways of getting data into the basic Spark data structure, the **Resilient Distributed Dataset** or **RDD**. An RDD is a distributed collection of elements. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling actions on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

#### References

The reference book for these and other Spark related topics is *Learning Spark* by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.  

The KDD Cup 1999 competition dataset is described in detail [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99).

Fill in the following cells where a *TODO* is given. Then run each cell by press SHIFT + ENTER. The results will be printed below the cells.

## Getting the data files  

In this notebook we will use the reduced dataset (1 percent) provided for the KDD Cup 1999, containing nearly half million network interactions. The file is provided as a *Gzip* file in the local directory.  

## Creating a RDD from a file  

The most common way of creating an RDD is to load it from a file. Notice that Spark's `textFile` can handle compressed files directly.    

In [1]:
import os
from pyspark import SparkContext
data_file = "file://" + os.getcwd() + "/../kddcup.data_1_percent.gz"
sc = SparkContext.getOrCreate();
raw_data = sc.textFile(data_file)
print(type(raw_data))


<class 'pyspark.rdd.RDD'>


Now we have our data file loaded into the `raw_data` RDD.

Without getting into Spark *transformations* and *actions*, the most basic thing we can do to check that we got our RDD contents right is to `count()` the number of lines loaded from the file into the RDD.  

In [2]:


# This step may take a while.
# TODO: store the count of 'raw_data' into a variable called 'count'
count = raw_data.count()
print("raw_data has " + str(count) + " entries.")



raw_data has 49402 entries.


In the following notebooks, we will use this raw data to learn about the different Spark transformations and actions.  

## Creating and RDD using `parallelize`

Another way of creating an RDD is to parallelize an already existing list.  

In [3]:
a = range(100)
list_ = list(a)
data = sc.parallelize(list_)
# TODO: Parallelize 'a' into a RDD called data. Hint: use sc.parallelize(<list>)
print(type(data))



<class 'pyspark.rdd.RDD'>


We can access the first few elements on our RDD.  

In [6]:
data.take(5)

[0, 1, 2, 3, 4]