# RDD creation

In this notebook we will introduce two different ways of getting data into the basic Spark data structure, the **Resilient Distributed Dataset** or **RDD**. An RDD is a distributed collection of elements. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling actions on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

### References

The reference book for these and other Spark related topics is *Learning Spark* by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.  

The KDD Cup 2010 competition dataset is described in detail [here](https://pslcdatashop.web.cmu.edu/KDDCup/).

## Getting the data files  

In this notebook we will use the dataset provided for the KDD Cup 2010. This dataset is used to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems. This task presents interesting technical challenges, has practical importance, and is scientifically interesting. The file is provided as a *zip* file that we will download locally.  

## Creating a RDD from a file  

The most common way of creating an RDD is to load it from a file.

In [None]:
import findspark
findspark.init()

In [None]:
import pyspark
sc = pyspark.SparkContext(appName="KDDCup10")

data_file = "algebra_2005_2006/algebra_2005_2006_train.txt"
raw_data = sc.textFile(data_file)

Now we have our data file loaded into the `raw_data` RDD.

Without getting into Spark *transformations* and *actions*, the most basic thing we can do to check that we got our RDD contents right is to `count()` the number of lines loaded from the file into the RDD.  

In [None]:
raw_data.count()

We can also check the first few entries in our data.  

In [None]:
raw_data.take(5)

In the following notebooks, we will use this raw data to learn about the different Spark transformations and actions.  

## Creating and RDD using `parallelize`

Another way of creating an RDD is to parallelize an already existing list.  

In [None]:
a = range(100)

data = sc.parallelize(a)

As we did before, we can `count()` the number of elements in the RDD.

In [None]:
data.count()

As before, we can access the first few elements on our RDD.  

In [None]:
data.take(5)