### loading data into an RDD

Lets see what it takes to read into an RDD a small file which resides in a web site.

#### First, Download the URL to a local file 

In [1]:
%%time
import urllib
import os
if not os.path.exists("kddcup.data.gz"):
    f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz", "kddcup.data.gz")
    
!ls -l kdd*

-rw-rw-r-- 1 501 dialout  2144903 Jun  6 23:58 kddcup.data_10_percent.gz
-rw-rw-r-- 1 501 dialout 18115902 Jun  7 05:12 kddcup.data.gz
CPU times: user 144 ms, sys: 492 ms, total: 636 ms
Wall time: 11.9 s


#### Second, use `textFile` to open the file
Note that this takes a very short time, nothing is actually read yet.

In [2]:
%%time
data_file = "./kddcup.data.gz"
raw_data = sc.textFile(data_file)

CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 821 ms


In [3]:
%%time
raw_data

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 21.9 µs


MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2

#### Third, define the parser that will break each line into fields
Still, nothing is really done, this is just adding to the execution plan.

In [4]:
%%time 
csv_data = raw_data.map(lambda x: x.split(","))

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 145 µs


#### Finally spark has to do something
Counting the number of records requires issuing tasks to the workers. The file is actually read (but not remembered) and the number of line is computed.

In [5]:
%%time
raw_data.count()

CPU times: user 0 ns, sys: 24 ms, total: 24 ms
Wall time: 12.3 s


4898431

## S3
The central two parts of the Amazon cloud (AWS) are:
* **ec2**: compute nodes
* **s3**: Storage

The homework server is a set of EC2 nodes, which consist of physical computers and of virtual computers.

**S3** Is a distributed file system. It is organized in *buckets*, each of which contains files and directories. Bucket names have to be unique across all of AWS.

One Terra-byte of storage on S3 costs about \$30 per month. There is also a long term backup service, called **Galcier** in which one terra-byte costs about \$7 per month (March 2016 prices).

In order to read a file from **S3** to an **ec2** instance. You need to first mount it on the instance file system. In order to do that, you need *AWS Credentials*. In the setup we use in this class, all this is handled for you. When you'll want to use AWS yourself, you'll need to figure this out.