# Downloading data and Starting with SparkR

[**Introduction to Apache Spark with R by J. A. Dianes**](https://github.com/jadianes/spark-r-notebooks)

In this notebook we will the 2013 American Community Survey dataset and start up a [SparkR](http://spark.apache.org/docs/latest/sparkr.html) cluster. Both are necessary steps in order to run the rest of the notebooks. After downloading the files we will have them locally and we won't need to download them again. However, we will need to init the cluster in each notebook in order to use it.  

## Getting and reading data

Let's first dowload the data files using R as follows.

In [1]:
population_data_files_url <- 'http://www2.census.gov/acs2013_1yr/pums/csv_pus.zip'
housing_data_files_url <- 'http://www2.census.gov/acs2013_1yr/pums/csv_hus.zip'

In [2]:
library(RCurl)

population_data_file <- getBinaryURL(population_data_files_url)

Loading required package: bitops


In [3]:
housing_data_file <- getBinaryURL(housing_data_files_url)

Now we want to persist the files, so we don't need to download them again in further notebooks.  

In [4]:
population_data_file_path <- '/nfs/data/2013-acs/csv_pus.zip'
population_data_file_local <- file(population_data_file_path, open = "wb")
writeBin(population_data_file, population_data_file_local)
close(population_data_file_local)

In [5]:
housing_data_file_path <- '/nfs/data/2013-acs/csv_hus.zip'
housing_data_file_local <- file(housing_data_file_path, open = "wb")
writeBin(housing_data_file, housing_data_file_local)
close(housing_data_file_local)

From the revious we got two zip files, `csv_pus.zip` and `csv_hus.zip`. We can now unzip them.

In [6]:
data_file_path <- '/nfs/data/2013-acs'
unzip(population_data_file_path, exdir=data_file_path)

In [7]:
unzip(housing_data_file_path, exdir=data_file_path)

Once you unzip the contents of both files you will see up to six files. Each zip contains three files, a PDF explanatory document, and two data files in `csv` format. Each housing/population data set is divided in two pieces, "a" and "b" (where "a" contains states 1 to 25 and "b" contains states 26 to 50). Therefore:  

- `ss13husa.csv`: housing data for states from 1 to 25.  
- `ss13husb.csv`: housing data for states from 26 to 50.  
- `ss13pusa.csv`: population data for states from 1 to 25.  
- `ss13pusb.csv`: population data for states from 26 to 50.  

We will work with these fours files in our notebooks.  

## Starting up a SparkR cluster

In further notebooks, we will explore our data by loading them into SparkSQL data frames. But first we need to init a SparkR cluster and use it to init a SparkSQL context.  

The first thing we need to do is to set up some environment variables and library paths as follows. Remember to replace the value assigned to `SPARK_HOME` with your Spark home folder.  

In [8]:
Sys.setenv(SPARK_HOME='/home/cluster/spark-1.5.0-bin-hadoop2.6')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))

Now we can load the `SparkR` library as follows.

In [9]:
library(SparkR)


Attaching package: ‘SparkR’

The following object is masked from ‘package:RCurl’:

    base64

The following objects are masked from ‘package:stats’:

    filter, na.omit

The following objects are masked from ‘package:base’:

    intersect, rbind, sample, subset, summary, table, transform



And now we can initialise the Spark context as follows. In our case we are use a standalone Spark cluster with one master and seven workers. If you are running Spark in local node, use just `master='local'`.

In [10]:
sc <- sparkR.init(master='spark://169.254.206.2:7077')

Launching java with spark-submit command /home/cluster/spark-1.5.0-bin-hadoop2.6/bin/spark-submit   sparkr-shell /tmp/RtmpPm0py4/backend_port29c24c141b34 


And finally we can start the SparkSQL context as follows.

In [11]:
sqlContext <- sparkRSQL.init(sc)

And that's it. Once we get to this poing, we are ready to load data into SparkSQL data frames. We will do this in the next notebook.