<DIV ALIGN=CENTER>

# Introduction to Spark
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

Previously in this course, we have discussed doing data science at the
Unix command line, and with Python, primarily by using Pandas. We also
have discussed other Python libraries that bring new functionalities to
the Python data science stack. Other _big data_ technologies, however,
also exist and can be relevant to particular data science
investigations, depending on the scale of data. Of these other
technologies, one of the most promising is [**Spark**][sp].

Spark is a cluster computing system that leverages [Hadoop][sh]
technologies like [HDFS][shdfs] for high performance storage and
[Yarn][sy] for cluster management. While some may see Spark as a
replacement for Hadoop, an alternative argument can be made that [Spark
is simply another compute engine][sce] for Hadoop, in addition to
Map-Reduce.

In this IPython Notebook, we explore using Spark to perform data
processing in a similar maner to our previous efforts with Pandas. For
this we will use the airline data, which has been stored in an HDFS
system that is accesible from within our Spark cluster. [Other][dw]
tutorials exist, although they often focus on Scala examples since Spark
is written for that language.

-----
[sp]: http://spark.apache.org
[sh]: http://hadoop.apache.org
[sy]: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
[shdfs]: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
[sce]: http://techcrunch.com/2015/07/12/spark-and-hadoop-are-friends-not-foes/
[dw]: https://github.com/deanwampler/spark-workshop/tree/master/tutorial

### Initialization

In this class, we have a dedicated Spark cluster running to allow
students to explore Spark from within our IPython Notebook environment.
Since our Spark cluster has limited resources, we need to carefully
manage them, in particular we need to ensure that any SparkContext
previously used by this Jupyter Server is properly released before
starting a new one. After this, we will initialize a new SparkContext to
properly interact from this dockerized IPython Notebook to the Spark
cluster.

----- 

In [3]:
# We release the SparkContext if it exists.
try:
    sc
except:
    pass ;
else:
    sc.stop()

# Now handle initial import statements
from pyspark import SparkConf, SparkContext

# Create new Spark Configuration (port numbers might need to be adjusted from defaults.)
myconf = SparkConf()
myconf.setMaster('local[*]')
myconf.setAppName("INFO490 SP16 W14-NB3: Professor Brunner")
myconf.set('spark.executor.memory', '1g')

# Create and initialize a new Spark Context
sc = SparkContext(conf=myconf)

# Display Spark version information, which also verifies SparkContext is active
print("\nSpark version: {0}".format(sc.version))


Spark version: 1.6.0


-----

### Data Processing

Previously in this Notebook, we have used Spark to create simple RDDs
that demonstrated Spark transformations and actions on small data. Now
we will change approaches and analyze the airline data, first starting
with the single 2001 flight data file. We can create a new RDD by
reading in the data as a textfile, after which we execute the RDD
creation by counting the number of lines in the RDD. We subsequently
apply several other RDD methods to display the first few rows of data by
using the `take` method. Finally, we use the built-in `help` to se the
list of supported RDD methods.

-----



In [4]:
filename = '/home/data_scientist/data/2001/2001-1.csv'

text_file = sc.textFile(filename)

col_data = text_file.map(lambda l: l.split(",")) \
            .map(lambda p: (p[0], p[1], p[2], p[4], p[14], p[15], p[16], p[17], p[18])) \
            .filter(lambda line: 'Year' not in line)

cols = col_data.filter(lambda line: 'NA' not in line)

fields = cols.map(lambda p: (int(p[0]), int(p[1]), int(p[2]), int(p[3]),
                          int(p[4]), int(p[5]), p[6], p[7], int(p[8])))

In [5]:
# Should be 480106 if everything works correctly
fields.count()

480106

In [6]:
fields.take(5)

[(2001, 1, 17, 1806, -3, -4, 'BWI', 'CLT', 361),
 (2001, 1, 18, 1805, 4, -5, 'BWI', 'CLT', 361),
 (2001, 1, 19, 1821, 23, 11, 'BWI', 'CLT', 361),
 (2001, 1, 20, 1807, 10, -3, 'BWI', 'CLT', 361),
 (2001, 1, 21, 1810, 20, 0, 'BWI', 'CLT', 361)]

-----
### Student Activity

In the preceding cells, we introduced Spark DataFrames, Spark SQL, and Basic Statistics with Spark. Now that you have run the Notebook, go back and make the
following changes to see how the results change.

1. Change the DataFrame to ...

2. New SQL query

3. Compute Statistics for Poisson. Now swithc to lognormal and calculate statistics.
`Numbers`.

4. Add an index column to this Spark DataFrame, which sequentially
increases.

Additional, more advanced problems:

1. Create a DataFrame containing the 'Year', 'Month', 'DayofMonth', 'dDelay',
and 'Origin' columns for the airline data.

2. Filter this DataFrame to contain only flight data for flights leaving Willard airport.

-----

-----

### Ending the Spark Session

We must stop the `SparkContext` in order to release resources on the
instructional cluster before existing this Notebook.

-----

In [None]:
sc.stop()

-----

## Spark Statistics


-----

In [7]:
from pyspark.mllib.stat import Statistics

sdt = fields.map(lambda p: (p[2], p[3], p[4], p[5], p[8]))

summary = Statistics.colStats(sdt)

mus = summary.mean()
mns = summary.min()
mxs = summary.max()

vrs = summary.variance()
nnzs = summary.numNonzeros()

In [8]:
cols = ['Day', 'Dep. Time', 'Arr. Delay', 'Dep. Delay', 'Distance']

# Print out Header
print('{0:>20s}{1:>12s}{2:>8s}{3:>10s}{4:>12s}'\
      .format('Mean', 'Variance', 'Min', 'Max', 'Non Zeroes'))
print(65*'-')

# Printout summary statistics
for idx, (m, v, mn, mx, n) in enumerate(zip(mus, vrs, mns, mxs, nnzs)):
    print('{5:10s}{0:10.2f}{1:12.2f}{2:8.2f}{3:10.2f}{4:12d}'\
          .format(m, v, mn, mx, int(n), cols[idx]))

                Mean    Variance     Min       Max  Non Zeroes
-----------------------------------------------------------------
Day            16.01       79.87    1.00     31.00      480106
Dep. Time    1359.66   237399.85    1.00   2400.00      480106
Arr. Delay      6.38      964.02  -80.00   1688.00      461157
Dep. Delay      8.78      782.11  -59.00   1692.00      393503
Distance      716.99   323369.33   21.00   4962.00      480106


-----

### Correlations


-----

In [9]:
# Demonstrate Correlation Measurements

# Sample Data
x = sc.parallelize([0, 1, 2])
y = sc.parallelize([1, 2, 4])
z = sc.parallelize([2, 1, 0])

print('x = ', x.collect())
print('y = ', y.collect())
print('z = ', z.collect())

print('\nPearson Correlation Tests')
print(25*'-')
print('x corr x = {0:+5.3f}'\
      .format(Statistics.corr(x, x, method='pearson')))

print('x corr y = {0:+5.3f}'\
      .format(Statistics.corr(x, y, method='pearson')))

print('x corr z = {0:+5.3f}'\
      .format(Statistics.corr(x, z, method='pearson')))

x =  [0, 1, 2]
y =  [1, 2, 4]
z =  [2, 1, 0]

Pearson Correlation Tests
-------------------------
x corr x = +1.000
x corr y = +0.982
x corr z = -1.000


In [10]:
# Set print precision of matrices
import numpy as np
np.set_printoptions(precision=3)

# Compute correlation of three columns in RDD
cd = sdt.map(lambda p: (p[1], p[2], p[3]))

print('Dearture Time, Arrival Delay, Departure Delay')

print('\nPearson Correlation Matrix:')
print(Statistics.corr(cd, method='pearson'))

print('\nSpearman Correlation Matrix:')
print(Statistics.corr(cd, method='spearman'))

Dearture Time, Arrival Delay, Departure Delay

Pearson Correlation Matrix:
[[ 1.     0.134  0.167]
 [ 0.134  1.     0.904]
 [ 0.167  0.904  1.   ]]

Spearman Correlation Matrix:
[[ 1.     0.109  0.173]
 [ 0.109  1.     0.616]
 [ 0.173  0.616  1.   ]]


-----

### Random Data and Sampling

-----

In [11]:
from pyspark.mllib.random import RandomRDDs

ud = RandomRDDs.uniformRDD(sc, 1000, seed=23)

nd = RandomRDDs.normalRDD(sc, 1000, seed=23)

pd = RandomRDDs.poissonRDD(sc, mean=2.0, size=1000, seed=23)

In [12]:
print('Uniform Distribution Statistics\n', ud.stats())

Uniform Distribution Statistics
 (count: 1000, mean: 0.495907509202282, stdev: 0.298581265498, max: 0.99957542053, min: 0.000220626980565)


In [13]:
print('Normal Distribution Statistics\n', nd.stats())

Normal Distribution Statistics
 (count: 1000, mean: -0.01951879687296531, stdev: 0.936332160006, max: 2.76048478382, min: -3.10768336984)


In [14]:
print('Poisson Distribution Statistics\n', pd.stats())

Poisson Distribution Statistics
 (count: 1000, mean: 2.0089999999999995, stdev: 1.45771019068, max: 9.0, min: 0.0)


In [15]:
# Sample without replacement

frac = 0.25

ds = nd.sample(False, frac)
print(ds.stats())

(count: 262, mean: -0.07762383584203582, stdev: 0.953321948519, max: 2.10552321418, min: -2.4877326251)


In [16]:
# Sample with replacement
ds = nd.sample(True, frac)
print(ds.stats())

(count: 272, mean: -0.017971298189434968, stdev: 1.01272571273, max: 2.70606620358, min: -2.98699042684)
