# Data Processing in Spark
  
This notebook continues exploring spark to perform data processing in a similar manner to your previous experience with Pandas in Python. We will use the airline data, which has been stored in HDFS on the EMR cluster. It is accesible from the Spark cluster. You will be asked to solve some simple problems at the end. There are some challenge activities at the end that you can try to answer. They will have an extra credit. 

Coming back to Spark, it is a cluster computing system that leverages Hadoop technologies like HDFS for high performance storage and Yarn for cluster management. While some may see Spark as a replacement for Hadoop, an alternative argument can be made that Spark is simply another compute engine for Hadoop, in addition to Map-Reduce.

### Initialization

Although we are starting a new sparkContext here, always make sure any SparkContext previously used by the Jupyter Server should be properly released before starting a new one. Lets initialize a new SparkContext to interact with the Spark cluster.

----- 

In [None]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
         .master("local") \
         .appName("flights-practice") \
         .getOrCreate()

sc = spark.sparkContext

In [None]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(spark)

### Using Spark

Spark is a framework for processing large-data tasks, in general this means Petabytes (or more of data). Spark can run on the HDFS file system, which can be set up to chunk files into blocks and to replicate these blocks across a cluster's storage to promote increased performance. Spark abstracts these details, however, allowing us to develop an application on a small system and scale up to large data on a cluster. 

In Spark, communications move between a driver process and the execution processes. This communication is handled for us by using a `SparkContext`, which requests resources from the Spark master process, such as number of cores, which are reserved to complete our Spark tasks. Once a Spark Context is active, we can use the Spark Console to monitor jobs and the overall Spark infrastructure. 

In [None]:
data = range(50)
print(data)

In [None]:
myRDD = sc.parallelize(data, 8)

In the previous code cell, we created a parallelized collection by using the parallelize method, which partitions the data across cores in a cluster. The general rule indicated in the Spark documentation is that you want 2-4 partitions per core.


Next, we use several functions on the RDD to obtain the RDD unique ID, which indicates when new RDDs are created, as well as naming RDDs to view them easily in the Spark cluster management software.

In [None]:
print("Initial RDD id: {0}".format(myRDD.id()))

In [None]:
myRDD.setName("DSA RDD")

-----

Now, given this simple RDD, we can apply a transformation, in this case
we simply add one to each element in the RDD. This tranformation doesn't
actually happen until we call an action method, which first occurs in
the third code cell below when we call the `collect` method. The new RDD
has been created, however as indicated by its new id.

-----

In [None]:
myaddRDD = myRDD.map(lambda a: a + 1)

In [None]:
print(myaddRDD.toDebugString())

In [None]:
print(myaddRDD.collect())

-----

We can now apply a second transformation, in this case we apply a
filter, which selects values from the RDD based on a condition (in this
example we select valus that are evenly divisible by 5). The
transformation doesn't occur, however, until we once again call the
`collect` method, which _collects_ the results of the different
transformations.

-----

In [None]:
myfilterRDD = myaddRDD.filter(lambda x: (x % 5) == 0)

In [None]:
myfilterRDD.collect()

In [None]:
print(myfilterRDD.toDebugString())

-----

Tranformations, however, can be chained together in a process called
pipelining. Doing so can produce long code strings, which can be
difficult to follow (or debug). Thus, it is considered good style
to break pipelined operations such that each transformation occurs on a
separate line. The following code combines the previous Spark tasks
together into a single line, but shown using recommended style.

-----

In [None]:
(sc
 .parallelize(data)
 .map(lambda x: x + 1)
 .filter(lambda x: (x % 5) == 0)
 .collect())

-----

### Data Processing

Previously in this Notebook, we have used Spark to create simple RDDs
that demonstrated Spark transformations and actions on small data. Now
we will change approaches and analyze the airline data, first starting
with the single 2001 flight data file. We can create a new RDD by
reading in the data as a textfile, after which we execute the RDD
creation by counting the number of lines in the RDD. We subsequently
apply several other RDD methods to display the first few rows of data by
using the `take` method. Finally, we use the built-in `help` to see the
list of supported RDD methods.

-----



In [None]:
filename = '/dsa/data/all_datasets/flights.csv'

text_file = sc.textFile(filename)

In [None]:
text_file.count()

In [None]:
text_file.take(5)

In [None]:
# Display help info on spark rdd
help(text_file)

-----

With this text RDD, we can begin to process the data. Since our data is,
at this point, simply a list of strings, we first need to transform the
data into columns, remove the header row, and extract out the columns of
interest. These steps are pipelined to create a single RDD, that isn't
processed until we execute an action method, in this case, the `first`
method that displays the first row in the new RDD.

-----

In [None]:
col_data = text_file.map(lambda l: l.split(",")) \
            .map(lambda p: (p[0], p[1], p[2], p[7], p[8], p[10], p[11], p[17], p[22])) \
            .filter(lambda line: 'YEAR' not in line)

In [None]:
col_data.first()

In [None]:
col_data.take(5)

Count the number of rows in col_data

In [None]:
col_data.count()

-----

Spark, unlike Pandas, will not handle NA values. Thus we need an
additional tranform to remove lines from our RDD that contain missing
data. We can accomplish this by using an appropriate filter.

-----

In [None]:
cols = col_data.filter(lambda line: '' not in line)

Count the number of rows in col_data which doesn't have any null values. 

In [None]:
cols.count()

Count the number of rows in col_data which have null values. 

In [None]:
na_rows = col_data.filter(lambda line: '' in line)
na_rows.count()

Make sure the sum of both counts rows with null values and rows free of null values is equal to total number of rows in col_data

In [None]:
5714008+105071

-----

To analyze these data, however, we need to convert the columns to the
appropriate data types. In this case, we can simply apply one final
transformation.

-----

In [None]:
fields = cols.map(lambda p: (int(p[0]), int(p[1]), int(p[2]), p[3],
                          p[4], int(p[5]), int(p[6]), int(p[7]), int(p[8])))

In [None]:
fields.take(1)

-----

## Spark DataFrame

Spark supports a simplified [Data Frame][spdf] as part of the [Spark
SQL][spsql] library. We can create a Data Frame from an existing RDD by
also specifying the column labels and data types. The data types must
be one of the pre-defined [Spark SQL types][spdt]. After creating the
new DataFrame (which is backed by an RDD), we can perform many of the
same tasks with Spark that we performed with Pandas (but not all, and
not in as simple of an approach). The following code cells show how we
can take our 2001 flight data RDD and create a new Data Frame, which we
subsequently use in several subsequent code cells.

-----
[spdf]: https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
[spsql]: https://spark.apache.org/sql/
[spdt]: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types

In [None]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *

# sc is an existing SparkContext.
sqlContext = SQLContext(sc)

schemaString = "Year Month DayOfMonth Origin Destination DepTime DepDelay Distance ArrDelay"

fieldTypes = [IntegerType(), IntegerType(), IntegerType(), \
              StringType(), StringType(), IntegerType(), \
              IntegerType(), IntegerType(), IntegerType()]

f_data = [StructField(field_name, field_type, True) \
          for field_name, field_type in zip(schemaString.split(), fieldTypes)]

schema = StructType(f_data)

In [None]:
df = sqlContext.createDataFrame(fields, schema)
print(df)

-----

In the following three code cells, we `show` the first few lines of the
DataFrame, then use the `head` method, which displays more semantic
information for each row, and finally use the `describe` method, which
doesn't execute until the `show` action is invoked. While the output is
less visually attractive than the Pandas result, we still obtain the
necessary information.

After these code cells, we access the DataFrame schema, first by using
the `printSchema` method to nicely output the schema, and next access a
column directly, which we can now do since we have named our DataFrame
columns.

-----

In [None]:
df.show(5)

In [None]:
df.head(4)

In [None]:
df.describe().show()

In [None]:
df.printSchema()

-----

We can extract data from the DataFrame by using similar techniques to
what we used with Pandas. One difference is that we need to `filter` the
DataFrame, as opposed to directly accessing rows. However, we can filter
rows to extract flights that left O'Hare, and secondly those flights
that left O'Hare more than two hours late. In the second case, we also
tranform the output to `select` the _Destination_ column and a new
column that is the _Distance_ in kilometers.

-----

In [None]:
df.filter(df['Origin'] == 'ORD').count()

In [None]:
df.filter(df['Origin'] == 'ORD').filter(df['DepDelay'] > 120).select(df['Destination'], df['Distance'] * 1.6).show(10)

-----

## Spark SQL

Given a Spark DataFrame, we can apply SQL statements directly against
the DataFrame by registering the DataFrame as a Spark temporary SQL
table. The following code cells demonstrates this, as we register our
DataFrame as a `flights` table, and execute a SQL statement to select
the same data we obtained from our previous DataFrame filter.Since the
data are unordered, we have different results displayed via the `show`
method.

-----

In [None]:
df = sqlContext.createDataFrame(fields, schema)

df.registerTempTable("flights")

# SQL can be run over DataFrames that have been registered as a table.
sql_q = "SELECT Destination, Distance FROM flights WHERE Origin = 'ORD' AND DepDelay > 120"

results = sqlContext.sql(sql_q)

# The results of SQL queries are RDDs and support all the normal RDD operations.
results.show(10)

-----

## Spark Statistics

The simplest type of data analysis is to compute basic statistical
measures of sequences of data. The Spark MLlib package includes a 
[basic statistical][sbs] component that can be easily used to obtain
statistical measurements of multiple columns in a Spark RDD. We
demonstrate this in the following code cells, where we create an RDD
from numeric columns in our `fields` RDD. We use the `colStats` function
from the `Statistics` object to compute a range of statistical measures
in one pass for all columns in the `sdt` RDD. In the second code cell,
we simply provide a nicely formatted display of these quantities for
each column.

-----

[sbs]: https://spark.apache.org/docs/latest/mllib-statistics.html

In [None]:
from pyspark.mllib.stat import Statistics

# Extract numeric columns and compute statistics
sdt = fields.map(lambda p: (p[2], p[5], p[6], p[7], p[8]))
summary = Statistics.colStats(sdt)

# Extract individual statistics for RDD
mus = summary.mean()
mns = summary.min()
mxs = summary.max()
vrs = summary.variance()
nnzs = summary.numNonzeros()

In [None]:
# Labels for display
cols = ['Day', 'Dep. Time', 'Dep. Delay', 'Distance', 'Arr. Delay']

# Print out Header
print('{0:>20s}{1:>12s}{2:>8s}{3:>10s}{4:>12s}'\
      .format('Mean', 'Variance', 'Min', 'Max', 'Non Zeroes'))
print(65*'-')

# Printout summary statistics
for idx, (m, v, mn, mx, n) in enumerate(zip(mus, vrs, mns, mxs, nnzs)):
    print('{5:10s}{0:10.2f}{1:12.2f}{2:8.2f}{3:10.2f}{4:12d}'\
          .format(m, v, mn, mx, int(n), cols[idx]))

-----

### Correlations

Another useful function is to compute the correlation between different
data sequences. The Spark MLlib package includes the `corr` method
within the Statistics component to compute correlations between
individual data sequences, or via the columns in an RDD. The `corr`
method can also calculate either the _Pearson_ correlation, which is the
default, or the _Spearman_ correlation. In the first code cell, we
create several data sequences, turn them into Spark data structures via
the `parallelize` method, and compute the Pearson correlation
coefficient between the different data sequences. In the second code
cell, we create a new RDD from three columns in the `sdt` RDD, and
compute both the Pearson and Spearman correlations between the columns
in this RDD.

-----

In [None]:
# Demonstrate Correlation Measurements

# Sample Data
x = sc.parallelize([0, 1, 2])
y = sc.parallelize([1, 2, 4])
z = sc.parallelize([2, 1, 0])

print('x = ', x.collect())
print('y = ', y.collect())
print('z = ', z.collect())

print('\nPearson Correlation Tests')
print(25*'-')
print('x corr x = {0:+5.3f}'\
      .format(Statistics.corr(x, x, method='pearson')))

print('x corr y = {0:+5.3f}'\
      .format(Statistics.corr(x, y, method='pearson')))

print('x corr z = {0:+5.3f}'\
      .format(Statistics.corr(x, z, method='pearson')))

In [None]:
# Set print precision of matrices
import numpy as np
np.set_printoptions(precision=3)

# Compute correlation of three columns in RDD
cd = sdt.map(lambda p: (p[1], p[2], p[4]))

print('Departure Time, Departure Delay, Arrival Delay')

print('\nPearson Correlation Matrix:')
print(Statistics.corr(cd, method='pearson'))

print('\nSpearman Correlation Matrix:')
print(Statistics.corr(cd, method='spearman'))

### Student Activities

Make the following changes to see how the results change.

**Note**: The hints and code insertion markings are not all-inclusive. You may need to add code before and after the hints.  They are just hints, not fully structured _fill-in-the-blank_.


**Activity 1:** Change the `myRDD` example to start with all integers from 0 to 399. Then use an appropriate lambda function to convert this RDD to a new RDD that has all odd integers from 1 to 399.

In [None]:
## Your code for activity 1 goes below this comment
# ----------------------------------------------------

<YourCodeHere>


myRDD = <YourCodeHere>

oddRDD = <YourCodeHere>

**Activity 2:** Filter the previous RDD to contain only entries that are divisible by 9.

In [None]:
## Your code for activity 2 goes below this comment
# ----------------------------------------------------

ninesRDD = <YourCodeHere>



**Activity 3:** Convert this RDD to a Spark DataFrame, specify the column name as `Numbers`.

In [None]:
## Your code for activity 3 goes below this comment
# ----------------------------------------------------

<YourCodeHere>

df = <???>.createDataFrame(<???>)


**Activity 4:** Change the DataFrame to include different columns from the flights data. You might review the original [airline data set](http://stat-computing.org/dataexpo/2009/) website to see the column descriptions.
 * [Make sure you are consulting the Spark Documentation](https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes)

In [None]:
## Your code for activity 4 goes below this comment
# ----------------------------------------------------





**Activity 5:** Use a SQL query on the `df` DataFrame to compute the mean distance between all flights from O'Hare to Los Angeles International Airport (LAX).

In [None]:
## Your code for activity 5 goes below this comment
# ----------------------------------------------------





### Additional, more advanced problems:

**Note**: Consult the [Spark Python API](https://spark.apache.org/docs/latest/api/python/index.html) for additional help, hints, and general exploration.

**Activity 6:** Add an index column to the Spark DataFrame created in activity 3, which sequentially increases.

In [None]:
## Your code for activity 6 goes below this comment
# ----------------------------------------------------




**Activity 7:** Create an RDD containing the 'Year', 'Month', 'DayofMonth', 'dDelay',
and 'Origin' columns for the airline data..

In [None]:
## Your code for activity 7 goes below this comment
# ----------------------------------------------------





-----

# Save your notebook

**Note**: If you do not do the extra material, you should still scroll down and execute the command to release the SparkContext.

# Optional/ Extra material

## Machine Learning

The bulk of the MLlib package is focused on performing machine learning
at scale by using Spark. With functions for computing classification,
regression, clustering, dimensional reduction, and more, the library
extends considerable power to the Spark user. Since we have already
covered these concepts by using Python and scikit-learn, in the rest of
this Notebook, we will present two specific machine learning algorithms
in order to demonstrate the basic concepts required to work with the
tools in the Spark MLlib package.

-----

### Linear Modeling

One of the simplest machine learning techniques is [linear regression][slr].
The main difference when using Spark is that for this supervised
learning technique our data must be in a Spark specific data structure
called [`LabeledPoint`][slp]. Spark provides several data structures to
simplify the application of distributed machine learning algorithms at
scale. The labeled nature refers to the label, used for training, that
is associated with the point. The first item in the data structure is
the label, while the second item is the set of feature columns.

In the following code cells, we first create a new data structure that
extracts the arrival delay to be the label and the departure delay as
the feature. These data re turned into a Spark sequence containing
`LabeledPoint` values for each row in the original RDD. Next we display
the first rows in the new sequence, and next we train the linear
regressor (using SVD in this case) and specify a number of iterations
and step size. You should feel free to modify these values and see the
impact on the resulting performance. Finally, we compute several
regression metrics to quantify the performance of this method on these
data (recall that the data span a large range, hence the RMSE is quite
reasonable).

-----

[slp]: https://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
[slr]: https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression

In [None]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
from pyspark.ml.regression import LinearRegressionModel

# Minimum departure delay
min_delay = 5.
data = fields.filter(lambda p: p[6] > min_delay).map(lambda p: LabeledPoint(p[8], [p[6]]))

In [None]:
data.take(5)

In [None]:
lr_model = LinearRegressionWithSGD.train(data, iterations=100, step=0.00000001)

In [None]:
vnp = data.map(lambda lp: (lp.label, float(lr_model.predict(lp.features))))

In [None]:
vnp.take(5)

In [None]:
from pyspark.mllib.evaluation import RegressionMetrics

tm = RegressionMetrics(vnp)

print('RMSE = {0:5.1f}'.format(tm.rootMeanSquaredError))
print('MSE = {0:5.1f}'.format(tm.meanSquaredError))
print('MAE = {0:5.1f}'.format(tm.meanAbsoluteError))
print('r2 = {0:5.1f}'.format(tm.r2))
print('EV = {0:5.1f}'.format(tm.explainedVariance))

In [None]:
print(lr_model)

### Additional, more advanced problems:

**Activity 8:** Add more columns into the Linear Regression demonstrated in this Notebook. In particular, include departure time and distance into the calculation.

In [None]:
## Your code for activity 8 goes below this comment
# ----------------------------------------------------





In [None]:
sc.stop()

# Save your notebook, then `File > Close and Halt`