<DIV ALIGN=CENTER>

# Introduction to Spark
## DataFrames, SQL, and Basic Data Analysis
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

In this IPython Notebook, we explore using Spark to perform data
processing in a similar maner to our previous efforts with Pandas. For
this we will use the airline data, which has been stored in an HDFS
system that is accesible from within our Spark cluster. [Other][dw]
tutorials exist, although they often focus on Scala examples since Spark
is written for that language.

-----
[sp]: http://spark.apache.org
[sh]: http://hadoop.apache.org
[sy]: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
[shdfs]: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
[sce]: http://techcrunch.com/2015/07/12/spark-and-hadoop-are-friends-not-foes/
[dw]: https://github.com/deanwampler/spark-workshop/tree/master/tutorial

In [3]:
from pyspark import SparkContext, SparkConf

sc = SparkContext('local[*]')

In [4]:
sc.stop()

In [8]:
# We release the SparkContext if it exists.
try:
    sc
except:
    pass ;
else:
    sc.stop()

# Now handle initial import statements

from os import environ
from pyspark import SparkConf, SparkContext

# Obtain initial environment variables.

port_usage = '''
To use our Spark cluster, we must manage access by nodes in the
JupyterHub Server to the Spark cluster. This may require modification to
the port numbers used by the SparkContext to communicate with the Spark
cluster. We first try to automatically set the port numbers, and display
the allowed range for this particular Notebook in order to allow you to
modify them manually as necessary. The two port values to set are 
spark.driver.port and spark.blockManager.port.

Allowed port range: {0}-{1}
'''

#bsp = environ['SPARK_PORT_BEGIN']
#esp = environ['SPARK_PORT_END']

#if bsp == None:
bsp = 7077

#if esp == None:
esp = 7077
    
#print(port_usage.format(bsp, esp))

slip = environ['HOSTNAME'] # Spark Local IP

if slip == None:
    slip = 'localhost' #141.142.236.173'

sdh = "spark{}".format(slip.split('.')[-1]) # Spark Driver Hostname

print(sdh)

sdh = 'local[*]' #'spark://141.142.236.173'
# Create new Spark Configuration (port numbers might need to be adjusted from defaults.)
myconf = SparkConf()
myconf.set("spark.driver.port", int(bsp))
myconf.set("spark.blockManager.port", int(esp))
myconf.set("spark.driver.host", sdh)
myconf.set("spark.local.ip", slip)
myconf.set("spark.cores.max",2)

myconf.setAppName("Practical Data Mining, UI RP: Brunner")

# Create and initialize a new Spark Context
sc = SparkContext(conf=myconf)

# Display Spark version information, which also verifies SparkContext is active
print("\nSpark version: {0}".format(sc.version))

spark85e5306aa1c0


Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.net.SocketException: Unresolved address
	at sun.nio.ch.Net.translateToSocketException(Net.java:137)
	at sun.nio.ch.Net.translateException(Net.java:163)
	at sun.nio.ch.Net.translateException(Net.java:169)
	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:125)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:485)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1089)
	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:430)
	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:415)
	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:903)
	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:198)
	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:348)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:107)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:217)
	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
	... 12 more


-----

### Data Processing

Previously in this Notebook, we have used Spark to create simple RDDs
that demonstrated Spark transformations and actions on small data. Now
we will change approaches and analyze the airline data, first starting
with the single 2001 flight data file. We can create a new RDD by
reading in the data as a textfile, after which we execute the RDD
creation by counting the number of lines in the RDD. We subsequently
apply several other RDD methods to display the first few rows of data by
using the `take` method. Finally, we use the built-in `help` to se the
list of supported RDD methods.

-----



In [18]:
text_file = sc.textFile("hdfs://10.0.3.113:9000/home/ubuntu/data/2001.csv")

col_data = text_file.map(lambda l: l.split(",")) \
            .map(lambda p: (p[0], p[1], p[2], p[4], p[14], p[15], p[16], p[17], p[18])) \
            .filter(lambda line: 'Year' not in line)

cols = col_data.filter(lambda line: 'NA' not in line)

fields = cols.map(lambda p: (int(p[0]), int(p[1]), int(p[2]), int(p[3]),
                          int(p[4]), int(p[5]), p[6], p[7], int(p[8])))

In [23]:
fields.count()

5723673

-----

### Spark DataFrame

Spark supports a simplified [Data Frame][spdf] as part of the [Spark
SQL][spsql] library. We can create a Data Frame from an existing RDD by
also specifying the column labels and data types. The data types must
be one of the pre-defined [Spark SQL types][spdt]. After creating the
new DataFrame (which is backed by an RDD), we can perform many of the
same tasks with Spark that we performed with Pandas (but not all, and
not in as simple of an approach). The following code cells show how we
can take our 2001 flight data RDD and create a new Data Frame, which we
subsequently use in several subsequent code cells.

-----
[spdf]: https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
[spsql]: https://spark.apache.org/sql/
[spdt]: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types

In [24]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *

# sc is an existing SparkContext.
sqlContext = SQLContext(sc)

schemaString = "Year Month DayOfMonth DepTime ArrDelay DepDelay Origin Destination Distance"

fieldTypes = [IntegerType(), IntegerType(), IntegerType(), IntegerType(), IntegerType(), IntegerType(), \
              StringType(), StringType(), IntegerType()]

f_data = [StructField(field_name, field_type, True) \
          for field_name, field_type in zip(schemaString.split(), fieldTypes)]

schema = StructType(f_data)

In [25]:
df = sqlContext.createDataFrame(fields, schema)
df

DataFrame[Year: int, Month: int, DayOfMonth: int, DepTime: int, ArrDelay: int, DepDelay: int, Origin: string, Destination: string, Distance: int]

-----

In the following three code cells, we `show` the first few lines of the
DataFrame, then use the `head` method, which displays more syntactic
information for each row, and finally use the `describe` method, which
doesn't execute until the `show` action is invoked. While the output is
less visually attractive than the Pandas result, we still obtain the
necessary information.

After these code cells, we access the DataFrame schema, first by using
the `printSchema` method to nicely output the schema, and next access a
column directly, which we can now do since we have named our DataFrame
columns.

-----

In [26]:
df.show(5)

+----+-----+----------+-------+--------+--------+------+-----------+--------+
|Year|Month|DayOfMonth|DepTime|ArrDelay|DepDelay|Origin|Destination|Distance|
+----+-----+----------+-------+--------+--------+------+-----------+--------+
|2001|    1|        17|   1806|      -3|      -4|   BWI|        CLT|     361|
|2001|    1|        18|   1805|       4|      -5|   BWI|        CLT|     361|
|2001|    1|        19|   1821|      23|      11|   BWI|        CLT|     361|
|2001|    1|        20|   1807|      10|      -3|   BWI|        CLT|     361|
|2001|    1|        21|   1810|      20|       0|   BWI|        CLT|     361|
+----+-----+----------+-------+--------+--------+------+-----------+--------+
only showing top 5 rows



In [27]:
df.head(4)

[Row(Year=2001, Month=1, DayOfMonth=17, DepTime=1806, ArrDelay=-3, DepDelay=-4, Origin='BWI', Destination='CLT', Distance=361),
 Row(Year=2001, Month=1, DayOfMonth=18, DepTime=1805, ArrDelay=4, DepDelay=-5, Origin='BWI', Destination='CLT', Distance=361),
 Row(Year=2001, Month=1, DayOfMonth=19, DepTime=1821, ArrDelay=23, DepDelay=11, Origin='BWI', Destination='CLT', Distance=361),
 Row(Year=2001, Month=1, DayOfMonth=20, DepTime=1807, ArrDelay=10, DepDelay=-3, Origin='BWI', Destination='CLT', Distance=361)]

In [28]:
df.describe().show()

+-------+-------+-----------------+-----------------+------------------+------------------+------------------+-----------------+
|summary|   Year|            Month|       DayOfMonth|           DepTime|          ArrDelay|          DepDelay|         Distance|
+-------+-------+-----------------+-----------------+------------------+------------------+------------------+-----------------+
|  count|5723673|          5723673|          5723673|           5723673|           5723673|           5723673|          5723673|
|   mean| 2001.0|6.291580773394986|15.71320251873229|1348.6880443729053| 5.528248731190619| 8.115271609681406| 735.173682004545|
| stddev|    0.0|3.381754330822876|8.827993155975875|   482.63871515896|31.429288422846703|28.234080794004345|574.8151318384248|
|    min|   2001|                1|                1|                 1|             -1116|              -204|               21|
|    max|   2001|               12|               31|              2400|              1688|      

In [29]:
df.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayOfMonth: integer (nullable = true)
 |-- DepTime: integer (nullable = true)
 |-- ArrDelay: integer (nullable = true)
 |-- DepDelay: integer (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Destination: string (nullable = true)
 |-- Distance: integer (nullable = true)



In [30]:
df.Year

Column<b'Year'>

-----

We can extract data from the DataFrame by using similar techniques to
what we used with Pandas. One difference is that we need to `filter` the
DataFrame, as opposed to directly access rows. However, we can filter
rows to extract flights that left O'Hare, and secondly those flights
that left O'Hare more than two hours late. In the second case, we also
tranform the output to `select` the _Destination_ column and a new
column that is the _Distance_ in kilometers.

-----

In [31]:
df.filter(df['Origin'] == 'ORD').count()

321784

In [32]:
df.filter(df['Origin'] == 'ORD').filter(df['DepDelay'] > 120).select(df['Destination'], df['Distance'] * 1.6).show(10)

+-----------+-----------------+
|Destination| (Distance * 1.6)|
+-----------+-----------------+
|        PHL|           1084.8|
|        CLT|958.4000000000001|
|        MEM|            785.6|
|        MEM|            785.6|
|        MEM|            785.6|
|        STL|            412.8|
|        STL|            412.8|
|        PVD|           1358.4|
|        LAX|           2792.0|
|        LAX|           2792.0|
+-----------+-----------------+
only showing top 10 rows



-----

### Spark SQL

Given a Spark DataFrame, we can apply SQL statements directly against
the DataFrame by registering the DataFrame as a Spark temporary SQL
table. The following code cells demonstrates this, as we register our
DataFrame as a `flights` table, and execute a SQL statement to select
the same data we obtained from our previous DataFrame filter.Since the
data are unordered, we have different results displayed via the `show`
method.

-----

In [33]:
df = sqlContext.createDataFrame(fields, schema)

df.registerTempTable("flights")

# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT Destination, Distance FROM flights WHERE Origin = 'ORD' AND DepDelay > 120")

# The results of SQL queries are RDDs and support all the normal RDD operations.
results.show(10)

+-----------+--------+
|Destination|Distance|
+-----------+--------+
|        PHL|     678|
|        CLT|     599|
|        MEM|     491|
|        MEM|     491|
|        MEM|     491|
|        STL|     258|
|        STL|     258|
|        PVD|     849|
|        LAX|    1745|
|        LAX|    1745|
+-----------+--------+
only showing top 10 rows



-----
## Breakout Session

During this breakout, you should work with the previous Spark examples
in order to better learn how Spark works, and how it is different than
pure Python approaches like Pandas. Specific problems you can attempt
include the following:

1. Change the `myRDD` example to start with all integers from 0 to 199.
Use an appropriate lambda function to convert this RDD to a new RDD that
has all odd integers from 1 to 399.

2. Filter the previous RDD to contain only entries that are divisible by
9.

3. Convert this RDD to a Spark DataFrame, specify the column name as
`Numbers`.

4. Add an index column to this Spark DataFrame, which sequentially
increases.

Additional, more advanced problems:

1. Create an RDD containing the 'Year', 'Month', 'DayofMonth', 'dDelay',
and 'Origin' columns for the airline data for all years 1990-2005.

2. Filter this RDD to contain only flight data for flights leaving O'Hare
airport.

3. Implement a linear fit to the airline flight data in this RDD.

-----

-----

### Ending the Spark Session

We must stop the `SparkContext` in order to release resources on the
instructional cluster before existing this Notebook.

-----

In [40]:
sc.stop()

### Additional References

2. [Official Spark Documentation][osd] .
5. [Spark][sn] for Data Science Notebook.
3. [Pandas and Spark][psd1] Comparison.
3. Another [Pandas & Spark ][psd1] Comparison.
8. [IPython Spark][ipys] Docker image to simplify learning.
-----
[osd]: https://spark.apache.org/docs/latest/index.html
[sn]: https://github.com/donnemartin/data-science-ipython-notebooks/blob/master/spark/spark.ipynb
[psd1]: https://github.com/christophebourguignat/notebooks/blob/master/Spark-Pandas-Differences.ipynb
[psd2]: https://lab.getbase.com/pandarize-spark-dataframes/
[ipys]: https://github.com/Lab41/ipython-spark-docker

### Return to the [Week Two](index.ipynb) index.

-----