## Introduction to Spark DataFrame

#### In Apache Spark: 
* **_DataFrame_** is a distributed collection of rows, where each column is named.
* Similar to relational table, Python Pandas object, R dataframe, or Excel sheet with column headers.

#### Similar to RDD:
* Immuatable: DataFrames cannot be changed, only be transformed.
* Lazy evaluation: Task is not executed until an *action* kicks in.
* Distributed:Rows and columns are distributed. 

#### Different from RDD:
* DataFrame is designed to process structured data.
* Query optimization becomes possible. 

### RDD vs DataFrame
<img src="./images/rdd_vs_dataframe.jpg" width="600" height="400" /> 

#### Howe to create a DataFrame:
* Loading data from a file of various formats: JSON, CSV, XML, ...
* Loading data from existing RDD (kind of transformation)
* Loading data from various databases

It can be created using different data formats. For example, loading the data from JSON, CSV.
Loading data from Existing RDD.
Programmatically specifying schema

<img src="./images/DataFrame-in-Spark.png" width="600" height="400" /> 

#### Example: Loading a csv file to DataFrame

In [1]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import Row
sqlContext = SQLContext(sc)


In [2]:
dividendsDF = sqlContext.read.load('data/NYSE_dividends_A.csv', format='csv', header=True, inferSchema=True)
dividendsDF.show(10)
#dividendsDF.printSchema()

+--------+------------+--------------------+---------+
|exchange|stock_symbol|                date|dividends|
+--------+------------+--------------------+---------+
|    NYSE|         AIT|2009-11-12 00:00:...|     0.15|
|    NYSE|         AIT|2009-08-12 00:00:...|     0.15|
|    NYSE|         AIT|2009-05-13 00:00:...|     0.15|
|    NYSE|         AIT|2009-02-11 00:00:...|     0.15|
|    NYSE|         AIT|2008-11-12 00:00:...|     0.15|
|    NYSE|         AIT|2008-08-13 00:00:...|     0.15|
|    NYSE|         AIT|2008-05-13 00:00:...|     0.15|
|    NYSE|         AIT|2008-02-13 00:00:...|     0.15|
|    NYSE|         AIT|2007-11-13 00:00:...|     0.15|
|    NYSE|         AIT|2007-08-13 00:00:...|     0.15|
+--------+------------+--------------------+---------+
only showing top 10 rows



In [3]:
dailyPricesDF= sqlContext.read.load('data/NYSE_daily_prices_A.csv', format='csv', header=True, inferSchema=True)
dailyPricesDF.show()

+--------+------------+--------------------+----------------+----------------+---------------+-----------------+------------+---------------------+
|exchange|stock_symbol|                date|stock_price_open|stock_price_high|stock_price_low|stock_price_close|stock_volume|stock_price_adj_close|
+--------+------------+--------------------+----------------+----------------+---------------+-----------------+------------+---------------------+
|    NYSE|         AEA|2010-02-08 00:00:...|            4.42|            4.42|           4.21|             4.24|      205500|                 4.24|
|    NYSE|         AEA|2010-02-05 00:00:...|            4.42|            4.54|           4.22|             4.41|      194300|                 4.41|
|    NYSE|         AEA|2010-02-04 00:00:...|            4.55|            4.69|           4.39|             4.42|      233800|                 4.42|
|    NYSE|         AEA|2010-02-03 00:00:...|            4.65|            4.69|            4.5|             4.55|

### DataFrame Manipulation

You can manipulation a DataFrame in two ways:
1. Using functions of DataFrame
2. Using SQL after creating/registering a table/view 

#### How to Count the number of rows, columns in DataFrame?

In [4]:
dividendsDF.count(), dailyPricesDF.count()
#len(dividendsDF.columns), dividendsDF.columns
#len(dailyPricesDF.columns), dailyPricesDF.columns

(8719, 735026)

#### Basic statistics (mean, standard deviance, min ,max , count) of numerical columns

In [5]:
dividendsDF.describe('dividends').show()

+-------+-------------------+
|summary|          dividends|
+-------+-------------------+
|  count|               8719|
|   mean|0.22300571957793303|
| stddev| 0.6983857030438609|
|    min|                0.0|
|    max|             34.958|
+-------+-------------------+



#### Select column(s) from a DataFrame

In [41]:
dividendsDF.select('stock_symbol').show()

+------------+
|stock_symbol|
+------------+
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
|         AIT|
+------------+
only showing top 20 rows



#### Filter the rows 

In [6]:
dailyPricesDF.filter(dailyPricesDF['stock_price_close'] > 200).show()
# equivalentaly
# dailyPricesDF.filter(dailyPricesDF.stock_price_close > 200).show()

+--------+------------+--------------------+----------------+----------------+---------------+-----------------+------------+---------------------+
|exchange|stock_symbol|                date|stock_price_open|stock_price_high|stock_price_low|stock_price_close|stock_volume|stock_price_adj_close|
+--------+------------+--------------------+----------------+----------------+---------------+-----------------+------------+---------------------+
|    NYSE|         ALX|2010-02-08 00:00:...|          280.05|           280.7|         272.74|           273.24|        4000|               273.24|
|    NYSE|         ALX|2010-02-05 00:00:...|           272.7|          281.32|         272.01|           281.32|       11400|               281.32|
|    NYSE|         ALX|2010-02-04 00:00:...|          278.06|          278.06|         271.55|           271.77|        5300|               271.77|
|    NYSE|         ALX|2010-02-03 00:00:...|          287.73|          287.73|         278.55|           280.06|

#### GroupBy, Aggregate, and OrderBy

In [7]:
dailyPricesDF.groupBy('stock_symbol').agg({'stock_price_close': 'max'}).orderBy('max(stock_price_close)', ascending=False).show(10)

+------------+----------------------+
|stock_symbol|max(stock_price_close)|
+------------+----------------------+
|         ALX|                467.25|
|         ADI|                182.62|
|         ACL|                175.47|
|         AXP|                 169.0|
|         AZO|                166.82|
|         AIG|                 156.5|
|         AET|                153.93|
|         AVB|                148.52|
|         AEG|                147.58|
|         APA|                 146.8|
+------------+----------------------+
only showing top 10 rows



### Running SQL Queries

The *sql* function enables applications to run SQL queries and returns the result as a DataFrame.

* Global Temporary View


In [8]:
dailyPricesDF.createOrReplaceTempView('daily_prices')
dividendsDF.createOrReplaceTempView('dividends')

In [9]:
price_result = sqlContext.sql('SELECT * FROM daily_prices LIMIT 10')
price_result.show()



#result.filter(result['stock_price_close'] > 2).show()
#result_1 = result.rdd.map(lambda row: (row,1)).toDF()
#result_1.show()

+--------+------------+--------------------+----------------+----------------+---------------+-----------------+------------+---------------------+
|exchange|stock_symbol|                date|stock_price_open|stock_price_high|stock_price_low|stock_price_close|stock_volume|stock_price_adj_close|
+--------+------------+--------------------+----------------+----------------+---------------+-----------------+------------+---------------------+
|    NYSE|         AEA|2010-02-08 00:00:...|            4.42|            4.42|           4.21|             4.24|      205500|                 4.24|
|    NYSE|         AEA|2010-02-05 00:00:...|            4.42|            4.54|           4.22|             4.41|      194300|                 4.41|
|    NYSE|         AEA|2010-02-04 00:00:...|            4.55|            4.69|           4.39|             4.42|      233800|                 4.42|
|    NYSE|         AEA|2010-02-03 00:00:...|            4.65|            4.69|            4.5|             4.55|

In [10]:
dividend_result = sqlContext.sql('SELECT * FROM dividends')
dividend_result.show()


+--------+------------+--------------------+---------+
|exchange|stock_symbol|                date|dividends|
+--------+------------+--------------------+---------+
|    NYSE|         AIT|2009-11-12 00:00:...|     0.15|
|    NYSE|         AIT|2009-08-12 00:00:...|     0.15|
|    NYSE|         AIT|2009-05-13 00:00:...|     0.15|
|    NYSE|         AIT|2009-02-11 00:00:...|     0.15|
|    NYSE|         AIT|2008-11-12 00:00:...|     0.15|
|    NYSE|         AIT|2008-08-13 00:00:...|     0.15|
|    NYSE|         AIT|2008-05-13 00:00:...|     0.15|
|    NYSE|         AIT|2008-02-13 00:00:...|     0.15|
|    NYSE|         AIT|2007-11-13 00:00:...|     0.15|
|    NYSE|         AIT|2007-08-13 00:00:...|     0.15|
|    NYSE|         AIT|2007-05-11 00:00:...|     0.12|
|    NYSE|         AIT|2007-02-13 00:00:...|     0.12|
|    NYSE|         AIT|2006-11-13 00:00:...|     0.12|
|    NYSE|         AIT|2006-08-11 00:00:...|     0.12|
|    NYSE|         AIT|2006-05-11 00:00:...|     0.12|
|    NYSE|

#### Join on two views
* List the closing prices when companies paid dividends

In [11]:
join = sqlContext.sql('''SELECT div.exchange, div.stock_symbol, div.date, div.dividends,
prices.stock_price_close  FROM dividends div INNER JOIN daily_prices prices
ON(div.stock_symbol=prices.stock_symbol AND div.date=prices.date) LIMIT 10''')
join.show()


+--------+------------+--------------------+---------+-----------------+
|exchange|stock_symbol|                date|dividends|stock_price_close|
+--------+------------+--------------------+---------+-----------------+
|    NYSE|         AEA|2009-11-20 00:00:...|    0.063|             6.24|
|    NYSE|         AEA|2009-08-21 00:00:...|    0.063|              5.9|
|    NYSE|         AEA|2009-05-21 00:00:...|    0.063|             4.35|
|    NYSE|         AEA|2009-02-20 00:00:...|    0.063|             1.01|
|    NYSE|         AEA|2008-11-21 00:00:...|    0.063|              1.5|
|    NYSE|         AEA|2008-08-22 00:00:...|    0.125|             4.94|
|    NYSE|         AEA|2008-05-22 00:00:...|    0.125|             6.99|
|    NYSE|         AEA|2008-02-22 00:00:...|    0.125|             7.07|
|    NYSE|         AEA|2007-11-23 00:00:...|    0.125|             8.37|
|    NYSE|         AEA|2007-08-24 00:00:...|    0.125|            12.86|
+--------+------------+--------------------+-------

##### Join and GroupBy 
* What are the maximum, minimum, and average closing procies at the time of dividends

In [12]:
join_group = sqlContext.sql('''SELECT div.stock_symbol, max(prices.stock_price_close) as max_close FROM dividends div 
INNER JOIN daily_prices prices ON(div.stock_symbol=prices.stock_symbol AND div.date=prices.date)
GROUP BY div.stock_symbol LIMIT 10''')
join_group.show()
    

+------------+---------+
|stock_symbol|max_close|
+------------+---------+
|         APX|    11.75|
|         AIV|    55.58|
|         AVY|    69.12|
|         AVX|    92.62|
|         AXP|   147.25|
|         ARL|    16.44|
|         AAV|    20.58|
|         ARM|    30.06|
|         ASH|     68.3|
|         AEB|     25.8|
+------------+---------+



In [13]:
join_group_agg = sqlContext.sql('''SELECT div.stock_symbol, max(prices.stock_price_close) maximum,
min(prices.stock_price_close) minimum, avg(prices.stock_price_close) average FROM dividends div 
INNER JOIN daily_prices prices ON(div.stock_symbol=prices.stock_symbol AND div.date=prices.date) 
GROUP BY div.stock_symbol LIMIT 10''')
join_group_agg.show()

+------------+-------+-------+------------------+
|stock_symbol|maximum|minimum|           average|
+------------+-------+-------+------------------+
|         APX|  11.75|   5.98|  9.57041666666669|
|         AIV|  55.58|    7.7| 33.90645161290322|
|         AVY|  69.12|  18.62| 44.17613636363635|
|         AVX|  92.62|   8.08|  19.7403448275862|
|         AXP| 147.25|  14.44| 45.93118518518519|
|         ARL|  16.44|   4.13| 10.42578947368421|
|         AAV|  20.58|   2.23| 12.31511111111111|
|         ARM|  30.06|   2.66|17.342058823529413|
|         ASH|   68.3|   6.82| 40.91733944954128|
|         AEB|   25.8|    5.0|18.488750000000003|
+------------+-------+-------+------------------+

