## Introduction to Spark DataFrame

#### In Apache Spark: 
* **_DataFrame_** is a distributed collection of rows, where each column is named.
* Similar to relational table, Python Pandas object, R dataframe, or Excel sheet with column headers.

#### Similar to RDD:
* Immuatable: DataFrames cannot be changed, only be transformed.
* Lazy evaluation: Task is not executed until an *action* kicks in.
* Distributed:Rows and columns are distributed. 

#### Different from RDD:
* DataFrame is designed to process structured data.
* Query optimization becomes possible. 

### RDD vs DataFrame
<img src="./images/rdd_vs_dataframe.jpg" width="600" height="400" /> 

#### Howe to create a DataFrame:
* Loading data from a file of various formats: JSON, CSV, XML, ...
* Loading data from existing RDD (kind of transformation)
* Loading data from various databases

It can be created using different data formats. For example, loading the data from JSON, CSV.
Loading data from Existing RDD.
Programmatically specifying schema

<img src="./images/DataFrame-in-Spark.png" width="600" height="400" /> 

#### Example: Loading a csv file to DataFrame

### Old Spark requires ....
_from pyspark import SparkContext_ <br>
_from pyspark.sql import SQLContext_ <br>
_from pyspark.sql import Row_ <br>
_sqlContext = SQLContext(sc)_


In [None]:
#dividendsDF = sqlContext.read.load('data/NYSE_dividends_A.csv', format='csv', header=True, inferSchema=True)
dividendsDF = spark.read.load('data/NYSE_dividends_A.csv', format='csv', header=True, inferSchema=True)
dividendsDF.show(10)
#dividendsDF.printSchema()

In [None]:
dailyPricesDF= spark.read.load('data/NYSE_daily_prices_A.csv', format='csv', header=True, inferSchema=True)
dailyPricesDF.show()

### DataFrame Manipulation

You can manipulation a DataFrame in two ways:
1. Using functions of DataFrame
2. Using SQL after creating/registering a table/view 

#### How to Count the number of rows, columns in DataFrame?

In [None]:
dividendsDF.count(), dailyPricesDF.count()
#len(dividendsDF.columns), dividendsDF.columns
#len(dailyPricesDF.columns), dailyPricesDF.columns

#### Basic statistics (mean, standard deviance, min ,max , count) of numerical columns

In [None]:
dividendsDF.describe('dividends').show()

#### Select column(s) from a DataFrame

In [None]:
dividendsDF.select('stock_symbol', 'dividends').show()

#### Filter the rows 

In [None]:
dailyPricesDF.filter(dailyPricesDF['stock_price_close'] > 200).show()
# equivalentaly
# dailyPricesDF.filter(dailyPricesDF.stock_price_close > 200).show()

#### GroupBy, Aggregate, and OrderBy

In [None]:
dailyPricesDF.groupBy('stock_symbol').agg({'stock_price_close': 'max'}).orderBy('max(stock_price_close)', ascending=False).show(10)

### Running SQL Queries

The *sql* function enables applications to run SQL queries and returns the result as a DataFrame.

* Global Temporary View


In [None]:
dailyPricesDF.createOrReplaceTempView('daily_prices')
dividendsDF.createOrReplaceTempView('dividends')

In [None]:
price_result = spark.sql('SELECT * FROM daily_prices LIMIT 10')
price_result.show()



#result.filter(result['stock_price_close'] > 2).show()
#result_1 = result.rdd.map(lambda row: (row,1)).toDF()
#result_1.show()

In [None]:
dividend_result = spark.sql('SELECT * FROM dividends')
dividend_result.show()


#### Join on two views
* List the closing prices when companies paid dividends

In [None]:
join = spark.sql('''SELECT div.exchange, div.stock_symbol, div.date, div.dividends,
prices.stock_price_close  FROM dividends div INNER JOIN daily_prices prices
ON(div.stock_symbol=prices.stock_symbol AND div.date=prices.date) LIMIT 10''')
join.show()


##### Join and GroupBy 
* What are the maximum, minimum, and average closing procies at the time of dividends

In [None]:
join_group = spark.sql('''SELECT div.stock_symbol, max(prices.stock_price_close) as max_close FROM dividends div 
INNER JOIN daily_prices prices ON(div.stock_symbol=prices.stock_symbol AND div.date=prices.date)
GROUP BY div.stock_symbol LIMIT 10''')
join_group.show()
    

In [None]:
join_group_agg = spark.sql('''SELECT div.stock_symbol, max(prices.stock_price_close) maximum,
min(prices.stock_price_close) minimum, avg(prices.stock_price_close) average FROM dividends div 
INNER JOIN daily_prices prices ON(div.stock_symbol=prices.stock_symbol AND div.date=prices.date) 
GROUP BY div.stock_symbol LIMIT 10''')
join_group_agg.show()

In [None]:
result = join_group_agg.collect()

In [None]:
for item in result:
    print item