# Spark Data Frames

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.

There are several ways to interact with Spark SQL including SQL, the DataFrames API and the Datasets API.
When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between the various APIs based on which provides the most natural way to express a given transformation.

# Setup

In [1]:
import os, sys
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

## Setup Spark

In [2]:
# %load ../01_Distributed_Computing_HDFS_Distributed_Data_Sets/pyspark_init_arc.py
#
# This configuration works for Spark on arc.insight.gsu.edu
#
import os, sys
# set OS environment variable
os.environ["SPARK_HOME"] = '/usr/hdp/2.4.2.0-258/spark'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell'

# add Spark library to Python
sys.path.insert(0, os.path.join(os.environ["SPARK_HOME"], 'python'))

# import package
import pyspark
from pyspark.context import SparkContext, SparkConf

import atexit
def stop_my_spark():
    sc.stop()
    del(sc)

# Register exit    
atexit.register(stop_my_spark)

# Configure and start Spark ... but only once.
if not 'sc' in globals():
    conf = SparkConf()
    conf.setAppName('MyFirstSpark') ## you may want to change this
    conf.setMaster('yarn-client')
    ##conf.set('spark.ui.port', '%d'%(52000+np.int(np.random.rand(1)*10000)))
    sc = SparkContext(conf=conf)
    print "Launched Spark version %s with ID %s" % (sc.version, sc.applicationId)
   

Launched Spark version 1.6.1 with ID application_1508160140652_0078


In [3]:
 print "http://arc.insight.gsu.edu:8088/cluster/app/%s"% (sc.applicationId)

http://arc.insight.gsu.edu:8088/cluster/app/application_1508160140652_0078


## Add SQL Context and a couple of classes

In [None]:
from pyspark.sql import SQLContext, Row, DataFrame
sqlCtx = SQLContext(sc)

In [None]:
user_df = sqlCtx.read.json('/data/yelp/user')
user_df.printSchema()

How many records?

In [None]:
user_df.count()

In [None]:
sqlCtx.read.json()
user_df.select('name', 'average_stars', 'compliments')

In [None]:
user_df.registerTempTable('users')
sqlCtx.sql("SELECT name, average_stars, compliments FROM users WHERE average_stars > 4").show()

In [None]:
review_df = sqlCtx.read.json('/data/yelp/review')
review_df.printSchema()

In [None]:
review_df.registerTempTable('reviews')

In [None]:
jnt_df = sqlCtx.sql("""
SELECT business_id, AVG(stars) AS Mstars, VARIANCE(stars) AS Vstars, COUNT(*) AS n FROM users
JOIN reviews 
ON users.user_id=reviews.user_id
GROUP BY business_id
HAVING COUNT(*)>20
""")
jnt_df.printSchema()

In [None]:
jnt_df.sort('n', ascending=).show()

In [None]:
Employees_df = sqlCtx.read.format('csv').load('/user/pmolnar/data/AdventureWorks/Employees.csv.gz')

# Adventure Works

In [None]:
# %load adventureworks_spark.py
if not 'sqlCtx' in vars():
    sqlCtx = SQLContext(sc)

Employees_df = sqlCtx.read.format('com.databricks.spark.csv')\
    .options(header=True, inferschema=True)\
    .load('/user/pmolnar/data/AdventureWorks/Employees.csv.gz')

Territory_df = sqlCtx.read.format('com.databricks.spark.csv')\
    .options(header=True, inferschema=True)\
    .load('/user/pmolnar/data/AdventureWorks/SalesTerritory.csv.gz')

Orders_df = sqlCtx.read.format('com.databricks.spark.csv')\
    .options(header=True, inferschema=True)\
    .load('/user/pmolnar/data/AdventureWorks/ItemsOrdered.csv.gz')

Customers_df = sqlCtx.read.format('com.databricks.spark.csv')\
    .options(header=True, inferschema=True)\
    .load('/user/pmolnar/data/AdventureWorks/Customer.csv.gz')



In [None]:
Employees_df.printSchema()