# SparkSession

SparkSession is automatically created when you start up a Notebook (e.g. Zeppelin, Databricks)

<img src="https://i.imgur.com/5Ai45fb.jpg" width=500px>

In [1]:
%spark
//Scala SparkSession
spark

In [2]:
%spark.pyspark
#PySpark SparkSession
spark

 

# Show DataFrame

`df.show()` is the Spark native API that displays data but it's not pretty. 

`z.show(df)` is a Zeppelin build-in feature that allows you to show a df result in a pretty way

In [4]:
%spark.pyspark

#List all hive tables in a df
tables_df = spark.sql("show tables")

In [5]:
%spark.pyspark

tables_df.show()

In [6]:
%spark.pyspark
z.show(tables_df)

 

# Spark SQL vs Dataframe

`%sql` is the Spark SQL interpreter

`%spark.pyspark` is the PySpark interpreter

`%spark` is the Spark Scala interpreter

In [8]:
%sql

select count(1) from wdi_csv_parquet

In [9]:
%spark.pyspark

#Read Hive data to a df (this is lazy)
wdi_df = spark.sql("SELECT * from wdi_csv_parquet")
#Persist df in memory for fast futuer access
wdi_df = wdi_df.cache()
wdi_df.printSchema()

#Spark action is eager
z.show(wdi_df.count())

 

# Show Historical GDP for Canada

- Re-write the hive query (left cell) using PySpark df

In [11]:
%sql
SELECT year, IndicatorValue as GDP
FROM wdi_csv_parquet
WHERE indicatorCode = 'NY.GDP.MKTP.KD.ZG' and countryName = 'Canada'
ORDER BY year


In [12]:
%spark.pyspark
from pyspark.sql.functions import col
from pyspark.sql.functions import desc

wdi_canada_df = wdi_df\
    .where(col('countryname') == 'Canada')\
    .where(col('indicatorcode') == 'NY.GDP.MKTP.KD.ZG')\
    .sort(desc('year'))


#use z.show to display df result and draw a bar chart
z.show(wdi_canada_df.select("year", "IndicatorValue"))

 

# Show GDP for Each County and Sort By Year

- Re-write the hive query (left cell) using PySpark df  
    - hint: you can create multiple DFs 

In [14]:
%sql
SELECT countryname,
       year,
       indicatorcode,
       indicatorvalue
FROM wdi_csv_parquet
WHERE indicatorcode = 'NY.GDP.MKTP.KD.ZG'
DISTRIBUTE BY countryname
SORT BY countryname, year


In [15]:
%spark.pyspark

#Your solution goes here (please remove this commnet from you copy)
from pyspark.sql.functions import col

wdi_gdp_df = wdi_df\
    .where(col('indicatorcode') == 'NY.GDP.MKTP.KD.ZG')\
    .sort(['countryname', 'year'], ascending=True)

#use z.show to display df result and draw a bar chart
z.show(wdi_gdp_df.select("countryname", "year", "indicatorcode", "IndicatorValue"))

# Find the highest GDP for each country

- Re-write the hive query (left cell) using PySpark df


In [17]:
%sql

SELECT wdi_csv_parquet.indicatorvalue AS value, 
       wdi_csv_parquet.year           AS year, 
       wdi_csv_parquet.countryname    AS country 
FROM   (SELECT Max(indicatorvalue) AS ind, 
               countryname 
        FROM   wdi_csv_parquet 
        WHERE  indicatorcode = 'NY.GDP.MKTP.KD.ZG' 
               AND indicatorvalue <> 0 
        GROUP  BY countryname) t1 
       INNER JOIN wdi_csv_parquet 
               ON t1.ind = wdi_csv_parquet.indicatorvalue 
                  AND t1.countryname = wdi_csv_parquet.countryname


In [18]:
%spark.pyspark
from pyspark.sql.functions import max
from pyspark.sql.functions import count

wdi_max_gdp_df = wdi_df\
    .where(col('indicatorcode') == 'NY.GDP.MKTP.KD.ZG')\
    .where(col('indicatorvalue') != 0)\
    .groupby(col('countryname'))\
    .agg(max('indicatorvalue').alias('indicatorvalue'))\
    .join(wdi_df, ['countryname', 'indicatorvalue'])

    
z.show(wdi_max_gdp_df.select("indicatorvalue", "year", "countryname"))

In [19]:
%spark.pyspark
