# SparkSession

SparkSession is automatically created when you start up a Notebook (e.g. Zeppelin, Databricks)

<img src="https://i.imgur.com/5Ai45fb.jpg" width=500px>

In [1]:
%spark
//Scala SparkSession
spark

 

# Show DataFrame

`df.show()` is the Spark native API that displays data but it's not pretty. 

`z.show(df)` is a Zeppelin build-in feature that allows you to show a df result in a pretty way

In [3]:
%spark
sqlContext.sql("show tables").show()


In [4]:
%spark.pyspark

tables_df.show()

In [5]:
%spark.pyspark
z.show(tables_df)

 

# Spark SQL vs Dataframe

`%sql` is the Spark SQL interpreter

`%spark.pyspark` is the PySpark interpreter

`%spark` is the Spark Scala interpreter

In [7]:
%sql

select count(1) from wdi_csv_parquet

In [8]:
%spark

var wdi_df = spark.sql("SELECT * FROM wdi_csv_parquet")
wdi_df.show()


In [9]:
%spark
wdi_df.printSchema()
z.show(wdi_df)

 

# Show Historical GDP for Canada


In [11]:
%sql
SELECT year, IndicatorValue as GDP
FROM wdi_csv_parquet
WHERE indicatorCode = 'NY.GDP.MKTP.KD.ZG' and countryName = 'Canada'
ORDER BY year


In [12]:
%spark
val filterDF = wdi_df
  .filter($"countryName" === "Canada")
  .filter($"indicatorCode" === "NY.GDP.MKTP.KD.ZG")
z.show(filterDF.selectExpr("year", "IndicatorValue as GDP"))

 

# Show GDP for Each County and Sort By Year



In [14]:
%sql
SELECT countryname,
       year,
       indicatorcode,
       indicatorvalue
FROM wdi_csv_parquet
WHERE indicatorcode = 'NY.GDP.MKTP.KD.ZG'
DISTRIBUTE BY countryname
SORT BY countryname, year


In [15]:
%spark
val gdp_country_part_DF = wdi_df
  .repartition($"countryName")
  .filter($"indicatorCode" === "NY.GDP.MKTP.KD.ZG")
  .orderBy(asc("countryName"),asc("year"))
z.show(gdp_country_part_DF.selectExpr("countryname", "year", "indicatorcode", "indicatorvalue"))

# Find the highest GDP for each country

- Re-write the hive query (left cell) using PySpark df


In [17]:
%sql

SELECT wdi_csv_parquet.indicatorvalue AS value, 
       wdi_csv_parquet.year           AS year, 
       wdi_csv_parquet.countryname    AS country 
FROM   (SELECT Max(indicatorvalue) AS ind, 
               countryname 
        FROM   wdi_csv_parquet 
        WHERE  indicatorcode = 'NY.GDP.MKTP.KD.ZG' 
               AND indicatorvalue <> 0 
        GROUP  BY countryname) t1 
       INNER JOIN wdi_csv_parquet 
               ON t1.ind = wdi_csv_parquet.indicatorvalue 
                  AND t1.countryname = wdi_csv_parquet.countryname


In [18]:
%spark
val highest_gdp_country_df = wdi_df
  .filter($"indicatorCode" === "NY.GDP.MKTP.KD.ZG")
  .filter($"indicatorvalue" =!= "0")
  .groupBy("countryName").agg(expr("max(indicatorvalue) as ind"))
  
var gdp_df =  highest_gdp_country_df.as("t1").join(
    wdi_df.as("wdi"), 
    $"t1.ind" <=> $"wdi.indicatorvalue"
        && $"t1.countryName" <=> $"wdi.countryName",
    "inner"
)


z.show(gdp_df.selectExpr("wdi.indicatorvalue as value", "wdi.year", "t1.countryName as country"))
