# SparkSession

SparkSession is automatically created when you start up a Notebook (e.g. Zeppelin, Databricks)

<img src="https://i.imgur.com/5Ai45fb.jpg" width=500px>

In [1]:
%spark
//Scala SparkSession
spark

In [2]:
%sh
spark-submit --version

 

# Show DataFrame

`df.show()` is the Spark native API that displays data but it's not pretty. 

`z.show(df)` is a Zeppelin build-in feature that allows you to show a df result in a pretty way

In [4]:
%spark
// List all hive tables in a df
val tables_df = spark.sql("show tables")

In [5]:
%spark
tables_df.show()

In [6]:
%spark
z.show(tables_df)

 

# Spark SQL vs Dataframe

`%sql` is the Spark SQL interpreter

`%spark.pyspark` is the PySpark interpreter

`%spark` is the Spark Scala interpreter

In [8]:
%sql

select count(1) from wdi_csv_parquet

In [9]:
%spark

var wdi_df = spark.sql("SELECT * from wdi_csv_parquet")
wdi_df.show()

In [10]:
%spark
wdi_df.printSchema()
z.show(wdi_df)

 

# Show Historical GDP for Canada

- Re-write the hive query (left cell) using PySpark df

In [12]:
%sql
SELECT year, IndicatorValue as GDP
FROM wdi_csv_parquet
WHERE indicatorCode = 'NY.GDP.MKTP.KD.ZG' and countryName = 'Canada'
ORDER BY year


In [13]:
%spark
val wdi_canada_df = wdi_df.filter($"IndicatorCode" === "NY.GDP.MKTP.KD.ZG")
                    .filter($"countryName" === "Canada")
                    
z.show(wdi_canada_df.selectExpr("year", "IndicatorValue").sort($"year"))

 

# Show GDP for Each County and Sort By Year

- Re-write the hive query (left cell) using PySpark df  
    - hint: you can create multiple DFs 

In [15]:
%sql
SELECT countryname,
       year,
       indicatorcode,
       indicatorvalue
FROM wdi_csv_parquet
WHERE indicatorcode = 'NY.GDP.MKTP.KD.ZG'
DISTRIBUTE BY countryname
SORT BY countryname, year


In [16]:
%spark

val gdp_country_df = wdi_df.repartition($"countryName")
                    .filter($"IndicatorCode" === "NY.GDP.MKTP.KD.ZG")
                    .sort("countryName", "year")
                    
z.show(gdp_country_df.selectExpr("countryName", "year", "indicatorCode", "indicatorValue"))

# Find the highest GDP for each country

- Re-write the hive query (left cell) using PySpark df


In [18]:
%sql

SELECT wdi_csv_parquet.indicatorvalue AS value, 
       wdi_csv_parquet.year           AS year, 
       wdi_csv_parquet.countryname    AS country 
FROM   (SELECT Max(indicatorvalue) AS ind, 
               countryname 
        FROM   wdi_csv_parquet 
        WHERE  indicatorcode = 'NY.GDP.MKTP.KD.ZG' 
               AND indicatorvalue <> 0 
        GROUP  BY countryname) t1 
       INNER JOIN wdi_csv_parquet 
               ON t1.ind = wdi_csv_parquet.indicatorvalue 
                  AND t1.countryname = wdi_csv_parquet.countryname


In [19]:
%spark

val max_gdp_country_df = wdi_df.filter($"indicatorcode" === "NY.GDP.MKTP.KD.ZG" )
                        .filter($"indicatorvalue" =!= 0)
                        .groupBy("countryName").agg(expr("max(indicatorValue) as maxValue"))

max_gdp_country_df.show(5)
                        
var find_gdp_df = max_gdp_country_df.as("m_gdp")
                .join(wdi_df.as("wdi"),
                $"m_gdp.maxValue" <=> $"wdi.indicatorValue"
                && $"m_gdp.countryName" <=> $"wdi.countryName", "inner")
                
z.show(find_gdp_df.selectExpr("wdi.IndicatorValue", "wdi.year", "wdi.countryName"))

In [20]:
%spark
