<img width="200" style="float:left" 
     src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg" />

# Sections
* [Description](#0)
* [1. Setup](#1)
  * [1.1 Start Hadoop](#1.1)  
  * [1.2 Search for Spark Installation](#1.2)
  * [1.3 Create SparkSession](#1.3)
* [2. Lab](#2)
* [3. TearDown](#3)
  * [3.1 Stop Hadoop](#3.1)

<a id='0'></a>
## Description
<p>
<div>The goals for this lab are:</div>
<ul>    
    <li>Get familiar with Spark SQL API</li>
    <li>Apply some transformations using Spark SQL API</li>
</ul>    
</p>

<a id='1'></a>
## 1. Setup

Since we are going to process data stored from HDFS let's start the service

<a id='1.1'></a>
### 1.1 Start Hadoop

Start Hadoop

Open a terminal and execute
```sh
hadoop-start.sh
```

<a id='1.2'></a>
### 1.2 Search for Spark Installation 
This step is required just because we are working in the course environment.

In [None]:
import findspark
findspark.init()

I'm changing pandas max column width property to improve data displaying

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

<a id='1.3'></a>
### 1.3 Create SparkSession
By setting this environment variable we can include extra libraries in our Spark cluster

In [None]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = ' pyspark-shell'

The first thing always is to create the SparkSession

In [None]:
from pyspark.sql.session import SparkSession

spark = (SparkSession.builder
.appName("Pokemons - SQL - Lab.ipynb")
.config("spark.sql.warehouse.dir","hdfs://localhost:9000/warehouse")
.getOrCreate())

## 2. Lab

Right now the metastore should be empty (no tables) but just one 'default' database

Also check the HDFS path, it should be empty as well.

http://localhost:50070/explorer.html#/warehouse

Lists all the databases

In [None]:
spark.sql("show databases").toPandas()

Shows the current database in use

In [None]:
spark.sql("select current_database()").toPandas()

Lists all tables in the current database

In [None]:
spark.sql("show tables").toPandas()

### 2.1 Creating managed tables 

Let's create a managed table called pokemons.

As we don't specify database it will belong to the 'default' database

In [None]:
pokemons = spark.read.parquet("hdfs://localhost:9000/datalake/std/pokemon/pokemon-data/")
spark.sql("drop table if exists pokemons")
pokemons.write.mode("overwrite").saveAsTable("pokemons")

Check HDFS directory again

http://localhost:50070/explorer.html#/warehouse

Now there is a pokemon folder with files whithin

Let's create a managed table called sightings.

As we don't specify database it will belong to the 'default' database

In [None]:
sightings = spark.read.parquet("hdfs://localhost:9000/datalake/raw/pokemon/pokemon-sightings/")
spark.sql("drop table if exists sightings")
sightings.write.mode("overwrite").saveAsTable("sightings")

Let's check the information in the metastore

In [None]:
spark.sql("show tables").toPandas()

If we drop a managed table we will delete it from the metastore and from the file system

In [None]:
spark.sql("drop table pokemons")
spark.sql("drop table sightings")

Check HDFS directory, it should be empty again

http://localhost:50070/explorer.html#/warehouse

### 2.2 Creating external tables 

In this case I want to organize the tables in a custom database rather than the 'default' one

In [None]:
spark.sql("create database if not exists pokemons")

In [None]:
spark.sql("show databases").toPandas()

In [None]:
spark.sql("select current_database()").toPandas()

In [None]:
spark.sql("use pokemons")
spark.sql("select current_database()").toPandas()

We can create an external tables from data that is already stored in HDFS

This is the preferred way of working with tables in Spark.

In [None]:
spark.sql("""
create table pokemons.pokemons
using parquet
location 'hdfs://localhost:9000/datalake/std/pokemon/pokemon-data/'
""")

In [None]:
spark.sql("show tables").toPandas()

In [None]:
spark.sql("select * from pokemons.pokemons").limit(10).toPandas()

In [None]:
spark.sql("""
create table pokemons.sightings
using parquet
location 'hdfs://localhost:9000/datalake/raw/pokemon/pokemon-sightings/'
""")

In [None]:
spark.sql("select * from pokemons.sightings").limit(10).toPandas()

Let's check the information in the metastore

In [None]:
spark.sql("show tables").toPandas()

If we drop external tables we will delete it from the metastore but data will remain int the file system

In [None]:
spark.sql("drop table if exists pokemons.pokemons")
spark.sql("drop table if exists pokemons.sightings")

Check HDFS directories, the still have the files

http://localhost:50070/explorer.html#/datalake/std/pokemon/pokemon-data/

http://localhost:50070/explorer.html#/datalake/raw/pokemon/pokemon-sightings/



In [None]:
spark.sql("show tables").toPandas()

Lets' create them again to keep working

In [None]:
spark.sql("""
create table pokemons.pokemons
using parquet
location 'hdfs://localhost:9000/datalake/std/pokemon/pokemon-data/'
""")

spark.sql("""
create table pokemons.sightings
using parquet
location 'hdfs://localhost:9000/datalake/raw/pokemon/pokemon-sightings/'
""")

### 2.3 DataFrame Transformations

Find all legendary pokemons from generations 1 and 2

In [None]:
spark.sql("select * from pokemons.pokemons where generation in (1,2) and legendary order by total desc").show()

Create a new column with emojis based on type_1

In [None]:
spark.sql("""
            select
                *,
                case 
                    when type_1='Water' then '💧'
                    when type_1='Fire' then '🔥'
                    when type_1='Electric' then '⚡️'
                    when type_1='Ice' then '❄️'
                    when type_1='Grass' then '🌿'
                    when type_1='Dragon' then '🐉'
                    when type_1='Psychic' then '🧠'
                    when type_1='Ghost' then '👻'
                    when type_1='Bug' then '🐛'
                    when type_1='Poison' then '☠️'
                end as emoji
                from pokemons.pokemons
          """).toPandas()

We can do the same thing by creating our custom function

In [None]:
from pyspark.sql.types import *

def emoji(str):
    res = None
    if str=="Water":
        res = "💧"
    elif str=="Fire":
        res = "🔥"
    elif str=="Electric":
        res = "⚡️"
    elif str=="Ice":
        res = "❄️"
    elif str=="Grass":
        res = "🌿"
    elif str=="Dragon":
        res = "🐉"
    elif str=="Psychic":
        res = "🧠"
    elif str=="Ghost":
        res = "👻"
    elif str=="Bug":
        res = "🐛"
    elif str=="Poison":
        res = "☠️"
    return res

spark.udf.register("emoji_sql", emoji, StringType())

spark.sql("select *, emoji_sql(type_1) as emoji from pokemons.pokemons").toPandas()

Add a three columns today,tomorrow and now

In [None]:
spark.sql("""
           select *,
             current_date() as today,
             date_add(current_date(),1) as tomorrow,
             current_timestamp() as now
           from pokemons.pokemons
          """).toPandas()

Practicing date/time functions

Calculate basic stats per pokemon type

In [None]:
spark.sql("""
            select
            type_1,
            count(*) as count,
            max(hp) as max_hp,
            round(avg(hp),2) as avg_hp,
            min(hp) as min_hp
            from pokemons.pokemons
            group by type_1
          """).show()

Join pokemons and sightings

Find all pokemon sightings in Spain

In [None]:
def in_spain(longitude,latitude):
    # Spain bounding box (aka bbox)
    return (longitude > -18.3936845) & (longitude < 4.5918885) & (latitude > 27.4335426) & (latitude  < 43.9933088) 

spark.udf.register("in_spain_sql", in_spain, BooleanType())

In [None]:
pokemons_in_spain = spark.sql("""
                                select
                                id, name, total, hp, 
                                location.coordinates[0] as longitude,
                                location.coordinates[1] as latitude
                                from pokemons.pokemons l join pokemons.sightings r
                                on l.id=r.pokemonId
                                where in_spain_sql(location.coordinates[0],location.coordinates[1])
                                  """).cache()

In [None]:
pokemons_in_spain.limit(10).toPandas()

### 2.4  Functions

In [None]:
spark.sql("show user functions").toPandas()

In [None]:
pd.set_option('display.max_rows', None)
spark.sql("show system functions").toPandas()

<a id='3'></a>
## 3. Tear Down

Once we complete the the lab we can stop all the services

<a id='3.1'></a>
### 3.1 Stop Hadoop

Stops Hadoop
Open a terminal and execute
```sh
hadoop-stop.sh
```