<img width="200" style="float:left" 
     src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg" />

# Sections
* [Description](#0)
* [1. Setup](#1)
  * [1.1 Start Hadoop](#1.1)  
  * [1.2 Search for Spark Installation](#1.2)
  * [1.3 Create SparkSession](#1.3)
* [2. Lab](#2)
  * [2.1 Check Files](#2.1)
* [3. Dataset Documentation](#3)
* [4. DataFrame Creation](#4)
* [5. DataFrame Inspection](#5)
  * [5.1 Schema Inspection](#5.1)
  * [5.2 Content Inspection](#5.2)
* [6. DataFrame Transformations](#6)
  * [6.1 Column Manipulation](#6.1)
  * [6.2 Row Filtering](#6.2)
  * [6.3 Row Sorting](#6.2)
* [7. TearDown](#7)
  * [7.1 Stop Hadoop](#7.1)

<a id='0'></a>
## Description
<p>
<div>The goals for this lab are:</div>
<ul>    
    <li>Get familiar with Spark DataFrames API</li>
    <li>Apply some transformations using Spark DataFrames API</li>
</ul>    
</p>

<a id='1'></a>
## 1. Setup

Since we are going to process data stored from HDFS let's start the service

<a id='1.1'></a>
### 1.1 Start Hadoop

Start Hadoop

Open a terminal and execute
```sh
hadoop-start.sh
```

<a id='1.2'></a>
### 1.2 Search for Spark Installation 
This step is required just because we are working in the course environment.

In [None]:
import findspark
findspark.init()

I'm changing pandas max column width property to improve data displaying

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

<a id='1.3'></a>
### 1.3 Create SparkSession
By setting this environment variable we can include extra libraries in our Spark cluster

In [None]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = ' pyspark-shell'

The first thing always is to create the SparkSession

In [None]:
from pyspark.sql.session import SparkSession

spark = (SparkSession.builder
.appName("Pokemon - DataFrames - Lab")
.config("spark.sql.warehouse.dir","hdfs://localhost:9000/warehouse")
.getOrCreate())

<a id='2'></a>
## 2. Lab

<a id='2.1'></a>
### 2.1 Check  Files

In order to complete this lab you need to previosly complete **'Pokemon - RAW to STD - DataFrames'**.<br/>
Check you have the data ready in HDFS

http://localhost:50070/explorer.html#/datalake/std/pokemon/

<a id='3'></a>
## 3. Dataset Documentation

### Metadata

id: ID for each pokemon <br/>
name: Name of each pokemon<br/>
type_1: Each pokemon has a type, this determines weakness/resistance to attacks<br/>
type_2: Some pokemon are dual type and have 2<br/>
total: sum of all stats that come after this, a general guide to how strong a pokemon is<br/>
hp: hit points, or health, defines how much damage a pokemon can withstand before fainting<br/>
attack: the base modifier for normal attacks (eg. Scratch, Punch)<br/>
defense: the base damage resistance against normal attacks<br/>
sp_atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)<br/>
sp_def: the base damage resistance against special attacks<br/>
speed: determines which pokemon attacks first each round<br/>
generation: pokemon generation<br/>
lengendary: determines if the pokemon is legendary or not<br/>

<a id='4'></a>
## 4. DataFrame Creation

We can create a DataFrame from extenal sources (like HDFS, databases ...) using the DataSources API, this is usinf the read method in the SparkSession
#### read: creates a Spark DataFrame using a Spark DataSource API

In [None]:
df = spark.read.parquet("hdfs://localhost:9000/datalake/std/pokemon/pokemon-data/")

#### createDataFrame: creates a Spark DataFrame from a python collection
We can also create a DataFrame from a python data structure like lists:

In [None]:
from pyspark.sql import Row
from pyspark.sql.types import *

data = [Row(0,"MewThree","Diamond","Energy", 500, 500, 500,500,500,500,500,0,True)]

schema = StructType([
    StructField("id",IntegerType(),True),
    StructField("name",StringType(),True),
    StructField("type_1",StringType(),True),
    StructField("type_2",StringType(),True),
    StructField("total",IntegerType(),True),
    StructField("hp",IntegerType(),True),
    StructField("attack",IntegerType(),True),
    StructField("defense",IntegerType(),True),
    StructField("sp_atk",IntegerType(),True),
    StructField("sp_def",IntegerType(),True),
    StructField("speed",IntegerType(),True),
    StructField("generation",IntegerType(),True),
    StructField("legendary",BooleanType(),True)
])

df2 = spark.createDataFrame(data, schema)
df2.toPandas()

We can also specify the schema using a DDL string

In [None]:
ddl = """
            id int,
            name string,
            type_1 string,
            type_2 string,
            total int,
            hp int,
            attack int,
            defense int,
            sp_atk int,
            sp_def int,
            speed int,
            generation int,
            legendary boolean
        """
df3 = spark.createDataFrame(data, ddl)
df3.toPandas()

<a id='5'></a>
## 5. DataFrame Inspection
<a id='5.1'></a>
### 5.1 Schema Inspection

#### schema : Returns a Spark schema object

In [None]:
df.schema

In [None]:
type(df.schema)

#### dtypes : Returns a list of tuples with column names and data types

In [None]:
df.dtypes

#### columns : Returns a list of column names

In [None]:
df.columns

#### printSchema : Prints the DataFrame schema

In [None]:
df.printSchema()

<a id='5.2'></a>
### 5.2 Content Inspection
Actions are functions that <b>return values,calculations or information</b> about a DataFrame.<br/>
Actions <b>are eager<b/> and force DataFrame computation inmediately

#### show: Show n first rows formated as a table

In [None]:
df.show()

Name column values are truncated due to their length <br/>
We can enforce Spark to not truncate column values

In [None]:
df.show(truncate=False)

#### collect : Retrieves all the rows to the driver
<span style="color:red">CAUTION: If you collect a DataFrame big enough (millions of rows) your application will blow up</span>

In [None]:
rows = df.collect()
rows

We now have all the rows (and it's data) in 'row' python variable.<br/>
We can access the values like we were accesing a python dictionary:

In [None]:
rows[0]["name"]
rows[0].name

#### toPandas : Retrieves all the rows to the driver and returns a pandas.DataFrame
<span style="color:red">CAUTION: If you collect a DataFrame big enough (millions of rows) your application will blow up</span>

In [None]:
df.toPandas()

#### take : Returns n first rows

In [None]:
df.take(2)

#### tail : Returns n last rows

In [None]:
df.tail(2)

#### head : Returns the first row 

In [None]:
df.head()

#### first : Returns the first row 

In [None]:
df.first()

#### count: Count total number of rows

In [None]:
df.count()

#### describe: Returns a new DataFrame with stats
Is a transformation!

In [None]:
stats_df = df.describe()
stats_df.toPandas()

#### summary: Returns a new DataFrame with stats 
Is a transformation!

In [None]:
summary_df = df.summary()
summary_df.toPandas()

We can use summary to just compute the stats we need

In [None]:
df.summary("stddev").toPandas()

<a id='6'></a>
## 6. DataFrame Transformations

Transformations are **functions** that can be applied to DataFrames **returning**  **new DataFrames**<br/>
Transformations **are lazy** and when applied they are just stacked until an **action** triggers the computation

<a id='6.1'></a>

### 6.1 Column Manipulation

#### select: Returns a new DataFrame with the column expressions

**Flavour 1** select transformation passing a list of column names (**list of strings**):

In [None]:
df.select("name", "type_1").show(5,False)

In [None]:
df.select(["name", "type_1"]).show(5,False)

In [None]:
df.select("*").show(5,False)

**Flavour 2** select transformation passing a list of Spark columns (**list of Column**):<br/>
Spark comes with a bunch of functions out of the box<br/>
<a href="http://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions">pyspark.sql.functions</a>

In [None]:
import pyspark.sql.functions as F

type(F.upper(df['name']))

In [None]:
df.select(F.upper(df['name'])).show(5,False)

We can also create a Column expresion with the **expr** function:

In [None]:
df.select(F.expr("upper(name)")).show(5,False)

#### withColumn: Returns a new DataFrame with the new column expressions

In [None]:
df.withColumn("new",F.expr("total + hp")).toPandas()

#### withColumnRenamed: Returns a new DataFrame with renaming a column

In [None]:
df.withColumnRenamed("#","id").toPandas()

#### drop: Returns a new DataFrame removing the specified columns

In [None]:
df.withColumn("new",F.expr("total + hp")).drop("new").toPandas()

<a id='6.2'></a>
### 6.2 Row Filtering

#### filter: Returns a new DataFrame with the rows that match the predicate
#### where: Alias of filter function

In [None]:
df.filter(df['type_1']=='Fire').show(5,False)

In [None]:
df.filter("type_1='Fire'").show(5,False)

Filter predicates may be as complex as we need to

In [None]:
df.where((df['type_1']=='Fire') & ((df['hp'] > 100) | (df['defense'] > 100 ))).show(5,False)

In [None]:
df.where(df.name.like('Mew%')).show()

In [None]:
df.where("name like 'Mew%'").show()

#### distinct: Returns a new DataFrame removing duplicated rows

In [None]:
df.distinct().count()

#### dropDuplicates: Returns a new DataFrame removing duplicated rows

In [None]:
df.dropDuplicates().count()

#### limit: Returns a new DataFrame with a fixed number of rows

In [None]:
df.limit(10).toPandas()

<a id='6.3'></a>
### 6.3 Row Sorting

#### sort: Returns a new DataFrame with the rows sorted

#### groupBy: Alias of sort

In [None]:
df.sort(F.col("name")).toPandas()

In desceding order

In [None]:
df.sort(F.col("name").desc()).toPandas()

<a id='7'></a>
## 7. Tear Down

Once we complete the the lab we can stop all the services

<a id='7.1'></a>
### 7.1 Stop Hadoop

Stops Hadoop
Open a terminal and execute
```sh
hadoop-stop.sh
```