<img width="200" style="float:left" 
     src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg" />

# Sections
* [Description](#0)
* [1. Setup](#1)
  * [1.1 Start Hadoop](#1.1)  
  * [1.2 Search for Spark Installation](#1.2)
  * [1.3 Create SparkSession](#1.3)
* [2. Kata](#2)
* [3. TearDown](#3)
  * [3.1 Stop Hadoop](#3.1)

<a id='0'></a>
## Description
<p>
<div>The goals for this lab are:</div>
<ul>    
    <li>Get familiar with Spark DataFrames API</li>
    <li>Apply some transformations using Spark DataFrames API</li>
</ul>    
</p>

<a id='1'></a>
## 1. Setup

Since we are going to process data stored from HDFS let's start the service

<a id='1.1'></a>
### 1.1 Start Hadoop

Start Hadoop

Open a terminal and execute
```sh
hadoop-start.sh
```

<a id='1.2'></a>
### 1.2 Search for Spark Installation 
This step is required just because we are working in the course environment.

In [None]:
import findspark
findspark.init()

I'm changing pandas max column width property to improve data displaying

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

<a id='1.3'></a>
### 1.3 Create SparkSession
By setting this environment variable we can include extra libraries in our Spark cluster

In [None]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = ' pyspark-shell'

The first thing always is to create the SparkSession

In [None]:
from pyspark.sql.session import SparkSession

spark = (SparkSession.builder
.appName("Pokemon - DataFrames - Kata")
.config("spark.sql.warehouse.dir","hdfs://localhost:9000/warehouse")
.getOrCreate())

<a id='2'></a>
## 2. Lab

<a id='2.1'></a>
### 2.1 Check  Files

In order to complete this lab you need to previosly complete **'Pokemon - RAW to STD - DataFrames'**.<br/>
Check you have the data ready in HDFS

http://localhost:50070/explorer.html#/datalake/std/pokemon/

###  Dataset Documentation

### Metadata

id: ID for each pokemon <br/>
name: Name of each pokemon<br/>
type_1: Each pokemon has a type, this determines weakness/resistance to attacks<br/>
type_2: Some pokemon are dual type and have 2<br/>
total: sum of all stats that come after this, a general guide to how strong a pokemon is<br/>
hp: hit points, or health, defines how much damage a pokemon can withstand before fainting<br/>
attack: the base modifier for normal attacks (eg. Scratch, Punch)<br/>
defense: the base damage resistance against normal attacks<br/>
sp_atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)<br/>
sp_def: the base damage resistance against special attacks<br/>
speed: determines which pokemon attacks first each round<br/>
generation: pokemon generation<br/>
legendary: determines if the pokemon is legendary or not<br/>

In [None]:
df = spark.read.parquet("hdfs://localhost:9000/datalake/std/pokemon/pokemon-data/")

<a id='2.2'></a>
### 2.2 Exercises

In case you need to use a Spark builtin function: 

https://spark.apache.org/docs/3.2.1/api/python/reference/pyspark.sql.html#functions



Find all legendary pokemons

Find all possible pokemon types (type_1)

Find top 10 most powerful pokemons (Total)

Calculate pokemon stats for all numerical columns (from Total to Speed)

Add a new column called power with the ratio between Total and HP

Transform pokemon names to upper case

Find all pokemons which names are comprised of more than one word

Double the speed of electric pokemons 

Save the DataFrame in JSON format in the following HDFS path hdfs://localhost:9000/datalake/work/pokemon-lab/

Read the DataFrame back

Drop generation column

Show the DataFrame schema

Show first 10 rows

Transform the Spark DataFrame into a Pandas DataFrame

<a id='3'></a>
## 3. Tear Down

Once we complete the the lab we can stop all the services

<a id='3.1'></a>
### 3.1 Stop Hadoop

Stops Hadoop
Open a terminal and execute
```sh
hadoop-stop.sh
```