<img width="200" style="float:left" 
     src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg" />

# Sections
* [Description](#0)
* [1. Setup](#1)
  * [1.1 Start Hadoop](#1.1)  
  * [1.2 Search for Spark Installation](#1.2)
  * [1.3 Create SparkSession](#1.3)
* [2. Lab](#2)
  * [2.1 Read DataFrame from CSV files](#2.1)
  * [2.2 Write DataFrame in multiple different formats](#2.2)
  * [2.3 Read DataFrame from multiple different formats](#2.3)
  * [2.4 Read DataFrame from image files](#2.4)  
  * [2.5 Read DataFrame from any binary files](#2.5)
* [3. TearDown](#3)
  * [3.1 Stop Hadoop](#3.1)

<a id='0'></a>
## Description
<p>
<div>The goal for this notebok is getting familiar with Spark's Data Source API, this is the API for creating and saving DataFrames from/to external data sources.</div>
<div>In this notebook we are going to work with <b>HDFS</b></div>
</p>

<a id='1'></a>
## 1. Setup

Since we are going to process data stored from HDFS let's start the service

<a id='1.1'></a>
### 1.1 Start Hadoop

Start Hadoop

Open a terminal and execute
```sh
hadoop-start.sh
```

<a id='1.2'></a>
### 1.2 Search for Spark Installation 
This step is required just because we are working in the course environment.

In [None]:
import findspark
findspark.init()

I'm changing pandas max column width property to improve data displaying

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

<a id='1.3'></a>
### 1.3 Create SparkSession
By setting this environment variable we can include extra libraries in our Spark cluster

In [None]:
import os
os.environ['PYSPARK_SUBMIT_ARGS']="--packages org.apache.spark:spark-avro_2.12:3.2.1,io.delta:delta-core_2.12:1.2.1 pyspark-shell"

The first thing always is to create the SparkSession

In [None]:
from pyspark.sql.session import SparkSession

spark = SparkSession.builder\
            .appName("Pokemon - DataSources - Lab")\
            .getOrCreate()

This notebook works with two pokemon datasets. Before continue please ingest the files into HDFS in the folders:

/datalake/raw/pokemon/pokemon-data

/datalake/raw/pokemon/pokemon-images

You can either ingest them using the batch Nifi dataflow or directly using HDFS UI, your choice!

<a id='2'></a>
## 2. Lab

<a id='2.1'></a>
### 2.1 Read DataFrame from CSV files

In [None]:
from pyspark.sql.functions import col

csv_df = (spark.read.option("inferSchema", "true")
                    .option("header", "true")
                    .csv("hdfs://localhost:9000/datalake/raw/pokemon/pokemon-data/"))

In [None]:
csv_df.printSchema()

In [None]:
csv_df = (spark.read.format("csv")
                    .option("inferSchema", "true")
                    .option("header", "true")
                    .load("hdfs://localhost:9000/datalake/raw/pokemon/pokemon-data/"))

In [None]:
csv_df.show(n=10,truncate=False)

In [None]:
csv_df.show(5, False)

In [None]:
csv_df.printSchema()

Some column names have space characters, this typically may cause problems if we later use SQL. Let's fix it renaming the columms, replacing spaces with underscores

In [None]:
def rename(column):
    if column =="#":
        return "id"
    else:
        return column.replace('.','').replace(' ', '_').lower()

In [None]:
csv_df.columns

In [None]:
import pyspark.sql.functions as F
[F.col("`" + c + "`").alias(rename(c)) for c in csv_df.columns]

In [None]:
# renaming columns
pokemon_df = csv_df.select([col("`" + c + "`").alias(rename(c)) for c in csv_df.columns])

In [None]:
pokemon_df.show(5,False)

<a id='2.2'></a>
### 2.2 Write DataFrame in multiple different formats.

This codes writes the DataFrame in different file formats

In [None]:
def save(df,f):
    (df.write.mode("overwrite")
            .format(f)
            .save(f"hdfs://localhost:9000/datalake/raw/pokemon/pokemon-data.{f}/"))

[save(pokemon_df,f) for f in ["json","parquet","orc","avro", "delta"]]

Check the pokemons directories contents in HDFS

http://localhost:50070/explorer.html#/datalake/raw/pokemon

<a id='2.3'></a>
### 2.3 Read DataFrame from multiple different formats.


In [None]:
def load_and_show(f):
    print(f)
    spark.read.format(f).load(f"hdfs://localhost:9000/datalake/raw/pokemon/pokemon-data.{f}/").show(5,False)

[load_and_show(f) for f in ["json","parquet","orc","avro","delta"]]

<a id='2.4'></a>
### 2.4 Read DataFrame from images files

In [None]:
images_df=spark.read.format("image").load("hdfs://localhost:9000/datalake/raw/pokemon/pokemon-images/")

The schema has one single column called image that is complex type (struct). Struct is equivalent to a python dictionary

In [None]:
images_df.printSchema()

In [None]:
images_df.select("image.origin").show(5,False)

In [None]:
def save(df,f):
    (df.write.mode("overwrite")
            .format(f)
            .save(f"hdfs://localhost:9000/datalake/raw/pokemon/pokemon-images.{f}/"))

[save(images_df.coalesce(1),f) for f in ["json","parquet","orc","avro", "delta"]]

Check the pokemon-image directories contents in HDFS

http://localhost:50070/explorer.html#/datalake/raw/pokemon/

<a id='2.5'></a>
### 2.5 Read DataFrame from any binary file format

This reader is for other binary file formats like video, pdfs and so on. In this case we are going to read the images again using this reader

In [None]:
binary_df=spark.read.format("binaryFile").load("hdfs://localhost:9000/datalake/raw/pokemon/pokemon-images/")

The schema has four columns

In [None]:
binary_df.printSchema()

In [None]:
binary_df.select("path").show(5,False)

<a id='3'></a>
## 3. Tear Down

Once we complete the the lab we can stop all the services

<a id='3.1'></a>
### 3.1 Stop Hadoop

Stops Hadoop
Open a terminal and execute
```sh
hadoop-stop.sh
```