# Download Datasets

In [0]:
%sh 
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/bank.csv'
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/vehicles.csv'
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/characters.csv'
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/planets.csv'
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/species.csv'
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/melb_data.csv'

In [0]:
dbutils.fs.mkdirs("/dataset")
dbutils.fs.cp('file:/databricks/driver/bank.csv','dbfs:/dataset/bank.csv')
dbutils.fs.cp('file:/databricks/driver/vehicles.csv','dbfs:/dataset/vehicles.csv')
dbutils.fs.cp('file:/databricks/driver/characters.csv','dbfs:/dataset/characters.csv')
dbutils.fs.cp('file:/databricks/driver/planets.csv','dbfs:/dataset/planets.csv')
dbutils.fs.cp('file:/databricks/driver/species.csv','dbfs:/dataset/species.csv')
dbutils.fs.cp('file:/databricks/driver/melb_data.csv','dbfs:/dataset/melb_data.csv')

# Windows Partitioning

---



## Example 1

In [0]:
dbutils.fs.head("dbfs:/dataset/bank.csv")

Reading data from `bank.csv` file to a DataFrame

In [0]:
from pyspark.sql.functions import *

bank_df = spark.read.format("csv") \
  .option("sep", ";") \
  .option("inferSchema", "true") \
  .option("header", "true") \
  .load("/dataset/bank.csv")

In [0]:
bank_df.display()

Get the balance of the two youngest people by job


In [0]:
from pyspark.sql.window import Window

byJob = Window.partitionBy("job").orderBy("age")

bank_df \
  .withColumn("new_column_job", row_number().over(byJob)) \
  .filter(col("new_column_job") <= 2) \
  .select("age", "job", "balance") \
  .orderBy("job", "age") \
  .display()

## Exercise 1

Using the dataframe built from `bank.csv`file, get the TOP 3 of maximum balance by marital

---




## Exercise 2



Load `vehicles.csv` file into a DataFrame

---

In [0]:
dbutils.fs.head("dbfs:/dataset/vehicles.csv")

For each vehicle, get the difference in price (`cost_in_credits`) for each product compared to the cheapest product in the same vehicle class


---



# Data Cleaning

---



## Exercise 3
---
1. Read file `melb_data.csv`
2. Get the number of houses built per year. Order the result by `YearBuilt`
3. Drop all `null` values. Repeat the grouping of the previous point
4. Drop `null` values only for column `YearBuilt`. Repeat the grouping specified in point 2
5. Replace `null` values in column `YearBuilt` by `1900`. Repeat the grouping specified in point 2

In [0]:
dbutils.fs.head("/dataset/melb_data.csv")

# Joins

## Exercise 4

1. Create dataframes for files `characters.csv` and `planets.csv`
2. Get the planet gravity for each character, selecting only the character name, planet name and gravity.


---




In [0]:
dbutils.fs.head("/dataset/characters.csv")

In [0]:

dbutils.fs.head("/dataset/planets.csv")

## Exercise 5

1. Check exercise 4. What join type are been used? Why?
2. After checking execution plan, execute the following instructions:

---

In [0]:
spark.conf.get("spark.databricks.adaptive.autoBroadcastJoinThreshold")

In [0]:
spark.conf.set("spark.databricks.adaptive.autoBroadcastJoinThreshold", "-1")

In [0]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

In [0]:
spark.conf.get("spark.sql.autoBroadcastJoinThreshold")

**Execute again the query of the exercise 4**

In [0]:
spark.conf.set("spark.databricks.adaptive.autoBroadcastJoinThreshold", "30MB")

In [0]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "30MB")

## Exercise 6

1. Create a DataFrame from `species.csv`.
2. Repartition the species and characters Dataframes to 100 partitions

---



## Exercise 7

- Set `spark.databricks.optimizer.adaptive.enabled` property to `False`
- Get the specie classification for each character. 
- Select only the character name and its classification<br>
**Use the DataFrames repartitioned previously in Exercise 6**


---



## Exercise 8

1. Execute the following statement over the DataFrame built in exercise 7. `classDF` is the output DataFrame coming from the previous exercise
2. Check the difference in terms of rows distribution across all partitions

---



# Spark 3. Adaptative Query Execution - AQE

## Exercise 9
---
**Coalescing partitions**<br>
Try to execute the following query twice:
- In the first execution, set to `False` the configuration parameter `spark.databricks.optimizer.adaptive.enabled`.
- Set up to `True` previous configuration parameter and repeat the query.


In [0]:
spark.conf.get("spark.databricks.optimizer.adaptive.enabled")

In [0]:
spark.conf.set("spark.databricks.optimizer.adaptive.enabled",False)

In [0]:
simpleData = [("James","Sales","NY",90000,34,10000), \
    ("Michael","Sales","NY",86000,56,20000), \
    ("Robert","Sales","CA",81000,30,23000), \
    ("Maria","Finance","CA",90000,24,23000), \
    ("Raman","Finance","CA",99000,40,24000), \
    ("Scott","Finance","NY",83000,36,19000), \
    ("Jen","Finance","NY",79000,53,15000), \
    ("Jeff","Marketing","CA",80000,25,18000), \
    ("Kumar","Marketing","NY",91000,50,21000)]

df = spark.sparkContext.parallelize(simpleData).toDF(['name','department','zip','max_salary','age','min_salary'])

df1 = df.groupBy("department").count()

df1.show()

In [0]:
spark.conf.set("spark.databricks.optimizer.adaptive.enabled",True)

## Example 2
---
**Tail**

In [0]:
df.tail(2)

## Example 3
---
**Repartition in SQL**

In [0]:
print("Before re-partition :" + str(df.rdd.getNumPartitions()))
df.createOrReplaceTempView("RANGE_TABLE")
df2=spark.sql("SELECT /*+ REPARTITION(20) */ * FROM RANGE_TABLE")
print("After re-partition :" + str(df2.rdd.getNumPartitions()))