# PySpark Training Notebook
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.

####  Run these cells to configure your interactive session

In [None]:
%idle_timeout 60
%glue_version 5.0
%worker_type G.1X
%number_of_workers 2

In [None]:
%%configure
{
    "--enable-continuous-cloudwatch-log": "true",
    "--enable-spark-ui": "true",
    "--spark-event-logs-path": "s3://dip-pyspark-training/spark_ui_tmp/",
    "--enable-metrics": "true",
    "--enable-observability-metrics": "true",
    "--conf": "spark.sql.codegen.comments=true",
    "--conf": "spark.sql.codegen.fallback=true",
    "--conf": "spark.sql.codegen.wholeStage=true",
    "--conf": "spark.sql.ui.explainMode=extended",
    "--conf": "spark.sql.ui.retainedExecutions=100",
    "--conf": "spark.ui.retainedJobs=1000",
    "--conf": "spark.ui.retainedStages=1000",
    "--conf": "spark.ui.retainedTasks=10000",
    "--conf": "spark.ui.showAdditionalMetrics=true"
}

### Start spark session 

In [None]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

### Get spark's configuration

In [None]:
dynamic_allocation_enabled = spark.sparkContext.getConf().get('spark.dynamicAllocation.enabled')
dynamic_min_executors = spark.sparkContext.getConf().get('spark.dynamicAllocation.minExecutors')
dynamic_max_executors = spark.sparkContext.getConf().get('spark.dynamicAllocation.maxExecutors')
dynamic_initial_executors = spark.sparkContext.getConf().get('spark.dynamicAllocation.initialExecutors')

executor_instances = spark.sparkContext.getConf().get('spark.executor.instances')
executor_cores = spark.sparkContext.getConf().get('spark.executor.cores')
executor_memory = spark.sparkContext.getConf().get('spark.executor.memory')

driver_cores = spark.sparkContext.getConf().get('spark.driver.cores')
driver_memory = spark.sparkContext.getConf().get('spark.driver.memory')

print(f'''
Dynamic allocation enabled: {dynamic_allocation_enabled}
Dynamic min executors: {dynamic_min_executors}
Dynamic max executors: {dynamic_max_executors}
Dynamic initial executors: {dynamic_initial_executors}
----------------------------------------
Executor instances: {executor_instances}
Executor cores: {executor_cores}
Executor memory: {executor_memory}
----------------------------------------
Driver cores: {driver_cores}
Driver memory: {driver_memory}
''')

In [None]:
spark.sparkContext.getConf().getAll()

### Import libraries

In [None]:
import pyspark.sql.functions as F
import pyspark.sql.types as T
import datetime

### Loading customers and transaction datasets

In [None]:
c_df = spark.read.format('parquet').load('s3://dip-pyspark-training/data/small/customers/')
c_df.rdd.getNumPartitions()

In [None]:
c_df.schema

In [None]:
c_df.show()

In [None]:
t_df = spark.read.format('parquet').load('s3://dip-pyspark-training/data/small/transactions/')
t_df.rdd.getNumPartitions()

In [None]:
t_df.show()

### Examples of narrow transformations

In [None]:
tmp_01_df = c_df.withColumn('first_name', F.split('name', ' ').getItem(0))

In [None]:
tmp_02_df = tmp_01_df.withColumn('last_name', F.split('name', ' ').getItem(1))

In [None]:
tmp_03_df = tmp_02_df.select(['cust_id', 'first_name', 'last_name', 'city', 'gender', 'birthday'])

In [None]:
tmp_04_df = tmp_03_df.filter(F.col('city') == 'chicago')

In [None]:
tmp_04_df.show()

In [None]:
tmp_04_df.explain(True)

#### Parsed plan
```bash
== Parsed Logical Plan ==
'Project ['cust_id, 'first_name, 'last_name, 'city, 'gender, 'birthday]
+- Project [cust_id#88, name#89, age#90, gender#91, birthday#92, zip#93, city#94, first_name#579, split(name#89,  , -1)[1] AS last_name#588]
   +- Project [cust_id#88, name#89, age#90, gender#91, birthday#92, zip#93, city#94, split(name#89,  , -1)[0] AS first_name#579]
      +- Filter (city#94 = chicago)
         +- Relation [cust_id#88,name#89,age#90,gender#91,birthday#92,zip#93,city#94] parquet
```
#### Logical plan
```bash
== Analyzed Logical Plan ==
cust_id: string, first_name: string, last_name: string, city: string, gender: string, birthday: string
Project [cust_id#88, first_name#579, last_name#588, city#94, gender#91, birthday#92]
+- Project [cust_id#88, name#89, age#90, gender#91, birthday#92, zip#93, city#94, first_name#579, split(name#89,  , -1)[1] AS last_name#588]
   +- Project [cust_id#88, name#89, age#90, gender#91, birthday#92, zip#93, city#94, split(name#89,  , -1)[0] AS first_name#579]
      +- Filter (city#94 = chicago)
         +- Relation [cust_id#88,name#89,age#90,gender#91,birthday#92,zip#93,city#94] parquet
```
#### Optimized plan
```bash
== Optimized Logical Plan ==
Project [cust_id#88, split(name#89,  , -1)[0] AS first_name#579, split(name#89,  , -1)[1] AS last_name#588, city#94, gender#91, birthday#92]
+- Filter (isnotnull(city#94) AND (city#94 = chicago))
   +- Relation [cust_id#88,name#89,age#90,gender#91,birthday#92,zip#93,city#94] parquet
```
#### Physical plan
```bash
== Physical Plan ==
*(1) Project [cust_id#88, split(name#89,  , -1)[0] AS first_name#579, split(name#89,  , -1)[1] AS last_name#588, city#94, gender#91, birthday#92]
+- *(1) Filter (isnotnull(city#94) AND (city#94 = chicago))
   +- *(1) ColumnarToRow
      +- FileScan parquet [cust_id#88,name#89,gender#91,birthday#92,city#94] Batched: true, DataFilters: [isnotnull(city#94), (city#94 = chicago)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[s3://dip-pyspark-training/data/small/customers], PartitionFilters: [], PushedFilters: [IsNotNull(city), EqualTo(city,chicago)], ReadSchema: struct<cust_id:string,name:string,gender:string,birthday:string,city:string>
```

### Examples of wide transformations

#### Repartition

In [None]:
t_df.rdd.getNumPartitions()

In [None]:
t_df.repartition(20).explain(True)

In [None]:
t_df.repartition('city').explain(True)

In [None]:
t_df.repartition('city').rdd.getNumPartitions()

In [None]:
t_df.repartition(2, 'city').explain(True)

#### Coalesce

In [None]:
t_df.rdd.getNumPartitions()

In [None]:
t_df.coalesce(4).rdd.getNumPartitions()

In [None]:
t_df.coalesce(4).explain(True)

In [None]:
t_df.coalesce(1).explain(True)

#### Join

In [None]:
t_df.show()

In [None]:
c_df.show()

In [None]:
# TEMPORARY disable auto broadcast join
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

In [None]:
joined_df = t_df.join(
    other=c_df,
    how='inner',
    on='cust_id')

In [None]:
joined.show()

In [None]:
joined.explain(True)

#### GroupBy

In [None]:
g_df = t_df.groupBy('cust_id').agg({'txn_id': 'count', 'amt': 'sum'})

In [None]:
g_df.explain(True)

In [None]:
g_df.show()

In [None]:
t_df.schema

In [None]:
t_df.select('cust_id').distinct().count()

In [None]:
t_df.repartition(64, 'cust_id').groupBy('cust_id').agg({'txn_id': 'count', 'amt': 'sum'}).explain(True)