# PySpark Training Notebook
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.


####  Run these cells to configure your interactive session

In [5]:
%idle_timeout 30
%glue_version 5.0
%worker_type G.1X
%number_of_workers 4

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 1.0.7 
Current idle_timeout is None minutes.
idle_timeout has been set to 30 minutes.
Setting Glue version to: 5.0
Previous worker type: None
Setting new worker type to: G.1X
Previous number of workers: None
Setting new number of workers to: 4


In [7]:
%%configure
{
    "--enable-continuous-cloudwatch-log": "true",
    "--enable-spark-ui": "true",
    "--spark-event-logs-path": "s3://dip-pyspark-training/spark_ui_tmp/",
    "--enable-metrics": "true",
    "--enable-observability-metrics": "true",
    "--conf": "spark.sql.codegen.comments=true",
    "--conf": "spark.sql.codegen.fallback=true",
    "--conf": "spark.sql.codegen.wholeStage=true",
    "--conf": "spark.sql.ui.explainMode=extended",
    "--conf": "spark.sql.ui.retainedExecutions=100",
    "--conf": "spark.ui.retainedJobs=1000",
    "--conf": "spark.ui.retainedStages=1000",
    "--conf": "spark.ui.retainedTasks=10000",
    "--conf": "spark.ui.showAdditionalMetrics=true"
}

The following configurations have been updated: {'--enable-continuous-cloudwatch-log': 'true', '--enable-spark-ui': 'true', '--spark-event-logs-path': 's3://dip-pyspark-training/spark_ui_tmp/', '--enable-metrics': 'true', '--enable-observability-metrics': 'true', '--conf': 'spark.ui.showAdditionalMetrics=true'}


### Start spark session 

In [1]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Trying to create a Glue session for the kernel.
Session Type: glueetl
Worker Type: G.1X
Number of Workers: 4
Idle Timeout: 30
Session ID: a2c53a3a-2ab6-455f-a5e0-0c7bbe25a5ba
Applying the following default arguments:
--glue_kernel_version 1.0.7
--enable-glue-datacatalog true
--enable-continuous-cloudwatch-log true
--enable-spark-ui true
--spark-event-logs-path s3://dip-pyspark-training/spark_ui_tmp/
--enable-metrics true
--enable-observability-metrics true
--conf spark.ui.showAdditionalMetrics=true
Waiting for session a2c53a3a-2ab6-455f-a5e0-0c7bbe25a5ba to get into ready status...
Session a2c53a3a-2ab6-455f-a5e0-0c7bbe25a5ba has been created.



### Get spark's configuration

In [2]:
dynamic_allocation_enabled = spark.sparkContext.getConf().get('spark.dynamicAllocation.enabled')
dynamic_min_executors = spark.sparkContext.getConf().get('spark.dynamicAllocation.minExecutors')
dynamic_max_executors = spark.sparkContext.getConf().get('spark.dynamicAllocation.maxExecutors')
dynamic_initial_executors = spark.sparkContext.getConf().get('spark.dynamicAllocation.initialExecutors')

executor_instances = spark.sparkContext.getConf().get('spark.executor.instances')
executor_cores = spark.sparkContext.getConf().get('spark.executor.cores')
executor_memory = spark.sparkContext.getConf().get('spark.executor.memory')

driver_cores = spark.sparkContext.getConf().get('spark.driver.cores')
driver_memory = spark.sparkContext.getConf().get('spark.driver.memory')

print(f'''
Dynamic allocation enabled: {dynamic_allocation_enabled}
Dynamic min executors: {dynamic_min_executors}
Dynamic max executors: {dynamic_max_executors}
Dynamic initial executors: {dynamic_initial_executors}
----------------------------------------
Executor instances: {executor_instances}
Executor cores: {executor_cores}
Executor memory: {executor_memory}
----------------------------------------
Driver cores: {driver_cores}
Driver memory: {driver_memory}
''')


Dynamic allocation enabled: false
Dynamic min executors: 1
Dynamic max executors: 3
Dynamic initial executors: 3
----------------------------------------
Executor instances: 3
Executor cores: 4
Executor memory: 10g
----------------------------------------
Driver cores: 4
Driver memory: 10g


### Import libraries

In [6]:
import pyspark.sql.functions as F
import pyspark.sql.types as T
import datetime




### Loading the New York's taxi dataset

In [3]:
# partitioned file
p_df = spark.read.format('parquet').load('s3://dip-pyspark-training/data/big/ny-taxi-dataset-partitioned/')




In [9]:
# p_df.rdd.getNumPartitions()

In [7]:
# Define the data as lists
vendors = ['VTS', 'CMT', 'DDS', 'VTS', 'CMT', 'DDS']
payment_type = ['CASH', 'CASH', 'CASH', 'CREDIT', 'CREDIT', 'CREDIT']
extra_col = ['A', 'B', 'C', 'D', 'E', 'F']

# Define the schema of the dataframe
schema = T.StructType([
    T.StructField("vendor_id", T.StringType(), False),
    T.StructField("payment_type", T.StringType(), False),
    T.StructField("extra_col_from_m", T.StringType(), False)
])

# Create a list of tuples
data = [(vendors[i], payment_type[i], extra_col[i]) for i in range(len(vendors))]

# Create a PySpark dataframe
m_df = spark.createDataFrame(data, schema)
# m_df.show()

+---------+------------+----------------+
|vendor_id|payment_type|extra_col_from_m|
+---------+------------+----------------+
|      VTS|        CASH|               A|
|      CMT|        CASH|               B|
|      DDS|        CASH|               C|
|      VTS|      CREDIT|               D|
|      CMT|      CREDIT|               E|
|      DDS|      CREDIT|               F|
+---------+------------+----------------+


In [8]:
# to_join_location = 's3://dip-pyspark-training/data/big/to_join_data/'
# m_df.write.format('parquet').mode('overwrite').save(to_join_location)




In [12]:
# spark_application_id = spark.sparkContext.applicationId.split('-')[-1]
# tmp_table_name = f'{spark_application_id}_tmp_table'
# tmp_table_name

'1736866877970_tmp_table'


In [13]:
# Register table on the spark catalog
# spark.sql(f"""
# CREATE TABLE {tmp_table_name} (
#     vendor_id STRING,
#     payment_type STRING,
#     extra_col_from_m STRING
# )
# STORED AS PARQUET
# LOCATION '{to_join_location}'
# """)

DataFrame[]


In [14]:
# Make sure we obtain the metadata needed to fetch the size of this table only
# spark.sql(f'ANALYZE TABLE {tmp_table_name} COMPUTE STATISTICS')

DataFrame[]


In [15]:
# m_df_from_catalog = spark.sql(f'SELECT * FROM {tmp_table_name}')
# m_df_from_catalog.show()

+---------+------------+----------------+
|vendor_id|payment_type|extra_col_from_m|
+---------+------------+----------------+
|      VTS|      CREDIT|               D|
|      CMT|      CREDIT|               E|
|      DDS|      CREDIT|               F|
|      VTS|        CASH|               A|
|      CMT|        CASH|               B|
|      DDS|        CASH|               C|
+---------+------------+----------------+


In [17]:
# READ: spark.sql.autoBroadcastJoinThreshold https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options
# READ: Join Hints https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html#join-hints-types
# p_joined_df = p_df.join(other=m_df, how='inner', on = ['vendor_id', 'payment_type'])
p_joined_df = p_df.join(other=m_df.hint('broadcast'), how='inner', on = ['vendor_id', 'payment_type'])
# p_joined_df = p_df.join(other=m_df_from_catalog, how='inner', on = ['vendor_id', 'payment_type'])
p_joined_df.explain(True)

== Parsed Logical Plan ==
'Join UsingJoin(Inner, [vendor_id, payment_type])
:- Relation [pickup_datetime#0,dropoff_datetime#1,passenger_count#2,trip_distance#3,pickup_longitude#4,pickup_latitude#5,rate_code_id#6,store_and_fwd_flag#7,dropoff_longitude#8,dropoff_latitude#9,payment_type#10,fare_amount#11,extra#12,mta_tax#13,tip_amount#14,tolls_amount#15,total_amount#16,vendor_id#17] parquet
+- ResolvedHint (strategy=broadcast)
   +- LogicalRDD [vendor_id#36, payment_type#37, extra_col_from_m#38], false

== Analyzed Logical Plan ==
vendor_id: string, payment_type: string, pickup_datetime: timestamp, dropoff_datetime: timestamp, passenger_count: int, trip_distance: double, pickup_longitude: double, pickup_latitude: double, rate_code_id: int, store_and_fwd_flag: string, dropoff_longitude: double, dropoff_latitude: double, fare_amount: double, extra: double, mta_tax: double, tip_amount: double, tolls_amount: double, total_amount: double, extra_col_from_m: string
Project [vendor_id#17, payment

In [18]:
ts = datetime.datetime.now()
output_file_path_partitioned = 's3://dip-pyspark-training/output/merged-dataset-02/'
p_joined_df.write.format('parquet').mode('overwrite').save(output_file_path_partitioned)
p_pt = (datetime.datetime.now() - ts).seconds
print(f'The processing time was {p_pt} seconds')

The processing time was 150 seconds


In [20]:
# p_tmp_df = spark.read.format('parquet').load(output_file_path_partitioned)
# p_tmp_df.show(5)

+---------+------------+-------------------+-------------------+---------------+-------------+----------------+---------------+------------+------------------+-----------------+----------------+-----------+-----+-------+----------+------------+------------+----------------+
|vendor_id|payment_type|    pickup_datetime|   dropoff_datetime|passenger_count|trip_distance|pickup_longitude|pickup_latitude|rate_code_id|store_and_fwd_flag|dropoff_longitude|dropoff_latitude|fare_amount|extra|mta_tax|tip_amount|tolls_amount|total_amount|extra_col_from_m|
+---------+------------+-------------------+-------------------+---------------+-------------+----------------+---------------+------------+------------------+-----------------+----------------+-----------+-----+-------+----------+------------+------------+----------------+
|      VTS|        CASH|2009-12-25 18:42:00|2009-12-25 18:57:00|              5|         3.52|       73.978847|      40.761798|        NULL|              NULL|       -74.00605

In [21]:
# p_tmp_df.count()

71448102


In [11]:
# spark.conf.get("spark.sql.autoBroadcastJoinThreshold")

'10485760b'


In [6]:
# spark.conf.get("spark.sql.join.preferSortMergeJoin")

'false'


In [22]:
# drop the temporaty table
# spark.sql(f'DROP TABLE {tmp_table_name}')

DataFrame[]
