# PySpark Training Notebook
##### Refreshing of basic concepts

####  Run these cells to configure your interactive session

In [5]:
%idle_timeout 60
%glue_version 5.0
%worker_type G.1X
%number_of_workers 2

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 1.0.7 
Current idle_timeout is None minutes.
idle_timeout has been set to 60 minutes.
Setting Glue version to: 5.0
Previous worker type: None
Setting new worker type to: G.1X
Previous number of workers: None
Setting new number of workers to: 2


In [7]:
%%configure
{
    "--spark-event-logs-path": "s3://dip-pyspark-training/spark_ui_tmp/"
}

The following configurations have been updated: {'--spark-event-logs-path': 's3://dip-pyspark-training/spark_ui_tmp/'}


### Start spark session 

In [1]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Trying to create a Glue session for the kernel.
Session Type: glueetl
Worker Type: G.1X
Number of Workers: 2
Idle Timeout: 60
Session ID: 6abe1590-24b2-489e-96b2-fe8bf106612f
Applying the following default arguments:
--glue_kernel_version 1.0.7
--enable-glue-datacatalog true
--spark-event-logs-path s3://dip-pyspark-training/spark_ui_tmp/
Waiting for session 6abe1590-24b2-489e-96b2-fe8bf106612f to get into ready status...
Session 6abe1590-24b2-489e-96b2-fe8bf106612f has been created.



### Spark's Core components

In [2]:
executor_instances = spark.sparkContext.getConf().get('spark.executor.instances')
executor_cores = spark.sparkContext.getConf().get('spark.executor.cores')
executor_memory = spark.sparkContext.getConf().get('spark.executor.memory')

driver_cores = spark.sparkContext.getConf().get('spark.driver.cores')
driver_memory = spark.sparkContext.getConf().get('spark.driver.memory')

print(f'''
----------------------------------------
Executor instances: {executor_instances}
Executor cores: {executor_cores}
Executor memory: {executor_memory}
----------------------------------------
Driver cores: {driver_cores}
Driver memory: {driver_memory}
----------------------------------------
''')


----------------------------------------
Executor instances: 1
Executor cores: 4
Executor memory: 10g
----------------------------------------
Driver cores: 4
Driver memory: 10g
----------------------------------------


In [3]:
spark.sparkContext.getConf().getAll()

[('spark.network.timeout', '600'), ('spark.files.useFetchCache', 'false'), ('spark.dynamicAllocation.minExecutors', '1'), ('spark.yarn.dist.archives', 'file:///tmp/glue-job-13064934523853194364_glue_venv.zip#python_environment'), ('spark.driver.cores', '4'), ('spark.glue.additionalParams.PROXY_DISABLED', 'false'), ('spark.hadoop.fs.AbstractFileSystem.s3.impl', 'org.apache.hadoop.fs.s3.EMRFSDelegate'), ('spark.eventLog.enabled', 'true'), ('spark.sql.shuffle.partitions', '4'), ('spark.eventLog.dir', 'file:///var/log/spark/apps'), ('spark.executor.extraJavaOptions', "-Djava.net.preferIPv6Addresses=false -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseParallelGC -XX:InitiatingHeapOccupancyPercent=70 -XX:OnOutOfMemoryError='kill -9 %p' -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/

### Import libraries

In [4]:
import pyspark.sql.functions as F
import pyspark.sql.types as T
import datetime




### Spark's Unified Framework

In [5]:
data = [('Alice', 34), ('Bob', 45), ('Cathy', 29), ('David', 50), ('Eve', 28), ('Frank', 20), ('Grace', 42), ('Hank', 21), ('Ivy', 26), ('Jack', 40), ('Karen', 19), ('Leo', 29), ('Mona', 35), ('Nina', 48), ('Javier', 38)]
columns = ['Name', 'Age']




In [6]:
# DataFrames Pyspark
df = spark.createDataFrame(data, columns)
df.show()

+------+---+
|  Name|Age|
+------+---+
| Alice| 34|
|   Bob| 45|
| Cathy| 29|
| David| 50|
|   Eve| 28|
| Frank| 20|
| Grace| 42|
|  Hank| 21|
|   Ivy| 26|
|  Jack| 40|
| Karen| 19|
|   Leo| 29|
|  Mona| 35|
|  Nina| 48|
|Javier| 38|
+------+---+


In [13]:
# df.rdd.glom().collect()

[[Row(Name='Alice', Age=34), Row(Name='Bob', Age=45), Row(Name='Cathy', Age=29)], [Row(Name='David', Age=50), Row(Name='Eve', Age=28), Row(Name='Frank', Age=20)], [Row(Name='Grace', Age=42), Row(Name='Hank', Age=21), Row(Name='Ivy', Age=26)], [Row(Name='Jack', Age=40), Row(Name='Karen', Age=19), Row(Name='Leo', Age=29), Row(Name='Mona', Age=35), Row(Name='Nina', Age=48), Row(Name='Javier', Age=38)]]


In [7]:
df.filter(F.col('Age') > 30).show()

+------+---+
|  Name|Age|
+------+---+
| Alice| 34|
|   Bob| 45|
| David| 50|
| Grace| 42|
|  Jack| 40|
|  Mona| 35|
|  Nina| 48|
|Javier| 38|
+------+---+


In [8]:
# DataFrames SQL
df.createOrReplaceTempView('friends')




In [9]:
spark.sql(
    '''
    SELECT *
    FROM friends
    WHERE Age > 30
    ''').show()

+------+---+
|  Name|Age|
+------+---+
| Alice| 34|
|   Bob| 45|
| David| 50|
| Grace| 42|
|  Jack| 40|
|  Mona| 35|
|  Nina| 48|
|Javier| 38|
+------+---+


In [10]:
# RDDs
rdd = spark.sparkContext.parallelize(data)
rdd.collect()

[('Alice', 34), ('Bob', 45), ('Cathy', 29), ('David', 50), ('Eve', 28), ('Frank', 20), ('Grace', 42), ('Hank', 21), ('Ivy', 26), ('Jack', 40), ('Karen', 19), ('Leo', 29), ('Mona', 35), ('Nina', 48), ('Javier', 38)]


In [11]:
filtered_rdd = rdd.filter(lambda x: x[1] > 30)




In [12]:
filtered_rdd.collect()

[('Alice', 34), ('Bob', 45), ('David', 50), ('Grace', 42), ('Jack', 40), ('Mona', 35), ('Nina', 48), ('Javier', 38)]


### RDDs - Example 1

In [46]:
this_is_a_variable = [i for i in range(10**5)]
this_is_a_variable[:10]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [47]:
rdd2 = spark.sparkContext.parallelize(this_is_a_variable)




In [48]:
rdd2.getNumPartitions()

4


In [51]:
# l = rdd2.glom().collect()




In [31]:
filtered_rdd = rdd2.filter(lambda x: x > 10**6 - 100)




In [33]:
filtered_rdd.collect()

[999901, 999902, 999903, 999904, 999905, 999906, 999907, 999908, 999909, 999910, 999911, 999912, 999913, 999914, 999915, 999916, 999917, 999918, 999919, 999920, 999921, 999922, 999923, 999924, 999925, 999926, 999927, 999928, 999929, 999930, 999931, 999932, 999933, 999934, 999935, 999936, 999937, 999938, 999939, 999940, 999941, 999942, 999943, 999944, 999945, 999946, 999947, 999948, 999949, 999950, 999951, 999952, 999953, 999954, 999955, 999956, 999957, 999958, 999959, 999960, 999961, 999962, 999963, 999964, 999965, 999966, 999967, 999968, 999969, 999970, 999971, 999972, 999973, 999974, 999975, 999976, 999977, 999978, 999979, 999980, 999981, 999982, 999983, 999984, 999985, 999986, 999987, 999988, 999989, 999990, 999991, 999992, 999993, 999994, 999995, 999996, 999997, 999998, 999999]


In [35]:
filtered_rdd.getNumPartitions()

4


In [39]:
filtered_rdd.glom().collect()

[[], [], [], [999901, 999902, 999903, 999904, 999905, 999906, 999907, 999908, 999909, 999910, 999911, 999912, 999913, 999914, 999915, 999916, 999917, 999918, 999919, 999920, 999921, 999922, 999923, 999924, 999925, 999926, 999927, 999928, 999929, 999930, 999931, 999932, 999933, 999934, 999935, 999936, 999937, 999938, 999939, 999940, 999941, 999942, 999943, 999944, 999945, 999946, 999947, 999948, 999949, 999950, 999951, 999952, 999953, 999954, 999955, 999956, 999957, 999958, 999959, 999960, 999961, 999962, 999963, 999964, 999965, 999966, 999967, 999968, 999969, 999970, 999971, 999972, 999973, 999974, 999975, 999976, 999977, 999978, 999979, 999980, 999981, 999982, 999983, 999984, 999985, 999986, 999987, 999988, 999989, 999990, 999991, 999992, 999993, 999994, 999995, 999996, 999997, 999998, 999999]]


### RDDs - Example 2

In [40]:
# Lazy Transformation
time_to_retirement_rdd = rdd.map(lambda x: 67 - x[1])




In [41]:
time_to_retirement_rdd.collect()

[33, 22, 38, 17, 39, 47, 25, 46, 41, 27, 48, 38, 32, 19, 29]


In [42]:
rdd.getNumPartitions()

4


In [43]:
rdd.glom().collect()

[[('Alice', 34), ('Bob', 45), ('Cathy', 29)], [('David', 50), ('Eve', 28), ('Frank', 20)], [('Grace', 42), ('Hank', 21), ('Ivy', 26)], [('Jack', 40), ('Karen', 19), ('Leo', 29), ('Mona', 35), ('Nina', 48), ('Javier', 38)]]


In [30]:
time_to_retirement_rdd.getNumPartitions()

4


In [44]:
time_to_retirement_rdd.glom().collect()

[[33, 22, 38], [17, 39, 47], [25, 46, 41], [27, 48, 38, 32, 19, 29]]


### Lazy vs eager transformations

In [21]:
df.show()

+------+---+
|  Name|Age|
+------+---+
| Alice| 34|
|   Bob| 45|
| Cathy| 29|
| David| 50|
|   Eve| 28|
| Frank| 20|
| Grace| 42|
|  Hank| 21|
|   Ivy| 26|
|  Jack| 40|
| Karen| 19|
|   Leo| 29|
|  Mona| 35|
|  Nina| 48|
|Javier| 38|
+------+---+


In [22]:
df.rdd.getNumPartitions()

8


In [19]:
df2 = df.withColumn('retirement_in', 67 - F.col('Age'))




In [25]:
df3 = df2.withColumn('older_than_30', F.when(F.col('Age')>30, F.lit(True)).otherwise(F.lit(False)))




In [27]:
df3.show()

+------+---+-------------+-------------+
|  Name|Age|retirement_in|older_than_30|
+------+---+-------------+-------------+
| Alice| 34|           33|         true|
|   Bob| 45|           22|         true|
| Cathy| 29|           38|        false|
| David| 50|           17|         true|
|   Eve| 28|           39|        false|
| Frank| 20|           47|        false|
| Grace| 42|           25|         true|
|  Hank| 21|           46|        false|
|   Ivy| 26|           41|        false|
|  Jack| 40|           27|         true|
| Karen| 19|           48|        false|
|   Leo| 29|           38|        false|
|  Mona| 35|           32|         true|
|  Nina| 48|           19|         true|
|Javier| 38|           29|         true|
+------+---+-------------+-------------+


In [29]:
df3.explain()

== Physical Plan ==
*(1) Project [Name#0, Age#1L, (67 - Age#1L) AS retirement_in#46L, ((Age#1L > 30) <=> true) AS older_than_30#85]
+- *(1) Scan ExistingRDD[Name#0,Age#1L]


In [31]:
df4 = df3.groupBy('older_than_30').agg(F.count('Name').alias('total_people'), F.mean('Age').alias('mean_age'), F.stddev('Age').alias('stddev_age'))




In [32]:
df4.show()

+-------------+------------+------------------+-----------------+
|older_than_30|total_people|          mean_age|       stddev_age|
+-------------+------------+------------------+-----------------+
|         true|           8|              41.5|5.855400437691199|
|        false|           7|24.571428571428573|4.429339411136566|
+-------------+------------+------------------+-----------------+


In [33]:
df2.rdd.getNumPartitions()

8


In [35]:
df4.explain(True)

== Parsed Logical Plan ==
'Aggregate ['older_than_30], ['older_than_30, count('Name) AS total_people#129, avg('Age) AS mean_age#131, 'stddev('Age) AS stddev_age#132]
+- Project [Name#0, Age#1L, retirement_in#46L, CASE WHEN (Age#1L > cast(30 as bigint)) THEN true ELSE false END AS older_than_30#85]
   +- Project [Name#0, Age#1L, (cast(67 as bigint) - Age#1L) AS retirement_in#46L]
      +- LogicalRDD [Name#0, Age#1L], false

== Analyzed Logical Plan ==
older_than_30: boolean, total_people: bigint, mean_age: double, stddev_age: double
Aggregate [older_than_30#85], [older_than_30#85, count(Name#0) AS total_people#129L, avg(Age#1L) AS mean_age#131, stddev(cast(Age#1L as double)) AS stddev_age#132]
+- Project [Name#0, Age#1L, retirement_in#46L, CASE WHEN (Age#1L > cast(30 as bigint)) THEN true ELSE false END AS older_than_30#85]
   +- Project [Name#0, Age#1L, (cast(67 as bigint) - Age#1L) AS retirement_in#46L]
      +- LogicalRDD [Name#0, Age#1L], false

== Optimized Logical Plan ==
Aggregat