
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session                                                                                                  |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X                                                                            |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0)                                |
| %security_configuration     |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |   Changes the session type to Glue ETL.                                                                                                                   |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |

In [None]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
It looks like there is a newer version of the kernel available. The latest version is 0.32 and you have 0.30 installed.
Please run `pip install --upgrade aws-glue-sessions` to upgrade your kernel
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::850188643689:role/gluenotebook_role
Attempting to use existing AssumeRole session credentials.
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: f79d85e6-de39-44e2-9566-b5fd2c14a21d
Applying the following default arguments:
--glue_kernel_version 0.30
--enable-glue-datacatalog true
Waiting for session f79d85e6-de39-44e2-9566-b5fd2c14a21d to get into

In [2]:
%idle_timeout 180

You are already connected to session c324f95f-a113-471c-8d0d-71a9258458af. Your change will not reflect in the current session, but it will affect future new sessions. 

Current idle_timeout is None minutes.
idle_timeout has been set to 180 minutes.


# Spark UDFs

1. Take all A* companies from S3
2. Use the time magic cell: %%time Now all your operations will be measured
3. Create a new column Month, based on Date using regexp
4. Create a second column Month2, based on Date, but using UDF
5. Compare the performance

https://drive.google.com/file/d/1xuN9twPlaz_Y9pOyG3GfnohvH_PnReVC/view

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *




In [2]:
spark = SparkSession.builder.master("local[1]").appName("AWSpark").getOrCreate()




### 1. Take all A* companies from S3

In [3]:
df = spark.read.option("header","true").csv("s3a://aws-stocks-dataset/")




In [4]:
df = df.withColumn("stock_index", input_file_name())
df = df.withColumn('stock_index', regexp_replace('stock_index', 's3a://aws-stocks-dataset/', ''))
df = df.withColumn('stock_index', regexp_replace('stock_index', '.csv', '')) 




In [5]:
df.show(2)

+----------+-------------------+----+------+-------------------+-------------------+-------------------+-----------+
|      Date|                Low|Open|Volume|               High|              Close|     Adjusted Close|stock_index|
+----------+-------------------+----+------+-------------------+-------------------+-------------------+-----------+
|21-02-1973|0.39506199955940247| 0.0| 15188|0.39506199955940247|0.39506199955940247|0.39506199955940247|       DIOD|
|22-02-1973| 0.3703700006008148| 0.0|  9113| 0.3703700006008148| 0.3703700006008148| 0.3703700006008148|       DIOD|
+----------+-------------------+----+------+-------------------+-------------------+-------------------+-----------+
only showing top 2 rows


In [6]:
%time
df = df.filter(df.stock_index.startswith("A"))

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.48 µs



In [7]:
df.show(3)

+----------+-------------------+-------------------+---------+-------------------+-------------------+-------------------+-----------+
|      Date|                Low|               Open|   Volume|               High|              Close|     Adjusted Close|stock_index|
+----------+-------------------+-------------------+---------+-------------------+-------------------+-------------------+-----------+
|12-12-1980| 0.1283479928970337| 0.1283479928970337|469033600| 0.1289059966802597| 0.1283479928970337|0.10003948211669922|       AAPL|
|15-12-1980|0.12165199965238571|0.12221000343561172|175884800|0.12221000343561172|0.12165199965238571|0.09482034295797348|       AAPL|
|16-12-1980|0.11272300034761429| 0.1132809966802597|105728000| 0.1132809966802597|0.11272300034761429|0.08786075562238693|       AAPL|
+----------+-------------------+-------------------+---------+-------------------+-------------------+-------------------+-----------+
only showing top 3 rows


In [7]:
%time
df.createOrReplaceTempView("stocks")
query = "SELECT * FROM stocks WHERE stock_index LIKE 'A%'"
spark.sql(query).show(5)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.15 µs
+----------+-------------------+-------------------+---------+-------------------+-------------------+-------------------+-----------+
|      Date|                Low|               Open|   Volume|               High|              Close|     Adjusted Close|stock_index|
+----------+-------------------+-------------------+---------+-------------------+-------------------+-------------------+-----------+
|12-12-1980| 0.1283479928970337| 0.1283479928970337|469033600| 0.1289059966802597| 0.1283479928970337|0.10003948211669922|       AAPL|
|15-12-1980|0.12165199965238571|0.12221000343561172|175884800|0.12221000343561172|0.12165199965238571|0.09482034295797348|       AAPL|
|16-12-1980|0.11272300034761429| 0.1132809966802597|105728000| 0.1132809966802597|0.11272300034761429|0.08786075562238693|       AAPL|
|17-12-1980|0.11551299691200256|0.11551299691200256| 86441600|0.11607100069522858|0.11551299691200256|0.09003540128469467|    

### 3. Create a new column Month, based on Date using regexp

In [8]:
%time
df = df.withColumn("Month", substring("Date", 4, 2))
df.show(5)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.48 µs
+----------+-------------------+-------------------+---------+-------------------+-------------------+-------------------+-----------+-----+
|      Date|                Low|               Open|   Volume|               High|              Close|     Adjusted Close|stock_index|Month|
+----------+-------------------+-------------------+---------+-------------------+-------------------+-------------------+-----------+-----+
|12-12-1980| 0.1283479928970337| 0.1283479928970337|469033600| 0.1289059966802597| 0.1283479928970337|0.10003948211669922|       AAPL|   12|
|15-12-1980|0.12165199965238571|0.12221000343561172|175884800|0.12221000343561172|0.12165199965238571|0.09482034295797348|       AAPL|   12|
|16-12-1980|0.11272300034761429| 0.1132809966802597|105728000| 0.1132809966802597|0.11272300034761429|0.08786075562238693|       AAPL|   12|
|17-12-1980|0.11551299691200256|0.11551299691200256| 86441600|0.11607100069522858|0.115512

### 4. Create a second column Month2, based on Date, but using UDF

In [9]:
# https://sparkbyexamples.com/pyspark/pyspark-udf-user-defined-function/ 
def extractMonth(date_string):
    month = date_string.split('-')[1]
    return month




In [10]:
udf_extractMonth = udf(extractMonth)




In [11]:
df = df.withColumn('Month2',  udf_extractMonth('Date'))




In [11]:
%time
def extractMonth(date_string):
    month = date_string.split('-')[1]
    return month

udf_extractMonth = udf(extractMonth)

df = df.withColumn('Month2',  udf_extractMonth('Date'))

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.48 µs



In [12]:
%time
df.show(5)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs
+----------+-------------------+-------------------+---------+-------------------+-------------------+-------------------+-----------+-----+------+
|      Date|                Low|               Open|   Volume|               High|              Close|     Adjusted Close|stock_index|Month|Month2|
+----------+-------------------+-------------------+---------+-------------------+-------------------+-------------------+-----------+-----+------+
|12-12-1980| 0.1283479928970337| 0.1283479928970337|469033600| 0.1289059966802597| 0.1283479928970337|0.10003948211669922|       AAPL|   12|    12|
|15-12-1980|0.12165199965238571|0.12221000343561172|175884800|0.12221000343561172|0.12165199965238571|0.09482034295797348|       AAPL|   12|    12|
|16-12-1980|0.11272300034761429| 0.1132809966802597|105728000| 0.1132809966802597|0.11272300034761429|0.08786075562238693|       AAPL|   12|    12|
|17-12-1980|0.11551299691200256|0.11551299691200

# Spark partitions

1. Read 100k rows from aws-stocks-dataset S3 Bucket and save them locally to improve performance (or take already existing local file)
2. Get number of Spark partitions with rdd.getNumPartitions
2. Get max "high" per partition: .withColumn("partition", spark_partition_id()) + group by
3. Repartition the dataset by field stock_index
5. Review the dataset
6. Use Glue write_dynamic_frame with partitionKeys option set to Date, to write data to your you OWN folder to s3 bucket aws-stocks-dataset-output. It should be partitioned on the folders level by date
7. Try to filter data from s3 using Date filter and other fields. Check the performance difference with %%time

### 1. Read 100k rows from aws-stocks-dataset S3 Bucket and save them locally to improve performance (or take already existing local file)

In [13]:
df = df.limit(100000)




In [14]:
df.count()

100000


In [15]:
# AnalysisException: 'Attribute name "Adjusted Close" contains invalid character(s) among " ,;{}()\\n\\t=". Please use alias to rename it.;'
df = df.withColumnRenamed("Adjusted Close", "Adjusted_close")




In [16]:
df.columns

['Date', 'Low', 'Open', 'Volume', 'High', 'Close', 'Adjusted_close', 'stock_index', 'Month', 'Month2']


In [11]:
# df.write.parquet("/media/ubi20/SanDisk/code/data_engineer/03_module_etl2_spark/stock.parquet")

In [10]:
# df.write.csv("/home/Downloads/stock.csv")

In [16]:
# df.write.option("header",True).mode("overwrite").csv("home/Downloads/stock.csv")




### 2. Get number of Spark partitions with rdd.getNumPartitions

In [None]:
# rdd = df.rdd

In [17]:
df.rdd.getNumPartitions() 

1


### 3. Get max "high" per partition: .withColumn("partition", spark_partition_id()) + group by

In [18]:
df = df.withColumn("partition", spark_partition_id())




In [19]:
df.show(5)

+----------+------+------+------+------+------+-----------------+-----------+-----+------+---------+
|      Date|   Low|  Open|Volume|  High| Close|   Adjusted_close|stock_index|Month|Month2|partition|
+----------+------+------+------+------+------+-----------------+-----------+-----+------+---------+
|03-05-1973| 3.625| 3.625|  2000|3.8125| 3.625|2.070335865020752|       ALCO|   05|    05|        0|
|04-05-1973| 3.625| 3.625|  1600|3.8125| 3.625|2.070335865020752|       ALCO|   05|    05|        0|
|07-05-1973| 3.625| 3.625|  1600|3.8125| 3.625|2.070335865020752|       ALCO|   05|    05|        0|
|08-05-1973|3.6875|3.6875|  1600| 3.875|3.6875|2.106030225753784|       ALCO|   05|    05|        0|
|09-05-1973|3.6875|3.6875|  6000| 3.875|3.6875|2.106030225753784|       ALCO|   05|    05|        0|
+----------+------+------+------+------+------+-----------------+-----------+-----+------+---------+
only showing top 5 rows


In [20]:
df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Open: string (nullable = true)
 |-- Volume: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Close: string (nullable = true)
 |-- Adjusted_close: string (nullable = true)
 |-- stock_index: string (nullable = false)
 |-- Month: string (nullable = true)
 |-- Month2: string (nullable = true)
 |-- partition: integer (nullable = false)


In [25]:
#from pyspark.sql.types import DoubleType
#df = df.withColumn("High",col("High").cast("double"))




In [21]:
df = df.withColumn("High_int",col("High").cast("int"))




In [22]:
df.groupBy("partition").agg(max("High").alias("max_high")).show()

+---------+------------+
|partition|    max_high|
+---------+------------+
|        0|99943.203125|
+---------+------------+


### 4. Repartition the dataset by field stock_index

In [23]:
%time 
df_reparted = df.repartition(5, "stock_index")
df_reparted.rdd.getNumPartitions()

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.25 µs
5


### 5. Review the dataset

In [24]:
df_reparted = df_reparted.withColumn("partition", spark_partition_id())




In [25]:
%time 
df_reparted.groupBy("partition").agg(max("High").alias("max_high")).show(5)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.72 µs
+---------+-----------------+
|partition|         max_high|
+---------+-----------------+
|        1|99.98999786376953|
|        3|9.997777938842773|
|        2|99.30000305175781|
|        4|9.989999771118164|
|        0| 99.9000015258789|
+---------+-----------------+


In [28]:
df_reparted.select('partition').distinct().count()

5


In [29]:
df_reparted.rdd.getNumPartitions() 

5


### 6. Use Glue write_dynamic_frame with partitionKeys option set to Date, to write data to your you OWN folder to s3 bucket aws-stocks-dataset-output. It should be partitioned on the folders level by date

In [26]:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame




In [40]:
df_reparted_date = df.repartition(50, "Date")




In [41]:
df_reparted_date = df_reparted_date.withColumn("partition", spark_partition_id())




In [42]:
df_reparted_date.show(5)

+----------+--------------------+--------------------+--------+-------------------+--------------------+-------------------+-----------+-----+------+---------+--------+
|      Date|                 Low|                Open|  Volume|               High|               Close|     Adjusted_close|stock_index|Month|Month2|partition|High_int|
+----------+--------------------+--------------------+--------+-------------------+--------------------+-------------------+-----------+-----+------+---------+--------+
|19-01-1981| 0.14676299691200256| 0.14676299691200256|41574400|0.14732100069522858| 0.14676299691200256|0.11439288407564163|       AAPL|   01|    01|        0|       0|
|10-03-1981| 0.10044600069522858| 0.10100399702787399|28380800|0.10100399702787399| 0.10044600069522858|0.07829158008098602|       AAPL|   03|    03|        0|       0|
|26-08-1981| 0.08482100069522858| 0.08537899702787399|33600000|0.08537899702787399| 0.08482100069522858|0.06611282378435135|       AAPL|   08|    08|      

In [44]:
df_reparted_date.rdd.getNumPartitions() 

50


In [None]:
ddf = DynamicFrame.fromDF(df_reparted_date.limit(10), glueContext, "ddf")

In [None]:
glueContext.write_dynamic_frame.from_options(ddf, connection_type="s3", connection_options={"path": "s3://aws-stocks-dataset-output/attila/stocks", "partitionKeys": ["Date"]}, format="parquet")

### 7. Try to filter data from s3 using Date filter and other fields. Check the performance difference with %%time

# Working with nested data

1. Read the JSON data from S3 Bucket s3://spark-concerts-json/zipcodes.json
2. Print schema and make sure that it's inferred correctly
3. Unnest operator field and get normalized dataset

In [1]:
df = spark.read.json("s3://spark-concerts-json/zipcodes.json")

NameError: name 'spark' is not defined

In [6]:
df.printSchema()

Exception encountered while retrieving session: An error occurred (ExpiredTokenException) when calling the GetSession operation: The security token included in the request is expired 
Traceback (most recent call last):
  File "/home/jupyter-user/.local/lib/python3.7/site-packages/aws_glue_interactive_sessions_kernel/glue_pyspark/GlueKernel.py", line 688, in get_current_session
    current_session = self.glue_client.get_session(Id=self.get_session_id())["Session"]
  File "/home/jupyter-user/.local/lib/python3.7/site-packages/botocore/client.py", line 415, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/jupyter-user/.local/lib/python3.7/site-packages/botocore/client.py", line 745, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ExpiredTokenException) when calling the GetSession operation: The security token included in the request is expired
Failed to retrieve session status 
Ex

In [None]:
# https://sparkbyexamples.com/pyspark/pyspark-select-nested-struct-columns/
df.select('operator.name')
# df_name = df.select('operator.*')

In [None]:
 df.withColumn("name", col("operator").getField("name"))