In [2]:
%help

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.38.1 



# Available Magic Commands

## Sessions Magic

----
    %help                             Return a list of descriptions and input types for all magic commands. 
    %profile            String        Specify a profile in your aws configuration to use as the credentials provider.
    %region             String        Specify the AWS region in which to initialize a session. 
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\ USERNAME \.aws\config" on Windows.
    %idle_timeout       Int           The number of minutes of inactivity after which a session will timeout. 
                                      Default: 2880 minutes (48 hours).
    %session_id_prefix  String        Define a String that will precede all session IDs in the format 
                                      [session_id_prefix]-[session_id]. If a session ID is not provided,
                                      a random UUID will be generated.
    %status                           Returns the status of the current Glue session including its duration, 
                                      configuration and executing user / role.
    %session_id                       Returns the session ID for the running session. 
    %list_sessions                    Lists all currently running sessions by ID.
    %stop_session                     Stops the current session.
    %glue_version       String        The version of Glue to be used by this session. 
                                      Currently, the only valid options are 2.0, 3.0 and 4.0. 
                                      Default: 2.0.
----

## Selecting Job Types

----
    %streaming          String        Sets the session type to Glue Streaming.
    %etl                String        Sets the session type to Glue ETL.
    %glue_ray           String        Sets the session type to Glue Ray.
----

## Glue Config Magic 
*(common across all job types)*

----

    %%configure         Dictionary    A json-formatted dictionary consisting of all configuration parameters for 
                                      a session. Each parameter can be specified here or through individual magics.
    %iam_role           String        Specify an IAM role ARN to execute your session with.
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\%USERNAME%\.aws\config` on Windows.
    %number_of_workers  int           The number of workers of a defined worker_type that are allocated 
                                      when a session runs.
                                      Default: 5.
    %additional_python_modules  List  Comma separated list of additional Python modules to include in your cluster 
                                      (can be from Pypi or S3).
    %%tags        Dictionary          Specify a json-formatted dictionary consisting of tags to use in the session.
----

                                      
## Magic for Spark Jobs (ETL & Streaming)

----
    %worker_type        String        Set the type of instances the session will use as workers. 
                                      ETL and Streaming support G.1X, G.2X, G.4X and G.8X. 
                                      Default: G.1X.
    %connections        List          Specify a comma separated list of connections to use in the session.
    %extra_py_files     List          Comma separated list of additional Python files From S3.
    %extra_jars         List          Comma separated list of additional Jars to include in the cluster.
    %spark_conf         String        Specify custom spark configurations for your session. 
                                      E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer
----
                                      
## Magic for Ray Job

----
    %min_workers        Int           The minimum number of workers that are allocated to a Ray job. 
                                      Default: 1.
    %object_memory_head Int           The percentage of free memory on the instance head node after a warm start. 
                                      Minimum: 0. Maximum: 100.
    %object_memory_worker Int         The percentage of free memory on the instance worker nodes after a warm start. 
                                      Minimum: 0. Maximum: 100.
----

## Action Magic

----

    %%sql               String        Run SQL code. All lines after the initial %%sql magic will be passed
                                      as part of the SQL code.  
----



In [1]:
import warnings

warnings.filterwarnings(action='ignore')

Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::819320734790:role/AmazonSageMakerExecutionRole-06146
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: 8f8620a4-2a25-4f1a-a71c-11c7e398a189
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.38.1
--enable-glue-datacatalog true
Waiting for session 8f8620a4-2a25-4f1a-a71c-11c7e398a189 to get into ready status...
Session 8f8620a4-2a25-4f1a-a71c-11c7e398a189 has been created.



### Importing GlueContext

In [2]:
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.types import *
from pyspark.sql import Row




In [3]:
glueContext = GlueContext(SparkContext.getOrCreate())




## Dataset 1

In [4]:
order_list = [
               ['1005', '623', 'YES', '1418901234', '75091'],
               ['1006', '547', 'NO',  '1418901256', '75034'],
               ['1007', '823', 'YES', '1418901300', '75023'],
               ['1008', '912', 'NO',  '1418901400', '82091'],
               ['1009', '321', 'YES', '1418902000', '90093']
             ]




In [5]:
# Define schema for the order_list
order_schema = StructType([  
                      StructField("order_id", StringType()),
                      StructField("customer_id", StringType()),
                      StructField("essential_item", StringType()),
                      StructField("timestamp", StringType()),
                      StructField("zipcode", StringType())
                    ])




In [6]:
# Create a Spark Dataframe from the python list and the schema
df_orders = spark.createDataFrame(order_list, schema=order_schema)




In [7]:
df_orders.show()

+--------+-----------+--------------+----------+-------+
|order_id|customer_id|essential_item| timestamp|zipcode|
+--------+-----------+--------------+----------+-------+
|    1005|        623|           YES|1418901234|  75091|
|    1006|        547|            NO|1418901256|  75034|
|    1007|        823|           YES|1418901300|  75023|
|    1008|        912|            NO|1418901400|  82091|
|    1009|        321|           YES|1418902000|  90093|
+--------+-----------+--------------+----------+-------+


In [8]:
df_orders.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- essential_item: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- zipcode: string (nullable = true)


### [DynamicFrame](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html)

A `DynamicFrame` is similar to a Spark `DataFrame`, except that each record is self-describing, so no schema is required initially.<br/>
Instead, AWS Glue computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a choice (or union) type. You can resolve these inconsistencies to make your datasets compatible with data stores that require a fixed schema.

In [9]:
dyf_orders = DynamicFrame.fromDF(df_orders, glueContext, "dyf")




In [10]:
dyf_orders.printSchema()

root
|-- order_id: string
|-- customer_id: string
|-- essential_item: string
|-- timestamp: string
|-- zipcode: string


## AWS Glue transofrm functions

### [ApplyMapping](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-ApplyMapping.html)

Applies a mapping in a `DynamicFrame`.

In [11]:
dyf_applyMapping = ApplyMapping.apply(frame=dyf_orders, mappings=[
    ("order_id", "String", "order_id", "Long"),
    ("customer_id", "String", "customer_id", "String"),
    ("essential_item", "String", "essential_item", "String"),
    ("timestamp", "String", "timestamp", "Long"),
    ("zipcode", "String", "zip", "Long")
])




ℹ️ **[describeArgs(cls)](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-GlueTransform.html#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)**

`describeArgs` function of `GlueTransform` class returns a list of dictionaries, each each corresponding to a named argument.

In [12]:
ApplyMapping.describeArgs()

[{'name': 'frame', 'type': 'DynamicFrame', 'description': 'DynamicFrame to transform', 'optional': False, 'defaultValue': None}, {'name': 'mappings', 'type': 'DynamicFrame', 'description': 'List of mapping tuples (source col, source type, target col, target type)', 'optional': False, 'defaultValue': None}, {'name': 'case_sensitive', 'type': 'Boolean', 'description': 'Whether ', 'optional': True, 'defaultValue': 'False'}, {'name': 'transformation_ctx', 'type': 'String', 'description': 'A unique string that is used to identify stats / state information', 'optional': True, 'defaultValue': ''}, {'name': 'info', 'type': 'String', 'description': 'Any string to be associated with errors in the transformation', 'optional': True, 'defaultValue': '""'}, {'name': 'stageThreshold', 'type': 'Integer', 'description': 'Max number of errors in the transformation until processing will error out', 'optional': True, 'defaultValue': '0'}, {'name': 'totalThreshold', 'type': 'Integer', 'description': 'Max n

In [13]:
import pprint

pprint.pprint(ApplyMapping.describeArgs())

[{'defaultValue': None,
  'description': 'DynamicFrame to transform',
  'name': 'frame',
  'optional': False,
  'type': 'DynamicFrame'},
 {'defaultValue': None,
  'description': 'List of mapping tuples (source col, source type, target col, '
                 'target type)',
  'name': 'mappings',
  'optional': False,
  'type': 'DynamicFrame'},
 {'defaultValue': 'False',
  'description': 'Whether ',
  'name': 'case_sensitive',
  'optional': True,
  'type': 'Boolean'},
 {'defaultValue': '',
  'description': 'A unique string that is used to identify stats / state '
                 'information',
  'name': 'transformation_ctx',
  'optional': True,
  'type': 'String'},
 {'defaultValue': '""',
  'description': 'Any string to be associated with errors in the '
                 'transformation',
  'name': 'info',
  'optional': True,
  'type': 'String'},
 {'defaultValue': '0',
  'description': 'Max number of errors in the transformation until processing '
                 'will error out',
  'n

In [14]:
dyf_applyMapping.printSchema()

root
|-- order_id: long
|-- customer_id: string
|-- essential_item: string
|-- timestamp: long
|-- zip: long


### [Filter](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-filter.html)

Builds a new `DynamicFrame` that contains records from the input `DynamicFrame` that satisfy a specified predicate function.

We now want to prioritize our order delivery for essential items. We can achieve that using the Filter function:

In [15]:
dyf_filter = Filter.apply(frame=dyf_applyMapping, f=lambda x: x["essential_item"] == 'YES')




In [16]:
dyf_filter.toDF().show()

+--------------+-----------+-----+----------+--------+
|essential_item|customer_id|  zip| timestamp|order_id|
+--------------+-----------+-----+----------+--------+
|           YES|        623|75091|1418901234|    1005|
|           YES|        823|75023|1418901300|    1007|
|           YES|        321|90093|1418902000|    1009|
+--------------+-----------+-----+----------+--------+


### [Map](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-map.html)

Builds a new `DynamicFrame` by applying a function to all records in the input `DynamicFrame`.

Map allows us to apply a transformation to each record of a Dynamic Frame. For our case, we want to target a certain zip code for next day air shipping. We implement a simple "`next_day_air`" function and pass it to the Dynamic Frame:

In [17]:
# This function takes in a dynamic frame record and checks if zipcode # 75034 is present in it. If present, it adds another column 
# “next_day_air” with value as True
def next_day_air(rec):
    if rec["zip"] == 75034:
        rec["next_day_air"] = True
    return rec




In [18]:
mapped_dyF = Map.apply(frame=dyf_applyMapping, f=next_day_air)




In [19]:
mapped_dyF.toDF().show()

+--------------+-----------+------------+-----+----------+--------+
|essential_item|customer_id|next_day_air|  zip| timestamp|order_id|
+--------------+-----------+------------+-----+----------+--------+
|           YES|        623|        null|75091|1418901234|    1005|
|            NO|        547|        true|75034|1418901256|    1006|
|           YES|        823|        null|75023|1418901300|    1007|
|            NO|        912|        null|82091|1418901400|    1008|
|           YES|        321|        null|90093|1418902000|    1009|
+--------------+-----------+------------+-----+----------+--------+


## Dataset 2

In [20]:
jsonStr1 = u'{ "zip": 75091, "customers": [{ "id": 623, "address": "108 Park Street, TX"}, { "id": 231, "address": "763 Marsh Ln, TX" }]}'
jsonStr2 = u'{ "zip": 82091, "customers": [{ "id": 201, "address": "771 Peek Pkwy, GA" }]}'
jsonStr3 = u'{ "zip": 75023, "customers": [{ "id": 343, "address": "66 P Street, NY" }]}'
jsonStr4 = u'{ "zip": 90093, "customers": [{ "id": 932, "address": "708 Fed Ln, CA"}, { "id": 102, "address": "807 Deccan Dr, CA" }]}'




In [21]:
df_row = spark.createDataFrame([
    Row(json=jsonStr1),
    Row(json=jsonStr2),
    Row(json=jsonStr4)
])




In [22]:
df_json = spark.read.json(df_row.rdd.map(lambda r: r.json))




In [23]:
df_json.show()

+--------------------+-----+
|           customers|  zip|
+--------------------+-----+
|[{108 Park Street...|75091|
|[{771 Peek Pkwy, ...|82091|
|[{708 Fed Ln, CA,...|90093|
+--------------------+-----+


In [24]:
df_json.printSchema()

root
 |-- customers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- address: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |-- zip: long (nullable = true)


In [25]:
dyf_json = DynamicFrame.fromDF(df_json, glueContext, "dyf_json")




In [26]:
dyf_json.printSchema()

root
|-- customers: array
|    |-- element: struct
|    |    |-- address: string
|    |    |-- id: long
|-- zip: long


### [SelectFields](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-SelectFields.html)

The `SelectFields` class creates a new `DynamicFrame` from an existing `DynamicFrame`, and keeps only the fields that you specify. `SelectFields` provides similar functionality to a SQL `SELECT` statement.

To join with the order list, we don’t need all the columns, so we use the `SelectFields` function to shortlist the columns we need. In our use case, we need the zip code column, but we can add more columns as the argument paths accepts a list:

In [27]:
dyf_selectFields = SelectFields.apply(frame=dyf_filter, paths=['zip'])




In [28]:
dyf_selectFields.toDF().show()

+-----+
|  zip|
+-----+
|75091|
|75023|
|90093|
+-----+


### [Join](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-join.html)

Performs an equality join on two `DynamicFrames`.

The `Join` function is straightforward and manages duplicate columns. We had two columns named `zip` from both datasets. 

In [29]:
dyf_join = Join.apply(dyf_json, dyf_selectFields, 'zip', 'zip')




ℹ️ AWS Glue added a period (`.`) in one of the duplicate column names to avoid errors

In [30]:
dyf_join.toDF().show()

+--------------------+-----+-----+
|           customers| .zip|  zip|
+--------------------+-----+-----+
|[{708 Fed Ln, CA,...|90093|90093|
|[{108 Park Street...|75091|75091|
+--------------------+-----+-----+


### [Dropfields](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-DropFields.html)

Drops fields within a `DynamicFrame`.

Because we don’t need two columns with the same name, we can use DropFields to drop one or multiple columns all at once. The backticks (<code>`</code>) around .zip inside the function call are needed because the column name contains a period (.):

In [31]:
dyf_dropfields = DropFields.apply(frame=dyf_join, paths="`.zip`")




In [32]:
dyf_dropfields.toDF().show()

+--------------------+-----+
|           customers|  zip|
+--------------------+-----+
|[{708 Fed Ln, CA,...|90093|
|[{108 Park Street...|75091|
+--------------------+-----+


### [Relationalize](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Relationalize.html)

Flattens a nested schema in a `DynamicFrame` and pivots out array columns from the flattened frame.

The `Relationalize` function can flatten nested structures and create multiple dynamic frames. Our customer column from the previous operation is a nested structure, and `Relationalize` can convert it into multiple flattened DynamicFrames:

In [33]:
temp_dir = 's3://sagemaker-us-east-1-123456789012/glue/temp/'
dyf_relationalize = dyf_dropfields.relationalize("root", temp_dir)




In [34]:
import pprint

pprint.pprint(Relationalize.describeArgs())

[{'defaultValue': None,
  'description': 'The DynamicFrame to relationalize',
  'name': 'frame',
  'optional': False,
  'type': 'DynamicFrame'},
 {'defaultValue': None,
  'description': 'path to store partitions of pivoted tables in csv format',
  'name': 'staging_path',
  'optional': True,
  'type': 'String'},
 {'defaultValue': 'roottable',
  'description': 'Name of the root table',
  'name': 'name',
  'optional': True,
  'type': 'String'},
 {'defaultValue': '{}',
  'description': 'dict of optional parameters for relationalize',
  'name': 'options',
  'optional': True,
  'type': 'Dictionary'},
 {'defaultValue': '',
  'description': 'A unique string that is used to identify stats / state '
                 'information',
  'name': 'transformation_ctx',
  'optional': True,
  'type': 'String'},
 {'defaultValue': '""',
  'description': 'Any string to be associated with errors in the '
                 'transformation',
  'name': 'info',
  'optional': True,
  'type': 'String'},
 {'defaultV

In [35]:
dyf_relationalize.keys()

dict_keys(['root', 'root_customers'])


### [SelectFromCollection](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-SelectFromCollection.html)

Selects one `DynamicFrame` in a `DynamicFrameCollection`.

The `SelectFromCollection` function allows us to retrieve the specific DynamicFrame from a collection of DynamicFrames. For this use case, we retrieve both DynamicFrames from the previous operation using this function.

In [36]:
dyf_selectFromCollection = SelectFromCollection.apply(dyf_relationalize, "root")




In [37]:
dyf_selectFromCollection.toDF().show()

+---------+-----+
|customers|  zip|
+---------+-----+
|        1|90093|
|        2|75091|
+---------+-----+


In [38]:
dyf_selectFromCollection = SelectFromCollection.apply(dyf_relationalize, "root_customers")




### [RenameField](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-RenameField.html)

Renames a node within a `DynamicFrame`.

In [39]:
dyf_renameField_1 = RenameField.apply(dyf_selectFromCollection, "`customers.val.address`", "address")




In [40]:
dyf_renameField_2 = RenameField.apply(dyf_renameField_1, "`customers.val.id`", "cust_id")




In [41]:
dyf_renameField_2.toDF().show()

+---+-----+-------------------+-------+
| id|index|            address|cust_id|
+---+-----+-------------------+-------+
|  2|    0|108 Park Street, TX|    623|
|  2|    1|   763 Marsh Ln, TX|    231|
|  1|    0|     708 Fed Ln, CA|    932|
|  1|    1|  807 Deccan Dr, CA|    102|
+---+-----+-------------------+-------+


In [42]:
dyf_dropfields_rf = DropFields.apply(
  frame = dyf_renameField_2,
  paths = ["index", "id"]
)




In [43]:
dyf_dropfields_rf.toDF().show()

+-------------------+-------+
|            address|cust_id|
+-------------------+-------+
|108 Park Street, TX|    623|
|   763 Marsh Ln, TX|    231|
|     708 Fed Ln, CA|    932|
|  807 Deccan Dr, CA|    102|
+-------------------+-------+


### [ResolveChoice](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-ResolveChoice.html)

Resolves a choice type within a `DynamicFrame`.

`ResloveChoice` can gracefully handle column type ambiguities.

In [44]:
import pprint

pprint.pprint(ResolveChoice.describeArgs())

[{'defaultValue': None,
  'description': 'DynamicFrame to transform',
  'name': 'frame',
  'optional': False,
  'type': 'DynamicFrame'},
 {'defaultValue': None,
  'description': 'List of specs (path, action)',
  'name': 'specs',
  'optional': True,
  'type': 'List'},
 {'defaultValue': '',
  'description': 'resolve choice option',
  'name': 'choice',
  'optional': True,
  'type': 'String'},
 {'defaultValue': '',
  'description': 'Glue catalog database name, required for MATCH_CATALOG '
                 'choice',
  'name': 'database',
  'optional': True,
  'type': 'String'},
 {'defaultValue': '',
  'description': 'Glue catalog table name, required for MATCH_CATALOG choice',
  'name': 'table_name',
  'optional': True,
  'type': 'String'},
 {'defaultValue': '',
  'description': 'A unique string that is used to identify stats / state '
                 'information',
  'name': 'transformation_ctx',
  'optional': True,
  'type': 'String'},
 {'defaultValue': '""',
  'description': 'Any string

In [45]:
dyf_resolveChoice = dyf_dropfields_rf.resolveChoice(specs = [('cust_id','cast:String')])




In [46]:
dyf_resolveChoice.printSchema()

root
|-- address: string
|-- cust_id: string


In [47]:
warehouse_inventory_list = [
              ['TX_WAREHOUSE', '{\
                          "strawberry":"220",\
                          "pineapple":"560",\
                          "mango":"350",\
                          "pears":null}'
              ],
              ['CA_WAREHOUSE', '{\
                         "strawberry":"34",\
                         "pineapple":"123",\
                         "mango":"42",\
                         "pears":null}'
              ],
              ['CO_WAREHOUSE', '{\
                         "strawberry":"340",\
                         "pineapple":"180",\
                         "mango":"2",\
                         "pears":null}'
              ]
            ]




In [48]:
warehouse_schema = StructType([StructField("warehouse_loc", StringType()),
                              StructField("data", StringType())])




In [49]:
df_warehouse = spark.createDataFrame(warehouse_inventory_list, schema=warehouse_schema)




In [50]:
dyf_warehouse = DynamicFrame.fromDF(df_warehouse, glueContext, "dyf_warehouse")




In [51]:
dyf_warehouse.printSchema()

root
|-- warehouse_loc: string
|-- data: string


### [Unbox](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html)

Unboxes (reformats) a string field in a `DynamicFrame`.

We use `Unbox` to extract JSON from `String` format for the new data.

In [52]:
import pprint

pprint.pprint(Unbox.describeArgs())

[{'defaultValue': None,
  'description': 'The DynamicFrame on which to call Unbox',
  'name': 'frame',
  'optional': False,
  'type': 'DynamicFrame'},
 {'defaultValue': None,
  'description': 'full path to the StringNode to unbox',
  'name': 'path',
  'optional': False,
  'type': 'String'},
 {'defaultValue': None,
  'description': 'file format -- "avro" or "json" only',
  'name': 'format',
  'optional': False,
  'type': 'String'},
 {'defaultValue': '',
  'description': 'A unique string that is used to identify stats / state '
                 'information',
  'name': 'transformation_ctx',
  'optional': True,
  'type': 'String'},
 {'defaultValue': '""',
  'description': 'Any string to be associated with errors in the '
                 'transformation',
  'name': 'info',
  'optional': True,
  'type': 'String'},
 {'defaultValue': '0',
  'description': 'Max number of errors in the transformation until processing '
                 'will error out',
  'name': 'stageThreshold',
  'optional'

In [53]:
dyf_unbox = Unbox.apply(frame=dyf_warehouse, path="data", format="json")




In [54]:
dyf_unbox.printSchema()

root
|-- warehouse_loc: string
|-- data: struct
|    |-- strawberry: string
|    |-- pineapple: string
|    |-- mango: string
|    |-- pears: null


In [55]:
dyf_unbox.toDF().show()

+-------------+--------------------+
|warehouse_loc|                data|
+-------------+--------------------+
| TX_WAREHOUSE|{220, 560, 350, n...|
| CA_WAREHOUSE| {34, 123, 42, null}|
| CO_WAREHOUSE| {340, 180, 2, null}|
+-------------+--------------------+


### [UnnestFrame](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-UnnestFrame.html)

Unnests a `DynamicFrame`, flattens nested objects to top-level elements, and generates join keys for array objects.

`Unnest` allows us to flatten a single DynamicFrame to a more relational table format. We apply `Unnest` to the nested structure from the previous operation and flatten it:

In [56]:
dyf_unnest = UnnestFrame.apply(frame=dyf_unbox)




In [57]:
dyf_unnest.toDF().printSchema()

root
 |-- warehouse_loc: string (nullable = true)
 |-- data.strawberry: string (nullable = true)
 |-- data.pineapple: string (nullable = true)
 |-- data.mango: string (nullable = true)
 |-- data.pears: null (nullable = true)


In [58]:
dyf_unnest.toDF().show()

+-------------+---------------+--------------+----------+----------+
|warehouse_loc|data.strawberry|data.pineapple|data.mango|data.pears|
+-------------+---------------+--------------+----------+----------+
| TX_WAREHOUSE|            220|           560|       350|      null|
| CA_WAREHOUSE|             34|           123|        42|      null|
| CO_WAREHOUSE|            340|           180|         2|      null|
+-------------+---------------+--------------+----------+----------+


### [DropNullFields](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-DropNullFields.html)

Drops all null fields in a `DynamicFrame` whose type is `NullType`. These are fields with missing or null values in every record in the `DynamicFrame` dataset.

The `DropNullFields` function makes it easy to drop columns with all null values. Our warehouse data indicated that it was out of pears and can be dropped. We apply the `DropNullFields` function on the DynamicFrame, which automatically identifies the columns with null values and drops them:

In [59]:
dyf_dropNullfields = DropNullFields.apply(frame=dyf_unnest)

null_fields ['`data.pears`']


In [60]:
dyf_dropNullfields.toDF().show()

+-------------+---------------+--------------+----------+
|warehouse_loc|data.strawberry|data.pineapple|data.mango|
+-------------+---------------+--------------+----------+
| TX_WAREHOUSE|            220|           560|       350|
| CA_WAREHOUSE|             34|           123|        42|
| CO_WAREHOUSE|            340|           180|         2|
+-------------+---------------+--------------+----------+


### [SplitFields](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-SplitFields.html)

Splits a `DynamicFrame` into two new ones, by specified fields. The function takes the field names of the first DynamicFrame that we want to generate followed by the names of the two DynamicFrames:

In [61]:
dyf_splitFields = SplitFields.apply(frame=dyf_dropNullfields,
                                    paths=["`data.strawberry`", "`data.pineapple`"],
                                    name1="a", name2="b")




In [62]:
dyf_retrieve_a = SelectFromCollection.apply(dyf_splitFields, "a")
dyf_retrieve_a.toDF().show()

+---------------+--------------+
|data.strawberry|data.pineapple|
+---------------+--------------+
|            220|           560|
|             34|           123|
|            340|           180|
+---------------+--------------+


In [63]:
dyf_retrieve_b = SelectFromCollection.apply(dyf_splitFields, "b")
dyf_retrieve_b.toDF().show()

+-------------+----------+
|warehouse_loc|data.mango|
+-------------+----------+
| TX_WAREHOUSE|       350|
| CA_WAREHOUSE|        42|
| CO_WAREHOUSE|         2|
+-------------+----------+


### [SplitRows](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-SplitRows.html)

Creates a `DynamicFrameCollection` that contains two `DynamicFrames`. One `DynamicFrame` contains only the specified rows to be split, and the other contains all remaining rows.<br/>
`SplitRows` allows us to filter our dataset within a specific range of counts and split them into two DynamicFrames:

In [64]:
dyf_splitRows = SplitRows.apply(frame=dyf_dropNullfields,
                               comparison_dict={"`data.pineapple`": {
                                   ">": "100", 
                                   "<": "200"}},
                               name1='pa_200_less',
                               name2='pa_200_more')




In [65]:
dyf_pa_200_less = SelectFromCollection.apply(dyf_splitRows, "pa_200_less")
dyf_pa_200_less.toDF().show()

+-------------+---------------+--------------+----------+
|warehouse_loc|data.strawberry|data.pineapple|data.mango|
+-------------+---------------+--------------+----------+
| CA_WAREHOUSE|             34|           123|        42|
| CO_WAREHOUSE|            340|           180|         2|
+-------------+---------------+--------------+----------+


In [66]:
dyf_pa_200_more = SelectFromCollection.apply(dyf_splitRows, "pa_200_more")
dyf_pa_200_more.toDF().show()

+-------------+---------------+--------------+----------+
|warehouse_loc|data.strawberry|data.pineapple|data.mango|
+-------------+---------------+--------------+----------+
| TX_WAREHOUSE|            220|           560|       350|
+-------------+---------------+--------------+----------+


### [Spigot](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-spigot.html)

Writes sample records to a specified destination to help you verify the transformations performed by your AWS Glue job.<br/>
`Spigot` allows you to write a sample dataset to a destination during transformation. For our use case, we write the top 10 records locally:

In [67]:
import pprint

pprint.pprint(Spigot.describeArgs())

[{'defaultValue': None,
  'description': 'spigot this DynamicFrame',
  'name': 'frame',
  'optional': False,
  'type': 'DynamicFrame'},
 {'defaultValue': None,
  'description': 'file path to write spigot',
  'name': 'path',
  'optional': False,
  'type': 'string'},
 {'defaultValue': None,
  'description': 'topk -> first k records, prob -> probability of picking any '
                 'record',
  'name': 'options',
  'optional': True,
  'type': 'Json'}]


In [68]:
temp_dir = 's3://sagemaker-us-east-1-123456789012/glue/Spigot/'
dyf_splitFields = Spigot.apply(dyf_pa_200_less, temp_dir, {'top10': 10})




### [Write Dynamic Frame](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer.html)

The `write_dynamic_frame` function writes a DynamicFrame using the specified connection and format. For our use case, we write locally (we use a `connection_type` of S3 with a POSIX path argument in `connection_options`, which allows writing to local storage):

In [69]:
glueContext.write_dynamic_frame.from_options(frame=dyf_splitFields,
                                             connection_options={'path': 's3://sagemaker-us-east-1-123456789012/glue/GlueOutput/'},
                                             connection_type='s3',
                                             format='json')

<awsglue.dynamicframe.DynamicFrame object at 0x7fcb32fe7fd0>


In [3]:
%stop_session

Stopping session: 8f8620a4-2a25-4f1a-a71c-11c7e398a189
Stopped session.


## Reference

* [Building an AWS Glue ETL pipeline locally without an AWS account (2020-09-21)](https://aws.amazon.com/blogs/big-data/building-an-aws-glue-etl-pipeline-locally-without-an-aws-account/)
* [Program AWS Glue ETL scripts in PySpark](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python.html)