# Read SOREL data
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.


#### Optional: Run this cell to see available notebook commands ("magics").


In [2]:
%help

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.37.0 



# Available Magic Commands

## Sessions Magic

----
    %help                             Return a list of descriptions and input types for all magic commands. 
    %profile            String        Specify a profile in your aws configuration to use as the credentials provider.
    %region             String        Specify the AWS region in which to initialize a session. 
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\ USERNAME \.aws\config" on Windows.
    %idle_timeout       Int           The number of minutes of inactivity after which a session will timeout. 
                                      Default: 2880 minutes (48 hours).
    %session_id_prefix  String        Define a String that will precede all session IDs in the format 
                                      [session_id_prefix]-[session_id]. If a session ID is not provided,
                                      a random UUID will be generated.
    %status                           Returns the status of the current Glue session including its duration, 
                                      configuration and executing user / role.
    %session_id                       Returns the session ID for the running session. 
    %list_sessions                    Lists all currently running sessions by ID.
    %stop_session                     Stops the current session.
    %glue_version       String        The version of Glue to be used by this session. 
                                      Currently, the only valid options are 2.0 and 3.0. 
                                      Default: 2.0.
----

## Selecting Job Types

----
    %streaming          String        Sets the session type to Glue Streaming.
    %etl                String        Sets the session type to Glue ETL.
    %glue_ray           String        Sets the session type to Glue Ray.
----

## Glue Config Magic 
*(common across all job types)*

----

    %%configure         Dictionary    A json-formatted dictionary consisting of all configuration parameters for 
                                      a session. Each parameter can be specified here or through individual magics.
    %iam_role           String        Specify an IAM role ARN to execute your session with.
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\%USERNAME%\.aws\config` on Windows.
    %number_of_workers  int           The number of workers of a defined worker_type that are allocated 
                                      when a session runs.
                                      Default: 5.
    %additional_python_modules  List  Comma separated list of additional Python modules to include in your cluster 
                                      (can be from Pypi or S3).
----

                                      
## Magic for Spark Jobs (ETL & Streaming)

----
    %worker_type        String        Set the type of instances the session will use as workers. 
                                      ETL and Streaming support G.1X and G.2X. 
                                      Default: G.1X.
    %connections        List          Specify a comma separated list of connections to use in the session.
    %extra_py_files     List          Comma separated list of additional Python files From S3.
    %extra_jars         List          Comma separated list of additional Jars to include in the cluster.
    %spark_conf         String        Specify custom spark configurations for your session. 
                                      E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer
----
                                      
## Magic for Ray Job

----
    %min_workers        Int           The minimum number of workers that are allocated to a Ray job. 
                                      Default: 1.
    %object_memory_head Int           The percentage of free memory on the instance head node after a warm start. 
                                      Minimum: 0. Maximum: 100.
    %object_memory_worker Int         The percentage of free memory on the instance worker nodes after a warm start. 
                                      Minimum: 0. Maximum: 100.
----

## Action Magic

----

    %%sql               String        Run SQL code. All lines after the initial %%sql magic will be passed
                                      as part of the SQL code.  
----



####  Run this cell to set up and start your interactive session.


In [1]:
%idle_timeout 2880
%glue_version 3.0
%worker_type G.1X
%number_of_workers 5

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Current idle_timeout is 2800 minutes.
idle_timeout has been set to 2880 minutes.
Setting Glue version to: 3.0
Previous worker type: G.1X
Setting new worker type to: G.1X
Previous number of workers: 5
Setting new number of workers to: 5
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: 880307b5-1e29-4ab6-a835-fe7e589ab0f4
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.37.0
--enable-glue-datacatalog true
Waiting for session 880307b5-1e29-4ab6-a835-fe7e589ab0f4 to get into ready status...
Session 880307b5-1e29-4ab6-a835-fe7e589ab0f4 has been created.



#### Create RDD using sc.binaryFiles


In [None]:
# TODO Pass job parameters externally
#args={"s3_bucket": "sorel-20m", "s3_key": "09-DEC-2020/binaries/0000029bfead495a003e43a7ab8406c6209ffb7d5e59dd212607aa358bfd66ea"}
args={"s3_bucket": "sorel-20m", "s3_key": "09-DEC-2020/binaries"}
#args={"s3_bucket": "sorel-20m-demo", "s3_key": "tmp/binaries"}

bucket = args['s3_bucket']
key = args['s3_key']
key_path = "s3://{}/{}".format(bucket, key)

df = spark.read.format("binaryFile").load(key_path)

## Read binary files

In [19]:
binary_df = spark.read.format("binaryFile").load(key_path)




## Write Parquet

In [None]:
df.write.mode("overwrite").parquet('s3://sorel-20m-demo/output')

In [22]:
%status
%list_sessions

There is no current session.
The first 25 sessions are:
0aa03fd3-ab6f-4d85-b882-0be986a1fd22
144285e4-a0c6-40c7-ad12-cd7df56562e3
1816a59b-d194-41f3-9d12-02bf90a098c5
1e367fb8-bf26-461f-b4b9-93a68db402b6
260bffd6-27e8-4487-8fe9-e547b1f79a2d
2fcace20-cec9-432d-9a60-6f67f5ff168b
3a26e41a-f615-4e0b-af93-751f9f952e7b
3edbd1f0-507c-4f00-8fce-5efe9815f4c8
416957e6-debf-426a-bd40-eda3df323fad
5618b7b1-d706-4b71-9aad-85dc985bd5fa
5f3b1b5d-59eb-4114-9503-9b68058b4841
62c54cec-660e-480b-b9cf-de0bc24f00dd
6766e8da-9786-41fe-89a9-978c7d71a053
697427db-8828-4441-82e9-e0cdf670eca0
716a100f-e125-4d4b-b69a-d18724c86304
82ae531d-206e-4d03-873e-f9733ec0cc71
8692fbe3-e31c-43db-b4bd-7788c69fdf65
8fd1ccd1-e001-463d-b26c-2c94b4d670d8
94be9d3a-1aad-47df-8566-c4a786313278
aa598c53-03c6-4dc6-94be-5733e8cdd801
bfe58f27-8c26-4e54-ac5c-7c5a81561e68
cf345940-2c40-453e-9c3c-656dbd454444
d832a791-052b-4491-bf7b-a4e4957a0c0e
da84b4df-2bab-419b-904b-932edc29b267
df629284-b003-4637-822f-4e4d64339007


In [19]:
%stop_session

There is no current session.
