
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session. The second cell in this notebook contains all the needed magics to start your session so all you need to do is execute it.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session                                                                                                  |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X                                                                            |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0                                                        |
| %security_config            |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %delete_session             |              |  Deletes the current session and kills the cluster. User stops being charged.                                                                             |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |

In [4]:
#studentcell:keep
# Execute this cell to configure and start your interactive session.
%session_id_prefix my-session-yftht
%idle_timeout 60
%%configure
{
"region": "us-east-1",
"iam_role": "arn:aws:iam::877047854295:role/LabRole"
}

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

It looks like there is a newer version of the kernel available. The latest version is stderr='ERROR:Exception:\nTraceback(mostrecentcalllast and you have 0.24 installed.
Please run `pip install --upgrade aws-glue-sessions` to upgrade your kernel
Setting session ID prefix to my-session-yftht
Current idle_timeout is None minutes.
idle_timeout has been set to 60 minutes.
The following configurations have been updated: {'region': 'us-east-1', 'iam_role': 'arn:aws:iam::877047854295:role/LabRole'}


In [None]:
#studentcell:keep
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Authenticating with profile=default
glue_role_arn defined by user: arn:aws:iam::877047854295:role/LabRole
Attempting to use existing AssumeRole session credentials.
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: my-session-yftht-55db0f82-712f-4d30-bfd4-3b3f6bbc010c
Applying the following default arguments:
--glue_kernel_version 0.24
--enable-glue-datacatalog true
Waiting for session my-session-yftht-55db0f82-712f-4d30-bfd4-3b3f6bbc010c to get into ready status...
Session my-session-yftht-55db0f82-712f-4d30-bfd4-3b3f6bbc010c has been created




In [8]:
#studentcell:keep
from pyspark.sql.functions import *




In [22]:
#studentcell:keep
df = spark.read.parquet('s3://bigdatarepo/data/reddit/subreddits/AskReddit/ym_partition=202106')




In [23]:
#studentcell:keep
df.printSchema()

root
 |-- all_awardings: string (nullable = true)
 |-- associated_award: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_created_utc: double (nullable = true)
 |-- author_flair_background_color: string (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_richtext: string (nullable = true)
 |-- author_flair_template_id: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- author_flair_text_color: string (nullable = true)
 |-- author_flair_type: string (nullable = true)
 |-- author_fullname: string (nullable = true)
 |-- author_patreon_flair: boolean (nullable = true)
 |-- author_premium: boolean (nullable = true)
 |-- awarders: string (nullable = true)
 |-- body: string (nullable = true)
 |-- can_gild: boolean (nullable = true)
 |-- can_mod_post: boolean (nullable = true)
 |-- collapsed: boolean (nullable = true)
 |-- collapsed_because_crowd_control: string (nullable = true)
 |-- collapsed_reason: 

In [24]:
#studentcell:keep
df.show(1)

+-------------+----------------+-----------+------------------+-----------------------------+----------------------+---------------------+------------------------+-----------------+-----------------------+-----------------+---------------+--------------------+--------------+--------+----------------+--------+------------+---------+-------------------------------+----------------+------------+----------------+-----------+-------------+------+------+--------+-------+------------+---------+------+---------+---------+--------------------+-----------+--------------+------------+-----+------------+--------+------------+-----------------------+--------------+----------------+---------------------+--------------+--------------+--------+--------------+
|all_awardings|associated_award|     author|author_created_utc|author_flair_background_color|author_flair_css_class|author_flair_richtext|author_flair_template_id|author_flair_text|author_flair_text_color|author_flair_type|author_fullname|author_

In [14]:
#studentcell:keep
df.select('created_utc').show(5)

+-----------+
|created_utc|
+-----------+
| 1624393692|
| 1624393693|
| 1624393693|
| 1624393693|
| 1624393693|
+-----------+
only showing top 5 rows


Convert the unix timestamp into a proper datetime object

In [32]:
#studentcell:keep
df = df.withColumn('created_utc_dt',                   )

In [32]:
#studentcell:drop
df = df.withColumn('created_utc_dt', from_unixtime(col('created_utc')))




Use the datetime library to build `time_current` and `time_window` variables

In [45]:
#studentcell:keep
from datetime import datetime, timedelta
time_current = datetime(           ) # june 30 2021
time_window = timedelta(         ) # 24 hours




In [None]:
#studentcell:drop
from datetime import datetime, timedelta
time_current = datetime(2021,6,30)
time_window = timedelta(hours = 24)

Filter to only reddit comments with a `score` with greater than 5000 as well as that every reddit was made after the `time_current` - `time_window` datetime cutoff

In [None]:
#studentcell:keep
high_posts_df = df.filter(                                                  )

In [40]:
#studentcell:drop
high_posts_df = df.filter((col('score') > lit(5000)) & (time_current - time_window  < col('created_utc_dt')))




In [42]:
#studentcell:keep
if high_posts_df.count() > 0:
    print('high count posts to save')

1334


Save your results to a separate folder with tags for the date

In [None]:
#studentcell:keep
high_posts_df.write.parquet(f's3://[[YOUR-BUCKET-NAME]]/high_posts/{str(time_current)}')

In [44]:
#studentcell:drop
high_posts_df.write.parquet(f's3://aem303test/high_posts/{str(time_current)}')


