### Kinesis Streams Data Processing

This notebook demonstrates the process of reading data from AWS Kinesis streams using PySpark. The code performs the following operations:

1. Reads AWS access and secret keys from a CSV file.
2. Initialises stream readers for three Kinesis streams: "streaming-0a48d8473ced-pin," "streaming-0a48d8473ced-geo," and "streaming-0a48d8473ced-user."
3. Retrieves and displays data frames from the respective Kinesis streams.
4. Data cleaning operations 
5. Write the cleaned data to a Delta table.


#### Configuration for AWS
The following code reads the AWS access and secret keys from a CSV file. This is allows us to access and retrieve data for streaming.

In [0]:
dbutils.fs.ls("/FileStore/tables")

from pyspark.sql.types import *
from pyspark.sql.functions import *
import urllib
file_type = "csv"
first_row_is_header = "true"
delimiter = ","
# Read the CSV file to spark dataframe
aws_keys_df = spark.read.format(file_type)\
.option("header", first_row_is_header)\
.option("sep", delimiter)\
.load("/FileStore/tables/authentication_credentials.csv")

# Get the AWS access key and secret key from the spark dataframe
ACCESS_KEY = aws_keys_df.where(col('User name')=='databricks-user').select('Access key ID').collect()[0]['Access key ID']
SECRET_KEY = aws_keys_df.where(col('User name')=='databricks-user').select('Secret access key').collect()[0]['Secret access key']
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")

#### Initialise Stream Readers
Next initialise stream readers for three Kinesis streams: "streaming-0a48d8473ced-pin," "streaming-0a48d8473ced-geo," and "streaming-0a48d8473ced-user."

In [0]:
df_pin = spark \
.readStream \
.format('kinesis') \
.option('streamName','streaming-0a48d8473ced-pin') \
.option('initialPosition','earliest') \
.option('region','us-east-1') \
.option('awsAccessKey', ACCESS_KEY) \
.option('awsSecretKey', SECRET_KEY) \
.load()

display(df_pin)

df_geo = spark \
.readStream \
.format('kinesis') \
.option('streamName','streaming-0a48d8473ced-geo') \
.option('initialPosition','earliest') \
.option('region','us-east-1') \
.option('awsAccessKey', ACCESS_KEY) \
.option('awsSecretKey', SECRET_KEY) \
.load()

display(df_geo)

df_user = spark \
.readStream \
.format('kinesis') \
.option('streamName','streaming-0a48d8473ced-user') \
.option('initialPosition','earliest') \
.option('region','us-east-1') \
.option('awsAccessKey', ACCESS_KEY) \
.option('awsSecretKey', SECRET_KEY) \
.load()

display(df_user)

partitionKey,data,stream,shardId,sequenceNumber,approximateArrivalTimestamp
5730,eyJpbmQiOjU3MzAsImZpcnN0X25hbWUiOiJSYWNoZWwiLCJsYXN0X25hbWUiOiJEYXZpcyIsImFnZSI6MzYsImRhdGVfam9pbmVkIjoiMjAxNS0xMi0wOCAyMDowMjo0MyJ9,streaming-0a48d8473ced-user,shardId-000000000000,49645573708681571896462118122891451991030310891572690946,2023-10-17T21:48:21.945+0000
8304,eyJpbmQiOjgzMDQsImZpcnN0X25hbWUiOiJDaGFybGVzIiwibGFzdF9uYW1lIjoiQmVycnkiLCJhZ2UiOjI1LCJkYXRlX2pvaW5lZCI6IjIwMTUtMTItMjggMDQ6MjE6MzkifQ==,streaming-0a48d8473ced-user,shardId-000000000000,49645573708681571896462118122892660916849925726905827330,2023-10-17T21:48:24.748+0000
7554,eyJpbmQiOjc1NTQsImZpcnN0X25hbWUiOiJDaGVyeWwiLCJsYXN0X25hbWUiOiJIdWVydGEiLCJhZ2UiOjIwLCJkYXRlX2pvaW5lZCI6IjIwMTctMDQtMTEgMTY6MzU6MzMifQ==,streaming-0a48d8473ced-user,shardId-000000000000,49645573708681571896462118122905959100865688297095036930,2023-10-17T21:48:49.167+0000
3156,eyJpbmQiOjMxNTYsImZpcnN0X25hbWUiOiJBbmRyZXciLCJsYXN0X25hbWUiOiJCYWtlciIsImFnZSI6MjIsImRhdGVfam9pbmVkIjoiMjAxNS0xMi0yMSAwODowNjo1NCJ9,streaming-0a48d8473ced-user,shardId-000000000000,49645573708681571896462118122913212655783376965496471554,2023-10-17T21:49:02.116+0000
2074,eyJpbmQiOjIwNzQsImZpcnN0X25hbWUiOiJBbm5ldHRlIiwibGFzdF9uYW1lIjoiRm9yYmVzIiwiYWdlIjoyMSwiZGF0ZV9qb2luZWQiOiIyMDE2LTAxLTAzIDE1OjQyOjEyIn0=,streaming-0a48d8473ced-user,shardId-000000000000,49645573708681571896462118122916839433242221334056927234,2023-10-17T21:49:08.510+0000
9979,eyJpbmQiOjk5NzksImZpcnN0X25hbWUiOiJLYXlsZWUiLCJsYXN0X25hbWUiOiJNaWxsZXIiLCJhZ2UiOjMxLCJkYXRlX2pvaW5lZCI6IjIwMTYtMTEtMDkgMTk6NTA6NTEifQ==,streaming-0a48d8473ced-user,shardId-000000000000,49645573708681571896462118122918048359061836031951110146,2023-10-17T21:49:10.388+0000
10138,eyJpbmQiOjEwMTM4LCJmaXJzdF9uYW1lIjoiQ2Fyb2wiLCJsYXN0X25hbWUiOiJTaWx2YSIsImFnZSI6MjIsImRhdGVfam9pbmVkIjoiMjAxNS0xMi0zMSAxNDo1NzowMiJ9,streaming-0a48d8473ced-user,shardId-000000000000,49645573708681571896462118122919257284881450798564769794,2023-10-17T21:49:12.197+0000
8887,eyJpbmQiOjg4ODcsImZpcnN0X25hbWUiOiJBdXN0aW4iLCJsYXN0X25hbWUiOiJSb2RyaWd1ZXoiLCJhZ2UiOjI0LCJkYXRlX2pvaW5lZCI6IjIwMTYtMDMtMzEgMjA6NTY6MzkifQ==,streaming-0a48d8473ced-user,shardId-000000000000,49645573708681571896462118122922884062340295304564178946,2023-10-17T21:49:20.621+0000
5730,eyJpbmQiOjU3MzAsImZpcnN0X25hbWUiOiJSYWNoZWwiLCJsYXN0X25hbWUiOiJEYXZpcyIsImFnZSI6MzYsImRhdGVfam9pbmVkIjoiMjAxNS0xMi0wOCAyMDowMjo0MyJ9,streaming-0a48d8473ced-user,shardId-000000000000,49645573708681571896462118127591755577692149651527958530,2023-10-17T22:27:17.555+0000
8304,eyJpbmQiOjgzMDQsImZpcnN0X25hbWUiOiJDaGFybGVzIiwibGFzdF9uYW1lIjoiQmVycnkiLCJhZ2UiOjI1LCJkYXRlX2pvaW5lZCI6IjIwMTUtMTItMjggMDQ6MjE6MzkifQ==,streaming-0a48d8473ced-user,shardId-000000000000,49645573708681571896462118127592964503511764418141618178,2023-10-17T22:27:20.461+0000


#### Display Data Frames
This code snippet retrieves and displays data frames from the respective Kinesis streams.

The selectExpr("CAST(data as STRING)") operation is used to convert the streamed data into a string format, enabling easier manipulation and analysis. 
The subsequent display functions showcase the contents of the data frames df_pin, df_geo, and df_user to facilitate further data processing and analysis.

In [0]:
df_pin = df_pin.selectExpr("CAST(data as STRING)")
display(df_pin)
df_geo = df_geo.selectExpr("CAST(data as STRING)")
display(df_geo)
df_user = df_user.selectExpr("CAST(data as STRING)")
display(df_user)

ind,user_name,age,date_joined
5730.0,Rachel Davis,36.0,2015-12-08T20:02:43.000+0000
8304.0,Charles Berry,25.0,2015-12-28T04:21:39.000+0000
7554.0,Cheryl Huerta,20.0,2017-04-11T16:35:33.000+0000
3156.0,Andrew Baker,22.0,2015-12-21T08:06:54.000+0000
2074.0,Annette Forbes,21.0,2016-01-03T15:42:12.000+0000
9979.0,Kaylee Miller,31.0,2016-11-09T19:50:51.000+0000
10138.0,Carol Silva,22.0,2015-12-31T14:57:02.000+0000
8887.0,Austin Rodriguez,24.0,2016-03-31T20:56:39.000+0000
5730.0,Rachel Davis,36.0,2015-12-08T20:02:43.000+0000
8304.0,Charles Berry,25.0,2015-12-28T04:21:39.000+0000


### Data Cleaning

The following code snippets demonstrates the data cleaning process for the DataFrames. These operations contribute to ensuring the quality and consistency of the data within the DataFrames. This is in preparation for subsequent analysis and processing.

#### Data Cleaning for df_pin

In [0]:
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StringType, StructField, StructType

# Take the JSON string and convert into a data frame with the corresponding schema
# Define the schema based on the JSON structure
pin_schema = StructType([
    StructField("index", StringType(), True),
    StructField("unique_id", StringType(), True),
    StructField("title", StringType(), True),
    StructField("description", StringType(), True),
    StructField("poster_name", StringType(), True),
    StructField("follower_count", StringType(), True),
    StructField("tag_list", StringType(), True),
    StructField("is_image_or_video", StringType(), True),
    StructField("image_src", StringType(), True),
    StructField("downloaded", StringType(), True),
    StructField("save_location", StringType(), True),
    StructField("category", StringType(), True)
    ])
# Extract fields from the JSON string and create separate columns
df_pin = df_pin.select(from_json(col("data"), pin_schema).alias("data")).select("data.*")

# Replace empty, irrelevant entries with Nones
df_pin = df_pin.replace('', None)
df_pin = df_pin.replace(' ', None)
df_pin = df_pin.replace('nan', None)
# Convert follower_count to integer, handling 'k' and 'M' notations
df_pin = df_pin.withColumn('follower_count', regexp_replace('follower_count', 'k', '000'))
df_pin = df_pin.withColumn('follower_count', regexp_replace('follower_count', 'M', '000000'))
df_pin = df_pin.withColumn('follower_count', df_pin['follower_count'].cast('int'))
# Clean save_location column by removing 'Local save in' text
df_pin = df_pin.withColumn('save_location', regexp_replace('save_location', 'Local save in ', ''))
# Rename index column to 'ind' and reorder DataFrame columns
df_pin = df_pin.withColumnRenamed('index', 'ind')
#Reorder the DataFrame columns to have the following column order:
df_pin = df_pin.select('ind', 'unique_id', 'title', 'description', 'follower_count', 'poster_name', 'tag_list', 'is_image_or_video', 'image_src', 'save_location', 'category')
display(df_pin)

ind,unique_id,title,description,follower_count,poster_name,tag_list,is_image_or_video,image_src,save_location,category
5730.0,1e1f0c8b-9fcf-460b-9154-c775827206eb,Island Oasis Coupon Organizer,"Description Coupon Organizer in a fun colorful fabric -island oasis, Great Size for the ""basic"" couponer - holds up to 500 coupons with ease, and is made long enough so that you…",0.0,Consuelo Aguirre,"Grocery Items,Grocery Coupons,Care Organization,Coupon Organization,Extreme Couponing,Couponing 101,Life Binder,Save My Money,Love Coupons",image,https://i.pinimg.com/originals/65/bb/ea/65bbeaf458907bb079317d8303c4fa0e.jpg,/data/finance,finance
8304.0,5b6d0913-25e4-43ab-839d-85d5516f78a4,The #1 Reason You’re Not His Priority Anymore - Matthew Coast,#lovequotes #matchmaker #matchmadeinheaven #loveyourself #respectyourself,51000.0,Commitment Connection,"Wise Quotes,Quotable Quotes,Words Quotes,Wise Words,Quotes To Live By,Great Quotes,Motivational Quotes,Inspirational Quotes,Funny Quotes",image,https://i.pinimg.com/originals/c6/64/ee/c664ee71524fb5a6e7b7b49233f93b43.png,/data/quotes,quotes
7554.0,c6fa12f4-0d4a-4b07-a335-5bf9f37f8281,Craig Style,imgentleboss: “ - More about men’s fashion at @Gentleboss - GB’s Facebook - ”,940.0,iElylike ..✿◕‿◕✿ஐ✿◕‿◕✿,"Mens Fashion Blog,Look Fashion,Autumn Fashion,Fashion News,Fashion Sale,80s Fashion,Paris Fashion,Runway Fashion,Fashion Trends",image,https://i.pinimg.com/originals/e7/6e/8e/e76e8ed6cc838b84a934c6948a5caff7.jpg,/data/mens-fashion,mens-fashion
3156.0,fa6e31a4-18c2-4eca-a6d8-e903eee2c2a4,Handprint Reindeer Ornaments - Crafty Morning,"This post may contain affiliate links, read our Disclosure Policy for more information. As an Amazon Associate I earn from qualifying purchases, thank you! Make some cute handpr…",892000.0,Michelle {CraftyMorning.com},"Christmas Gifts For Parents,Christmas Decorations For Kids,Christmas Crafts For Toddlers,Preschool Christmas,Christmas Crafts For Gifts,Christmas Activities,Toddler Crafts,Kids Christmas,Christmas Feeling",image,https://i.pinimg.com/originals/ff/fe/38/fffe384f3ec18a0d87cb2d80cc8c1499.jpg,/data/diy-and-crafts,diy-and-crafts
2074.0,86ed09a7-842d-496d-9501-010c654eb340,35 Christmas Decorating Ideas We Bet You Haven't Thought Of,20 Christmas Decorating Ideas We Bet You Haven't Thought Of via @PureWow,868000.0,PureWow,"Holiday Centerpieces,Xmas Decorations,Centerpiece Ideas,Table Centerpieces,Valentine Decorations,Wedding Centerpieces,Outdoor Decorations,Christmas Centerpieces With Candles,Christmas Dining Table Decorations",image,https://i.pinimg.com/originals/e9/b9/f0/e9b9f01cc3b2cf41948b45854335396c.jpg,/data/christmas,christmas
9979.0,2b2abc85-fc51-481f-8ae6-17681993da28,Paris in the Summer. 10 fun things to do in Paris in the Summertime • Petite in Paris,"Are you traveling to Paris during the summer? Find out what to do in Paris, France during the summer. Fun summertime activities in Paris. Enjoy the incredible outdoors when trav…",3000.0,Petite in Paris,"Torre Eiffel Paris,Tour Eiffel,Picnic In Paris,Hello France,Voyage Europe,Destination Voyage,Beautiful Places To Travel,Travel Aesthetic,Paris Travel",image,https://i.pinimg.com/originals/6c/4c/90/6c4c90bba27ebf8c8bfe4c1acfb9f07a.jpg,/data/travel,travel
10138.0,927c4658-cc3f-4b92-9b5c-70743d0c238d,"14 Amazing Things To Do In Costa Rica | Volcanoes, Waterfalls, Wildlife And More","This Costa Rica itinerary is the ultimate guide to spending two weeks in Costa Rica. Find out about visiting La Fortuna, Arenal, Monteverde, Naranjo, Corcovado National Park, Or…",10000.0,"Wanderlust Chloe ✈️ Travel guides, inspo and adventure travel ✈️","Costa Rica Travel,Rio Celeste Costa Rica,Dream Vacations,Vacation Spots,Vacation Travel,Travel Pictures,Travel Photos,Fortuna Costa Rica,Costa Rica Pictures",image,https://i.pinimg.com/originals/30/93/cb/3093cb01d9de2d125fda8ba5e3e41946.jpg,/data/travel,travel
8887.0,5df9f6e5-07f5-4ce8-a82e-96586bbc05d8,25 Ultra Sexy Back Tattoo Ideas For Girls,Tattoos are one of the most efficient ways through which one decides to express themselves…,4000.0,RapidLeaks,"Dream Tattoos,Body Art Tattoos,New Tattoos,Small Tattoos,Cross Tattoos,Random Tattoos,Fashion Tattoos,Bird Tattoos,Fitness Tattoos",image,https://i.pinimg.com/originals/ab/8e/50/ab8e505b04d4abc8f23e273c15f8a65d.jpg,/data/tattoos,tattoos
7922.0,a584581c-1b38-4731-a1cc-f36115ecf229,45 Top Life Quotes School Did Not Teach You,summcoco gives you inspiration for the women fashion trends you want. Thinking about a new look or lifestyle? This is your ultimate resource to get the hottest trends. 45 Top Li…,306000.0,"Sumcoco | Decor Ideas, Hairstyles, Nails Fashion Advice","Life Quotes Love,Inspirational Quotes About Love,Mood Quotes,Motivational Quotes,Tears Quotes,Quotes About Sadness,Deep Quotes About Life,Quotes Quotes,Quote Life",image,https://i.pinimg.com/originals/bb/c0/e6/bbc0e6a797079505f11ac12bcb0b8c66.jpg,/data/quotes,quotes
5730.0,1e1f0c8b-9fcf-460b-9154-c775827206eb,Island Oasis Coupon Organizer,"Description Coupon Organizer in a fun colorful fabric -island oasis, Great Size for the ""basic"" couponer - holds up to 500 coupons with ease, and is made long enough so that you…",0.0,Consuelo Aguirre,"Grocery Items,Grocery Coupons,Care Organization,Coupon Organization,Extreme Couponing,Couponing 101,Life Binder,Save My Money,Love Coupons",image,https://i.pinimg.com/originals/65/bb/ea/65bbeaf458907bb079317d8303c4fa0e.jpg,/data/finance,finance


#### Data Cleaning for df_geo

In [0]:
# Take the JSON string and convert into a data frame with the corresponding schema
# Define the schema based on the JSON structure
geo_schema = StructType([
    StructField("ind", StringType(), True),
    StructField("timestamp", StringType(), True),
    StructField("latitude", StringType(), True),
    StructField("longitude", StringType(), True),
    StructField("country", StringType(), True)
    ])    

# Extract fields from the JSON string and create separate columns
df_geo = df_geo.select(from_json(col("data"), geo_schema).alias("data")).select("data.*")

#Create a new column that contains an array based on the latitude and longitude columns
df_geo = df_geo.withColumn('coordinates', array('latitude', 'longitude'))
df_geo = df_geo.drop('latitude', 'longitude')
#Convert the timestamp column from a string to a timestamp data type
df_geo = df_geo.withColumn('timestamp', df_geo['timestamp'].cast('timestamp'))
#Reorder the DataFrame columns to have the following column order:
df_geo = df_geo.select('ind', 'country', 'coordinates', 'timestamp')
display(df_geo)

ind,country,coordinates,timestamp
5730.0,Colombia,"List(-77.015, -101.437)",2021-04-19T17:37:03.000+0000
8304.0,French Guiana,"List(-28.8852, -164.87)",2019-09-13T04:50:29.000+0000
7554.0,Sudan,"List(-51.2172, -77.9768)",2019-03-20T03:15:07.000+0000
3156.0,Armenia,"List(-84.738, -160.795)",2018-01-13T19:33:49.000+0000
2074.0,Central African Republic,"List(-52.3213, -50.11)",2019-11-03T05:41:59.000+0000
9979.0,Dominican Republic,"List(14.9967, -120.682)",2018-07-18T19:01:46.000+0000
10138.0,Austria,"List(-72.142, -74.3545)",2019-08-03T00:59:29.000+0000
8887.0,Botswana,"List(-28.0137, -160.708)",2021-09-19T05:27:43.000+0000
5730.0,Colombia,"List(-77.015, -101.437)",2021-04-19T17:37:03.000+0000
8304.0,French Guiana,"List(-28.8852, -164.87)",2019-09-13T04:50:29.000+0000


#### Data Cleaning for df_user

In [0]:
# Take the JSON string and convert into a data frame with the corresponding schema
# Define the schema based on the JSON structure
user_schema = StructType([
    StructField("ind", StringType(), True),
    StructField("first_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("age", StringType(), True),
    StructField("date_joined", StringType(), True)
    ])

# Extract fields from the JSON string and create separate columns
df_user = df_user.select(from_json(col("data"), user_schema).alias("data")).select("data.*")

#Clean the df_user dataframe
#Create a new column user_name that concatenates the first_name and last_name columns
df_user = df_user.withColumn('user_name', concat(df_user['first_name'], lit(' '), df_user['last_name']))
df_user = df_user.drop('first_name', 'last_name')
#Convert the date_joined column from a string to a timestamp data type
df_user = df_user.withColumn('date_joined', df_user['date_joined'].cast('timestamp'))
#Reorder the DataFrame columns to have the following column order:
df_user = df_user.select('ind', 'user_name', 'age', 'date_joined')
display(df_user)

ind,user_name,age,date_joined
5730.0,Rachel Davis,36.0,2015-12-08T20:02:43.000+0000
8304.0,Charles Berry,25.0,2015-12-28T04:21:39.000+0000
7554.0,Cheryl Huerta,20.0,2017-04-11T16:35:33.000+0000
3156.0,Andrew Baker,22.0,2015-12-21T08:06:54.000+0000
2074.0,Annette Forbes,21.0,2016-01-03T15:42:12.000+0000
9979.0,Kaylee Miller,31.0,2016-11-09T19:50:51.000+0000
10138.0,Carol Silva,22.0,2015-12-31T14:57:02.000+0000
8887.0,Austin Rodriguez,24.0,2016-03-31T20:56:39.000+0000
5730.0,Rachel Davis,36.0,2015-12-08T20:02:43.000+0000
8304.0,Charles Berry,25.0,2015-12-28T04:21:39.000+0000


#### Save Cleaned Data as Delta Tables

The cleaned data is saved as Delta tables. Each of the DataFrames, namely df_pin, df_geo, and df_user, are saved as Delta tables with their respective table names, ensuring that the cleaned data is appropriately stored for future analysis and processing. 
Previously saved tables are deleted to ensure the data is updated. 
The option("checkpointLocation", "/tmp/kinesis/_checkpoints/") argument specifies the checkpoint location to ensure fault tolerance and data consistency in the event of failures.


In [0]:
# Delete previous tables
dbutils.fs.rm("/tmp/kinesis/_checkpoints/", True)
 
#Save df_pin as Delta table
df_pin.writeStream.format("delta").outputMode("append").option("checkpointLocation", "/tmp/kinesis/_checkpoints/").table("0a48d8473ced_pin_table")

# Save df_geo as Delta table
df_geo.writeStream.format("delta").outputMode("append").option("checkpointLocation", "/tmp/kinesis/_checkpoints/").table("0a48d8473ced_geo_table")

# Save df_user as Delta table
df_user.writeStream.format("delta").outputMode("append").option("checkpointLocation", "/tmp/kinesis/_checkpoints/").table("0a48d8473ced_user_table")
