## Mounting an S3 drive and Data Analysis with PySpark

This PySpark notebook performs data analysis on various datasets retrieved from AWS S3. The datasets include information about Pinterest posts, user demographics, and geolocation data. The code reads and processes these datasets to derive insights such as popular categories, user engagement, geographical trends, and more. 

In [None]:
dbutils.fs.ls("/FileStore/tables")

### Mount S3 Drive

The first step involves setting up the configuration for AWS access and specifying the necessary details related to the AWS S3 bucket. Following this, the code mounts the AWS S3 drive within the Databricks notebook environment. Mounting the drive allows seamless access and retrieval of data from the designated AWS S3 bucket for subsequent analysis.

In [None]:
from pyspark.sql.functions import *
import urllib# Replace empty, irrelevant entries with Nones

file_type = "csv"
first_row_is_header = "true"
delimiter = ","
# Read the CSV file to spark dataframe
aws_keys_df = spark.read.format(file_type)\
.option("header", first_row_is_header)\
.option("sep", delimiter)\
.load("/FileStore/tables/authentication_credentials.csv")

# Get the AWS access key and secret key from the spark dataframe
ACCESS_KEY = aws_keys_df.where(col('User name')=='databricks-user').select('Access key ID').collect()[0]['Access key ID']
SECRET_KEY = aws_keys_df.where(col('User name')=='databricks-user').select('Secret access key').collect()[0]['Secret access key']
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")
AWS_S3_BUCKET = "user-0a48d8473ced-bucket"
MOUNT_NAME = "/mnt/user-1a48d8473ced-bucket"
SOURCE_URL = "s2n://{0}:{1}@{2}".format(ACCESS_KEY, ENCODED_SECRET_KEY, AWS_S3_BUCKET)
# Mount the drive
#dbutils.fs.mount(SOURCE_URL, MOUNT_NAME) #Only need to mount the drive once

In [None]:
# Display the content of the mounted directory
display(dbutils.fs.ls("/mnt/user-0a48d8473ced-bucket/../.."))

path,name,size,modificationTime
dbfs:/FileStore/,FileStore/,0,1697799066425
dbfs:/Volume/,Volume/,0,0
dbfs:/Volumes/,Volumes/,0,0
dbfs:/databricks-datasets/,databricks-datasets/,0,0
dbfs:/databricks-results/,databricks-results/,0,0
dbfs:/delta/,delta/,0,1697799066425
dbfs:/df_pin.csv/,df_pin.csv/,0,1697799066425
dbfs:/local_disk0/,local_disk0/,0,1697799066425
dbfs:/mnt/,mnt/,0,1697799066425
dbfs:/pin_kinesis_events/,pin_kinesis_events/,0,1697799066425


### Data Retrieval from JSON Files

The next block of code retrieves JSON data from specified file locations in the mounted AWS S3 bucket. The asterisk (*) denotes that all JSON files in the designated directories are being read into Spark DataFrames. By leveraging schema inference, the code automatically detects the structure of the JSON data, ensuring efficient data loading. 

In [None]:
# Function to read in jsons from mounted S3 bucket
def read_from_s3(file_location, file_type, infer_schema):
  return spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .load(file_location)

# Asterisk(*) indicates reading all the content of the specified file that have .json extension
file_location_pin = "/mnt/user-0a48d8473ced-bucket/topics/0a48d8473ced.pin/partition=0/*.json"
file_location_geo = "/mnt/user-0a48d8473ced-bucket/topics/0a48d8473ced.geo/partition=0/*.json" 
file_location_user = "/mnt/user-0a48d8473ced-bucket/topics/0a48d8473ced.user/partition=0/*.json" 

file_type = "json"
# Ask Spark to infer the schema
infer_schema = "true"

# Read in JSONs from mounted S3 bucket
df_pin = read_from_s3(file_location_pin, file_type, infer_schema)
df_geo = read_from_s3(file_location_geo, file_type, infer_schema)
df_user = read_from_s3(file_location_user, file_type, infer_schema)


# Display Spark dataframe to check its content
display(df_pin.limit(5))
display(df_geo.limit(5))
display(df_user.limit(5))

category,description,downloaded,follower_count,image_src,index,is_image_or_video,poster_name,save_location,tag_list,title,unique_id
event-planning,Το όνομα που επέλεξε η μαμά Ανδριανή για τη γλυκιά Τιτίκα δεν είναι καθόλου τυχαίο. Και φυσικά δεν άφησε τίποτα στην τύχη ούτε την ημέρα της βάπτισης. Ανέθεσε την οργάνωση στην…,1,4,https://i.pinimg.com/originals/db/aa/d2/dbaad28fa85012a4ea6958540d98a8e5.jpg,4387,image,Manosbojana Katsareas,Local save in /data/event-planning,"Diy Flowers,Flower Diy,Baptism Decorations,Christening,Event Planning,Wedding Planner,Baptism Ideas,Birthday,Party",Βάπτιση: H παραμυθένια βάπτιση της Τιτίκας με θέμα το μονόκερο από την e.m. for you,ae5e7377-f1bd-4ac5-94de-bee317f51a43
home-decor,"Традиционные шведские коттеджи, обычно с красным фасадом — это настоящее воплощением идеального зимнего уюта. Они обычно оформлены очень просто и ✌PUFIK. Beautiful Interiors. On…",1,136k,https://i.pinimg.com/originals/32/eb/72/32eb72e4fd8654c115a64528bd1f34b4.png,6717,image,PUFIK Interiors & Inspirations,Local save in /data/home-decor,"Scandinavian Cottage,Swedish Cottage,Swedish Home Decor,Swedish Farmhouse,Swedish Style,Swedish Kitchen,Kitchen Black,Swedish House,Cozy Cottage",〚 Уютные шведские коттеджи от Carina Olander 〛 ◾ Фото ◾ Идеи ◾ Дизайн,bc5ab9ee-505e-44f6-92ba-677fe4fdf3e3
event-planning,"15.1k Likes, 83 Comments - THE EVENT COLLECTIVE ✖️ (@theeventcollectivex) on Instagram: “I’ve always loved emerald green 🌲 by @a.purnellproduction Beautiful balloons by…”",1,311,https://i.pinimg.com/originals/91/0b/5c/910b5c120f7d1570ffc840302d7b49f4.jpg,4858,image,Marie Bradford,Local save in /data/event-planning,"Diy Birthday Decorations,Balloon Decorations,Table Decorations,Emerald Green Decor,40th Birthday Parties,24th Birthday,Surprise Birthday,Brunch Decor,Quinceanera Themes",THE EVENT COLLECTIVE ✖️ on Instagram: “I’ve always loved emerald green 🌲 by @a.purnellproduction Beautiful balloons by @basicallycuteevents @inspiredengravings for the acrylic…”,58101415-9273-4311-a5bd-0015a56579b4
event-planning,"Wow your guests! Our backdrops are a great option for providing a personalized, stylish and fun addition to your party .It will be the focal point in any event! They are great a…",1,1k,https://i.pinimg.com/originals/15/1f/93/151f93d662dc158ca2c9bbfed198f556.jpg,4608,image,"Iconica Design | Personalized Event Decor, Stationery & Gifts",Local save in /data/event-planning,"Christmas Party Backdrop,Holiday Banner,Birthday Backdrop,Circus First Birthday,First Birthday Banners,Dinasour Birthday,Birthday Bash,Banner Backdrop,Photo Booth Backdrop","Virtual Baby Shower Little Man Baby Shower Banner, Mustache Baby Shower Backdrop, Oh Boy, Any Color, Printed Or Printable File BBS0035 - 10x8 ft / Top Pole Pocket",d234e56f-5b18-4ef3-905b-44103f7719d9
home-decor,"6,636 Likes, 141 Comments - The Cottage Journal (@thecottagejournal) on Instagram: “Can you say color?! 😍😍😍 We are loving the cheery vibes that these aqua blue cabinets are g…",1,394,https://i.pinimg.com/originals/8c/17/a2/8c17a257b70780480bb89c3699363144.jpg,6633,image,Sarah Martin,Local save in /data/home-decor,"Diy Kitchen Cabinets,Kitchen Redo,Home Decor Kitchen,New Kitchen,Home Kitchens,Kitchen Remodeling,Aqua Kitchen,Kitchen Counters,Kitchen Islands",The Cottage Journal on Instagram: “Can you say color?! 😍😍😍 We are loving the cheery vibes that these aqua blue cabinets are giving. If you could paint your cabinets any…”,d136f6bc-840d-44f8-bbad-115eb7e6c51e


country,ind,latitude,longitude,timestamp
British Indian Ocean Territory (Chagos Archipelago),9455,-82.9272,-150.346,2022-03-15 01:46:32
British Indian Ocean Territory (Chagos Archipelago),6814,-86.5675,-149.565,2022-09-02 11:34:28
British Indian Ocean Territory (Chagos Archipelago),7151,-14.6744,-75.3714,2020-06-05 23:37:24
British Indian Ocean Territory (Chagos Archipelago),8221,-20.5574,-54.4834,2021-12-29 06:33:46
British Indian Ocean Territory (Chagos Archipelago),7569,-86.5675,-149.565,2018-10-16 08:40:26


age,date_joined,first_name,ind,last_name
42,2017-02-18 00:31:22,Christopher,6353,Hernandez
27,2016-03-08 13:38:37,Christopher,2015,Bradshaw
59,2017-05-12 21:22:17,Alexander,10673,Cervantes
48,2016-02-27 16:57:44,Christopher,1857,Hamilton
45,2016-09-15 06:02:53,Christopher,10020,Hawkins


### Data Cleaning

The following code snippets demonstrates the data cleaning process for the DataFrames. These operations contribute to ensuring the quality and consistency of the data within the DataFrames. This is in preparation for subsequent analysis and processing.

#### Data Cleaning for df_pin

In [None]:
# Replace empty, irrelevant entries with Nones
df_pin = df_pin.replace('', None)
df_pin = df_pin.replace(' ', None)
df_pin = df_pin.replace('nan', None)
# Convert follower_count to integer, handling 'k' and 'M' notations
df_pin = df_pin.withColumn('follower_count', regexp_replace('follower_count', 'k', '000'))
df_pin = df_pin.withColumn('follower_count', regexp_replace('follower_count', 'M', '000000'))
df_pin = df_pin.withColumn('follower_count', df_pin['follower_count'].cast('int'))
# Clean save_location column by removing 'Local save in' text
df_pin = df_pin.withColumn('save_location', regexp_replace('save_location', 'Local save in ', ''))
# Rename index column to 'ind' and reorder DataFrame columns
df_pin = df_pin.withColumnRenamed('index', 'ind')
#Reorder the DataFrame columns to have the following column order:
df_pin = df_pin.select('ind', 'unique_id', 'title', 'description', 'follower_count', 
                       'poster_name', 'tag_list', 'is_image_or_video', 'image_src', 'save_location', 'category')

display(df_pin.limit(5))

ind,unique_id,title,description,follower_count,poster_name,tag_list,is_image_or_video,image_src,save_location,category
4387,ae5e7377-f1bd-4ac5-94de-bee317f51a43,Βάπτιση: H παραμυθένια βάπτιση της Τιτίκας με θέμα το μονόκερο από την e.m. for you,Το όνομα που επέλεξε η μαμά Ανδριανή για τη γλυκιά Τιτίκα δεν είναι καθόλου τυχαίο. Και φυσικά δεν άφησε τίποτα στην τύχη ούτε την ημέρα της βάπτισης. Ανέθεσε την οργάνωση στην…,4,Manosbojana Katsareas,"Diy Flowers,Flower Diy,Baptism Decorations,Christening,Event Planning,Wedding Planner,Baptism Ideas,Birthday,Party",image,https://i.pinimg.com/originals/db/aa/d2/dbaad28fa85012a4ea6958540d98a8e5.jpg,/data/event-planning,event-planning
6717,bc5ab9ee-505e-44f6-92ba-677fe4fdf3e3,〚 Уютные шведские коттеджи от Carina Olander 〛 ◾ Фото ◾ Идеи ◾ Дизайн,"Традиционные шведские коттеджи, обычно с красным фасадом — это настоящее воплощением идеального зимнего уюта. Они обычно оформлены очень просто и ✌PUFIK. Beautiful Interiors. On…",136000,PUFIK Interiors & Inspirations,"Scandinavian Cottage,Swedish Cottage,Swedish Home Decor,Swedish Farmhouse,Swedish Style,Swedish Kitchen,Kitchen Black,Swedish House,Cozy Cottage",image,https://i.pinimg.com/originals/32/eb/72/32eb72e4fd8654c115a64528bd1f34b4.png,/data/home-decor,home-decor
4858,58101415-9273-4311-a5bd-0015a56579b4,THE EVENT COLLECTIVE ✖️ on Instagram: “I’ve always loved emerald green 🌲 by @a.purnellproduction Beautiful balloons by @basicallycuteevents @inspiredengravings for the acrylic…”,"15.1k Likes, 83 Comments - THE EVENT COLLECTIVE ✖️ (@theeventcollectivex) on Instagram: “I’ve always loved emerald green 🌲 by @a.purnellproduction Beautiful balloons by…”",311,Marie Bradford,"Diy Birthday Decorations,Balloon Decorations,Table Decorations,Emerald Green Decor,40th Birthday Parties,24th Birthday,Surprise Birthday,Brunch Decor,Quinceanera Themes",image,https://i.pinimg.com/originals/91/0b/5c/910b5c120f7d1570ffc840302d7b49f4.jpg,/data/event-planning,event-planning
4608,d234e56f-5b18-4ef3-905b-44103f7719d9,"Virtual Baby Shower Little Man Baby Shower Banner, Mustache Baby Shower Backdrop, Oh Boy, Any Color, Printed Or Printable File BBS0035 - 10x8 ft / Top Pole Pocket","Wow your guests! Our backdrops are a great option for providing a personalized, stylish and fun addition to your party .It will be the focal point in any event! They are great a…",1000,"Iconica Design | Personalized Event Decor, Stationery & Gifts","Christmas Party Backdrop,Holiday Banner,Birthday Backdrop,Circus First Birthday,First Birthday Banners,Dinasour Birthday,Birthday Bash,Banner Backdrop,Photo Booth Backdrop",image,https://i.pinimg.com/originals/15/1f/93/151f93d662dc158ca2c9bbfed198f556.jpg,/data/event-planning,event-planning
6633,d136f6bc-840d-44f8-bbad-115eb7e6c51e,The Cottage Journal on Instagram: “Can you say color?! 😍😍😍 We are loving the cheery vibes that these aqua blue cabinets are giving. If you could paint your cabinets any…”,"6,636 Likes, 141 Comments - The Cottage Journal (@thecottagejournal) on Instagram: “Can you say color?! 😍😍😍 We are loving the cheery vibes that these aqua blue cabinets are g…",394,Sarah Martin,"Diy Kitchen Cabinets,Kitchen Redo,Home Decor Kitchen,New Kitchen,Home Kitchens,Kitchen Remodeling,Aqua Kitchen,Kitchen Counters,Kitchen Islands",image,https://i.pinimg.com/originals/8c/17/a2/8c17a257b70780480bb89c3699363144.jpg,/data/home-decor,home-decor


#### Data Cleaning for df_geo

In [None]:
#Create a new column that contains an array based on the latitude and longitude columns
df_geo = df_geo.withColumn('coordinates', array('latitude', 'longitude'))
df_geo = df_geo.drop('latitude', 'longitude')
#Convert the timestamp column from a string to a timestamp data type
df_geo = df_geo.withColumn('timestamp', df_geo['timestamp'].cast('timestamp'))
#Reorder the DataFrame columns to have the following column order:
df_geo = df_geo.select('ind', 'country', 'coordinates', 'timestamp')

display(df_geo.limit(5))

ind,country,coordinates,timestamp
9455,British Indian Ocean Territory (Chagos Archipelago),"List(-82.9272, -150.346)",2022-03-15T01:46:32.000+0000
6814,British Indian Ocean Territory (Chagos Archipelago),"List(-86.5675, -149.565)",2022-09-02T11:34:28.000+0000
7151,British Indian Ocean Territory (Chagos Archipelago),"List(-14.6744, -75.3714)",2020-06-05T23:37:24.000+0000
8221,British Indian Ocean Territory (Chagos Archipelago),"List(-20.5574, -54.4834)",2021-12-29T06:33:46.000+0000
7569,British Indian Ocean Territory (Chagos Archipelago),"List(-86.5675, -149.565)",2018-10-16T08:40:26.000+0000


#### Data Cleaning for df_user

In [None]:
#Clean the df_user dataframe
#Create a new column user_name that concatenates the first_name and last_name columns
df_user = df_user.withColumn('user_name', concat(df_user['first_name'], lit(' '), df_user['last_name']))
df_user = df_user.drop('first_name', 'last_name')
#Convert the date_joined column from a string to a timestamp data type
df_user = df_user.withColumn('date_joined', df_user['date_joined'].cast('timestamp'))
#Reorder the DataFrame columns to have the following column order:
df_user = df_user.select('ind', 'user_name', 'age', 'date_joined')

display(df_user.limit(5))

ind,user_name,age,date_joined
6353,Christopher Hernandez,42,2017-02-18T00:31:22.000+0000
2015,Christopher Bradshaw,27,2016-03-08T13:38:37.000+0000
10673,Alexander Cervantes,59,2017-05-12T21:22:17.000+0000
1857,Christopher Hamilton,48,2016-02-27T16:57:44.000+0000
10020,Christopher Hawkins,45,2016-09-15T06:02:53.000+0000


### Data Analysis

In this section, the script demonstrates various data analysis operations conducted on the retrieved data.

Through a series of transformations and analytical queries, the code explores different facets of the dataset. By leveraging the capabilities of Spark DataFrames, the script provides insights into user demographics, popular content categories, follower statistics, and other metrics. 

This analysis offer valuable perspectives on the underlying patterns and trends within the dataset, facilitating informed decision-making and strategic insights for further business development and optimisation.

#### Find the most popular Pinterest category people post to based on their country.


In [None]:
#Use the ind column as the join key
df_category_country = df_geo.join(df_pin, df_geo.ind == df_pin.ind).select(df_geo.country, df_pin.category)

df_category_country = df_category_country.groupBy('country', 'category').count()
df_category_country = df_category_country.withColumnRenamed('count', 'category_count')
df_category_country = df_category_country.orderBy('category_count', ascending=False)
df_category_country = df_category_country.dropDuplicates(['country'])
df_category_country = df_category_country.select('country', 'category', 'category_count')

display(df_category_country.limit(10))

country,category,category_count
Afghanistan,education,29
Albania,art,40
Algeria,quotes,53
American Samoa,tattoos,18
Andorra,tattoos,11
Angola,diy-and-crafts,7
Anguilla,diy-and-crafts,9
Antarctica (the territory South of 60 deg S),christmas,11
Antigua and Barbuda,travel,8
Argentina,tattoos,22


#### Find how many posts each category had between 2018 and 2022.

In [None]:
df_category_year = df_geo.join(df_pin, df_geo.ind == df_pin.ind) \
        .select(year(df_geo.timestamp).alias('post_year'), df_pin.category)

df_category_year = df_category_year.groupBy('post_year', 'category').count()
df_category_year = df_category_year.withColumnRenamed('count', 'category_count')
df_category_year = df_category_year.orderBy('post_year', ascending=False)
df_category_year = df_category_year.select('post_year', 'category', 'category_count')
df_category_year = df_category_year.filter(df_category_year.post_year >= 2018)
df_category_year = df_category_year.filter(df_category_year.post_year <= 2022)

display(df_category_year.limit(10))

post_year,category,category_count
2022,beauty,41
2022,tattoos,39
2022,finance,43
2022,education,31
2022,travel,38
2022,quotes,50
2022,event-planning,28
2022,christmas,58
2022,home-decor,26
2022,vehicles,27


#### Find the user with the most followers in each country.

In [None]:
df_most_followers_per_country = df_geo.join(df_pin, df_geo.ind == df_pin.ind) \
        .select(df_geo.country, df_pin.poster_name, df_pin.follower_count)
df_most_followers_per_country = df_most_followers_per_country.groupBy('country', 'poster_name') \
        .max('follower_count')
df_most_followers_per_country = df_most_followers_per_country \
        .withColumnRenamed('max(follower_count)', 'follower_count')
df_most_followers_per_country = df_most_followers_per_country.orderBy('follower_count', ascending=False)
df_most_followers_per_country = df_most_followers_per_country.dropDuplicates(['country'])
df_most_followers_per_country = df_most_followers_per_country.select('country', 'poster_name', 'follower_count')

display(df_most_followers_per_country.limit(10))

country,poster_name,follower_count
Afghanistan,9GAG,3000000
Albania,The Minds Journal,5000000
Algeria,Apartment Therapy,5000000
American Samoa,Mamas Uncut,8000000
Andorra,Teachers Pay Teachers,1000000
Angola,Tastemade,8000000
Anguilla,We Heart It,15000000
Antarctica (the territory South of 60 deg S),Refinery29,1000000
Antigua and Barbuda,Country Living Magazine,1000000
Argentina,Cheezburger,2000000


#### Find the country with the user with most followers.

In [None]:
df_most_followed = df_most_followers_per_country.groupBy('country').max('follower_count')
df_most_followed = df_most_followed.withColumnRenamed('max(follower_count)', 'follower_count')
df_most_followed = df_most_followed.orderBy('follower_count', ascending=False)
#limit to one entry
df_most_followed = df_most_followed.limit(1)

display(df_most_followed)

country,follower_count
Anguilla,15000000


#### Find the most popular category for different age groups.

In [None]:
#The following age groups are used:
# 18-24 25-35 36-50 50+
df_age_groups_category = df_pin.join(df_user, df_pin.ind == df_user.ind).select(df_user.age, df_pin.category)
df_age_groups_category = df_age_groups_category.withColumn(
        'age_group', when((df_age_groups_category.age >= 18) & (df_age_groups_category.age <= 24), '18-24')
        .when((df_age_groups_category.age >= 25) & (df_age_groups_category.age <= 35), '25-35')
        .when((df_age_groups_category.age >= 36) & (df_age_groups_category.age <= 50), '36-50')
        .when(df_age_groups_category.age > 50, '50+')
        .otherwise('Unknown'))
df_age_groups_category = df_age_groups_category.groupBy('age_group', 'category').count()
df_age_groups_category = df_age_groups_category.withColumnRenamed('count', 'category_count')
df_age_groups_category = df_age_groups_category.orderBy('age_group', ascending=True)
df_age_groups_category = df_age_groups_category.select('age_group', 'category', 'category_count')

display(df_age_groups_category.limit(10))

age_group,category,category_count
18-24,tattoos,148
18-24,travel,74
18-24,education,109
18-24,beauty,52
18-24,diy-and-crafts,120
18-24,event-planning,55
18-24,christmas,112
18-24,quotes,124
18-24,mens-fashion,119
18-24,finance,76


#### Find the median follower count for these age groups:


In [None]:
df_median_followers = df_pin.join(df_user, df_pin.ind == df_user.ind).select(df_user.age, df_pin.follower_count)
df_median_followers = df_median_followers \
        .withColumn('age_group', when((df_median_followers.age >= 18) & (df_median_followers.age <= 24), '18-24') \
	.when((df_median_followers.age >= 25) & (df_median_followers.age <= 35), '25-35') \
	.when((df_median_followers.age >= 36) & (df_median_followers.age <= 50), '36-50') \
	.when(df_median_followers.age > 50, '50+') \
	.otherwise('Unknown'))
#calculate median follower count
df_median_followers = df_median_followers.groupBy('age_group') \
        .agg(expr('percentile(follower_count, 0.5)').alias('median_follower_count'))
df_median_followers = df_median_followers.orderBy('age_group', ascending=True)

display(df_median_followers)

age_group,median_follower_count
18-24,130000.0
25-35,26000.0
36-50,7000.0
50+,877.0


#### Find how many users have joined each year.

In [None]:
df_users_joined = df_user.select(year(df_user.date_joined).alias('post_year'))
df_users_joined = df_users_joined.groupBy('post_year').count()
df_users_joined = df_users_joined.withColumnRenamed('count', 'number_users_joined')
df_users_joined = df_users_joined.orderBy('post_year', ascending=True)

display(df_users_joined)

post_year,number_users_joined
2015,894
2016,1004
2017,359


#### Find median follower count of users based on the year they joined.

In [None]:
#For years joined between 2015 and 2020
df_users_joined_median_follower_count = df_user.join(df_pin, df_user.ind == df_pin.ind) \
        .select(year(df_user.date_joined).alias('post_year'), df_pin.follower_count)
df_users_joined_median_follower_count = df_users_joined_median_follower_count \
        .groupBy('post_year').agg(expr('percentile(follower_count, 0.5)').alias('median_follower_count'))
df_users_joined_median_follower_count = df_users_joined_median_follower_count \
        .orderBy('post_year', ascending=True)
df_users_joined_median_follower_count = df_users_joined_median_follower_count \
        .filter(df_users_joined_median_follower_count.post_year >= 2015)
df_users_joined_median_follower_count = df_users_joined_median_follower_count \
        .filter(df_users_joined_median_follower_count.post_year <= 2020)

display(df_users_joined_median_follower_count)

post_year,median_follower_count
2015,163000.0
2016,18000.0
2017,4000.0


#### Find the median follower count of users that have joined between 2015 and 2020, based on which age group they are part of.

In [None]:
df_med_followers_for_year_age_group = df_user.join(df_pin, df_user.ind == df_pin.ind) \
        .select(df_user.age, year(df_user.date_joined).alias('post_year'), df_pin.follower_count)
df_med_followers_for_year_age_group = df_med_followers_for_year_age_group \
        .withColumn('age_group', when((df_med_followers_for_year_age_group.age >= 18) & (
        df_med_followers_for_year_age_group.age <= 24), '18-24') \
	.when((df_med_followers_for_year_age_group.age >= 25) & (
         df_med_followers_for_year_age_group.age <= 35), '25-35') \
	.when((df_med_followers_for_year_age_group.age >= 36) & (
         df_med_followers_for_year_age_group.age <= 50), '36-50') \
	.when(df_med_followers_for_year_age_group.age > 50, '50+')
	.otherwise('Unknown'))
df_med_followers_for_year_age_group = df_med_followers_for_year_age_group \
        .groupBy('post_year', 'age_group').agg(expr('percentile(follower_count, 0.5)') \
                .alias('median_follower_count'))
df_med_followers_for_year_age_group = df_med_followers_for_year_age_group.orderBy('post_year', ascending=True)
df_med_followers_for_year_age_group = df_med_followers_for_year_age_group \
        .filter(df_med_followers_for_year_age_group.post_year >= 2015)
df_med_followers_for_year_age_group = df_med_followers_for_year_age_group \
        .filter(df_med_followers_for_year_age_group.post_year <= 2020)

display(df_med_followers_for_year_age_group)

post_year,age_group,median_follower_count
2015,36-50,11000.0
2015,25-35,44000.0
2015,50+,14000.0
2015,18-24,228000.0
2016,36-50,9500.0
2016,25-35,24000.0
2016,50+,457.0
2016,18-24,37000.0
2017,36-50,3000.0
2017,50+,2000.0
