##Reddit Practice Problem: Surging Subreddits##

The Scenario
You're an Analytics Engineering Lead at Reddit. The "Discovery" team wants to identify subreddits that are experiencing a sudden surge in user engagement. They hypothesize that a rapid increase in the number of comments per post is a key indicator of a subreddit "going viral" or becoming a hot topic.

Your task is to build a data pipeline that identifies the top 10 subreddits with the highest week-over-week percentage growth in their average comments per post.

Input DataFrames
You are given two primary sources of data:

posts DataFrame: Contains information about each post submitted.

post_id (string)

subreddit_id (string)

created_utc (timestamp)

comments DataFrame: A log of all comments made on posts.

comment_id (string)

post_id (string)

created_utc (timestamp)

The Task
Write a PySpark script that performs the following steps:

Calculate Weekly Comment Counts: For each post, determine how many comments it received. Then, aggregate this data to find the total number of comments and the total number of posts for each subreddit, for each week.

A "week" can be defined using the weekofyear function.

Calculate Average Comments Per Post: Using the weekly aggregated data from Step 1, calculate the average number of comments per post for each subreddit for each week.

Find Previous Week's Average: For each subreddit and each week, you need to find the average comments per post from the previous week. This is the key step and will require a window function partitioned by subreddit and ordered by week.

Calculate Week-over-Week Growth: Calculate the percentage growth from the previous week's average to the current week's average. The formula is: ((current_avg - previous_avg) / previous_avg) * 100.

Handle cases where the previous week's average was zero to avoid division errors.

Filter and Rank: Filter for the most recent complete week in the dataset. Then, rank the subreddits by their week-over-week growth in descending order and return the top 10.

Expected Final Output
A DataFrame with the following schema, showing the top 10 surging subreddits for the most recent week:

subreddit_id

week

avg_comments_per_post

previous_week_avg_comments

wow_growth_percentage

rank

Follow-up Questions for a Lead Role
Optimization: This job could be slow if the comments table is huge. How would you optimize the initial join between posts and comments? What if there's data skew in a few very popular posts? (Probes knowledge of partitioning, broadcasting, and salting).

Data Modeling: How would you productionize this logic? Would you create an intermediate, aggregated table? What would be the "grain" of that table (daily, weekly)? How would you handle late-arriving data?

Edge Cases: What are the flaws in this "surge" logic? What if a subreddit is new and has no "previous week" data? How should its growth be represented? What if a subreddit's activity is so low that going from 1 comment to 2 registers as 100% growth? How might you refine the logic to account for this?

Definition of "Week": We used weekofyear. What are the potential issues with this function, especially around the new year? (e.g., Week 52 vs. Week 1). What is a more robust way to define a "week"? (Probes deeper SQL/Spark function knowledge, like using date_trunc).

In [1]:
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import col, count, weekofyear, lag, when, lit, desc, rank

# --- 1. Setup Spark Session and Create Dummy Data ---
spark = SparkSession.builder.appName("RedditSurgeAnalysis").getOrCreate()

posts_data = [("p1", "sub1", "2025-08-20 10:00:00"), ("p2", "sub1", "2025-08-28 11:00:00"),
              ("p3", "sub2", "2025-08-21 12:00:00"), ("p4", "sub2", "2025-08-29 13:00:00"),
              ("p5", "sub3", "2025-08-22 14:00:00"), ("p6", "sub3", "2025-08-30 15:00:00")]
posts = spark.createDataFrame(posts_data, ["post_id", "subreddit_id", "created_utc"]) \
    .withColumn("created_utc", col("created_utc").cast("timestamp"))

comments_data = [("c1", "p1", "2025-08-30 15:01:00"), ("c2", "p1", "2025-08-30 15:02:00"), ("c3", "p2", "2025-08-30 15:03:00"), # sub1
                 ("c4", "p3", "2025-08-30 15:04:00"), ("c5", "p4", "2025-08-30 15:05:00"), ("c6", "p4", "2025-08-30 15:06:00"), ("c7", "p4", "2025-08-30 15:07:00"), # sub2
                 ("c8", "p5", "2025-08-30 15:08:00"), ("c9", "p6", "2025-08-30 15:09:00"), ("c10", "p6", "2025-08-30 15:11:00")] # sub3
comments = spark.createDataFrame(comments_data, ["comment_id", "post_id", "created_utc"]) \
    .withColumn("created_utc", col("created_utc").cast("timestamp"))

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/07 12:13:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/07 12:13:38 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/09/07 12:13:38 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
