# Tweets' Rolling Averages  
**Twitter SQL Interview Question**  

---

### Question  
This is the same question as problem #10 in the SQL Chapter of *Ace the Data Science Interview*!  

Given a table of tweet data over a specified time period, calculate the 3-day rolling average of tweets for each user. Output the user ID, tweet date, and rolling averages rounded to 2 decimal places.  

---

### Notes  
- A rolling average, also known as a moving average or running mean, is a time-series technique that examines trends in data over a specified period of time.  
- In this case, we want to determine how the tweet count for each user changes over a 3-day period.  
- Effective April 7th, 2023, the problem statement, solution and hints for this question have been revised.  

---

### `tweets` Table:

| Column Name  | Type      |
|--------------|-----------|
| user_id      | integer   |
| tweet_date   | timestamp |
| tweet_count  | integer   |

---

### Example Input:

| user_id | tweet_date          | tweet_count |
|---------|---------------------|-------------|
| 111     | 06/01/2022 00:00:00 | 2           |
| 111     | 06/02/2022 00:00:00 | 1           |
| 111     | 06/03/2022 00:00:00 | 3           |
| 111     | 06/04/2022 00:00:00 | 4           |
| 111     | 06/05/2022 00:00:00 | 5           |

---

### Example Output:

| user_id | tweet_date          | rolling_avg_3d |
|---------|---------------------|----------------|
| 111     | 06/01/2022 00:00:00 | 2.00           |
| 111     | 06/02/2022 00:00:00 | 1.50           |
| 111     | 06/03/2022 00:00:00 | 2.00           |
| 111     | 06/04/2022 00:00:00 | 2.67           |
| 111     | 06/05/2022 00:00:00 | 4.00           |

---

### Explanation  
For each user and date, the rolling average is calculated based on the current date and the two previous dates (i.e., a 3-day window). For instance, on 06/03/2022, the average is calculated as:  
(2 + 1 + 3) / 3 = 2.00  

As we proceed through time, this average updates to reflect only the most recent three days of tweet activity.


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, TimestampType
from pyspark.sql.functions import *
from datetime import datetime

# Initialize Spark session
spark = SparkSession.builder.master('local[1]').appName("TweetsRollingAverages").getOrCreate()

# Define schema
schema = StructType([
    StructField("user_id", IntegerType(), True),
    StructField("tweet_date", TimestampType(), True),
    StructField("tweet_count", IntegerType(), True)
])

# Create data
data = [
    (111, datetime.strptime("06/01/2022 00:00:00", "%m/%d/%Y %H:%M:%S"), 2),
    (111, datetime.strptime("06/02/2022 00:00:00", "%m/%d/%Y %H:%M:%S"), 1),
    (111, datetime.strptime("06/03/2022 00:00:00", "%m/%d/%Y %H:%M:%S"), 3),
    (111, datetime.strptime("06/04/2022 00:00:00", "%m/%d/%Y %H:%M:%S"), 4),
    (111, datetime.strptime("06/05/2022 00:00:00", "%m/%d/%Y %H:%M:%S"), 5)
]

# Create DataFrame
tweets_df = spark.createDataFrame(data, schema)

# Show DataFrame
tweets_df.show(truncate=False)


+-------+-------------------+-----------+
|user_id|tweet_date         |tweet_count|
+-------+-------------------+-----------+
|111    |2022-06-01 00:00:00|2          |
|111    |2022-06-02 00:00:00|1          |
|111    |2022-06-03 00:00:00|3          |
|111    |2022-06-04 00:00:00|4          |
|111    |2022-06-05 00:00:00|5          |
+-------+-------------------+-----------+



In [8]:
from pyspark.sql.window import Window
winspec= Window.partitionBy('user_id').orderBy('tweet_date').rowsBetween(-2,0)
tweets_df.withColumn('rolling_avg_3d', round(avg('tweet_count').over(winspec),2))\
    .drop('tweet_count').show()

+-------+-------------------+--------------+
|user_id|         tweet_date|rolling_avg_3d|
+-------+-------------------+--------------+
|    111|2022-06-01 00:00:00|           2.0|
|    111|2022-06-02 00:00:00|           1.5|
|    111|2022-06-03 00:00:00|           2.0|
|    111|2022-06-04 00:00:00|          2.67|
|    111|2022-06-05 00:00:00|           4.0|
+-------+-------------------+--------------+



In [3]:
tweets_df.createOrReplaceTempView('tweets')
spark.sql('''SELECT user_id,tweet_date,
round(avg(tweet_count) 
over(PARTITION BY user_id ORDER BY tweet_date RANGE BETWEEN INTERVAL '2 day' PRECEDING and CURRENT ROW ),2) as rolling_avg_3d
FROM tweets;''').show()

+-------+-------------------+--------------+
|user_id|         tweet_date|rolling_avg_3d|
+-------+-------------------+--------------+
|    111|2022-06-01 00:00:00|           2.0|
|    111|2022-06-02 00:00:00|           1.5|
|    111|2022-06-03 00:00:00|           2.0|
|    111|2022-06-04 00:00:00|          2.67|
|    111|2022-06-05 00:00:00|           4.0|
+-------+-------------------+--------------+

