# Average Post Hiatus (Part 1)

## Facebook SQL Interview Question

### Question

Given a table of Facebook posts, for each user who posted at least twice in 2021, write a query to find the number of days between each user’s first post of the year and last post of the year in the year 2021. Output the user and number of the days between each user's first and last post.

---

p.s. If you've read the Ace the Data Science Interview and liked it, consider writing us a review?

---

### Table: `posts`

| Column Name   | Type       |
|---------------|------------|
| user_id       | integer    |
| post_id       | integer    |
| post_content  | text       |
| post_date     | timestamp  |

---

### Example Input for `posts` Table:

| user_id | post_id | post_content                                                                 | post_date               |
|---------|---------|-------------------------------------------------------------------------------|-------------------------|
| 151652  | 599415  | Need a hug                                                                   | 07/10/2021 12:00:00     |
| 661093  | 624356  | Bed. Class 8-12. Work 12-3. Gym 3-5 or 6. Then class 6-10. Another day that's gonna fly by. I miss my girlfriend | 07/29/2021 13:00:00     |
| 004239  | 784254  | Happy 4th of July!                                                           | 07/04/2021 11:00:00     |
| 661093  | 442560  | Just going to cry myself to sleep after watching Marley and Me.               | 07/08/2021 14:00:00     |
| 151652  | 111766  | I'm so done with covid - need travelling ASAP!                               | 07/12/2021 19:00:00     |

---

### Example Output:

| user_id | days_between |
|---------|--------------|
| 151652  | 2            |
| 661093  | 21           |

---

### Explanation

For **user 151652**, the first post was on **07/10/2021** and the last post was on **07/12/2021**. The number of days between these posts is **2**.  
For **user 661093**, the first post was on **07/08/2021** and the last post was on **07/29/2021**. The number of days between these posts is **21**.


In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType
from datetime import datetime

# Create Spark session
spark = SparkSession.builder.master('local[1]').getOrCreate()
sc = spark.sparkContext

# Define the data for posts table
df = sc.parallelize([
    (151652, 599415, "Need a hug", datetime(2021, 7, 10, 12, 0)),
    (661093, 624356, "Bed. Class 8-12. Work 12-3. Gym 3-5 or 6. Then class 6-10. Another day that's gonna fly by. I miss my girlfriend", datetime(2021, 7, 29, 13, 0)),
    (4239, 784254, "Happy 4th of July!", datetime(2021, 7, 4, 11, 0)),
    (661093, 442560, "Just going to cry myself to sleep after watching Marley and Me.", datetime(2021, 7, 8, 14, 0)),
    (151652, 111766, "I'm so done with covid - need travelling ASAP!", datetime(2021, 7, 12, 19, 0))
])

# Create the Spark DataFrame


# Show the DataFrame
df.toDF().show(truncate=False)


+------+------+----------------------------------------------------------------------------------------------------------------+-------------------+
|_1    |_2    |_3                                                                                                              |_4                 |
+------+------+----------------------------------------------------------------------------------------------------------------+-------------------+
|151652|599415|Need a hug                                                                                                      |2021-07-10 12:00:00|
|661093|624356|Bed. Class 8-12. Work 12-3. Gym 3-5 or 6. Then class 6-10. Another day that's gonna fly by. I miss my girlfriend|2021-07-29 13:00:00|
|4239  |784254|Happy 4th of July!                                                                                              |2021-07-04 11:00:00|
|661093|442560|Just going to cry myself to sleep after watching Marley and Me.                            

In [None]:
df2=df.map(lambda x:(x[0],(x[3],x[3])))\
      .reduceByKey(lambda x,y:(max(x[0],y[0]),min(x[1],y[1])))\
      .filter(lambda x: x[1][0] != x[1][1])\
      .map(lambda x:(x[0], (x[1][0]-x[1][1])))\
      
df2.toDF(['user_id','day_between']).show(truncate=False)
# THIS WILL NOT GIVE ACURATE ANSWER SINCE THEY IS ASKING FOR ROUNDED DATE SO INCASE 20 DAYS +23 HOURS IS 21 DAYS WE CAN DO THAT BY CONVERTING TO DATE  LIKE BELOW

+-------+------------------------------------+
|user_id|day_between                         |
+-------+------------------------------------+
|151652 |INTERVAL '2 07:00:00' DAY TO SECOND |
|661093 |INTERVAL '20 23:00:00' DAY TO SECOND|
+-------+------------------------------------+



In [8]:
df2 = df.map(lambda x: (x[0], (x[3].date(), x[3].date()))) \
        .reduceByKey(lambda x, y: (max(x[0], y[0]), min(x[1], y[1]))) \
        .filter(lambda x: x[1][0] != x[1][1]) \
        .map(lambda x: (x[0], (x[1][0] - x[1][1]).days))

df2.toDF(['user_id','day_between']).show(truncate=False)

+-------+-----------+
|user_id|day_between|
+-------+-----------+
|151652 |2          |
|661093 |21         |
+-------+-----------+

