# Sending vs. Opening Snaps

## Snapchat SQL Interview Question

### Question  
This is the same question as problem #25 in the SQL Chapter of *Ace the Data Science Interview*!

Assume you're given tables with information on Snapchat users, including their ages and time spent sending and opening snaps.

Write a query to obtain a breakdown of the time spent **sending** vs. **opening** snaps as a **percentage of total time** spent on these activities, grouped by **age group**.  
Round the percentages to 2 decimal places in the output.

---

### Notes:
- Calculate the following:
  - Time spent sending / (Time spent sending + Time spent opening)
  - Time spent opening / (Time spent sending + Time spent opening)
- To avoid integer division, **multiply by 100.0**, not 100.
- Effective April 15th, 2023, the solution has been updated and optimized.

---

### `activities` Table:

| Column Name   | Type     |
|---------------|----------|
| activity_id   | integer  |
| user_id       | integer  |
| activity_type | string ('send', 'open', 'chat') |
| time_spent    | float    |
| activity_date | datetime |

---

### `activities` Example Input:

| activity_id | user_id | activity_type | time_spent | activity_date        |
|-------------|---------|----------------|------------|----------------------|
| 7274        | 123     | open           | 4.50       | 06/22/2022 12:00:00  |
| 2425        | 123     | send           | 3.50       | 06/22/2022 12:00:00  |
| 1413        | 456     | send           | 5.67       | 06/23/2022 12:00:00  |
| 1414        | 789     | chat           | 11.00      | 06/25/2022 12:00:00  |
| 2536        | 456     | open           | 3.00       | 06/25/2022 12:00:00  |

---

### `age_breakdown` Table:

| Column Name | Type   |
|-------------|--------|
| user_id     | integer|
| age_bucket  | string ('21-25', '26-30', '31-35') |

---

### `age_breakdown` Example Input:

| user_id | age_bucket |
|---------|------------|
| 123     | 31-35      |
| 456     | 26-30      |
| 789     | 21-25      |

---

### Example Output:

| age_bucket | send_perc | open_perc |
|------------|-----------|-----------|
| 26-30      | 65.40     | 34.60     |
| 31-35      | 43.75     | 56.25     |

---

### Explanation:

Using the **age bucket 26-30** as an example:  
- Time spent sending: **5.67**  
- Time spent opening: **3.00**  
- Total = 8.67

Percentages:  
- `send_perc = (5.67 / 8.67) * 100 = 65.40%`  
- `open_perc = (3.00 / 8.67) * 100 = 34.60%`


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType, TimestampType
from pyspark.sql.functions import *
from datetime import datetime

# Create Spark session
spark = SparkSession.builder.master('local[1]').appName("SendingVsOpeningSnaps").getOrCreate()

# Define schema for activities table
activities_schema = StructType([
    StructField("activity_id", IntegerType(), True),
    StructField("user_id", IntegerType(), True),
    StructField("activity_type", StringType(), True),
    StructField("time_spent", FloatType(), True),
    StructField("activity_date", TimestampType(), True),
])

# Sample data for activities table
activities_data = [
    (7274, 123, "open", 4.50, datetime(2022, 6, 22, 12, 0, 0)),
    (2425, 123, "send", 3.50, datetime(2022, 6, 22, 12, 0, 0)),
    (1413, 456, "send", 5.67, datetime(2022, 6, 23, 12, 0, 0)),
    (1414, 789, "chat", 11.00, datetime(2022, 6, 25, 12, 0, 0)),
    (2536, 456, "open", 3.00, datetime(2022, 6, 25, 12, 0, 0)),
]

# Create activities DataFrame
activities_df = spark.createDataFrame(activities_data, schema=activities_schema)

# Define schema for age_breakdown table
age_schema = StructType([
    StructField("user_id", IntegerType(), True),
    StructField("age_bucket", StringType(), True),
])

# Sample data for age_breakdown table
age_data = [
    (123, "31-35"),
    (456, "26-30"),
    (789, "21-25"),
]

# Create age_breakdown DataFrame
age_df = spark.createDataFrame(age_data, schema=age_schema)



In [12]:
activities_df.join(age_df,['user_id'])\
    .groupBy('age_bucket').agg(sum(when(col('activity_type') == 'send',col('time_spent'))).alias('send_spent'),
                               sum(when(col('activity_type') == 'open',col('time_spent'))).alias('open_spent'))\
    .withColumn('send_perc',round((col('send_spent')/(col('open_spent')+col('send_spent')))*100.0,2))\
    .withColumn('open_perc',round((col('open_spent')/(col('open_spent')+col('send_spent')))*100.0,2))\
    .drop('send_spent','open_spent')\
    .where('send_spent !=0')\
    .show()

+----------+---------+---------+
|age_bucket|send_perc|open_perc|
+----------+---------+---------+
|     31-35|    43.75|    56.25|
|     26-30|     65.4|     34.6|
+----------+---------+---------+



In [9]:
age_df.createOrReplaceTempView('age_breakdown')
activities_df.createOrReplaceTempView('activities')

spark.sql(

'''with cte AS
(SELECT 
  age_bucket,
  SUM(CASE WHEN activity_type = 'send' THEN time_spent else 0 end) as send_spent,
  SUM(CASE WHEN activity_type = 'open' THEN time_spent else 0 end) as open_spent
  
FROM activities JOIN age_breakdown
USING(user_id)
GROUP BY age_bucket)

SELECT
age_bucket,
round((send_spent/(open_spent+send_spent))*100.0,2) as send_perc,
round((open_spent/(open_spent+send_spent))*100.0,2) as open_perc
from cte
where send_spent !=0 '''
).show()


+----------+---------+---------+
|age_bucket|send_perc|open_perc|
+----------+---------+---------+
|     31-35|    43.75|    56.25|
|     26-30|     65.4|     34.6|
+----------+---------+---------+

