# Problem: Tweet Histogram per User in 2022

**Objective:**  
Write a SQL query to obtain a histogram of tweets posted per user in 2022.  
Output the tweet count per user as the bucket and the number of Twitter users who fall into that bucket.

---

### Table: `tweets`

| Column Name | Type     |
|-------------|----------|
| tweet_id    | integer  |
| user_id     | integer  |
| msg         | string   |
| tweet_date  | timestamp|

---

### Example Input:

| tweet_id | user_id | msg                                                         | tweet_date           |
|----------|---------|-------------------------------------------------------------|----------------------|
| 214252   | 111     | Am considering taking Tesla private at $420. Funding secured.| 12/30/2021 00:00:00 |
| 739252   | 111     | Despite the constant negative press covfefe                | 01/01/2022 00:00:00 |
| 846402   | 111     | Following @NickSinghTech on Twitter changed my life!       | 02/14/2022 00:00:00 |
| 241425   | 254     | If the salary is so competitive why won’t you tell me?     | 03/01/2022 00:00:00 |
| 231574   | 148     | I no longer have a manager. I can't be managed             | 03/23/2022 00:00:00 |

---

### Expected Output:

| tweet_bucket | users_num |
|--------------|-----------|
| 1            | 2         |
| 2            | 1         |

---

### Instructions:
Write a query to:
- Filter tweets from the year 2022.
- Count how many tweets each user posted.
- Group users by the number of tweets they posted (bucket).
- Count the number of users in each bucket.


In [None]:
from pyspark.sql import SparkSession
from datetime import datetime
# Create Spark session
spark = SparkSession.builder.master('local[1]').getOrCreate()
sc = spark.sparkContext



# Define the data
df = sc.parallelize([
    (214252, 111, "Am considering taking Tesla private at $420. Funding secured.", datetime(2021, 12, 30, 0, 0)),
    (739252, 111, "Despite the constant negative press covfefe", datetime(2022, 1, 1, 0, 0)),
    (846402, 111, "Following @NickSinghTech on Twitter changed my life!", datetime(2022, 2, 14, 0, 0)),
    (241425, 254, "If the salary is so competitive why won’t you tell me what it is?", datetime(2022, 3, 1, 0, 0)),
    (231574, 148, "I no longer have a manager. I can't be managed", datetime(2022, 3, 23, 0, 0))
])

df.toDF().show(truncate=False)



+------+---+-----------------------------------------------------------------+-------------------+
|_1    |_2 |_3                                                               |_4                 |
+------+---+-----------------------------------------------------------------+-------------------+
|214252|111|Am considering taking Tesla private at $420. Funding secured.    |2021-12-30 00:00:00|
|739252|111|Despite the constant negative press covfefe                      |2022-01-01 00:00:00|
|846402|111|Following @NickSinghTech on Twitter changed my life!             |2022-02-14 00:00:00|
|241425|254|If the salary is so competitive why won’t you tell me what it is?|2022-03-01 00:00:00|
|231574|148|I no longer have a manager. I can't be managed                   |2022-03-23 00:00:00|
+------+---+-----------------------------------------------------------------+-------------------+



In [52]:
#in RDD
df2 = df\
        .filter(lambda x: x[3].year == 2022)\
        .map(lambda x: (x[1],1))\
        .reduceByKey(lambda x,y: (x+y))\
        .map(lambda x:(x[1],1))\
        .reduceByKey(lambda x,y: (x+y))

df2.toDF(['tweet_bucket','users_num']).show()

+------------+---------+
|tweet_bucket|users_num|
+------------+---------+
|           2|        1|
|           1|        2|
+------------+---------+



In [None]:
%%sql
