## 1241. Number of Comments per Post
## Table: Submissions

| Column Name | Type |
|-------------|------|
| sub_id      | int  |
| parent_id   | int  |

There is no primary key for this table, it may have duplicate rows.  
Each row can be a post or comment on the post.  
parent_id is null for posts.  
parent_id for comments is sub_id for another post in the table.

Write an SQL query to find number of comments per each post.  
SQL courses

Result table should contain post_id and its corresponding number_of_comments, and must be sorted by post_id in ascending order.

Submissions may contain duplicate comments. You should count the number of unique comments per post.

Submissions may contain duplicate posts. You should treat them as one post.

The query result format is in the following example:

### Submissions table:

| sub_id | parent_id |
|--------|-----------|
| 1      | Null      |
| 2      | Null      |
| 1      | Null      |
| 12     | Null      |
| 3      | 1         |
| 5      | 2         |
| 3      | 1         |
| 4      | 1         |
| 9      | 1         |
| 10     | 2         |
| 6      | 7         |

### Result table:

| post_id | number_of_comments |
|---------|--------------------|
| 1       | 3                  |
| 2       | 2                  |
| 12      | 0                  |

The post with id 1 has three comments in the table with id 3, 4 and 9. The comment with id 3 is repeated in the table, we counted it only once.  
The post with id 2 has two comments in the table with id 5 and 10.  
The post with id 12 has no comments in the table.  
The comment with id 6 is a comment on a deleted post with id 7 so we ignored it.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.functions import col, countDistinct, asc

# Create Spark session
spark = SparkSession.builder.getOrCreate()

# Define schema
schema = StructType([
    StructField("sub_id", IntegerType(), True),
    StructField("parent_id", IntegerType(), True)
])

# Sample data
data = [
    (1, None),
    (2, None),
    (1, None),
    (12, None),
    (3, 1),
    (5, 2),
    (3, 1),
    (4, 1),
    (9, 1),
    (10, 2),
    (6, 7)
]

# Create DataFrame
df = spark.createDataFrame(data, schema)
df.createOrReplaceTempView("Submissions")
df.display()


In [0]:

# SQL logic
spark.sql("""
    WITH posts AS (
        SELECT DISTINCT sub_id AS post_id
        FROM Submissions
        WHERE parent_id IS NULL
    ),
    comments AS (
        SELECT DISTINCT sub_id, parent_id
        FROM Submissions
        WHERE parent_id IS NOT NULL
    )
    SELECT
        p.post_id,
        COUNT(DISTINCT c.sub_id) AS number_of_comments
    FROM posts p
    LEFT JOIN comments c
        ON p.post_id = c.parent_id
    GROUP BY p.post_id
    ORDER BY p.post_id ASC
""").createOrReplaceTempView("PostCommentStats")

# Display result
display(spark.sql("SELECT * FROM PostCommentStats"))

In [0]:
%sql
select count(distinct sub_id) as number_of_comments  ,parent_id from submissions group by parent_id 

In [0]:
%sql
with cte as (
  select distinct sub_id  as post from Submissions where parent_id   is Null
)
,cte2 as (
  select count(distinct sub_id) as number_of_comments  ,parent_id from submissions group by parent_id 
)
select c1.post  as post_id ,coalesce(c2.number_of_comments , 0) from cte c1 left join cte2 c2 on c1.post=c2.parent_id 

In [0]:
post = df.filter(col("parent_id").isNull()).selectExpr("sub_id as post_id").distinct()
#post.display()
comment = df.groupBy(col("parent_id")).agg(countDistinct(col("sub_id")).alias("number_of_comments"))
#comment.display()
post.join(comment, col("post_id")== col("parent_id") , "left")\
   .selectExpr("post_id", "coalesce(number_of_comments,0)").display()
        
