# AS02 Reading PTT data

在這個作業中，老師給了你一個檔案的資料是g0v所搜集的PTT資料，是2020-01-02當天發佈在Gosshiping版上的貼文及其留言。該檔案用jsonl的方式儲存，每一行是一個json格式的資料，包含了該篇貼文的標題、作者、發文時間、內容、留言等等資訊。你的任務是讀取該檔案，並且回答出該天發文數最多的作者是誰，以及該作者發了幾篇貼文等問題。
- 資料鏈結：https://github.com/p4css/py4css/blob/main/data/2020-01-02.jsonl

## Q1. 讀取jsonl資料

該資料集中共包含幾篇貼文？（成功回答這題代表有成功解開資料，會得到基本的40分）

向ChatGPT3.5查詢要如何讀取jsonl資料，並將該檔案讀取到一個list或dict中。理想上，每篇貼文會是一個dict，但在貼文中，還會有一個key所對應到的value是comments，而該comments也是一個list，裡面包含所有該貼文的comment。

In [1]:
import json

# 創建一個空的列表，用於存儲字典
post_list = []

# 開啟 JSONL 檔案，一行一行讀取並解析 JSON
with open("../_build/html/data/2020-01-02.jsonl", "r", encoding="utf-8") as file:
    for line in file:
        try:
            # 解析 JSON 字典並加入到列表中
            post = json.loads(line)
            post_list.append(post)
        except json.JSONDecodeError as e:
            # 處理解析錯誤
            print(f"解析 JSON 時出錯: {str(e)}")

# 現在，post_list 中包含了所有的字典，每個字典代表一篇貼文

In [6]:
type(post_list)
post_list[0]
post_list[0]['comments']
print("Number of comments: ", len(post_list[0]['comments']))

Number of comments:  234


## Q2. 該資料集中共有幾篇留言？10分



In [8]:
# The answer is written with list comprehension, which is a bit more advanced.
# You can also use a for loop to do the same thing.
# Ask ChatGPT or me if you have any questions.

sum([len(post['comments']) for post in post_list])

6501

## Q3. 該資料集中的貼文共有哪些欄位？留言共有哪些欄位？10分

In [16]:
print("Fields of posts: \n", ", ".join(list(post_list[0].keys())))
print()
print("Fields of comments: \n", ", ".join(list(post_list[0]['comments'][0].keys())))


Fields of posts: 
 version, canonical_url, title, author, connect_from, published_at, first_seen_at, last_updated_at, id, producer_id, text, urls, image_urls, hashtags, keywords, tags, metadata, comments

Fields of comments: 
 id, reaction, author, text, published_at, connect_from


## Q4. 該資料集中共有幾位貼文者？（注意，可能同一位版友會張貼超過一篇文章）10分

The answer should be like this:
```
There are totally n users has posts in this dataset.
```

In [19]:
# Here we use a list comprehension to get all the authors of the posts.
# Then we use the Counter class to count the number of unique authors.
# You can also use a for loop to do the same thing.

from collections import Counter
n_author = len(Counter([post['author'] for post in post_list]))
print(f"There are totally {n_author} users has posts in this dataset.")
    

There are totally 113 users has posts in this dataset.


## Q5. 留言數最高的留言者是誰？10分

The answer should be like this:
```
{'author': 'sirius9453', 'count': 10}
```


In [21]:
c_author_counter = Counter()
for post in post_list:
    c_authors = [comment['author'] for comment in post['comments']]
    c_author_counter.update(c_authors)

top_c_authors = c_author_counter.most_common(10)
top_c_authors_list = [{'author': author, 'count': count} for author, count in top_c_authors]
top_c_authors_list[0]

{'author': 'valentian', 'count': 40}

## Q6. 留言數最高的前10大留言者是誰？其分別發布了多少篇留言？20分

Print out the top 10 authors with the most comments. The output format is as follows:

```
Top 1: commentor_name, 100 comments
Top 2: commentor_name, 99 comments

```

In [22]:
c_author_counter = Counter()
for post in post_list:
    c_authors = [comment['author'] for comment in post['comments']]
    c_author_counter.update(c_authors)

top_c_authors = c_author_counter.most_common(10)
for i, (author, count) in enumerate(top_c_authors, 1):
    print(f"Top {i}: {author}, {count} comments")

Top 1: valentian, 40 comments
Top 2: electronicyi, 27 comments
Top 3: q12341234, 25 comments
Top 4: lasekoutkast, 24 comments
Top 5: Tchachavsky, 22 comments
Top 6: THAO168, 18 comments
Top 7: soria, 17 comments
Top 8: scratch01, 16 comments
Top 9: kent, 16 comments
Top 10: zebra01, 16 comments
