# Homework 01-5 - 下載 Gossiping 板塊文章的留言

## Import the packages

載入 Pandas、Requests 和 BeautifulSoup。

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup, Tag

## Extract articles

抽象一個 `extract_articles` 函式，擷取文章列表中的文章概要。

- 文章標題：使用 `div.title` 選取
- 文章作者：使用 `div.author` 選取
- 文章日期：使用 `div.date` 選取
- 連結：使用 `div.title` 的 `a` 子元素選取。因為連結是純 pathname，所以要補上 `https://www.ptt.cc` 網域。

In [2]:
def extract_articles(soup: BeautifulSoup) -> pd.DataFrame:
    articles = soup.find_all("div", class_="r-ent")
    results = []

    for i, article in enumerate(articles):
        assert isinstance(article, Tag)

        title = article.find("div", class_="title")
        author = article.find("div", class_="author")
        date = article.find("div", class_="date")
        link = article.select_one(".title > a")

        assert isinstance(title, Tag)
        assert isinstance(author, Tag)
        assert isinstance(date, Tag)

        if link is None:
            print(f"Article {i} has no link")
            continue

        assert isinstance(link, Tag)

        title_text = title.text.strip()
        author_text = author.text.strip()
        date_text = date.text.strip()
        link_text = link.get("href")

        assert isinstance(link_text, str)

        results.append({
            "文章標題": title_text,
            "作者": author_text,
            "文章日期": date_text,
            "文章連結": "https://www.ptt.cc" + link_text
        })

    results_df = pd.DataFrame(results)

    return results_df 

## Extract comments from an article

給定一個內文，取得內文的留言資訊。

- 連結：作為 key 來和 `extract_articles` 的 dataframe 做合併
- 喜好程度：推 / 踩 / →
- 留言者
- 留言內容：移除前面的「: 」。
- 留言時間（不含 IP）
    - 使用 Regex 分離 IP 和日期時間 `/([\d.]+) (\d{2}/\d{2} \d{2}:\d{2})`
    - 只留日期時間

In [3]:
import re

# 111.240.96.24 03/29 22:49
IP_DATETIME_EXTRACTOR = re.compile(r"([\d.]+) (\d{2}/\d{2} \d{2}:\d{2})")
COMMENT_CONTENT_EXTRACTOR = re.compile(r": (.*)")

def extract_comments(link: str, soup: BeautifulSoup) -> pd.DataFrame:
    comments = soup.find_all("div", class_="push")
    results = []

    for comment in comments:
        assert isinstance(comment, Tag)

        # 推 / 踩 / →
        push_tag = comment.find("span", class_="push-tag")
        assert isinstance(push_tag, Tag)
        push_tag_text = push_tag.text.strip()
        assert isinstance(push_tag_text, str)

        # 留言者
        push_user_id = comment.find("span", class_="push-userid")
        assert isinstance(push_user_id, Tag)
        push_user_id_text = push_user_id.text.strip()
        assert isinstance(push_user_id_text, str)

        # 留言內容
        push_content = comment.find("span", class_="push-content")
        assert isinstance(push_content, Tag)
        push_content_text_raw = push_content.text.strip()
        assert isinstance(push_content_text_raw, str)

        push_content_matches = COMMENT_CONTENT_EXTRACTOR.search(push_content_text_raw)
        if push_content_matches is None:
            print(f"warning: push_content_text does not start with ':', ignoring: {push_content_text_raw}")
            continue
        assert isinstance(push_content_matches, re.Match)
        push_content_text = push_content_matches.group(1)
        assert isinstance(push_content_text, str)

        # 留言時間（含 IP）
        push_ipdatetime = comment.find("span", class_="push-ipdatetime")
        assert isinstance(push_ipdatetime, Tag)
        push_ipdatetime_text = push_ipdatetime.text.strip()
        assert isinstance(push_ipdatetime_text, str)

        push_datetime_matches = IP_DATETIME_EXTRACTOR.search(push_ipdatetime_text)
        assert isinstance(push_datetime_matches, re.Match)
        push_datetime = push_datetime_matches.group(2)
        assert isinstance(push_datetime, str)
        
        results.append({
            "連結": link,
            "喜好程度": push_tag_text,
            "留言者": push_user_id_text,
            "留言內容": push_content_text,
            "留言時間": push_datetime
        })

    return pd.DataFrame(results)

## Get previous page URL

抽象一個 `get_previous_page_url` 方法，從網頁中擷取「上一頁」的連結。

In [4]:
def get_previous_page_url(soup: BeautifulSoup) -> str:
    # 抓取 innerText 是 '‹ 上頁' 的元素
    prev_page_link = soup.find("a", string="‹ 上頁")
    assert isinstance(prev_page_link, Tag)

    prev_page_path = prev_page_link.get('href')
    assert isinstance(prev_page_path, str)

    return "https://www.ptt.cc" + prev_page_path

## Main application

取得 Gossiping 的最新一頁內容。

- 需要傳入 cookie `over18=1`，以免卡在驗證是否 18 歲的介面。
- 使用 `lxml` 取得比較快的解析速度。

In [5]:
# First Page

url = "https://www.ptt.cc/bbs/Gossiping/index.html"
resp = requests.get(url, headers={"cookie": "over18=1"})
soup = BeautifulSoup(resp.text, "lxml")

接著使用剛才的函式，取得最新一頁的文章。

In [6]:
pd_1 = extract_articles(soup)
pd_1

Unnamed: 0,文章標題,作者,文章日期,文章連結
0,Re: [問卦] 女友想當全職主婦不想去工作怎麼辦？,zangetsu9006,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402072....
1,[問卦] 現在是不是該刪APP了？,Workforme,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402085....
2,[問卦] 國動不想拍YT 怎麼還是一直拍？,frank110306,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402097....
3,[問卦] 現在小孩到底能不能揍？,sophia748596,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402287....
4,[問卦] 可以改推水龍敬風格AI圖嗎？,wang111283,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402293....
5,Re: [問卦] 女生說：今晚 我爸媽不在家 是什麼意思,aventardorsv,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402378....
6,[問卦] 王心凌廣州搭機返台，有像學生妹,doig,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402426....
7,[問卦] 一天吃8顆tomato的情況有多嚴重?,lichuer,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402506....
8,[問卦] 20240805那天發生什麼事？,KyrieIrving1,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402618....
9,[問卦] 一年前台海明明沒有要戰爭啊?,EcstasyE,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402635....


接著取得上一頁的 URL。

In [7]:
previous_page_url = get_previous_page_url(soup)
previous_page_url

'https://www.ptt.cc/bbs/Gossiping/index38954.html'

取得 Gossiping 的上一頁內容。

In [8]:
# Second Page
resp = requests.get(previous_page_url, headers={"cookie": "over18=1"})
soup = BeautifulSoup(resp.text, "lxml")

一樣進行解析。

In [9]:
pd_2 = extract_articles(soup)
pd_2

Unnamed: 0,文章標題,作者,文章日期,文章連結
0,[問卦] 戚夫人被做成人彘之後,hwsbetty,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743401016....
1,[問卦] 畫三角形 外資跑光預言成真 接下來呢?,wwewcwwwf,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743401161....
2,[問卦] 要出遠門 在阿嬤脖子上掛一個大餅可以嗎,poeta,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743401218....
3,[爆卦] 台股史上五大跌點,ncc5566,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743401296....
4,[問卦] 台股 前五大跌幅都在這兩年 各位有頭緒嗎,a5687920,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743401354....
5,[問卦] 20元的雞蛋水餃股 會不會跌到下市？,ppp123,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743401396....
6,[問卦] 吉卜力風格大家看膩了咩,Bastille,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743401430....
7,[新聞] 徐巧芯座車駛入河濱單車道　北市府確定,JioJoin,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743401507....
8,[問卦] 來點 奶子 露毛照 撫慰受傷的心情？,kinve1014,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743401510....
9,[問卦] 預言外資跑光光的是不是先知?,jeff0025,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743401536....


合併兩頁的 dataframes。

In [10]:
# merge two pd
pd_articles = pd.concat([pd_1, pd_2])
pd_articles

Unnamed: 0,文章標題,作者,文章日期,文章連結
0,Re: [問卦] 女友想當全職主婦不想去工作怎麼辦？,zangetsu9006,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402072....
1,[問卦] 現在是不是該刪APP了？,Workforme,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402085....
2,[問卦] 國動不想拍YT 怎麼還是一直拍？,frank110306,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402097....
3,[問卦] 現在小孩到底能不能揍？,sophia748596,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402287....
4,[問卦] 可以改推水龍敬風格AI圖嗎？,wang111283,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402293....
5,Re: [問卦] 女生說：今晚 我爸媽不在家 是什麼意思,aventardorsv,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402378....
6,[問卦] 王心凌廣州搭機返台，有像學生妹,doig,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402426....
7,[問卦] 一天吃8顆tomato的情況有多嚴重?,lichuer,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402506....
8,[問卦] 20240805那天發生什麼事？,KyrieIrving1,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402618....
9,[問卦] 一年前台海明明沒有要戰爭啊?,EcstasyE,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402635....


接著，將擷取到的文章一一採集其留言。

In [11]:
results = []

for index, row in pd_articles.iterrows():
    print(f"Extracting comments of {row['文章連結']}")
    
    link = row["文章連結"]
    resp = requests.get(link, headers={"cookie": "over18=1"})
    soup = BeautifulSoup(resp.text, "lxml")

    pd_comments = extract_comments(link, soup)
    results.append(pd_comments)

pd_comments_all = pd.concat(results)
pd_comments_all

Extracting comments of https://www.ptt.cc/bbs/Gossiping/M.1743402072.A.75D.html
Extracting comments of https://www.ptt.cc/bbs/Gossiping/M.1743402085.A.A5C.html
Extracting comments of https://www.ptt.cc/bbs/Gossiping/M.1743402097.A.7DE.html
Extracting comments of https://www.ptt.cc/bbs/Gossiping/M.1743402287.A.BE0.html
Extracting comments of https://www.ptt.cc/bbs/Gossiping/M.1743402293.A.E99.html
Extracting comments of https://www.ptt.cc/bbs/Gossiping/M.1743402378.A.4F4.html
Extracting comments of https://www.ptt.cc/bbs/Gossiping/M.1743402426.A.70B.html
Extracting comments of https://www.ptt.cc/bbs/Gossiping/M.1743402506.A.99C.html
Extracting comments of https://www.ptt.cc/bbs/Gossiping/M.1743402618.A.513.html
Extracting comments of https://www.ptt.cc/bbs/Gossiping/M.1743402635.A.96C.html
Extracting comments of https://www.ptt.cc/bbs/Gossiping/M.1743402684.A.460.html
Extracting comments of https://www.ptt.cc/bbs/Gossiping/M.1743402710.A.263.html
Extracting comments of https://www.ptt.c

Unnamed: 0,連結,喜好程度,留言者,留言內容,留言時間
0,https://www.ptt.cc/bbs/Gossiping/M.1743402097....,→,gn134679,錢,03/31 14:21
0,https://www.ptt.cc/bbs/Gossiping/M.1730554547....,推,HCYPMGO,推,11/02 21:36
1,https://www.ptt.cc/bbs/Gossiping/M.1730554547....,推,LYS5566,老人才回被騙,11/02 21:36
2,https://www.ptt.cc/bbs/Gossiping/M.1730554547....,推,dan5120,$$,11/02 21:37
3,https://www.ptt.cc/bbs/Gossiping/M.1730554547....,推,k385476916,錢,11/02 21:37
...,...,...,...,...,...
130,https://www.ptt.cc/bbs/Gossiping/M.1743146619....,推,rose01,推,03/30 18:53
131,https://www.ptt.cc/bbs/Gossiping/M.1743146619....,推,yylzvv,錢,03/30 20:43
132,https://www.ptt.cc/bbs/Gossiping/M.1743146619....,推,jkl85621,錢,03/30 21:46
133,https://www.ptt.cc/bbs/Gossiping/M.1743146619....,推,Tommy92C,有錢發的繼續連任吧,03/31 03:54


合併成一個大的 pd_all，以 `連結` 當關聯鍵。

In [12]:
pd_all = pd.merge(pd_articles, pd_comments_all, left_on="文章連結", right_on="連結")
pd_all

Unnamed: 0,文章標題,作者,文章日期,文章連結,連結,喜好程度,留言者,留言內容,留言時間
0,[問卦] 國動不想拍YT 怎麼還是一直拍？,frank110306,3/31,https://www.ptt.cc/bbs/Gossiping/M.1743402097....,https://www.ptt.cc/bbs/Gossiping/M.1743402097....,→,gn134679,錢,03/31 14:21
1,Fw: [公告] 請留意新註冊帳號使用信件詐騙,ubcs,11/02,https://www.ptt.cc/bbs/Gossiping/M.1730554547....,https://www.ptt.cc/bbs/Gossiping/M.1730554547....,推,HCYPMGO,推,11/02 21:36
2,Fw: [公告] 請留意新註冊帳號使用信件詐騙,ubcs,11/02,https://www.ptt.cc/bbs/Gossiping/M.1730554547....,https://www.ptt.cc/bbs/Gossiping/M.1730554547....,推,LYS5566,老人才回被騙,11/02 21:36
3,Fw: [公告] 請留意新註冊帳號使用信件詐騙,ubcs,11/02,https://www.ptt.cc/bbs/Gossiping/M.1730554547....,https://www.ptt.cc/bbs/Gossiping/M.1730554547....,推,dan5120,$$,11/02 21:37
4,Fw: [公告] 請留意新註冊帳號使用信件詐騙,ubcs,11/02,https://www.ptt.cc/bbs/Gossiping/M.1730554547....,https://www.ptt.cc/bbs/Gossiping/M.1730554547....,推,k385476916,錢,11/02 21:37
...,...,...,...,...,...,...,...,...,...
290,[公告] 八卦板主徵選規則修正&截止時間,ubcs,3/28,https://www.ptt.cc/bbs/Gossiping/M.1743146619....,https://www.ptt.cc/bbs/Gossiping/M.1743146619....,推,rose01,推,03/30 18:53
291,[公告] 八卦板主徵選規則修正&截止時間,ubcs,3/28,https://www.ptt.cc/bbs/Gossiping/M.1743146619....,https://www.ptt.cc/bbs/Gossiping/M.1743146619....,推,yylzvv,錢,03/30 20:43
292,[公告] 八卦板主徵選規則修正&截止時間,ubcs,3/28,https://www.ptt.cc/bbs/Gossiping/M.1743146619....,https://www.ptt.cc/bbs/Gossiping/M.1743146619....,推,jkl85621,錢,03/30 21:46
293,[公告] 八卦板主徵選規則修正&截止時間,ubcs,3/28,https://www.ptt.cc/bbs/Gossiping/M.1743146619....,https://www.ptt.cc/bbs/Gossiping/M.1743146619....,推,Tommy92C,有錢發的繼續連任吧,03/31 03:54


輸出為 `comment-detail.csv`。

In [13]:
pd_articles.to_csv("comment-detail.csv", index=False)