# PTT 運動內衣資料分析

本筆記本用於分析從 PTT 收集的運動內衣相關資料。

## 載入相關套件

In [1]:
# 載入相關套件
import pandas as pd

## 讀取和預處理資料

In [2]:
# 讀取運動內衣原始資料
ptt_data = pd.read_csv("datasets/PTT_運動內衣_onepage資料.csv")

In [3]:
# 將重複與空白訊息去除
ptt_data = ptt_data.drop_duplicates()
ptt_data = ptt_data.dropna()

## 資料類型轉換與處理

In [4]:
# 檢視資料資訊
ptt_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  21 non-null     int64 
 1   標題          21 non-null     object
 2   時間          21 non-null     object
 3   內容          21 non-null     object
 4   類別          21 non-null     object
 5   版名          21 non-null     object
 6   文章ID        21 non-null     object
 7   作者          21 non-null     object
 8   IP          21 non-null     object
 9   總留言數        21 non-null     int64 
 10  留言內容        21 non-null     object
 11  推推總數        21 non-null     int64 
 12  噓聲總數        21 non-null     int64 
 13  中立總數        21 non-null     int64 
dtypes: int64(5), object(9)
memory usage: 2.4+ KB


In [5]:
# 改時間格式
ptt_data["時間"] = pd.to_datetime(ptt_data["時間"])
ptt_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Unnamed: 0  21 non-null     int64         
 1   標題          21 non-null     object        
 2   時間          21 non-null     datetime64[ns]
 3   內容          21 non-null     object        
 4   類別          21 non-null     object        
 5   版名          21 non-null     object        
 6   文章ID        21 non-null     object        
 7   作者          21 non-null     object        
 8   IP          21 non-null     object        
 9   總留言數        21 non-null     int64         
 10  留言內容        21 non-null     object        
 11  推推總數        21 non-null     int64         
 12  噓聲總數        21 non-null     int64         
 13  中立總數        21 non-null     int64         
dtypes: datetime64[ns](1), int64(5), object(8)
memory usage: 2.4+ KB


In [6]:
# 計算留言總數
ptt_data["留言總數"] = (
    ptt_data["推推總數"] + ptt_data["噓聲總數"] + ptt_data["中立總數"]
)

In [7]:
# 將分類為【公告】的去除
ptt_data = ptt_data[ptt_data["類別"] != "公告"]

## 文字內容處理

In [8]:
# 將「內容」與「所有留言」文字內容合併，創造一欄位 - 「所有文」
ptt_data["所有文"] = ptt_data["標題"] + ptt_data["內容"]

In [9]:
# 去除無意義字元，先進行無意義字元列表，可以自行新增
removeword = [
    "span",
    "class",
    "f3",
    "https",
    "imgur",
    "h1",
    "_   blank",
    "href",
    "rel",
    "nofollow",
    "target",
    "cdn",
    "cgi",
    "b4",
    "jpg",
    "hl",
    "b1",
    "f5",
    "f4",
    "goo.gl",
    "f2",
    "email",
    "map",
    "f1",
    "f6",
    "__cf___",
    "data",
    "bbshtml",
    "cf",
    "f0",
    "b2",
    "b3",
    "b5",
    "b6",
    "原文內容",
    "原文連結",
    "作者標題",
    "時間",
    "看板",
    "<",
    ">",
    "，",
    "。",
    "？",
    "—",
    "閒聊",
    "・",
    "/",
    " ",
    "=",
    '"',
    "\n",
    "」",
    "「",
    "！",
    "[",
    "]",
    "：",
    "‧",
    "╦",
    "╔",
    "╗",
    "║",
    "╠",
    "╬",
    "╬",
    ":",
    "╰",
    "╩",
    "╯",
    "╭",
    "╮",
    "│",
    "╪",
    "─",
    "《",
    "》",
    ".",
    "、",
    "（",
    "）",
    "　",
    "*",
    "※",
    "~",
    "○",
    '"',
    '"',
    "～",
    "@",
    "＋",
    "\r",
    "▁",
    ")",
    "(",
    "-",
    "═",
    "?",
    ",",
    "!",
    "…",
    "&",
    ";",
    "『",
    "』",
    "#",
    "＝",
    "＃",
    "\\",
    "\\n",
    '"',
    "的",
    "^",
    "︿",
    "＠",
    "$",
    "＄",
    "%",
    "％",
    "＆",
    "＊",
    "＿",
    "+",
    "'",
    "{",
    "}",
    "｛",
    "｝",
    "|",
    "｜",
    "．",
    "‵",
    "`",
    "；",
    "●",
    "§",
    "※",
    "○",
    "△",
    "▲",
    "◎",
    "☆",
    "★",
    "◇",
    "◆",
    "□",
    "■",
    "▽",
    "▼",
    "㊣",
    "↑",
    "↓",
    "←",
    "→",
    "↖",
    "XD",
    "XDD",
    "QQ",
    "【",
    "】",
]

for word in removeword:
    ptt_data["所有文"] = ptt_data["所有文"].str.replace(word, "")

## 使用 jieba 進行中文斷詞

In [10]:
# 所有文關鍵字萃取
import jieba

jieba.set_dictionary("dict/dict.txt.big")
ptt_data = ptt_data.dropna(subset=["所有文"])
ptt_data["關鍵字"] = ptt_data["所有文"].apply(lambda x: list(jieba.cut(x)))

Building prefix dict from /Volumes/Dev/nkust/nkust-homework/semester-6/marketing/01-scrapers/dict/dict.txt.big ...
Loading model from cache /var/folders/qj/62r8d09n5hn3nm_bdzf0dcpr0000gn/T/jieba.u29065df1b72c0fc2cbec4ddb88c2e368.cache
Loading model cost 0.171 seconds.
Prefix dict has been built successfully.


In [11]:
# 檢視處理後的資料資訊
ptt_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17 entries, 4 to 20
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Unnamed: 0  17 non-null     int64         
 1   標題          17 non-null     object        
 2   時間          17 non-null     datetime64[ns]
 3   內容          17 non-null     object        
 4   類別          17 non-null     object        
 5   版名          17 non-null     object        
 6   文章ID        17 non-null     object        
 7   作者          17 non-null     object        
 8   IP          17 non-null     object        
 9   總留言數        17 non-null     int64         
 10  留言內容        17 non-null     object        
 11  推推總數        17 non-null     int64         
 12  噓聲總數        17 non-null     int64         
 13  中立總數        17 non-null     int64         
 14  留言總數        17 non-null     int64         
 15  所有文         17 non-null     object        
 16  關鍵字         17 non-null     objec

## 儲存處理後的資料

In [12]:
# 存檔csv
ptt_data.to_csv("output/PTT_運動內衣_onepage資料clear.csv", encoding="UTF-8-sig")

## 處理留言資料

In [13]:
# 第一個貼文留言資料展開
comment = ptt_data["留言內容"].iloc[1]  # 此為str type
comment1 = eval(comment)  # 此為list type
new = pd.DataFrame(comment1)
new

Unnamed: 0,type,user,content,ipdatetime
0,推,yalisa61037,美美的,02/23 23:52
1,推,wenyu66,真的都很美欸 不能再看了嗚嗚嗚 好想買,02/24 00:23
2,推,sd929598,感覺穿起來會刺刺的,02/24 01:27
3,推,mocc,*\(^o^)/*,02/24 09:23
4,推,Valentine17,樓樓上sd 大 莎露的蕾絲真的穿起來不會刺！這大概也,02/24 12:15
5,→,Valentine17,是為什麼可以賣這麼貴的原因XD,02/24 12:15
6,推,r40296,有了第一套就會有第三套第四套～,02/24 12:42
7,推,pchome0503,符合我要的尺寸 又是華麗的龍袍內衣～,02/24 13:13
8,→,hjkl369,莎露穿起來真的完全不刺，非常親膚，再加上版型很適合我,02/24 15:51
9,→,hjkl369,，所以覺得錢花得值得,02/24 15:51


In [14]:
# 所有貼文留言資料展開
newlist = []
for i in range(len(ptt_data)):
    new = pd.DataFrame(eval(ptt_data["留言內容"].iloc[i]))
    newlist.append(new)

In [15]:
# 將留言合併成dataframe
ptt_comment = pd.concat(newlist)

In [16]:
# 存檔
ptt_comment.to_csv("output/PTT_運動內衣_onepage資料留言內容.csv", encoding="UTF-8-sig")