# 文件和文辭分析
**Author:** 李畇彤<br>
**Date created:** 2023/04/12<br>
**Last modified:** 2023/04/12<br>

## 大綱
1. 套件
2. 資料前處理
    - 資料清理
    - 文章斷詞與整理
3. TF-IDF
    - 3.1 計算每篇文章的詞數
    - 3.2 計算tf-idf值
    - 3.3 檢視結果
4. 透過結巴斷詞與N-gram幫助建立字典
    - 4.1 Bigram
    - 4.2 Trigram
5. 使用自建字典
6. 建立Ngram預測模型
7. Bigram視覺化
8. Pairwise correlation
    - 8.1 找出相關性高的詞彙
    - 8.2 畫出關係圖
9. 計算文章相似度

## 1. 套件

In [None]:
# pip install -U networkx
# pip install nltk
# pip install -U scikit-learn scipy matplotlib

Collecting networkx
  Downloading networkx-3.1-py3-none-any.whl (2.1 MB)
     ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
     -- ------------------------------------- 0.1/2.1 MB 4.2 MB/s eta 0:00:01
     ---------- ----------------------------- 0.5/2.1 MB 6.7 MB/s eta 0:00:01
     --------------------------------- ------ 1.7/2.1 MB 13.6 MB/s eta 0:00:01
     ---------------------------------------- 2.1/2.1 MB 13.1 MB/s eta 0:00:00
Installing collected packages: networkx
Successfully installed networkx-3.1
Note: you may need to restart the kernel to use updated packages.


In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import re
import jieba
import jieba.analyse
import math
from nltk import ngrams
from nltk import FreqDist
from collections import Counter, namedtuple
import networkx as nx
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from numpy.linalg import norm

In [13]:
plt.rcParams['font.sans-serif']=['SimHei'] #使圖中中文能正常顯示
plt.rcParams['axes.unicode_minus']=False #使負號能夠顯示

## 2. 資料前處理

In [14]:
df = pd.read_csv("data/Tech_Job_OriginalData.csv")
df.head(3)

Unnamed: 0,system_id,artUrl,artTitle,artDate,artPoster,artCatagory,artContent,artComment,e_ip,insertedDate,dataSource
0,1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,[請益]研替offer（類比科/力智/鈺創/瑞鼎）,2018-01-01 18:19:10,elohaxup6xl3,Tech_Job,各位年薪千萬的大大好，新年快樂。\n小弟是第一次發文的新鮮人\n目前研替面試一個段落\n拿到...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""hsucheng"", ""...",223.141.230.104,2018-04-27 12:51:21,ptt
1,2,https://www.ptt.cc/bbs/Tech_Job/M.1514853292.A...,[新聞]【掙扎片】年薪百萬的科技人　卻因這幾點,2018-01-02 00:28:49,Angels5566,Tech_Job,有網友在mobile01分享，自己在科技業已工作9年，年薪約百萬，但最近老婆還是希望他\n去...,"[{""cmtStatus"": ""→"", ""cmtPoster"": ""latin0126"", ...",113.196.174.254,2018-04-27 12:51:21,ptt
2,3,https://www.ptt.cc/bbs/Tech_Job/M.1515382875.A...,[徵才]高雄昇雷科技股份有限公司誠徵工程師,2018-01-08 03:35:11,qqgreenmoon,Tech_Job,【公司名稱】\n昇雷科技股份有限公司\n\n【工作職缺】\n1、硬體工程師\n2、系統工程師...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""ohmypig"", ""c...",1.173.60.158,2018-04-27 12:51:21,ptt


### 去除特殊字元及標準符號

In [15]:
# 移除網址格式
df['sentence'] = df.artContent.apply(lambda x: re.sub('(http|https)://.*', '', x))
# 只留下中文字
df['sentence'] = df.artContent.apply(lambda x: re.sub('[^\u4e00-\u9fa5]+', '',x))
df.head(3)

Unnamed: 0,system_id,artUrl,artTitle,artDate,artPoster,artCatagory,artContent,artComment,e_ip,insertedDate,dataSource,sentence
0,1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,[請益]研替offer（類比科/力智/鈺創/瑞鼎）,2018-01-01 18:19:10,elohaxup6xl3,Tech_Job,各位年薪千萬的大大好，新年快樂。\n小弟是第一次發文的新鮮人\n目前研替面試一個段落\n拿到...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""hsucheng"", ""...",223.141.230.104,2018-04-27 12:51:21,ptt,各位年薪千萬的大大好新年快樂小弟是第一次發文的新鮮人目前研替面試一個段落拿到以下幾家豬屎屋的...
1,2,https://www.ptt.cc/bbs/Tech_Job/M.1514853292.A...,[新聞]【掙扎片】年薪百萬的科技人　卻因這幾點,2018-01-02 00:28:49,Angels5566,Tech_Job,有網友在mobile01分享，自己在科技業已工作9年，年薪約百萬，但最近老婆還是希望他\n去...,"[{""cmtStatus"": ""→"", ""cmtPoster"": ""latin0126"", ...",113.196.174.254,2018-04-27 12:51:21,ptt,有網友在分享自己在科技業已工作年年薪約百萬但最近老婆還是希望他去考公職但自己卻感到苦惱不知是...
2,3,https://www.ptt.cc/bbs/Tech_Job/M.1515382875.A...,[徵才]高雄昇雷科技股份有限公司誠徵工程師,2018-01-08 03:35:11,qqgreenmoon,Tech_Job,【公司名稱】\n昇雷科技股份有限公司\n\n【工作職缺】\n1、硬體工程師\n2、系統工程師...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""ohmypig"", ""c...",1.173.60.158,2018-04-27 12:51:21,ptt,公司名稱昇雷科技股份有限公司工作職缺硬體工程師系統工程師天線工程師軟體工程師工作內容設計硬體...


### 文章斷詞及整理

In [16]:
# 設定繁體中文詞庫
jieba.set_dictionary('./dict/dict.txt.big')

# 新增stopwords
# jieba.analyse.set_stop_words('./dict/stop_words.txt') #jieba.analyse.extract_tags才會作用
with open('./dict/stopwords.txt',encoding="utf-8") as f:
    stopWords = [line.strip() for line in f.readlines()]

In [17]:
# 設定斷詞 function
def getToken(row):
    seg_list = jieba.lcut(row)
    seg_list = [w for w in seg_list if w not in stopWords and len(w)>1] # 篩選掉停用字與字元數大於1的詞彙
    return seg_list

In [18]:
data = df.copy()
data['word'] = data.sentence.apply(getToken)

# 將word欄位展開
data = data.explode('word')

data.head(3)

Building prefix dict from d:\Programs\Python\NSYSU\MIS581_SocialMediaAnalysis\dict\dict.txt.big ...
Dumping model to file cache C:\Users\s2568\AppData\Local\Temp\jieba.uf1d59ac34902ab31556946386c5e7328.cache
Loading model cost 2.108 seconds.
Prefix dict has been built successfully.


Unnamed: 0,system_id,artUrl,artTitle,artDate,artPoster,artCatagory,artContent,artComment,e_ip,insertedDate,dataSource,sentence,word
0,1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,[請益]研替offer（類比科/力智/鈺創/瑞鼎）,2018-01-01 18:19:10,elohaxup6xl3,Tech_Job,各位年薪千萬的大大好，新年快樂。\n小弟是第一次發文的新鮮人\n目前研替面試一個段落\n拿到...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""hsucheng"", ""...",223.141.230.104,2018-04-27 12:51:21,ptt,各位年薪千萬的大大好新年快樂小弟是第一次發文的新鮮人目前研替面試一個段落拿到以下幾家豬屎屋的...,年薪
0,1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,[請益]研替offer（類比科/力智/鈺創/瑞鼎）,2018-01-01 18:19:10,elohaxup6xl3,Tech_Job,各位年薪千萬的大大好，新年快樂。\n小弟是第一次發文的新鮮人\n目前研替面試一個段落\n拿到...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""hsucheng"", ""...",223.141.230.104,2018-04-27 12:51:21,ptt,各位年薪千萬的大大好新年快樂小弟是第一次發文的新鮮人目前研替面試一個段落拿到以下幾家豬屎屋的...,新年快樂
0,1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,[請益]研替offer（類比科/力智/鈺創/瑞鼎）,2018-01-01 18:19:10,elohaxup6xl3,Tech_Job,各位年薪千萬的大大好，新年快樂。\n小弟是第一次發文的新鮮人\n目前研替面試一個段落\n拿到...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""hsucheng"", ""...",223.141.230.104,2018-04-27 12:51:21,ptt,各位年薪千萬的大大好新年快樂小弟是第一次發文的新鮮人目前研替面試一個段落拿到以下幾家豬屎屋的...,小弟


## 3. TF-IDF
TF-IDF 是一種統計方法，可用來評估單詞對於文件的集合的重要程度  
- **TF** (Term Frequency)：某一個詞彙在某一個文件中所出現的頻率  
    - TF = 詞彙在該文件中出現次數 / 該文件中詞彙總數  
- **IDF** (Inverse Document Frequent)：為文件數除以某特定詞彙有被多少文件所提及的數量取log  
    - IDF = log( 總文件數量 / 包含該詞彙的文件數量 ) 

### 3.1 計算每篇文章的 總詞彙數 與 各個詞彙數

In [19]:
# 每篇文章的總詞彙數
total_words = data.groupby(['artUrl'],as_index=False).size()
total_words.rename(columns={'size': 'total'}, inplace=True)
total_words

Unnamed: 0,artUrl,total
0,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,272
1,https://www.ptt.cc/bbs/Tech_Job/M.1514853292.A...,166
2,https://www.ptt.cc/bbs/Tech_Job/M.1515382875.A...,199
3,https://www.ptt.cc/bbs/Tech_Job/M.1515470624.A...,153
4,https://www.ptt.cc/bbs/Tech_Job/M.1515484201.A...,28
...,...,...
1683,https://www.ptt.cc/bbs/Tech_Job/M.1672102664.A...,306
1684,https://www.ptt.cc/bbs/Tech_Job/M.1672146819.A...,186
1685,https://www.ptt.cc/bbs/Tech_Job/M.1672167246.A...,120
1686,https://www.ptt.cc/bbs/Tech_Job/M.1672305717.A...,242


In [20]:
# 計算各詞彙在各文章中出現的次數
word_count = data.groupby(['artUrl','word'],as_index=False).size()
word_count.rename(columns={'size': 'count'}, inplace=True)
word_count

Unnamed: 0,artUrl,word,count
0,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一次,1
1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一線,1
2,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一間,1
3,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一階,2
4,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,三年,1
...,...,...,...
237242,https://www.ptt.cc/bbs/Tech_Job/M.1672493477.A...,類別,1
237243,https://www.ptt.cc/bbs/Tech_Job/M.1672493477.A...,顯示,1
237244,https://www.ptt.cc/bbs/Tech_Job/M.1672493477.A...,風險,1
237245,https://www.ptt.cc/bbs/Tech_Job/M.1672493477.A...,高層,2


#### 合併需要的資料欄位
- 合併 **每個詞彙在每篇文章中出現的次數** 與 **每篇文章的詞數**

In [21]:
job_words = word_count.merge(total_words,on = 'artUrl',how = 'left')
job_words

Unnamed: 0,artUrl,word,count,total
0,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一次,1,272
1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一線,1,272
2,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一間,1,272
3,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一階,2,272
4,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,三年,1,272
...,...,...,...,...
237242,https://www.ptt.cc/bbs/Tech_Job/M.1672493477.A...,類別,1,202
237243,https://www.ptt.cc/bbs/Tech_Job/M.1672493477.A...,顯示,1,202
237244,https://www.ptt.cc/bbs/Tech_Job/M.1672493477.A...,風險,1,202
237245,https://www.ptt.cc/bbs/Tech_Job/M.1672493477.A...,高層,2,202


### 3.2 計算 tf-idf 值
- 以每篇文章爲單位，計算每個詞彙的 tf-idf 值  
    - tf-idf = tf * idf

In [22]:
# 計算tf
job_words_tf_idf = job_words.assign(tf = job_words.iloc[:,2]/job_words.iloc[:,3])
job_words_tf_idf.head()

Unnamed: 0,artUrl,word,count,total,tf
0,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一次,1,272,0.003676
1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一線,1,272,0.003676
2,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一間,1,272,0.003676
3,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一階,2,272,0.007353
4,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,三年,1,272,0.003676


In [23]:
# 計算每個詞彙出現在幾篇文章中
idf_df = job_words.groupby(['word'],as_index=False).size()
job_words_tf_idf = job_words_tf_idf.merge(idf_df,on = 'word',how = 'left')
job_words_tf_idf.head()

Unnamed: 0,artUrl,word,count,total,tf,size
0,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一次,1,272,0.003676,179
1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一線,1,272,0.003676,40
2,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一間,1,272,0.003676,96
3,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一階,2,272,0.007353,3
4,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,三年,1,272,0.003676,87


In [24]:
# 計算idf
job_words_tf_idf = job_words_tf_idf.assign(idf = job_words_tf_idf.iloc[:,5]
                                               .apply(lambda x: math.log((len(total_words)/x),10)))

job_words_tf_idf = job_words_tf_idf.drop(labels=['size'],axis=1)
job_words_tf_idf.head()

Unnamed: 0,artUrl,word,count,total,tf,idf
0,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一次,1,272,0.003676,0.974519
1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一線,1,272,0.003676,1.625312
2,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一間,1,272,0.003676,1.245101
3,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一階,2,272,0.007353,2.750251
4,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,三年,1,272,0.003676,1.287853


In [25]:
# 計算tf*idf
job_words_tf_idf = job_words_tf_idf.assign(tf_idf = job_words_tf_idf.iloc[:,4] * job_words_tf_idf.iloc[:,5])
job_words_tf_idf.head()

Unnamed: 0,artUrl,word,count,total,tf,idf,tf_idf
0,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一次,1,272,0.003676,0.974519,0.003583
1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一線,1,272,0.003676,1.625312,0.005975
2,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一間,1,272,0.003676,1.245101,0.004578
3,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,一階,2,272,0.007353,2.750251,0.020222
4,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,三年,1,272,0.003676,1.287853,0.004735


### 3.3 檢視結果

In [26]:
# 選出每篇文章，tf-idf值最大的前五個詞
group = job_words_tf_idf.groupby("artUrl").apply(lambda x : x.nlargest(5, "tf_idf"))

In [27]:
group.loc[:,["word","tf_idf"]][0:15]

Unnamed: 0_level_0,Unnamed: 1_level_0,word,tf_idf
artUrl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A.569.html,94,瑞鼎,0.059327
https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A.569.html,151,類比,0.047387
https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A.569.html,64,應為,0.043034
https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A.569.html,118,聘書,0.031591
https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A.569.html,135,責任制,0.030166
https://www.ptt.cc/bbs/Tech_Job/M.1514853292.A.240.html,170,公職,0.147597
https://www.ptt.cc/bbs/Tech_Job/M.1514853292.A.240.html,201,幾萬,0.044263
https://www.ptt.cc/bbs/Tech_Job/M.1514853292.A.240.html,176,初考,0.038884
https://www.ptt.cc/bbs/Tech_Job/M.1514853292.A.240.html,169,公務員,0.035216
https://www.ptt.cc/bbs/Tech_Job/M.1514853292.A.240.html,198,年薪,0.035069


#### 計算整個文集中較常 tf-idf 值高的字

In [28]:
# 從每篇文章挑選出tf-idf最大的前十個詞
(job_words_tf_idf.groupby("artUrl").apply(lambda x : x.nlargest(10, "tf_idf")).reset_index(drop=True)
# 計算每個詞被選中的次數
.groupby(['word'],as_index=False).size()
).sort_values('size', ascending=False).head(10) # 排序看前十名

Unnamed: 0,word,size
9636,面試,115
598,主管,54
2509,員工,41
7433,網友,40
519,中國,30
8145,裁員,29
7902,英文,28
2069,半導體,28
7547,美國,28
9798,馬斯克,26


## 4. 透過結巴斷詞與N-gram幫助建立字典
N-gram 指文本中連續出現的n個語詞。 透過N-gram我們可以找出有哪些詞彙較常一起出現，檢查是否需要加入自定義字典中。  

### 4.1 Biagram

In [29]:
# 設定 bigram 斷詞 function
def bigram_getToken(row):
    seg_list = jieba.lcut(row)
    seg_list = [w for w in seg_list if w not in stopWords and len(w)>1]
    seg_list = ngrams(seg_list, 2)
    seg_list = [" ".join(w) for w in list(seg_list)]
    return seg_list

In [30]:
job_bigram = df.copy()

job_bigram["word"] = job_bigram.sentence.apply(bigram_getToken)
job_bigram = job_bigram.explode('word')
job_bigram.head(3)

Unnamed: 0,system_id,artUrl,artTitle,artDate,artPoster,artCatagory,artContent,artComment,e_ip,insertedDate,dataSource,sentence,word
0,1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,[請益]研替offer（類比科/力智/鈺創/瑞鼎）,2018-01-01 18:19:10,elohaxup6xl3,Tech_Job,各位年薪千萬的大大好，新年快樂。\n小弟是第一次發文的新鮮人\n目前研替面試一個段落\n拿到...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""hsucheng"", ""...",223.141.230.104,2018-04-27 12:51:21,ptt,各位年薪千萬的大大好新年快樂小弟是第一次發文的新鮮人目前研替面試一個段落拿到以下幾家豬屎屋的...,年薪 新年快樂
0,1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,[請益]研替offer（類比科/力智/鈺創/瑞鼎）,2018-01-01 18:19:10,elohaxup6xl3,Tech_Job,各位年薪千萬的大大好，新年快樂。\n小弟是第一次發文的新鮮人\n目前研替面試一個段落\n拿到...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""hsucheng"", ""...",223.141.230.104,2018-04-27 12:51:21,ptt,各位年薪千萬的大大好新年快樂小弟是第一次發文的新鮮人目前研替面試一個段落拿到以下幾家豬屎屋的...,新年快樂 小弟
0,1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,[請益]研替offer（類比科/力智/鈺創/瑞鼎）,2018-01-01 18:19:10,elohaxup6xl3,Tech_Job,各位年薪千萬的大大好，新年快樂。\n小弟是第一次發文的新鮮人\n目前研替面試一個段落\n拿到...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""hsucheng"", ""...",223.141.230.104,2018-04-27 12:51:21,ptt,各位年薪千萬的大大好新年快樂小弟是第一次發文的新鮮人目前研替面試一個段落拿到以下幾家豬屎屋的...,小弟 發文


#### 統計最常出現的bigram組合

In [31]:
# 計算每個組合出現的次數
job_bigram_count = job_bigram.groupby(["word"],as_index=False).size()
job_bigram_count.head()

Unnamed: 0,word,size
0,一一 一一,3
1,一一 人力,3
2,一一 分享,1
3,一一 分析,1
4,一一 列出,1


In [37]:
# 清除包含英文或數字的bigram組合
jb_filter = list(job_bigram_count["word"].apply(lambda x: True if not re.search("[0-9a-zA-Z]",x) else False))
job_bigram_count[jb_filter].sort_values(by=['size'], ascending=False).head(30)

Unnamed: 0,word,size
90788,工作 內容,410
19242,人力 銀行,301
91697,工作 經驗,176
255237,面試 過程,169
91436,工作 機會,147
254084,面試 主管,146
180410,科技 公司,131
149911,比特 大陸,114
204677,英文 履歷,114
14611,主管 面試,110


### 4.2 Trigram

In [33]:
# 設定 trigram 斷詞 function
def trigram_getToken(row):
    seg_list = jieba.lcut(row)
    seg_list = [w for w in seg_list if w not in stopWords and len(w)>1]
    seg_list = ngrams(seg_list, 3)
    seg_list = [" ".join(w) for w in list(seg_list)]
    return seg_list

In [35]:
job_trigram = df.copy()

job_trigram["word"] = job_trigram.sentence.apply(trigram_getToken)
job_trigram = job_trigram.explode('word')
job_trigram.head(3)

Unnamed: 0,system_id,artUrl,artTitle,artDate,artPoster,artCatagory,artContent,artComment,e_ip,insertedDate,dataSource,sentence,word
0,1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,[請益]研替offer（類比科/力智/鈺創/瑞鼎）,2018-01-01 18:19:10,elohaxup6xl3,Tech_Job,各位年薪千萬的大大好，新年快樂。\n小弟是第一次發文的新鮮人\n目前研替面試一個段落\n拿到...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""hsucheng"", ""...",223.141.230.104,2018-04-27 12:51:21,ptt,各位年薪千萬的大大好新年快樂小弟是第一次發文的新鮮人目前研替面試一個段落拿到以下幾家豬屎屋的...,年薪 新年快樂 小弟
0,1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,[請益]研替offer（類比科/力智/鈺創/瑞鼎）,2018-01-01 18:19:10,elohaxup6xl3,Tech_Job,各位年薪千萬的大大好，新年快樂。\n小弟是第一次發文的新鮮人\n目前研替面試一個段落\n拿到...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""hsucheng"", ""...",223.141.230.104,2018-04-27 12:51:21,ptt,各位年薪千萬的大大好新年快樂小弟是第一次發文的新鮮人目前研替面試一個段落拿到以下幾家豬屎屋的...,新年快樂 小弟 發文
0,1,https://www.ptt.cc/bbs/Tech_Job/M.1514831112.A...,[請益]研替offer（類比科/力智/鈺創/瑞鼎）,2018-01-01 18:19:10,elohaxup6xl3,Tech_Job,各位年薪千萬的大大好，新年快樂。\n小弟是第一次發文的新鮮人\n目前研替面試一個段落\n拿到...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""hsucheng"", ""...",223.141.230.104,2018-04-27 12:51:21,ptt,各位年薪千萬的大大好新年快樂小弟是第一次發文的新鮮人目前研替面試一個段落拿到以下幾家豬屎屋的...,小弟 發文 新鮮


#### 統計最常出現的trigram組合

In [39]:
# 計算每個組合出現的次數
job_trigram_count = job_trigram.groupby(["word"],as_index=False).size()
# 清除包含英文或數字的trigram組合
jb_filter = list(job_trigram_count["word"].apply(lambda x: True if not re.search("[0-9a-zA-Z]",x) else False))
job_trigram_count[jb_filter].sort_values(by=['size'], ascending=False).head(30)

Unnamed: 0,word,size
23710,人力 銀行 調查,33
238387,網站 投遞 履歷,31
187521,求職網 發言人 楊宗斌,31
59284,勞動部 勞動力 發展署,31
113121,工作 內容 應徵,28
253920,英文 履歷 格式,28
41026,內容 應徵 條件,27
27931,介紹 工作 內容,26
214303,直接 網站 投遞,26
69556,台幣 月薪 台幣,26
