## Lesson 13 - Extract abstracts and keywords from Chinese text



### Table of Contents
* [FastTextRank](#FastTextRank)
* [Large scaling content keyword(tag) extraction](#Large-scaling-content-keyword)



<a id="FastTextRank"></a>
# FastTextRank
Extract abstracts and keywords from Chinese text, use *optimized iterative algorithms* to improve running **speed**, and *selectively use word vectors* to improve **accuracy**.
## PageRank
PageRank is a website page ranking algorithm from Google.<br/>
PageRank was originally used to calculate the importance of web pages. The entire www can be seen as a directed graph, and the node is a web page.<br/>
This algorithm can caculate all node's importance by their connections.<br/>
* My algorithm changed the iterative algorithm to make the algorithm much faster, it costs 10ms per article, on the mean while TextRank4ZH costs 80ms on my data.<br/>
* My algorithm also use word2vec to make the abstract more accurate, but it will cost more time to run the algorithm. Using word2vec costs 40ms per article on the same traning data.

## FastTextRank4Sentence
### Introduction
1. Cut article into sentence
2. Calculate similarity between sentences:
   * Using word vectors' cosine similarity
   * Using two sentences' common words
3. Build a graph by sentences' similarity
4. Caculate the importance of each sentence by improved iterative algorithm
5. Get the abstract
### API
* use_stopword: boolean, default True
* stop_words_file: str, default None.
The stop words file you want to use. If it is None, you will use this package's stop words.
* use_w2v: boolean, default False
If it is True, you must input passing dict_path parameter.
* dict_path: str, default None.
* max_iter:maximum iteration round
* tol: maximum tolerance error

## FastTextRank4Word

### Introduction
1. Cut artile into word
2. Calculate similarity between word: 
   If two words are all in window distance, then the graph's side of this two word add 1.0. Window is set by user.
3. Build a graph by word' similarity
4. Caculate the importance of each word by improved iterative algorithm
5. Get the key word

### API
* use_stopword=boolean, default True
* stop_words_file=str, default None.
The stop words file you want to use. If it is None, you will use this package's stop words.
* max_iter=maximum iteration round
* tol=maximum tolerance error
* window=int, default 2
The window to determine if two words are related

### 擷取文章中的關鍵詞(Keyword)，自動標籤

In [1]:
from src.FastTextRank.FastTextRank4Word import FastTextRank4Word
import codecs
import datetime
import types

Building prefix dict from D:\Programming\Python\課程教學\David Python 教育訓練\dict\dict.big.txt ...
Loading model from cache C:\Users\princ\AppData\Local\Temp\jieba.u9fe11b2227c896c5bbad65bb17730314.cache
Loading model cost 1.411 seconds.
Prefix dict has been built succesfully.


In [2]:
content_string = '''外送平台正夯，日前一位熊貓外送員騎乘重型機車外送爆紅，最近也有一位熊貓外送員因為太可愛，被客人加LINE後滿臉問號。
一位世新大學女學生在社群網站《Dcard》發文「嚇死熊貓外送員」，指出她透過foodpanda點宵夜，外送員抵達後發現對方是位可愛男
讓她被電到內心吶喊「天啊！太可愛了吧！」女大生拿完餐點後隨即跟室友說「外送員超可愛的啦」馬上讓宿舍內暴動。
豈料，說完3秒後手機響起，就是那位可愛外送員打來，室友見狀紛紛搶電話說「我要接！我要接！」
沒想到接起電話後可愛外送員說「那個你們的對講機是不是沒關好，因為在樓下的對講機都可以聽到你們講話的聲音。」讓女大生心生尷尬。
▲▼校園午餐外送潮。（圖／記者呂佳賢攝）
▲美食外送平台式許多學生的最愛。（圖／記者呂佳賢攝）
後來女大生把可愛外送員電話存進手機，結果LINE也跟著自動加入，3分鐘後女大生的LINE響起，竟然是可愛外送員傳了「？」圖示來
讓女大生嚇到直接刪掉。「我真的是超尷尬超丟臉...很想挖洞鑽」。
對此外送經驗，網友留言說「下一篇：我的男友是熊貓」、「過幾天...半夜肚子餓叫熊貓竟然脫魯了！！」
「閃光get？？？」、「要是男女互換，就是變態了呢」。
'''

In [3]:
mod = FastTextRank4Word(tol=0.0001,window=2)
old_time = datetime.datetime.now()
print('關鍵詞彙:')
old_time = datetime.datetime.now()
po=mod.summarize(content_string, 10)
print(po)
print(datetime.datetime.now() - old_time)

關鍵詞彙:
['外送員', '大生', '男友', 'LINE', '美食', '是不是', 'Dcard', '日前', '外送潮', '午餐']
0:00:00.218737


### 擷取文章中的關鍵句摘要(Sentence)

In [4]:
from src.FastTextRank.FastTextRank4Sentence import FastTextRank4Sentence
import codecs
import datetime

In [5]:
# 不使用 Word2Vec
mod = FastTextRank4Sentence(use_w2v=False,tol=0.0001)
old_time = datetime.datetime.now()
print('摘要:')
old_time = datetime.datetime.now()
po=mod.summarize(content_string, 4)
print(po)
print(datetime.datetime.now() - old_time)

摘要:
['（圖／記者呂佳賢攝）後來女大生把可愛外送員電話存進手機，結果LINE也跟著自動加入，3分鐘後女大生的LINE響起，竟然是可愛外送員傳了「？', '一位世新大學女學生在社群網站《Dcard》發文「嚇死熊貓外送員」，指出她透過foodpanda點宵夜，外送員抵達後發現對方是位可愛男，讓她被電到內心吶喊「天啊！', '」沒想到接起電話後可愛外送員說「那個你們的對講機是不是沒關好，因為在樓下的對講機都可以聽到你們講話的聲音。', '對此外送經驗，網友留言說「下一篇：我的男友是熊貓」、「過幾天...半夜肚子餓叫熊貓竟然脫魯了！']
0:00:00.013392


In [6]:
# 使用 Word2Vec
mod = FastTextRank4Sentence(use_w2v=True, dict_path='model/Word2Vec/word2vec_news.bin', tol=0.0001)
old_time = datetime.datetime.now()
print('摘要:')
old_time = datetime.datetime.now()
po=mod.summarize(content_string, 4)
print(po)
print(datetime.datetime.now() - old_time)

摘要:
['」、「要是男女互換，就是變態了呢」。', '豈料，說完3秒後手機響起，就是那位可愛外送員打來，室友見狀紛紛搶電話說「我要接！', '一位世新大學女學生在社群網站《Dcard》發文「嚇死熊貓外送員」，指出她透過foodpanda點宵夜，外送員抵達後發現對方是位可愛男，讓她被電到內心吶喊「天啊！', '對此外送經驗，網友留言說「下一篇：我的男友是熊貓」、「過幾天...半夜肚子餓叫熊貓竟然脫魯了！']
0:00:00.014879


<a id="Large-scaling-content-keyword"></a>
## Large scaling content keyword(tag) extracting

In [70]:
import pandas as pd
import codecs
import datetime
from src.FastTextRank.FastTextRank4Word import FastTextRank4Word

c_list = []
input_file_name = "data/document/UnsupervisedTextData.txt"
with open(input_file_name, 'r', encoding="utf8") as file:
    c_list = file.read().splitlines()
file.close()

df_ = pd.DataFrame(c_list, columns=['content'])
df_.head()

Unnamed: 0,content
0,美股費半及道瓊走高激勵下，周五台指期夜盤同步上攻至 10455 點。台股受美股走高影響加權指...
1,這天是忙碌的一天 同一天有兩場記者會發表 另一是長期的友愛品牌RMK Taiwan #RMK...
2,亞幣走強外資偏匯入 新台幣升5.6分（中央社記者潘姿羽台北18日電）隨著韓元走強，新台幣兌美...
3,雅詩蘭黛精選賣爆唇色推出楊冪限定款唷唷唷 想跟楊冪一樣渾身飄仙氣從這幾色下手準沒錯 不過唇彩...


In [73]:
import time
import multiprocessing.dummy as multiprocessing

cpu_count = 4

def MultiFastTextRank(x):
    mod = FastTextRank4Word(tol=0.0001,window=2)
    po=mod.summarize(x, 10)
    return po

t1 = time.time()
with multiprocessing.Pool(processes=(cpu_count)) as pool:
    res_list = pool.map(MultiFastTextRank, df_['content'])
t2 = time.time()
tm_cost = t2-t1
print('time cost: {}'.format(tm_cost))
print('speed %s bytes/second' % (sum(df_['content'].str.len())/tm_cost))

time cost: 0.6407473087310791
speed 2038.2450026772203 bytes/second


In [74]:
res_list

[['走高', '今日', '回測', '加上', '波幅', '上攻', '道瓊', '連末', '指期', '夜盤'],
 ['RMK', '美麗', '事情', '味道', '珠光', '疑惑', '外出', '眼影', '染色', '唇部'],
 ['升值', '表示', '可能', '今天', '亞幣', '使得', '台北', '推升', '央行', '控管'],
 ['楊冪', '效果', '神器', '搭配', '打底', '推出', '唇部', '唇膏', '乾燥', '嘴唇']]

### Homework
- if content string is not enough words, how will it perform?
- load up google play comment, try to extract keywords from them
--- "data/sentimental/googleplay/google_play_15_500.xlsx"
- how to enhanced comment classification of questions?

In [84]:
# Hint
import pandas as pd
import codecs
import datetime
from src.FastTextRank.FastTextRank4Word import FastTextRank4Word

c_list = []
df_ = pd.read_excel("data/sentimental/googleplay/google_play_15_500.xlsx")
df_ = df_.rename(columns={'comments': 'content'})
df_.head()

Unnamed: 0,content,ratings
0,越做越爛，廣告一直入，又不是沒儲值，還強制觀看廣告，操，爛公司爛遊戲，在不更改強制性播放廣...,1
1,手機是三星note5。要我輸入配對金鑰結果數字鍵只有到6，6以後都不見了剛好我就差一個8。...,1
2,幹拎涼咧～從戰鬥力23萬~43萬，一路升到77萬（77萬夠誇張了吧！），打PVP一樣是那個...,1
3,儲值310元，戰爭基金8000金幣沒給我，第二次了，都沒給。都沒人在管理，差，給負評,1
4,已經更新至最新版本 為什麼功能都沒顯示 運費卷全憑空消失 請處理,1


In [80]:
sum(df_['content'].str.len())

34145