## 本日課程 - 文字預處理
* 部分內容前面章節可能提過，這裡會將前處理所需技巧串起

## 預處理順序整理
1. 匯入套件
2. 讀取資料
3. 去除部分字元、轉小寫 `re.sub()`
4. 斷詞斷句：英文用 `nltk.word_tokenize()`、中文用 `jieba.cut()`
5. 移除贅字：`nltk.download('stopwords')`
6. 詞幹提取（英文）`PorterStemmer()`

## 預測
1. 轉為詞袋：`CountVectorizer()`
2. 訓練預測分組：`train_test_split()`
3. 訓練：`classifier.fit()`
4. 預測：`classifier.predict()`

In [21]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from nltk.corpus import stopwords

In [39]:
# tsv是指用 tab 分開字元的檔案
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t', quoting=3)
print('review before preprocessing : {}'.format(dataset['Review'][0]))

review before preprocessing : Wow... Loved this place.


## 運用 re.sub 去除部分字元
    re.sub(target, replace, text)
* 第一個參數: 去除字元，但可以透過添加＾，變成不要去除字元
* 第二個參數: 去除字元後這些東西要變成什麼，在這我們是希望它變成一個空格
* 第三個參數: 我們要剝除的字元從哪裡來

In [43]:
import re

review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][0])
print('review after re.sub : {}'.format(review))

review after re.sub : Wow    Loved this place 


## 將所有字母轉為小寫

* 因為大部分情境區分大小寫並不能提供而外訊息
* 如CV內顏色無法提供額外訊息時我們會將圖像轉為灰階，藉此降低複雜度

In [44]:
review = review.lower()
print('review after lower : {}'.format(review))

review after lower : wow    loved this place 


---
## 斷詞

In [45]:
import nltk
# 把 review 裡面的單字切開
print('review after split : {}'.format(review.split()))

review after split : ['wow', 'loved', 'this', 'place']


**tokenize 相較於 split 會是更好的選擇，如 split 無法分開 "word."**

In [46]:
test_str = 'Wow... Loved this place.'

print('review after split : {}'.format(test_str.split()))
print('review after tokenized : {}'.format(nltk.word_tokenize(test_str)))

review after split : ['Wow...', 'Loved', 'this', 'place.']
review after tokenized : ['Wow', '...', 'Loved', 'this', 'place', '.']


In [47]:
review = nltk.word_tokenize(review)
print('review after tokenized : {}'.format(review))

review after tokenized : ['wow', 'loved', 'this', 'place']


**中文使用 jieba**

In [48]:
import jieba
jieba.set_dictionary('dict.txt')

In [49]:
review_ = '哇！我好喜歡這個地方'
cut_result = jieba.cut(review_, cut_all=False, HMM=False)
print("output: {}".format('|'.join(cut_result)))

Building prefix dict from /Users/jiaping/Desktop/Coding/1st-NLP100Days/homework/Day014/dict.txt ...
Loading model from cache /var/folders/55/dc9c0nvd6sl1f2ncrh90zcx40000gn/T/jieba.ud4a3c199cf528fdf91e7a951490248e7.cache
Loading model cost 0.519 seconds.
Prefix dict has been built successfully.


output: 哇|！|我|好|喜歡|這|個|地方


---
## stopwords: 移除贅字
* 此步驟為前處理的重要步驟之一，過多的贅字不僅無法提供更多訊息，還會干擾到模型的訓練

In [50]:
# 處理文字，有建立好的文字庫會幫我們移除不想要的文字
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jiaping/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**`stopwords.words('english')`**
**是一個建立好的 list，包含一些常見的英文贅字**

In [51]:
stopwords.words('english')[:5]

['i', 'me', 'my', 'myself', 'we']

In [52]:
review = [word for word in review if not word in set(stopwords.words('english'))]
print('review after removing stopwords : {}'.format(review))

review after removing stopwords : ['wow', 'loved', 'place']


**我們也可以自己建立 stopwords list**

In [53]:
# source: https://github.com/tomlinNTUB/Machine-Learning

with open('停用詞-繁體中文.txt','r') as file:
    stop_words = file.readlines()
    
stop_words = [word.strip('\n') for word in stop_words]
stop_words[:20]

['\ufeff,',
 '?',
 '、',
 '。',
 '“',
 '”',
 '《',
 '》',
 '！',
 '，',
 '：',
 '；',
 '？',
 '人民',
 '末##末',
 '啊',
 '阿',
 '哎',
 '哎呀',
 '哎喲']

In [55]:
practice_sentence = ['哈哈','!','現在','好想','學習', '新課程','啊']
practice_sentence = [word for word in practice_sentence if not word in set(stop_words)]

print('practice_sentence after removing stopwords : {}'.format(practice_sentence))

practice_sentence after removing stopwords : ['現在', '好想', '學習', '新課程']


---
## Stemming 詞幹提取
 * e.g. loves, loved 都變成 love
 * 中文沒有詞幹提取的需求

In [37]:
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()
review = [ps.stem(word) for word in review]

In [38]:
print('review after stemming : {}'.format(review))

review after stemming : ['wow', 'love', 'place']


---
## 練習清理所有的句子

In [56]:
#dataset=pd.read_csv('movie_feedback.csv',encoding = 'Big5',names=['feedback', 'label'] )
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t', quoting=3)

In [57]:
corpus = []
row = len(dataset)

for i in range(0, row):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    
    ## 這裡先不用 stopwords 因為 review 中很多反定詞會被移掉 (如isn't good, 會變成 good)
    review = [ps.stem(word) for word in review]
    review = ' '.join(review)
    corpus.append(review)

In [60]:
dataset[:5]

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [61]:
corpus[:5]

['wow love thi place',
 'crust is not good',
 'not tasti and the textur wa just nasti',
 'stop by dure the late may bank holiday off rick steve recommend and love it',
 'the select on the menu wa great and so were the price']

---
## 手動選出現頻率較高的單字
* 一般來說我們不需要自己處理這個步驟，通常文字轉向量或 index 的 api 都有參數可以設定，這裡是讓大家自己練習

In [62]:
from collections import Counter

In [63]:
# 從整個 corpus 中取出所有的單詞

whole_words = []
for sentence in corpus:
    for words in nltk.word_tokenize(sentence):
        whole_words.append(words)
        
whole_words[:10]

['wow', 'love', 'thi', 'place', 'crust', 'is', 'not', 'good', 'not', 'tasti']

In [64]:
# 取出出現頻率 top_k 的單詞
top_k = 1000
top_k_words = []
for item in Counter(whole_words).most_common(top_k):
    top_k_words.append(item[0])

top_k_words[:10]

['the', 'and', 'i', 'wa', 'a', 'to', 'is', 'it', 'thi', 'of']

### 以 corpus 中第一個句子為範例

In [65]:
rm_low_freq_word = ' '.join([word for word in nltk.word_tokenize(corpus[0]) if word in set(top_k_words)])

In [66]:
print('Before removing low frequency words:\n {}'.format(corpus[0]))
print('\n')
print('After removing low frequency words:\n {}'.format(rm_low_freq_word))

Before removing low frequency words:
 wow love thi place


After removing low frequency words:
 wow love thi place


---
## 轉 bag-of-words vector

In [67]:
from sklearn.feature_extraction.text import CountVectorizer
# Creating bag of word model
# tokenization (符號化)

# max_features 是要建造幾個 column，會按照字出現的高低去篩選 
cv = CountVectorizer(max_features=1000)

# toarray 是建造 matrixs
# X 現在為 sparsity (稀疏矩陣＝很多零的 matrix)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

In [68]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [75]:
dataset.iloc[:, 1]

0      1
1      0
2      0
3      1
4      1
      ..
995    0
996    0
997    0
998    0
999    0
Name: Liked, Length: 1000, dtype: int64

In [69]:
y

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,

---
## 選擇練習
* 將處理好數據放入 naive_bayes 模型，並預測評論為正向或負面，詳細原理之後章節會解釋。

## Training

In [76]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)

# Feature Scaling

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB()

## Inference

In [77]:
message = 'I really like this!!'

# 使用一樣的前處理
review = re.sub('[^a-zA-Z]', ' ', message)
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review]
review = ' '.join(review)
input_ = cv.transform([review]).toarray()
prediction = classifier.predict(input_)

In [78]:
prediction ## 1代表正向評價

array([1])

In [84]:
message = 'All dishes are disgusting !!'

review = re.sub('[^a-zA-Z]', ' ', message)
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review]
review = ' '.join(review)
input_ = cv.transform([review]).toarray()
prediction = classifier.predict(input_)

In [85]:
prediction ## 0代表負面評價

array([0])

In [88]:
message = 'All dishes are great !!'

review = re.sub('[^a-zA-Z]', ' ', message)
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review]
review = ' '.join(review)
input_ = cv.transform([review]).toarray()
prediction = classifier.predict(input_)

In [89]:
prediction

array([1])