## 本日課程 - 文字預處理
* 部分內容前面章節可能提過，這裡會將前處理所需技巧串起

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from nltk.corpus import stopwords

In [18]:
dataset = pd.read_csv('movie_feedback.csv', header=None, encoding='Big5')
X = dataset[0].values
Y = dataset[1].values

In [3]:
dataset

Unnamed: 0,0,1
0,the rock is destined to be the 21st century's ...,1
1,"the gorgeously elaborate continuation of "" the...",1
2,effective but too-tepid biopic,1
3,if you sometimes like to go to the movies to h...,1
4,"emerges as something rare , an issue movie tha...",1
...,...,...
10657,a terrible movie that some people will neverth...,0
10658,there are many definitions of 'time waster' bu...,0
10659,"as it stands , crocodile hunter has the hurrie...",0
10660,the thing looks like a made-for-home-video qui...,0


In [6]:
print('review before preprocessing :\n{}'.format(X[0]))

review before preprocessing :
the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 


---
## re.sub() 去除部分字元

In [7]:
import re 
# 去除 a-zA-Z 以外的字元，並將他們取代為空格' '
review = re.sub('[^a-zA-Z]', ' ', X[0])

In [8]:
print('review after re.sub : {}'.format(review))

review after re.sub : the rock is destined to be the   st century s new   conan   and that he s going to make a splash even greater than arnold schwarzenegger   jean claud van damme or steven segal   


## 將所有字母轉為小寫:
* 因為大部分情境區分大小寫並不能提供而外訊息
* 如 CV 內顏色無法提供額外訊息時我們會將圖像轉為灰階，藉此降低複雜度

In [9]:
review = review.lower()
print('review after lower : {}'.format(review))

review after lower : the rock is destined to be the   st century s new   conan   and that he s going to make a splash even greater than arnold schwarzenegger   jean claud van damme or steven segal   


## 斷詞

In [10]:
import nltk
# 把 review 裡面的單字切開
print('review after split : {}'.format(review.split()))

review after split : ['the', 'rock', 'is', 'destined', 'to', 'be', 'the', 'st', 'century', 's', 'new', 'conan', 'and', 'that', 'he', 's', 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', 'jean', 'claud', 'van', 'damme', 'or', 'steven', 'segal']


tokenize 相較於split會是更好的選擇，如 split 無法分開 word. 這種case

In [11]:
review = nltk.word_tokenize(review)
print('review after tokenize : {}'.format(review))

review after tokenize : ['the', 'rock', 'is', 'destined', 'to', 'be', 'the', 'st', 'century', 's', 'new', 'conan', 'and', 'that', 'he', 's', 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', 'jean', 'claud', 'van', 'damme', 'or', 'steven', 'segal']


## stopwords 移除贅字
* 此步驟為前處理的重要步驟之一，過多的贅字不僅無法提供更多訊息，還會干擾到模型的訓練

In [12]:
#處理文字，有建立好的文字褲會幫我們移除不想要的文字
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jiaping/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
review = [word for word in review if not word in set(stopwords.words('english'))]
print('review after removeing stopwords : {}'.format(review))

review after removeing stopwords : ['rock', 'destined', 'st', 'century', 'new', 'conan', 'going', 'make', 'splash', 'even', 'greater', 'arnold', 'schwarzenegger', 'jean', 'claud', 'van', 'damme', 'steven', 'segal']


## Stemming 詞幹提取
 * e.g. loves,loved 都變成 love
 * 中文沒有詞幹提取的需求

In [14]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
review = [ps.stem(word) for word in review]

In [15]:
print('review after stemming : {}'.format(review))

review after stemming : ['rock', 'destin', 'st', 'centuri', 'new', 'conan', 'go', 'make', 'splash', 'even', 'greater', 'arnold', 'schwarzenegg', 'jean', 'claud', 'van', 'damm', 'steven', 'segal']


---
## 練習清理所有的句子

In [19]:
# dataset = pd.read_csv('movie_feedback.csv', encoding = 'Big5', names=['feedback', 'label'] )
X = dataset[0].values

In [20]:
X

array(['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . ',
       'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . ',
       'effective but too-tepid biopic', ...,
       "as it stands , crocodile hunter has the hurried , badly cobbled look of the 1959 godzilla , which combined scenes of a japanese monster flick with canned shots of raymond burr commenting on the monster's path of destruction . ",
       'the thing looks like a made-for-home-video quickie . ',
       "enigma is well-made , but it's just too dry and too placid . "],
      dtype=object)

In [21]:
corpus = []
row = len(X)

for i in range(0, row):
    review = re.sub('[^a-zA-Z]', ' ', X[i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    ## 這裡先不用stopwords 因為 review中很多反定詞會被移掉 如isn't good, 會變成 good
    review = [ps.stem(word) for word in review ]
    review = ' '.join(review)
    corpus.append(review)

## 轉 bag-of-words vector

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
#Creating bag of word model
#tokenization(符號化)

#max_features是要建造幾個column，會按造字出現的高低去篩選 
cv = CountVectorizer(max_features=1500)

#toarray是建造matrixs
#X現在為sparsity就是很多零的matrix
X_ = cv.fit_transform(corpus).toarray()
Y_ = dataset[1].values

---
## 選擇練習
* 將處理好數據放入 naive_bayes 模型，並預測評論為正向或負面，詳細原理之後章節會解釋。

## Training

In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_, Y_, test_size = 0.1)

# Feature Scaling

#Naive Bayes
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB()

## Inference

In [32]:
message='I like this movie!!'

## 要使用一樣的前處理
review = re.sub('[^a-zA-Z]', ' ', message)
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review]
review = ' '.join(review)

input_ = cv.transform([review]).toarray()
prediction = classifier.predict(input_)

In [33]:
prediction ## 1代表正向評價

array([1])

In [34]:
message='A terrible movie  !!'

review = re.sub('[^a-zA-Z]', ' ', message)
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review]
review = ' '.join(review)

input_ = cv.transform([review]).toarray()
prediction = classifier.predict(input_)

In [35]:
prediction ## 0代表負面評價

array([0])