## 題目:電商產品評分文件以機器學習方式分辨是否為正向或負向<br>

說明：<br>
1.輸入文件 positive.review 和 negative.review，兩者都是XML檔。我們用BeautifulSoup讀進來，擷取review_text，然後用NLTK自建Tokenizer。先產生 word-to-index map 再產生 word-frequency vectors。<br>
2.之後 shuffle data 創造 train/test splits，留100個給 test 用。<br>
3.接著用Logistic Regression 分類器,找出訓練組和測試組的準確度(Accuracy)。接著我們可以看看每個單字的正負權重，可以訂一個閥值，比方絕對值大於正負0.5，以確認情緒是顯著的。<br>
4.最後我們找出根據現有演算法歸類錯誤最嚴重的正向情緒和負向情緒的例子。<br>

延伸:可用不同的tokenizer，不同的tokens_to_vector，不同的ML分類器做改進準確率的比較。最後可用您的model去預測unlabeled.review檔的內容。<br>


---


範例程式檔名: sentiment_情緒分析.py，以LogisticRegression 方式完成情緒分析。<br>
模組: sklearn, bs4, numpy, nltk<br>
輸入檔：stopwords.txt, /electronics 下 positive.review, negative.review<br>
成績：辨識百分率<br>


---


注意事項：nltk 需要有 punkt corpus 和 wordnet  資源<br>
import nltk<br>
nltk.download('punkt')<br>
nltk.download('wordnet') <br>


In [7]:
from __future__ import print_function, division
from future.utils import iteritems
from builtins import range

import nltk
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
#nltk.download('stopwords')
#nltk.download('punkt')
#nltk.download('wordnet')
import numpy as np
from sklearn.utils import shuffle
from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from bs4 import BeautifulSoup

wordnet_lemmatizer = WordNetLemmatizer()

In [10]:
# 讀正向與負向 reviews
# data courtesy of http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html
positive_reviews = BeautifulSoup(open('/content/drive/MyDrive/NLP/electronics/positive.review', encoding='utf-8').read(), features="html5lib")
positive_reviews = positive_reviews.findAll('review_text')

negative_reviews = BeautifulSoup(open('/content/drive/MyDrive/NLP/electronics/negative.review', encoding='utf-8').read(), features="html5lib")
negative_reviews = negative_reviews.findAll('review_text')

In [17]:
# 基於nltk自建 tokenizer(使用pos tag)
def get_wordnet_pos(word):
    """將pos_tag結果mapping到lemmatizer中pos的格式"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def my_tokenizer_with_postag(s):
    s = s.lower() # downcase
    tokens = nltk.tokenize.word_tokenize(s) # 將字串改為tokens
    tokens = [t for t in tokens if len(t) > 2] # 去除短字
    tokens = [wordnet_lemmatizer.lemmatize(t,get_wordnet_pos(t)) for t in tokens] # 提取詞幹
    tokens = [t for t in tokens if t not in stopwords] # 去除 stopwords
    return tokens

In [19]:
# 基於nltk自建 tokenizer

def my_tokenizer(s):
    s = s.lower() # downcase
    tokens = nltk.tokenize.word_tokenize(s) # 將字串改為tokens
    tokens = [t for t in tokens if len(t) > 2] # 去除短字
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] # 提取詞幹
    tokens = [t for t in tokens if t not in stopwords] # 去除 stopwords
    return tokens

In [20]:
# 先產生 word-to-index map 再產生 word-frequency vectors
# 同時儲存 tokenized 版本未來不需再做 tokenization
word_index_map = {}
current_index = 0
positive_tokenized = [] #儲存每篇正評的tokens
negative_tokenized = [] #儲存每篇負評的tokens
orig_reviews = [] #儲存所有原始評論

for review in positive_reviews:
    orig_reviews.append(review.text)
    tokens = my_tokenizer(review.text)
    positive_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index+=1


for review in negative_reviews:
    orig_reviews.append(review.text)
    tokens = my_tokenizer(review.text)
    negative_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index+=1

print("len(word_index_map):", len(word_index_map))

len(word_index_map): 11297


In [21]:
# now let's create our input matrices
def tokens_to_vector(tokens, label):
    x = np.zeros(len(word_index_map) + 1) # 最後一個元素是label
    for t in tokens:
        i = word_index_map[t]
        x[i] += 1
    x = x / x.sum() # 正規化數據提升未來準確度
    x[-1] = label
    return x

N = len(positive_tokenized) + len(negative_tokenized)
# (N x D+1) 矩陣 - 擺在一塊將來便於shuffle
data = np.zeros((N, len(word_index_map) + 1))
i = 0
for tokens in positive_tokenized:
    xy = tokens_to_vector(tokens, 1)
    data[i,:] = xy
    i += 1

for tokens in negative_tokenized:
    xy = tokens_to_vector(tokens, 0)
    data[i,:] = xy
    i += 1

In [22]:
# shuffle data 創造 train/test splits
orig_reviews, data = shuffle(orig_reviews, data)

X = data[:,:-1]
Y = data[:,-1]

# 最後 100 列是測試用
Xtrain = X[:-100,]
Ytrain = Y[:-100,]
Xtest = X[-100:,]
Ytest = Y[-100:,]


In [23]:
LR = LogisticRegression(random_state=1)
LR.fit(Xtrain, Ytrain)
print("Train accuracy:", LR.score(Xtrain, Ytrain))
print("Test accuracy:", LR.score(Xtest, Ytest))

Train accuracy: 0.7947368421052632
Test accuracy: 0.81


In [24]:
# 列出每個字的正負 weight
# 用不同的 threshold values!
threshold = 0.5
for word, index in iteritems(word_index_map):
    weight = LR.coef_[0][index]
    if weight > threshold or weight < -threshold:
        print(word, weight)

unit -0.5915704831501671
bad -0.6248628581718502
cable 0.5278136456596043
time -0.6624396695007301
used 0.9742193542402741
've 0.5961302300635231
month -0.6404595335265224
problem 0.5237838279780997
need 0.5842710324763746
good 1.9141913566366175
sound 1.1005783405300855
like 0.6639907319545435
lot 0.6206479032444179
n't -1.8547321869367932
easy 1.2796688861055274
case 0.5584096450685518
get -1.0830047256835995
use 1.3929140808656175
quality 1.1638328539232914
company -0.5323035799687027
best 0.9280170851000734
item -0.9358914753079649
working -0.5139693053769487
well 1.010490058294626
wa -1.2679184563257488
perfect 0.901549496751304
fast 0.7629842222419804
ha 0.6077946661799492
price 2.254989164770666
great 3.3557038918586577
money -0.8610392889022679
memory 0.7493247570250994
would -0.629000510489913
buy -0.948243208325204
worked -0.8117477438572105
happy 0.5228389621823937
pretty 0.5491661788996735
doe -1.0282908667135213
two -0.6328203074147392
highly 0.8309417932859127
recommend 0

In [28]:
# 找出歸類錯誤的例子
preds = LR.predict(X)
P = LR.predict_proba(X)[:,1] # p(y = 1 | x) #取出預測結果為1的機率

In [38]:
# 只列出最糟的
minP_whenYis1 = 1
maxP_whenYis0 = 0
wrong_positive_review = None
wrong_negative_review = None
wrong_positive_prediction = None
wrong_negative_prediction = None
for i in range(N):
    p = P[i]
    y = Y[i]
    if y == 1 and p < 0.5: #答案為1，但預測為1的機率小於0.5 == FN
        if p < minP_whenYis1:
            wrong_positive_review = orig_reviews[i]
            wrong_positive_prediction = preds[i]
            minP_whenYis1 = p
            idx_p=i
    elif y == 0 and p > 0.5: #答案為0，但預測為1的機率大於0.5 == FP
        if p > maxP_whenYis0:
            wrong_negative_review = orig_reviews[i]
            wrong_negative_prediction = preds[i]
            maxP_whenYis0 = p
            idx_n=i

print("Most wrong positive review is NO.%s (prob = %s, pred = %s):" % (idx_p,minP_whenYis1, wrong_positive_prediction))
print(wrong_positive_review)
print("Most wrong negative review is NO.%s (prob = %s, pred = %s):" % (idx_n, maxP_whenYis0, wrong_negative_prediction))
print(wrong_negative_review)


Most wrong positive review is NO.461 (prob = 0.40135203639398737, pred = 0.0):

This was a defective unit. Got new unit and it works as expected

Most wrong negative review is NO.1456 (prob = 0.6676719016677286, pred = 1.0):

The Voice recorder meets all my expectations and more
Easy to use, easy to transfer great results

