### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [24]:
import json
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [26]:
all_reviews = []
with open("./All_Beauty_5.json", "r") as f:
    for review in f:
        all_reviews.append(json.loads(review))
        
all_reviews[0]

{'asin': 'B0000530HU',
 'overall': 5.0,
 'reviewText': 'As advertised. Reasonably priced',
 'reviewTime': '09 1, 2016',
 'reviewerID': 'A3CIUOJXQ5VDQ2',
 'reviewerName': 'Shelly F',
 'style': {'Flavor:': ' Classic Ice Blue', 'Size:': ' 7.0 oz'},
 'summary': 'Five Stars',
 'unixReviewTime': 1472688000,
 'verified': True}

In [27]:
len(all_reviews)

5269

In [28]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

###<your code>###
for review in all_reviews[:5000]:
  if review.get("reviewText", False) and review.get("overall", False):
    corpus.append(review["reviewText"])
    labels.append(review["overall"])
        
#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3

###<your code>###
# for i, label in enumerate(labels):
#   if label == 1 or label == 2:
#     labels[i] = 1
#   elif label == 3:
#     labels[i] = 2
#   else:
#     labels[i] = 3

In [29]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)
###<your code>###
pattern = r"\S*@\S*|\\n|[^a-zA-Z0-9 ]"
for i, review in enumerate(corpus):
  fil_review = [w for w in re.sub(pattern, " ", review).split(" ") if w != ""]
  corpus[i] = " ".join(fil_review)

In [30]:
#split corpus and label into train and test
###<your code>###
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2, random_state=0)

len(x_train), len(x_test), len(y_train), len(y_test)

(3996, 999, 3996, 999)

In [31]:
#change corpus into vector
#you can use tfidf or BoW here

###<your code>###
tfidf_vec = TfidfVectorizer()
tfidf_vec.fit(x_train)

#transform training and testing corpus into vector form
x_train = tfidf_vec.transform(x_train).toarray()
x_test = tfidf_vec.transform(x_test).toarray()

### 訓練與預測

In [32]:
#build classification model (decision tree, random forest, or adaboost)
#start training

###<your code>###
tree = DecisionTreeClassifier(max_depth = 6, criterion='entropy')
tree.fit(x_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=6, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [33]:
#start inference
y_pred = tree.predict(x_test)

In [34]:
#calculate accuracy
###<your code>###
print(f"Accuracy: {tree.score(x_test, y_test)}")

Accuracy: 0.9129129129129129


In [35]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

         1.0       1.00      0.67      0.80        24
         2.0       1.00      0.45      0.62        11
         3.0       0.60      0.15      0.24        20
         4.0       0.75      0.19      0.30        64
         5.0       0.92      1.00      0.95       880

    accuracy                           0.91       999
   macro avg       0.85      0.49      0.58       999
weighted avg       0.90      0.91      0.89       999

[[ 16   0   0   1   7]
 [  0   5   0   1   5]
 [  0   0   3   0  17]
 [  0   0   0  12  52]
 [  0   0   2   2 876]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現