### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [1]:
import json
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [2]:
#load json data
all_reviews = []
###<your code>###
with open('All_Beauty.json','r',encoding='utf-8') as f:
    for line in f.readlines():
        all_reviews.append(json.loads(line))
        
all_reviews[0]

{'overall': 1.0,
 'verified': True,
 'reviewTime': '02 19, 2015',
 'reviewerID': 'A1V6B6TNIC10QE',
 'asin': '0143026860',
 'reviewerName': 'theodore j bigham',
 'reviewText': 'great',
 'summary': 'One Star',
 'unixReviewTime': 1424304000}

In [3]:
all_reviews[0].get('overall')

1.0

In [4]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

###<your code>###
for review in all_reviews[:10000]:
    if review.get('overall',False) and review.get('reviewText',False):
        corpus.append(review['reviewText'])
        labels.append(review['overall'])
        
#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
for i in range(len(labels)):
    if labels[i] == 1 or labels[i]==2:
        labels[i] = 1
    elif labels[i]==3:
        labels[i] = 2
    else:
        labels[i]=3
labels[:5]
###<your code>###

[1, 3, 3, 3, 3]

In [5]:
corpus[:5]

['great',
 "My  husband wanted to reading about the Negro Baseball and this a great addition to his library\n Our library doesn't haveinformation so this book is his start. Tthank you",
 'This book was very informative, covering all aspects of game.',
 'I am already a baseball fan and knew a bit about the Negro leagues, but I learned a lot more reading this book.',
 "This was a good story of the Black leagues. I bought the book to teach in my high school reading class. I found it very informative and exciting. I would recommend to anyone interested in the history of the black leagues. It is well written, unlike a book of facts. The McKissack's continue to write good books for young audiences that can also be enjoyed by adults!"]

In [6]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 

#use lemmatizer
lemmatizer = WordNetLemmatizer()
def get_wordnet_pos(word):
    """將pos_tag結果mapping到lemmatizer中pos的格式"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def clean_content(X):
    X_clean = [re.sub('[^a-zA-Z]',' ',x).lower() for x in X]#sub取代指定字串
    #tokenize
    X_token = [nltk.word_tokenize(x) for x in X_clean]
    #stopwords_lemmatizer
    X_stopwords_lemmatizer = []
    stop_words = set(stopwords.words('english'))
    for content in X_token:
        content_clean = []
        for word in content:
            if word not in stop_words:
                word = lemmatizer.lemmatize(word,get_wordnet_pos(word))
                content_clean.append(word)
        X_stopwords_lemmatizer.append(content_clean)
    X_output = [' '.join(x) for x in X_stopwords_lemmatizer]
    
    return X_output


###<your code>###

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/linrongwei/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
corpus = clean_content(corpus)
corpus[:5]

['great',
 'husband want reading negro baseball great addition library library haveinformation book start tthank',
 'book informative cover aspect game',
 'already baseball fan knew bit negro league learn lot reading book',
 'good story black league bought book teach high school reading class found informative excite would recommend anyone interested history black league well write unlike book fact mckissack continue write good book young audience also enjoy adult']

In [8]:
#split corpus and label into train and test
###<your code>###
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2, random_state=0)
len(x_train), len(x_test), len(y_train), len(y_test)

(7996, 1999, 7996, 1999)

In [9]:
#change corpus into vector
#you can use tfidf or BoW here
cv = CountVectorizer(max_features = 1500)
###<your code>###
#tf = TfidfVectorizer()
#tf.fit(x_train)
#transform training and testing corpus into vector form
x_train = cv.fit_transform(x_train).toarray()
x_test = cv.fit_transform(x_test).toarray()

### 訓練與預測

In [10]:
x_train.shape

(7996, 1500)

In [11]:
x_test.shape

(1999, 1500)

In [12]:
#build classification model (decision tree, random forest, or adaboost)
#start training
forest_cls = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=4,
                                           min_samples_split=5, min_samples_leaf=5)
'''adaboost_cls = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',
                                                                        max_depth=3,
                                                                        min_samples_split=10,
                                                                        min_samples_leaf=5),
                                  n_estimators=50,
                                  learning_rate=0.8)'''
forest_cls.fit(x_train,y_train)
###<your code>###

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=4, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [13]:
#start inference
y_pred = forest_cls.predict(x_test)

In [14]:
#calculate accuracy
###<your code>###
print('Accuracy:', forest_cls.score(x_test,y_test))

Accuracy: 0.896448224112056


In [15]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00       134
           2       0.00      0.00      0.00        73
           3       0.90      1.00      0.95      1792

    accuracy                           0.90      1999
   macro avg       0.30      0.33      0.32      1999
weighted avg       0.80      0.90      0.85      1999

[[   0    0  134]
 [   0    0   73]
 [   0    0 1792]]


  _warn_prf(average, modifier, msg_start, len(result))


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現