<a href="https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W06-text-classification-with-scikit-learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is adapted by [Haowen Jiang](https://howard-haowen.rohan.tw/) from [this one](https://github.com/nlptown/nlp-notebooks/blob/master/Traditional%20text%20classification%20with%20Scikit-learn.ipynb) included in the [nlptown
/nlp-notebooks](https://github.com/nlptown/nlp-notebooks) repo. It is meant for the 2022 [NLP Workshop at NSYSU](https://howard-haowen.rohan.tw/NLP-demos/nsysu_workshop).

In [None]:
from datetime import date

today = date.today()
print("Last updated:", today)

Last updated: 2022-05-27


# "Traditional" Text Classification with Scikit-learn

In this notebook, we're going to experiment with a few "traditional" approaches to text classification. These approaches pre-date the deep learning revolution in Natural Language Processing, but are often quick and effective ways of training a text classifier. 

There are numerous use cases for text classification, including

- Email spam detector
![](https://i1.wp.com/www.opinosis-analytics.com/wp-content/uploads/2020/08/document_classification.png?resize=872%2C436&ssl=1)

- Hate speech detector
![](https://i1.wp.com/www.opinosis-analytics.com/wp-content/uploads/2020/08/facebook_hatespeech.png?resize=721%2C548&ssl=1)

- Customer sentiment analysis
![](https://d33wubrfki0l68.cloudfront.net/9e1b2a906ae6b01cfe2d5d237e1e51f5d41864e3/2a5f9/static/348bb1d70089176ca2f61ea402094382/50bf7/main.png)

- Customer support system
![](https://www.opinosis-analytics.com/wp-content/uploads/2020/07/big_data_strategy_ticket_routing-1024x717.png)

- News classification
![](https://miro.medium.com/max/700/1*HgXA9v1EsqlrRDaC_iORhQ.png)

- Chatbot intent recognition 
![](https://assets-global.website-files.com/5e29a0c20f2d35836e6bc609/5eafc053bd54499b92d23c9d_Intent-Classification.png)

## Dataset

In this tutorial, we'll be using a dataset of 50K online reviews for 5 product categories. Read [this post](https://howard-haowen.rohan.tw/blog.ai/spacy/text-classification/sentiment-analysis/customer-reviews/fasttext/facets/2021/03/12/Classifying-customer-reviews-with-spaCy-v3.html#Preparing-the-dataset) of mine to find out details on how the dataset has been processed to become the way it looks now.

In [None]:
!wget -O reviews.csv https://github.com/howard-haowen/NLP-demos/raw/main/online_shopping_5_cats_tra.csv

--2022-05-27 07:42:27--  https://github.com/howard-haowen/NLP-demos/raw/main/online_shopping_5_cats_tra.csv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/howard-haowen/NLP-demos/main/online_shopping_5_cats_tra.csv [following]
--2022-05-27 07:42:27--  https://raw.githubusercontent.com/howard-haowen/NLP-demos/main/online_shopping_5_cats_tra.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8062412 (7.7M) [text/plain]
Saving to: ‘reviews.csv’


2022-05-27 07:42:27 (286 MB/s) - ‘reviews.csv’ saved [8062412/8062412]



In [None]:
import pandas as pd
pd.options.plotting.backend = "plotly"

In [None]:
df = pd.read_csv('reviews.csv')
df

Unnamed: 0,cat,label,review
0,平板,1,﻿很不錯。。。。。。很好的平板
1,平板,1,幫同學買的，同學說感覺挺好，質量也不錯
2,平板,1,東西不錯，一看就是正品包裝，還沒有開機，相信京東，都是老顧客，還是京東值得信賴，給五星好評
3,平板,1,總體而言，產品還是不錯的。
4,平板,1,好，不錯，真的很好不錯
...,...,...,...
49995,酒店,0,我們去鹽城的時候那裡的最低氣溫只有4度，晚上冷得要死，居然還不開空調，投訴到酒店客房部，得到...
49996,酒店,0,房間很小，整體設施老化，和四星的差距很大。毛巾太破舊了。早餐很簡陋。房間隔音很差，隔兩間房間...
49997,酒店,0,我感覺不行。。。價效比很差。不知道是銀川都這樣還是怎麼的！
49998,酒店,0,房間時間長，進去有點異味！服務員是不是不夠用啊！我在一樓找了半個小時以上才找到自己房間，想找...


### Proportion of categories


In [None]:
cat_counts = df['cat'].value_counts()
cat_counts.plot.bar()

### Train/test split


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
TRAIN_SIZE= 0.7
RANDOM_STATE = 500
train, test = train_test_split(df, 
                               train_size=TRAIN_SIZE,
                               random_state=RANDOM_STATE)
print(f"The training set size: {train.shape}")
print(f"The valid_test set size: {test.shape}")

The training set size: (35000, 3)
The valid_test set size: (15000, 3)


In [None]:
train['cat'].value_counts().plot.bar()

In [None]:
test['cat'].value_counts().plot.bar()

## Preprocessing

The first step in the development of any NLP model is text preprocessing. This means we're going to transform our texts from word sequences to feature vectors. These feature vectors contain their values for each of a large number of features.

In this experiment, we're going to work with so-called "bag-of-word" approaches. Bag-of-word methods treat every text as an unordered collection of words (or optionally, ngrams), and the raw feature vectors simply tell us how often each word (or ngram) occurs in a text. In Scikit-learn, we can construct these raw feature vectors with the `CountVectorizer`, which tokenizes a text and counts the number of times any given text contains every token in the corpus. 

However, these raw counts are not very informative yet. This is because the raw feature vectors of most texts in the same language will be very similar. For example, most texts in English contain many instances of relatively uninformative words, such as *a*, *the* or *be*. Instead, what we're interested in are words like *computer* or *hardware*: words that occur often in one text, but not very often in the corpus as a whole. Therefore we're going to weight all features by their [tf-idf score](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), which counts the number of times every token appears in a text and divides it by (the logarithm of) the percentage of corpus documents that contain that token. This weighting is performed by Scikit-learn's `TfidfTransformer`.

To obtain the weighted feature vectors, we can combine the `CountVectorizer` and `TfidfTransformer` in a `Pipeline`, and fit this pipeline on the training data. We then transform both the training texts and the test texts to a collection of such weighted feature vectors. Scikit-learn also has a `TfidfVectorizer`, which achieves the same result as the pipeline.

In [None]:
!pip install -U -q pip setuptools wheel
!pip install -U -q spacy
!python -m spacy download zh_core_web_sm
!git clone -l -s https://github.com/L706077/jieba-zh_TW.git jieba_tw
%cd jieba_tw
import jieba
%cd ../

Here, we're replacing spaCy's built-in tokenizer with another one, which is `jieba-zh_tw` in this case. You can plug in any tokenizer you have access to in the same manner 👍!

### Tokenization


In [None]:
import spacy
from spacy.tokens import Doc

nlp = spacy.load("zh_core_web_sm")

class TwTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words =  jieba.lcut(text)
        spaces = [False] * len(words)        
        return Doc(self.vocab, words=words, spaces=spaces)

nlp.tokenizer = TwTokenizer(nlp.vocab)

In [None]:
text = "宜家家居新店店店長的名字好長喔！"
doc = nlp(text)
tokens = [tok.text for tok in doc]
" | ".join(tokens)

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.286 seconds.
Prefix dict has been built succesfully.


'宜家 | 家居 | 新店店 | 店長 | 的 | 名字 | 好長 | 喔 | ！'

- First Version

In [None]:
def preprocess_texts(raw_texts):
    # nlp.pipe() is more efficient than nlp()
    clean_texts = []

    for doc in nlp.pipe(raw_texts, disable=["ner", "parser", "tagger"]):
        tokens = [tok.text for tok in doc if (
                    not tok.is_stop 
                    and not tok.is_punct
                    and not tok.is_currency
                    and not tok.is_space
                    and not tok.like_num
                )
        ]
        clean_text = " ".join(tokens)
        clean_texts.append(clean_text)

    return clean_texts

In [None]:
sample_texts = train['review'][-5:]
sample_texts

19389                                   又舊又殘！！不新鮮！不會再買啦！！！
3790     平板應該還是不錯的，華為也還可以，還沒用，等待測試。一次買了兩個。大小還算合適。快遞也還不錯...
41233    環境優雅安靜，湖景房確實美景如畫，早晚在掛滿了詠風嘆景對聯的長廊散步閒坐，身心極度放鬆。房價...
44865    這次去廣西在金都住了2天，一次普通房，一次豪華房。規規矩矩4星級。服務很好，房間不錯，離火車...
17335                      外面都黃了，看起來一點都不新鮮，好傷心哦！這個樣子送人很丟臉，
Name: review, dtype: object

In [None]:
preprocess_texts(sample_texts)

['舊 殘 新鮮 不會 買',
 '平板 應該 還是 不錯 華為 還可以 還 沒用 等待 測試 次 買 兩 個 大小 還 算 合適 快遞 還 不錯 調貨 天',
 '環境 優雅 安靜 湖景 房 確實 美景 如畫 早晚 掛滿 詠風 嘆景 對聯 長廊 散步 閒坐 身心 極度 放鬆 房價 雖 高 環境 確非 鎮中 酒店 小住 有傍湖 泳池 任遊 滿目 湖景 連 遊湖 省 早餐 夠 水準',
 '這 次 廣西 金都 住了 天 次 普通房 次 豪華 房 規規矩矩 星級 服務 房間 不錯 離 火車站 步行 分鐘 價效 當然 廣西 整體 房價 高 推薦',
 '外面 黃 看起來 一點 新鮮 傷心 這 個 樣子 送人 丟臉']

- Second Version

In [None]:
import re

def preprocess_texts(raw_texts):
    # nlp.pipe() is more efficient than nlp()
    clean_texts = []

    for doc in nlp.pipe(raw_texts, disable=["ner", "parser", "tagger"]):
        tokens = [tok.text for tok in doc if (
                    not tok.is_stop 
                    and not tok.is_punct
                    and not tok.is_currency
                    and not tok.is_space
                    and not tok.like_num
                )
        ]
        # remove tokens consisting of alphanumeric strings
        clean_tokens = [re.sub(r'[0-9a-zA-z]', '', tok) for tok in tokens]
        # filter empty tokens
        clean_tokens = [tok for tok in clean_tokens if tok] 
        # add a space between tokens
        clean_text = " ".join(clean_tokens)
        clean_texts.append(clean_text)

    return clean_texts

In [None]:
preprocess_texts(sample_texts)

['蘋果 夠甜 酥脆 可口 還可以 包裝 個 沒 壞給 好評',
 '先定 標間 感覺 比較 差 連 三星 換到 豪華 標間 樓棟 環境 要好 酒店 給人 整體 感覺 平庸 沒有 特點',
 '酒店 四星 服務 不錯 晚上 還 送 點心 酒店 設施 四星 標準 時間 比較 長了 設施 有點 老化 顯得 比較 舊 不過 整體 來 說 還是 不錯',
 '住 床房 四星 標準 衡量 話 傢俱 太 舊 房間 太 早餐 品種 單一 別 總體 來 說 還可以 地段 不錯 服務 滿好',
 '來 江門 出差 住 這 個 酒店 覺得 價效 還可以']

In [None]:
tokenized_train = preprocess_texts(train['review'])
tokenized_test = preprocess_texts(test['review'])

### Vectorization


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf_vec = TfidfVectorizer()

**IMPORTANT**: Call the `fit_transform` method on the training data.

In [None]:
print("Vectorizing training data...")
dtmatrix_train = tfidf_vec.fit_transform(tokenized_train)
print(f"The shape of the document-term matrix for the training data is: {dtmatrix_train.shape}")

Vectorizing training data...
The shape of the document-term matrix for the training data is: (35000, 32409)


**IMPORTANT**: Call the `transform` method on the test data.

In [None]:
print("Vectorizing test data...")
dtmatrix_test = tfidf_vec.transform(tokenized_test)
print(f"The shape of the document-term matrix for the test data is: {dtmatrix_test.shape}")

Vectorizing test data...
The shape of the document-term matrix for the test data is: (15000, 32409)


## Training

Next, we train a text classifier on the preprocessed training data. We're going to experiment with three classic text classification models: Naive Bayes, Support Vector Machines and Logistic Regression. 

### Naive Bayes classifiers

![](https://miro.medium.com/max/1200/1*39U1Ln3tSdFqsfQy6ndxOA.png)

Naive Bayes classifiers are extremely simple classifiers that assume all features are independent of each other. They just learn how frequent all classes are and how frequently each feature occurs in a class. To classify a new text, they simply multiply the probabilities for every feature $x_i$ given each class $C$ and pick the class that gives the highest probability: 

\begin{equation*}
\hat y = argmax_k\  p(C_k) \prod_{i=1}^n p(x_i \mid C_k)
\end{equation*}

Naive Bayes Classifiers are very quick to train, but usually fall behind in terms of performance. 

> Read more about this type of classifiers on [GeeksforGeeks](https://www.geeksforgeeks.org/naive-bayes-classifiers/?ref=leftbar-rightbar).

### Support Vector Machines

![](https://forum.huawei.com/enterprise/en/data/attachment/forum/202104/14/171007vu658w418brbj1hy.png?21.PNG)

Support Vector Machines are much more advanced than Naive Bayes classifiers. They try to find the hyperplane in the feature space that best separates the data from the different classes. They do so by picking the hyperplane that maximizes the distance to the nearest data point on each side. When the classes are not linearly separable, SVMs map the data into a higher-dimensional space where a linear separation can hopefully be found. SVMs often achieve very good performance in text classification tasks.

> Read more about this type of classifiers on [GeeksforGeeks](https://www.geeksforgeeks.org/support-vector-machine-algorithm/?ref=gcse)

### Logistic Regression

![](https://miro.medium.com/max/966/1*KoAzQLM1zDi5s9yTR9V6hw.png)

Logistic Regression models, finally, model the log-odds $l$, or $log(p/(1-p))$, of a class as a linear model and estimate the parameters $\beta$ of the model during training: 

\begin{equation*}
l = \beta_0 + \sum_{i=1}^n \beta_i x_i
\end{equation*}

Like SVMs, they often achieve great performance in text classification.

> Read more about this type of classifiers on [GeeksforGeeks](https://www.geeksforgeeks.org/understanding-logistic-regression/?ref=gcse)

We train our three classifiers in Scikit-learn with the `fit` method, giving it the preprocessed training text and the correct classes for each text as parameters.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

In [None]:
nb_classifier = MultinomialNB()
svm_classifier = LinearSVC()
lr_classifier = LogisticRegression(multi_class="ovr") # for one-vs-rest
target_train = train['cat']

print("Training Naive Bayes classifier...")
nb_classifier.fit(dtmatrix_train, target_train)

print("Training SVM classifier...")
svm_classifier.fit(dtmatrix_train, target_train)

print("Training Logistic Regression classifier...")
lr_classifier.fit(dtmatrix_train, target_train)

Training Naive Bayes classifier...
Training SVM classifier...
Training Logistic Regression classifier...


LogisticRegression(multi_class='ovr')

There are many more classifiers that you might want to experiment with. One worth trying is XGBoost classifiers. The `xgboost` library is already preinstalled on Colab. Use the following snippet to build a XGBoost model.

```python
from xgboost import XGBClassifier

xgb_classifier = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_classifier.fit(dtmatrix_train, target_train)
```

## Simple evaluation

Let's find out how well each classifier performs. To find oud, we have each classifier `predict` the label for all texts in our preprocessed test set.

In [None]:
nb_predictions = nb_classifier.predict(dtmatrix_test)
svm_predictions = svm_classifier.predict(dtmatrix_test)
lr_predictions = lr_classifier.predict(dtmatrix_test)

Now we can compute the accuracy of each model: the proportion of test texts for which the predicted label is the same as the target label. The three classifiers all assigned correct labels in about 87.4% of the cases.

In [None]:
import numpy as np

In [None]:
target_test = test['cat']
print("Evaluation scores for the test data >>>")
print("NB Accuracy:", np.mean(nb_predictions == target_test))
print("SVM Accuracy:", np.mean(svm_predictions == target_test))
print("LR Accuracy:", np.mean(lr_predictions == target_test))

Evaluation scores for the test data >>>
NB Accuracy: 0.8776666666666667
SVM Accuracy: 0.8741333333333333
LR Accuracy: 0.8732666666666666


## Extensive evaluation

### Detailed scores

So far we've only looked at the accuracy of our models: the proportion of test examples for which their prediction is correct. This is fine as a first evaluation, but it doesn't give us much insight in what mistakes the models make and why. We'll therefore perform a much more extensive evaluation, in three steps. Let's start by computing the precision, recall and F-score of the best SVM for the individual classes:

- Precision is the number of times the classifier predicted a class correctly, divided by the total number of times it predicted this class. 
- Recall is the proportion of documents with a given class that were labelled correctly by the classifier. 
- The F1-score is the harmonic mean between precision and recall: $2*P*R/(P+R)$

The classification report below shows, for example, that the hotel class was the easiest to predict, while the shampoo class proved much more difficult. 

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print("Classification report on the Naive Bayes classifier >>>")
print(classification_report(target_test,   # true labels
                            nb_predictions # model predictions
                            )
)

Classification report on the Naive Bayes classifier >>>
              precision    recall  f1-score   support

          平板       0.86      0.81      0.83      3025
          水果       0.90      0.86      0.88      3036
         洗髮水       0.79      0.87      0.82      2968
          衣服       0.88      0.87      0.87      2985
          酒店       0.97      0.98      0.98      2986

    accuracy                           0.88     15000
   macro avg       0.88      0.88      0.88     15000
weighted avg       0.88      0.88      0.88     15000



### Confusion matrix

Second, we're going to visualize our results in even more detail, using a so-called confusion matrix. A confusion matrix helps us better understand the errors our classifier makes. Its rows display the actual labels, its columns show the predictions of our classifier. This means all correct predictions will lie on the diagonal, where the actual label and the predicted label are the same. The predictions elsewhere in the matrix help us understand what classes are often mixed up by our classifier. Our confusion matrix shows, for example, that 91 documents with the label `talk.politics.misc` incorrectly received the label `talk.politics.guns`.

In [None]:
import plotly.express as px

In [None]:
target_labels = sorted(target_test.unique().tolist())
target_labels

['平板', '水果', '洗髮水', '衣服', '酒店']

In [None]:
conf_matrix = confusion_matrix(target_test,    # true labels
                               nb_predictions, # model predictions
                               )
conf_matrix

array([[2449,   81,  311,  165,   19],
       [ 115, 2608,  199,  100,   14],
       [ 163,  121, 2572,   90,   22],
       [ 110,   87,  165, 2597,   26],
       [   9,    8,   21,    9, 2939]])

In [None]:
conf_matrix_df = pd.DataFrame(conf_matrix, 
                              index=target_labels, 
                              columns=target_labels)
conf_matrix_df

Unnamed: 0,平板,水果,洗髮水,衣服,酒店
平板,2449,81,311,165,19
水果,115,2608,199,100,14
洗髮水,163,121,2572,90,22
衣服,110,87,165,2597,26
酒店,9,8,21,9,2939


In [None]:
fig = px.imshow(conf_matrix_df)
fig.show()

### Explainability

Finally, we'd like to perform a more qualitative evaluation of our model by taking a look at the features that it assigns the highest weight for each of the classes. This will help us understand if the model indeed captures the phenomena we'd like it to capture. A great Python library to do this is `eli5`, which works together seamlessly with `scikit-learn`. Its `explain_weights` function takes a trained model, a list of feature names and target names, and prints out the features that have the highest positive values for each of the targets. The results convince us that our SVM indeed models the correct information: it sees a strong link between the `tablet` class and words such as _tablet_ and _screen_, or between the `hotel` class and words such as _hotel_, _room_, and _breakfast_, and so on. 

In [None]:
!pip install -q eli5

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/216.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.2/216.2 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.1/133.1 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for eli5 (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.1.2 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m[31m
[0m

In [None]:
import eli5

In [None]:
feature_names = tfidf_vec.get_feature_names_out()
feature_names[-30:]

array(['鼠疫', '鼻子', '鼻涕', '鼻血', '鼾聲', '齊備', '齊全', '齊鳴', '齊齊', '齷齪', '龍之夢',
       '龍卡', '龍城', '龍巖', '龍河', '龍湖', '龍灣', '龍熙', '龍眼', '龍華', '龍藝訂', '龍門',
       '龍閣', '龍陽', '龍頭', '龍騰', '龍鳳', '龐大', '龜爬', '龜速'], dtype=object)

In [None]:
len(feature_names)

32409

In [None]:
target_names = sorted(target_train.unique().tolist())
target_names

['平板', '水果', '洗髮水', '衣服', '酒店']

- Explaining the SVM model

In [None]:
eli5.explain_weights(svm_classifier, 
                     feature_names = feature_names,
                     target_names = target_names
                    )

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4
+4.915,華為,,,
+4.849,平板,,,
+4.350,螢幕,,,
+3.439,解析度,,,
+3.434,手機,,,
+3.215,耳機,,,
+3.072,機子,,,
+2.997,機器,,,
+2.745,音質,,,
+2.650,宕機,,,

Weight?,Feature
+4.915,華為
+4.849,平板
+4.350,螢幕
+3.439,解析度
+3.434,手機
+3.215,耳機
+3.072,機子
+2.997,機器
+2.745,音質
+2.650,宕機

Weight?,Feature
+6.023,蘋果
+4.386,水果
+4.356,好吃
+4.230,新鮮
+4.153,火龍果
+3.909,口感
+3.783,個頭
+3.494,生鮮
+3.324,果子
+2.540,難吃

Weight?,Feature
+5.019,洗髮
+3.383,沙宣
+3.259,清揚
+3.228,吹風機
+3.199,頭髮
+3.135,頭皮屑
+2.956,瓶子
+2.956,頭皮
+2.811,小瓶
+2.796,潘婷

Weight?,Feature
+5.868,褲子
+4.560,衣服
+3.335,尺碼
+3.157,穿著
+3.129,面料
+2.976,穿上
+2.746,掉色
+2.660,布料
+2.587,牛仔褲
+2.541,合身

Weight?,Feature
+6.393,房間
+6.272,酒店
+3.721,入住
+3.344,早餐
+3.114,環境
+2.912,機場
+2.894,設施
+2.673,位置
+2.672,交通
+2.666,衛生


- Explaining the logistic regression model

In [None]:
eli5.explain_weights(lr_classifier, 
                     feature_names = feature_names,
                     target_names = target_names
                    )

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4
+9.553,華為,,,
+9.341,平板,,,
+8.472,螢幕,,,
+8.381,手機,,,
+5.955,解析度,,,
+5.313,耳機,,,
+4.923,遊戲,,,
+4.891,機子,,,
+4.749,影片,,,
+4.431,機器,,,

Weight?,Feature
+9.553,華為
+9.341,平板
+8.472,螢幕
+8.381,手機
+5.955,解析度
+5.313,耳機
+4.923,遊戲
+4.891,機子
+4.749,影片
+4.431,機器

Weight?,Feature
+14.622,蘋果
+9.311,水果
+9.077,新鮮
+9.011,好吃
+7.832,火龍果
+7.690,個頭
+6.991,口感
+5.551,果子
+5.480,生鮮
+4.511,難吃

Weight?,Feature
+10.129,洗髮
+7.257,頭髮
+6.151,頭皮屑
+6.052,沙宣
+5.787,清揚
+5.621,頭皮
+5.397,吹風機
+4.822,洗完
+4.737,瓶子
+4.665,小瓶

Weight?,Feature
+12.390,褲子
+9.764,衣服
+5.827,面料
+5.761,質量
+5.633,尺碼
+5.583,穿著
+5.478,穿上
+5.336,掉色
+5.034,布料
+4.776,顏色

Weight?,Feature
+14.956,酒店
+14.650,房間
+7.327,入住
+7.171,早餐
+6.521,環境
+5.917,機場
+5.858,設施
+5.568,位置
+5.546,衛生
+5.473,前臺


## Conclusion

This notebook has demonstrated how you can quickly train a text classifier. Although the types of models we've looked at predate the deep learning revolution in NLP, they're often a quick and effective way of training a first classifier for your text classification problem. Not only can they provide a good baseline and help you understand your data and problem better. In some cases, you may find they are quite hard to beat even with state-of-the-art deep learning models.

# Assignment

Here're some ideas for the assignment. Pick one (or more) from the following list. 

*   Rerun the whole notebook, but use spaCy's built-in tokenizer instead. Then compare the difference in model performance.

In [None]:
# HINT
# nlp.tokenizer = TwTokenizer(nlp.vocab)

*   Rerun the whole notebook, but train classifiers on the same dataset in simplified Chinese instead. Then compare the difference in model performance. Here's how to download the simplified version of the dataset.

In [None]:
# !wget -O reviews.csv https://github.com/howard-haowen/NLP-demos/raw/main/online_shopping_5_cats_sim.csv

*   Rerun the whole notebook, but instead of training models on all the  35K samples of the training data, use just 25K and 15K of it. Then compare the difference in model performance.

In [None]:
# HINT
# RANDOM_STATE = 500
# train = train.sample(n=25_000, random_state=RANDOM_STATE)

In [None]:
# HINT
# RANDOM_STATE = 500
# train = train.sample(n=15_000, random_state=RANDOM_STATE)





* Rerun the whole notebook, but this time set the `max_features` parameter of `TfidfVectorizer` at 20K and 10K. Then compare the difference in model performance.


In [None]:
# HINT
# tfidf_vec = TfidfVectorizer(max_features=20_000)

In [None]:
# HINT
# tfidf_vec = TfidfVectorizer(max_features=10_000)

- Use whatever parameters that you consider appropriate to train sentiment classifiers that can distinguish  positive reviews from negative ones. 

In [None]:
# HINT
# target_train = test['label'] # 1 for positive 0 for negative
# target_test = test['label'] # 1 for positive 0 for negative