# Workshop Notebook 2: Text Classification on True Voice Intent Dataset with PyThaiNLP



Updated: 31 October 2019


----

True Voice Intent Dataset : https://github.com/PyThaiNLP/truevoice-intent

Intent Dataset from Customer Service Phone Calls Transcribed by TrueVoice's Mari.


#### Required packages


Visualization
  - `matplotlib`
  - `seaborn`
  
Machine Learning
  - `sklearn`
  
Dataframe, Data structure
  - `pandas`
  - `numpy`

In [None]:
!pip install --upgrade --user -q --pre pythainlp

In [None]:
!pip install --user -q matplotlib==3.1.0 numpy pandas sklearn seaborn

In [None]:
import os
import re
from functools import partial

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, f1_score, precision_score, recall_score
from sklearn.preprocessing import Normalizer

from sklearn.svm import LinearSVC
from pythainlp.tokenize import word_tokenize
from pythainlp.ulmfit import ungroup_emoji

import matplotlib.pyplot as plt
import seaborn as sns

## 1. Explore the dataset



### 1.1 Load dataset

In [None]:
TRUEVOICE_INTENT_DIR = "../data/truevoice_intent"

truevoice_dataset_path = { 
    "train": os.path.join(TRUEVOICE_INTENT_DIR, "mari_train.csv"),
    "test": os.path.join(TRUEVOICE_INTENT_DIR, "mari_test.csv")
}
truevoice_dataset_path


In [None]:
truevoice_dataset = {
    "train": pd.read_csv(truevoice_dataset_path["train"]),
    "test": pd.read_csv(truevoice_dataset_path["test"])
}

### 1.2  Customer voice transcription and Destination

In [None]:
truevoice_dataset["train"].head(10)

### 1.2 Data Statistics

#### Number of examples for the training and test set

In [None]:
truevoice_dataset["train"].describe()

In [None]:
truevoice_dataset["test"].describe()

#### Percentage of class labels

In [None]:
for set_name in ["train", "test"]:
    print("set:", set_name)
    print("")
    print(truevoice_dataset[set_name]['destination'].value_counts() / truevoice_dataset[set_name].shape[0] * 100)
    print("\n\n")

## 2. Data Preprocessing

In this tutorial, we will use `CountVectorizer` from scikit-learn [link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

- `CountVectorizer` converts a collection of text documents to a matrix of token counts

```python
documents = ["ฉันขึ้นรถไฟ", "ฉันชอบรถไฟฟ้า", "ฉันชอบรถไฟ รถไฟ"]

vocabulary = Set['ขึ้น', 'ฉัน', 'ชอบ', 'รถไฟ', 'รถไฟฟ้า']
```


In [None]:
documents = ["ฉันขึ้นรถไฟ", "ฉันชอบรถไฟฟ้า", "ฉันชอบรถไฟ รถไฟ"]

In [None]:
vectoizer = CountVectorizer(tokenizer=word_tokenize)
X = vectoizer.fit_transform(documents)

In [None]:
vectoizer.get_feature_names()

In [None]:
X.toarray()

In [None]:
print(vectoizer.get_feature_names(), "\n")
for i, document in enumerate(documents):
    tokens = word_tokenize(document)
    print(document,"\n", tokens,"\n", X.toarray()[i])
    print("")
                             


#### Define a function to process texts

In [None]:
def process_text(text):
    text = text.lower()
    words = word_tokenize(text, keep_whitespace=False)    
    return words

In [None]:
process_text("Hello ฉัน")

## 3. Model Training

#### 3.1 Define preprocessing/training pipeline

1. Vectorize text with `CountVectorizer`.
2. Normalize Count Vector with L2 norm.
3. Fit the training data with __Linear Support Vector Classification__ ([LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)).


![title](images/svc.png)

Image from: https://scikit-learn.org/stable/modules/svm.html#svm-classification

In [None]:
classifier = Pipeline([
    ('count_vectorizer', CountVectorizer(tokenizer=process_text,
                                         ngram_range=(1,2))),
    ('normalizer', Normalizer()),
    ('classifier', LinearSVC(max_iter=25000, random_state=1, class_weight="balanced")),
])


In [None]:
X_train, y_train = truevoice_dataset["train"]['texts'], truevoice_dataset["train"]["destination"]
X_test, y_test = truevoice_dataset["test"]['texts'], truevoice_dataset["test"]["destination"]

#### 3.2 Fit traing data.

In [None]:
classifier.fit(X_train, y_train)

#### 3.3 Make prediction from testing data

In [None]:
predictions = classifier.predict(X_test)

### Example predictions

In [None]:
for index, x in enumerate(X_test[0:5]):
    print("question: {}".format(x))
    print("groundtruth: {}".format(y_test[index]))
    print("predition: {}".format(predictions[index]))
    print("")

#### __Question 1:__ How many examples in test set `y_test` that are predicted incorrectly.


Hint:
```python
>>> print(predictions.shape, y_test.shape)
(3236,) (3236,)

>>> print(predictions[0])
promotions

>>> print(y_test[0])
promotions
```

In [None]:
## Write down the code to find the answer








##

__Solution__:

In [None]:
indices = X_test[predictions != y_test].index
print("Number of examples that predicted incorrectly = {}".format(len(indices)))

In [None]:
for index in indices[:15]:
    print(index, X_test.iloc[index])
    print(" groundtruth:", y_test.iloc[index])
    print(" prediction:", predictions[index])
    print("")

## 4. Model Evaluation

In [None]:
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(handle_unknown='ignore')

#### 4.1 Metrics

__Per-class Accuracy__

$$
\text{class_accuracy}(i) = \frac{\text{Number of correct prediction for class } i}{\text{Number of samples are in class } i}
$$


__Per-class F1__

$$
\text{class_f1}(i) = \frac{ 2 \cdot (\text{class_precision}(i) \cdot \text{class_recall}(i)) }{ \text{class_precision}(i) + \text{class_recall}(i) }
$$


__Per-class Precision__

$$
\text{class_precision}(i) =  \frac{\text{Number of correct prediction for class } i}{\text{Number of correct prediction for class } i + \text{Number of samples in other classes predicted as class } i \text{ (False Positive)}}
$$

__Per-class Recall__

$$
\text{class_recall}(i) =  \frac{\text{Number of correct prediction for class } i}{\text{Number of correct prediction for class } i + \text{Number of samples in class } i  \text{ that predicted as other classes (False Negative)}}
$$

----

Reference: https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

In [None]:
onehot_encoder_fit = onehot_encoder.fit(truevoice_dataset["test"]["destination"][:,None])
predictions_onehot = onehot_encoder_fit.transform(predictions[:,None]).toarray()

y_onehot = onehot_encoder_fit.transform(truevoice_dataset["test"]["destination"][:,None]).toarray()
    
nb_class = 7
for i in range(nb_class):
    print("Class: ", i, )
    print("Accuracy: {:.2f} ".format((predictions_onehot[:,i] == y_onehot[:,i]).mean()))
    print("F1-score: {:.2f} ".format(f1_score(predictions_onehot[:,i], y_onehot[:,i])))
    print("Precision: {:.2f} ".format(precision_score(predictions_onehot[:,i],y_onehot[:,i])))
    print("Recall: {:.2f} ".format(recall_score(predictions_onehot[:,i], y_onehot[:,i])))
    print("")

#### Overall Accuracy

$$
\text{Overall accuracy} = \frac{\text{Number of correct prediction}}{\text{Number of samples}}
$$

In [None]:
print("Overall accuracy ")
accuracy_score(predictions_onehot, y_onehot)

#### Confusion Matrix



Change from one hot encoding (e.g. `[0, 0, 0, 0, 0, 0, 1]`)
to the original label (e.g. `"true money"`).

In [None]:
predictions_orig = onehot_encoder.inverse_transform(predictions_onehot)
predictions_orig

In [None]:
y_orig = onehot_encoder.inverse_transform(y_onehot)
y_orig

In [None]:
confusion_matrix(y_orig, predictions_orig)

In [None]:
labels = list(onehot_encoder.categories_[0])
print("labels", labels)


plt.figure(figsize = (8,8))
ax = plt.subplot(111, aspect = 'equal')

sns.heatmap(confusion_matrix(y_orig, predictions_orig),
            annot=True, cmap="rocket", fmt="d",
            xticklabels=labels,
            yticklabels=labels,
            square=True)

bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)