# Quality metrics for NLP

There are several levels of chatbot quality measurements. In this section start with the most backend measures related strictly to the machine learning models. In the second section we show how to measure the quality based on chatbots' output. We check the grammar and spelling of the output. The last part of this notebook is dedicated to sentiment analysis that can be in many cases crucial.

## Grammar and spelling

There are several tools to check the spelling and grammar. We don't want our chatbot to reply with bad grammar or spelling errors. In Python we can use SpellChecker to check the spelling, pytypo to correct the typos and Language-check to check the grammar of a given sentence. We should check the grammar and spell so often as possible.

### Spell checking

Spell checking is one of the basic tool to check the output of our chatbot. It is not useful in many cases, only for a few generative-based chatbots.

In [None]:
from spellchecker import SpellChecker

spell = SpellChecker()

words = ['sample', 'words', 'heri', 'here']

for word in words:
    print(spell.correction(word))
    print(spell.candidates(word))

### Typos fixing

We can also easily fix some simple typos with pytypo.

In [None]:
import pytypo

pytypo.correct_sentence('this traiining is great!!!')

### Grammar check

A more complex tool that can measure the grammar is language tool that allows to check more than 25 languages. It's an app written in Java, but has ports in Python.

In [None]:
import language_check

tool = language_check.LanguageTool('en-US')

tool.check("the are trainings")

## Model quality measures

Quality measure are more about the input and output of the model. We can take the dataset and depending on the way how we divided it, we can measure the quality of our model. The output can be also measured with some methods where the most popular is accuracy.

### Dataset preparation

One of the common problem that each data scientist has is about how to divide the data set into training and testing data sets. To understand the following equations we need to introduce new designations. Let $\mathcal{L}_{n}$ be our training data set of size $n$, $T_{m}$ our testing data set of size $m$, $M_{e}$ the number of misclassified cases, $\mathcal{I}$ a function that return 1 if there is a match between predicted and label value and $e(d)$ the error rate of classifier $d$. We use also $X$ and $Y$ sets that we have already explained. We can write the error rate like following:
\begin{equation}
e(d)=\frac{M_{e}}{m}.
\end{equation}
It is the opposite to accuracy that is described later in this section. Error rate can be calculated differently depending on which method of data set preparation is used. There are few commonly used approached of how we can handle the training, testing and validation data sets:

- resubstitution -- R-method,
- hold-out -- H-method,
- cross-validation -- $\pi$-method,
- bootstrap,
- jackknife.

The first method is a very simple one. We have the same data set for training and testing. It is not the best solution if we consider to have a solid classifier. The error rate can be written as following:
\begin{equation}
e_{R}(d)=\frac{1}{n}\sum_{j=1}^{n}\mathcal{I}(d(X_{j};\mathcal{L}_{n})\neq Y_{j}).
\end{equation}
It means that we calculate the error rate for each element $j$ of our training data set and add 1 for each well predicted case. We need to divide it with $n$ which is the number of elements in the training data set. 

The second method is about dividing a data set into two data sets. It can be divided by half or other proportions. One set is our training data set and the second training data set. We can swap those sets and calculate the average of both sets. The error rate can be calculated as following:
\begin{equation}
e_{\tau}(\hat{d})=\frac{1}{m}\sum_{j=1}^{m}I(\hat{d}(X_{j}^{t};\mathcal{L}_{n}\neq Y_{j}^{t}).
\end{equation}
Compared to resubstitution method it uses the testing data set only.
Cross-validation is the most common approach. It can be also called as rotation method. We need to divide the data set into $k$ subsets. The elements in each set are randomly chosen. One of those sets are taken as a testing set where the other sets are merged into a  training set. It should be repeated $k$ times for each $k$ subset. The error rate can be calculated like following:
\begin{equation}
e_{CV}(d)=\frac{1}{n}\sum_{j=1}^{n}I(\hat{d}(X_{j};\mathcal{L}_{n}^{(-j)}\neq Y_{j}).
\end{equation}
%sprawdzic n z m
A special case is when $k=m$. It means that we have subsets where each consist of just one element. This approach is known as leave-one-out or U-method.\\
Bootstrap method can be considered as an extension of resubstitution. The goal is to generate multiple sets from the main set by randomly selection. We use resubstitution method on each set and calculate an average error at the end:
\begin{equation}
e_{B}(d)=\frac{1}{B}\sum_{b=1}^{B}\frac{\sum_{j=1}^{n}\mathcal{I}(Z_{j}\notin\mathcal{L}_{n}^{\star b})\mathcal{I}(d(X_{j};\mathcal{L}^{\star b}_{n})\not Y_{j})}{\sum_{j=1}^{n}(Z_{j}\notin\mathcal{L}^{\star b}_{n})}.
\end{equation}

### Output quality metrics

There are several metrics to show the quality of our classification model:

- ROC that stands for Receiver Operating Characteristic curve,
- AUC -- Area Under Curve,
- $F_{1}$ score,
- Precision,
- Recall.

To calculate the metrics we ned 

|                      |condition positive |condition negative |
|----------------------|-------------------|-------------------|
|**predicted positive**|True Positive (TP) |False Positive (FP)|         
|**predicted negative**|False Negative (FN)|True Negative (TN) |

Most common metric is the accuracy. It can be calculated like following:
\begin{equation}
ACC=\frac{\#TP+\#TN}{\#TP+\#TN+\#FP+\#FN}.
\end{equation}
First one that we describe is called True Positive Rate (TPR). It can be calculated like following:
\begin{equation}
TPR=\frac{\#TP}{\#TP+\#FN}.
\end{equation}
TPR is also called sensitivity or recall and is a measure of good predictions within a set of cases. By $\#TP, \#FP$ we mean the number of True Positive and False Positive cases. An opposite to it is specificity. It is also called TNR what stands for True Negative Rate. It can be calculated as following:
\begin{equation}
TNR=\frac{\#TN}{\#TN+\#FP}.
\end{equation}
It is a measure that says how good we are at predicting negative scenario. Another important metric is precision that is also known as Positive Predictive Value (PPV):
\begin{equation}
PPV=\frac{\#TP}{\#TP+\#FP}.
\end{equation}
It is a ratio of positive cases that that were well predicted to all positive cases, even those that are not well predicted. The opposite to it is the Negative Predictive Value:
\begin{equation}
NPV=\frac{TN}{TN+FN}.
\end{equation}
We can also calculate the False Positive Rate metric known as fall-out. It is about how bad we are on predicting positive cases:
\begin{equation}
FPR=1-TNR.
\end{equation}
The opposite to FPR is False Negative Rate:
\begin{equation}
FNR=1-TPR.
\end{equation}
Another popular metric is called $F_{1}$ score and it is a weighted accuracy measure. It takes PPV and TPR to calculate the score:
\begin{equation}
F_{1}=2\frac{PPV\cdot TPR}{TPR+PPV}.
\end{equation}
The $F_{1}$ value as in case of all previous metrics between 1 and 0, where 1 is the best. 
A interesting measure is Matthews Correlation Coefficient measure that is about the correlation between observed and predicted values. The value of MCC is between -1 and 1. If we have a perfect classifier we get MCC=1. A random classifier is when we have MCC=0 and a totally bad classifier if have MCC=-1. This measure can be calculated as following:
\begin{equation}
MCC=\frac{\#TP\cdot\#TN-\#FP\cdot\#FN}{\sqrt{(\#TP+\#FP)(\#TP+\#FN)(\#TN+\#FP)(\#TN+\#FN)}}.
\end{equation}

In [None]:
def calculate_quality_metrics(data_set, predicted_set):
    tn=0
    tp=0
    fn=0
    fp=0
    for i in range(len(data_set)):
        if data_set[i]>0:
            if data_set[i]==predicted_set[i]:
                tp=tp+1
            else:
                fp=fp+1
        else:
            if data_set[i]==predicted_set[i]:
                tn=tn+1
            else:
                fn=fn+1
    acc=((tp+tn)*1.0)/((tp+tn+fp+fn)*1.0)
    tpr=tp*1.0/(tp+fn)*1.0
    tnr=tn*1.0/(tn+fp)*1.0
    ppv=tp*1.0/(tp+fp)*1.0
    npv=tn*1.0/(tn+fn)*1.0
    fpr=1.0-tnr
    fnr=1.0-tpr
    f1=2*(ppv*tpr*1.0/(tpr+ppv*1.0))
    return [acc,tpr,tnr,ppv,npv,fpr,fnr,f1]

## Sentiment analysis

If we want to publish our chatbot on production, it's very important to measure the sentiment of the customers and our chatbot. We don't want to send to our customers a message with a negative sentiment. Two most popular libraries to check the sentiment analysis is CoreNLP and TextBlob. The libraries are trained on a dataset that usually does not give us the expected result. This is why many times we need to build our own library. Before we build a new one we check TextBlob to get the main idea of sentiment analysis.

In [None]:
example = "The weather is good outside."

We just get the sentiment for the example text:

In [None]:
from textblob import TextBlob

text = TextBlob(example)
text.sentiment

A negative polarity means a negative sentiment, a posisivt polarity means a positive sentiment. The subjectivity means if the sentence is objective or subjective. The value is between 0 and 1.

### Own sentiment analysis library

We want to measure three different sentiments:

In [None]:
SENTIMENT_TO_LABEL_MAPPING = {
    "negative": -1,
    "neutral": 0,
    "positive": 1
}

We use a classic example of airplane customers tweets. In the first place, we need to clean the tweets a bit to remove the emojis, duplicated and punctuation marks.

In [None]:
import re
import html

# https://gist.github.com/Alex-Just/e86110836f3f93fe7932290526529cd1
EMOJI_REGEX = re.compile("([\U00010000-\U0010ffff])", re.UNICODE)
DUPLICATED_LETTER_REGEX = re.compile(r"([^a-z0-9])\1+", re.UNICODE | re.I)
PUNCTUATION_MARKS_REGEX = re.compile(r"([,\.\!\?\[\]\(\)])", re.UNICODE)
LEADING_CHARACTER_REGEX = re.compile(r"([^a-z0-9])([a-z0-9][^\s]*)", re.UNICODE | re.I)


def preprocess_text(raw_text):
    # Convert all the letters to lowercase
    text = raw_text.lower()
    # Remove html entities
    text = html.unescape(text)
    # Remove hashtag symbol and "at" for user mentions
    text = text.replace("#", "")
    text = text.replace("@", "")
    # Divide the emojis written in a row with spaces
    text = EMOJI_REGEX.sub("\\1 ", text)
    # Remove quotation marks
    text = text.replace("\"", "")
    text = text.replace("'", "")
    # Get rid of the misused spaces by
    text = PUNCTUATION_MARKS_REGEX.sub(" \\1 ", text)
    # Divide leading special characters to a separate words
    text = LEADING_CHARACTER_REGEX.sub("\\1 \\2", text)
    # Divide duplicated characters, so after text split they'll be treated
    # as if they were a single character used a couple of times
    text = DUPLICATED_LETTER_REGEX.sub("\\1", text)
    # Return preprocessed value
    return text.strip()

Before loading the tweets, make sure you have the dataset downloaded.

In [None]:
import pandas as pd

# Load the dataset
raw_tweets = pd.read_csv("datasets/twitter-airlines-sentiment.csv")

# Preprocess the data with the function declared previously
tweets = raw_tweets[["airline_sentiment", "text"]]
tweets.columns = ("sentiment", "text", )
tweets["text"] = tweets["text"].map(preprocess_text)
tweets["sentiment"] = tweets["sentiment"].map(lambda x: SENTIMENT_TO_LABEL_MAPPING[x])

We use Random Forest as classifier, but we can use for it any other classifier. We check if with different division method and for different estimators.

![random forest](images/random-forest.png)

The code below can take some time to finish.

In [None]:
import itertools

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


N_ESTIMATORS = (5, 10, 25, 50, 100)
CRITERION = ("gini", "entropy")
MAX_FEATURES = ("auto", "log2", None)

# Divide the dataset into train and test fraction
train_messages, test_messages, train_targets, test_targets = train_test_split(tweets["text"], 
                                                                              tweets["sentiment"],
                                                                              test_size=0.2)

vectorizer = TfidfVectorizer()
for n_estimators, criterion, max_features in itertools.product(N_ESTIMATORS,
                                                               CRITERION,
                                                               MAX_FEATURES):
    # Define the classifier instance
    classifier = RandomForestClassifier(random_state=2018, 
                                        n_estimators=n_estimators, 
                                        criterion=criterion, 
                                        max_features=max_features)
    # Vectorize preprocessed sentences
    train_features = vectorizer.fit_transform(train_messages)

    # Train the model
    %time fit = classifier.fit(train_features.toarray(), train_targets)

    # Check the accuracy of the model on test data and display it
    test_features = vectorizer.transform(test_messages)
    train_predictions = fit.predict(train_features.toarray())
    train_accuracy = accuracy_score(train_predictions, train_targets)
    test_predictions = fit.predict(test_features.toarray())
    test_accuracy = accuracy_score(test_predictions, test_targets)
    print("Configuration: n_estimators = {}, criterion = {}, max_features = {}\n"
          "Train accuracy score: {}\n"
          "Test accuracy score: {}\n".format(n_estimators, criterion, max_features, 
                                             train_accuracy, test_accuracy))

It turned out the following configuration achieves the best accuracy on our test dataset:

n_estimators = 100, criterion = gini, max_features = auto

For that reason we are going to create a simple application that will use these parameters for training. That will be a console application reading the sentences from the user and classifies its sentiment.

#### Feature importance

One of the biggest advantages of Random Forest classifier is the ability to describe the importance of the used features. It allows to check which variables have the best predictive force and to understand how the model performs the decision. The following code snippet visualizes the feature importance for our created model: