# Intro to Text Classification

In this notebook we experiment with methods to transform sentences into vectors, and use vectors for text classification.

While a lot of contemporary NLP is neural, discussing simpler techniques first will help us introduce some terminology and build some intuition on what (doesn't) work(s) and why.

You are encouraged to play around with the code and modify / re-built parts of it as you fit: there is NO substitute for "tinkering with code" to understand how all the concepts fit together (corollary: all this code is written for pedagogical purposes, so some functions are re-used from previous lectures to provide a self-sufficient script).

In [None]:
# some global import
# we import specific libraries in due time
import json
import glob
import os
import numpy as np
import pandas as pd
from collections import Counter
from matplotlib import pyplot as plt
from random import choice

In [None]:
%matplotlib inline

## Data loading

In [None]:
# make sure you have the datasets library installed
# see: https://github.com/huggingface/datasets

# !pip install datasets

In [None]:
import string

# some utils function
def get_finance_sentiment_dataset(split: str='sentences_allagree'):
    # load financial dataset from HF
    from datasets import load_dataset
    # https://huggingface.co/datasets/financial_phrasebank
    # by default, load just sentences for which all annotators agree
    dataset = load_dataset("financial_phrasebank", split)
    
    return dataset['train']


def get_finance_sentences():
    dataset = get_finance_sentiment_dataset()
    cleaned_dataset = [[pre_process_sentence(_['sentence']), _['label']] for _ in dataset]
    # debug 
    print("{} cleaned sentences from finance dataset\n".format(len(cleaned_dataset)))
    
    return cleaned_dataset


def pre_process_sentence(sentence: str):
    # this choices are VERY important. Here, we take a simplified 
    # view, remove the punctuations and just lower case everything
    lower_sentence = sentence.lower()
    # remove punctuation
    # nice suggestion from https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string
    # if we change the exclude set, we can control what to exclude
    exclude = set(string.punctuation)
    return ''.join(ch for ch in lower_sentence if ch not in exclude)

In [None]:
finance_dataset = get_finance_sentences()
# print out the first items in the dataset, to check the format
finance_dataset[:2]

In [None]:
# get sentences without label for vectorizer part
finance_dataset_sentences = [_[0] for _ in finance_dataset]

## Introducing count and tf-idf vectorizer

_Computers do NOT understand words, so the first thing to do is to map the sentences found in the dataset to vectors of numbers._

The oldest trick for manipulating text is the so-called "vector space model" (still very much in Information Retrieval), which is simply the idea that you can map documents to vectors and use "geometric notions" to compute interesting metrics, such as semantic similarity. The oldest mapping between words and vectors is a simple count transformation:

D1: 'The bill is passed.'
D2: 'Jesse proposed the bill.'

D1: The=1, bill=1, is=1, passed=1, Jesse=0, proposed=0
D2: The=1, bill=1, is=0, passed=0, Jesse=1, proposed=1

A more sophisticated version is TF-IDF, which is the foundation of most search engines (SOLR, Elastic etc.): vectors are not counts, but they weights, as we discussed in class. 

Let's see the difference with some Python examples.

In [None]:
# difference between count and tf-idf vectors
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
# example docs
docs = finance_dataset_sentences[:2]
# instantiate the vectorizer object
countvectorizer = CountVectorizer(analyzer='word')
tfidfvectorizer = TfidfVectorizer(analyzer='word')
# convert the documents into a matrix
count_wm = countvectorizer.fit_transform(docs)
tfidf_wm = tfidfvectorizer.fit_transform(docs)
# display the difference
count_tokens = countvectorizer.get_feature_names()
tfidf_tokens = tfidfvectorizer.get_feature_names()
df_countvect = pd.DataFrame(data=count_wm.toarray(),
                            index=['Doc{}'.format(_) for _ in range(len(docs))], 
                            columns=count_tokens)
df_tfidfvect = pd.DataFrame(data=tfidf_wm.toarray(),
                            index=['Doc{}'.format(_) for _ in range(len(docs))], 
                            columns=tfidf_tokens)
print("Count Vectorizer\n")
print(df_countvect)
print("\nTD-IDF Vectorizer\n")
print(df_tfidfvect)

* Important observation # 1: the size of the vector for each doc is proportional to vocabulary size.
* Important observation # 2: the vectors are mostly empty (it is a "sparse" representation). As we shall see in future lectures, neural network produces "dense" representations.

In [None]:
# see how vector dimension changes with more docs
docs = finance_dataset_sentences[:100]
countvectorizer = CountVectorizer(analyzer='word')
count_wm = countvectorizer.fit_transform(docs)
count_tokens = countvectorizer.get_feature_names()
df_countvect = pd.DataFrame(data=count_wm.toarray(), 
                            index=['Doc{}'.format(_) for _ in range(len(docs))], 
                            columns=count_tokens)
print("Count Vectorizer\n")
print(df_countvect)

If documents are points in a |V| dimensional space, their similarity is given by the angle between them: we show how to compute similarity with cosine first, and then wrap it up in a function:

In [None]:
# calculate similarity using cosine similarity
from sklearn.metrics.pairwise import linear_kernel
docs = finance_dataset_sentences[:5]
tfidfvectorizer = TfidfVectorizer(analyzer='word')
tfidf_wm = tfidfvectorizer.fit_transform(docs)
cosine_similarities = linear_kernel(tfidf_wm, tfidf_wm).flatten()
# print out for debug - the diagonal is 1 as all documents are perfectly
# similar to themselves
cosine_similarities

In [None]:
# wrap up similarity calculation in a function
def most_similar_docs_tf_idf(target_doc_index, docs, tfidf, top_k=3, debug=False):
    sims = linear_kernel(tfidf[target_doc_index:target_doc_index + 1], tfidf).flatten()
    indices = sims.argsort()[:-top_k-2:-1]
    
    if debug:
        print(indices)
    
    return [docs[i] for i in indices]

In [None]:
for idx, d in enumerate(docs):
    print("\n======== Most similar docs to: {}\n".format(d))
    print(most_similar_docs_tf_idf(idx, docs, tfidf_wm)[1:])

_Let's run on all the dataset now!!!_

In [None]:
all_text_data = finance_dataset_sentences

* Important observation: the scikit vectorizer comes with MANY options (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). As we discussed in class, there are non-equivalent ways of cleaning text, tokenize it, etc. The removal of stop words, or how frequent a word should be to be included in the model, or what ngram size to use, are governed by parameters that can be tweaked.

In [None]:
finance_tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english')
finance_tfidf_wm = tfidfvectorizer.fit_transform(all_text_data)

We pick a random sentence from our data, and compute the most similar sentences according to TF-IDF...

In [None]:
from random import randint
rnd_idx = randint(0, len(all_text_data))
print("\n======== Most similar docs to:\n {}-{}\n\n".format(rnd_idx, all_text_data[rnd_idx]))
top_docs = most_similar_docs_tf_idf(rnd_idx, all_text_data, finance_tfidf_wm, debug=True)
for t in top_docs[1:]:
    print(t, '\n')

## Applied Vectorization: Text Classification

Once we have vectors, we can proceed to build a classifier pretty much exactly like you did for other supervised learning projects in the class.

In [None]:
from sklearn.model_selection import train_test_split

finance_dataset_text = [_[0] for _ in finance_dataset]
finance_dataset_label = [_[1] for _ in finance_dataset]
all_labels = set(finance_dataset_label)
print("All labels are: {}".format(all_labels))
X_train, X_test, y_train, y_test = train_test_split(finance_dataset_text, 
                                                    finance_dataset_label, 
                                                    test_size=0.1, 
                                                    random_state=42)

In [None]:
print(len(X_train), len(X_test), X_train[:3], y_train[:3])

In [None]:
final_tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english')
final_tfidf_train = final_tfidfvectorizer.fit_transform(X_train)
final_tfidf_train.shape

In [None]:
X_test_transformed = final_tfidfvectorizer.transform(X_test)

#### Bonus: Chi-Square selection

If the model is too big (remember: the model grows with vocabulary size!), we can check which words are most predictive of the target label, and trim the model accordingly (i.e. we are doing feature selection on our text data - see for example https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html)

In [None]:
feature_names = final_tfidfvectorizer.get_feature_names()
print(len(feature_names), feature_names[-3:])

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

ch2 = SelectKBest(chi2, k=25)
X_chi_train = ch2.fit_transform(final_tfidf_train, y_train)
X_chi_test = ch2.transform(X_test_transformed)
new_feature_names =  np.array(feature_names)[ch2.get_support()]
print(len(new_feature_names), new_feature_names[:10])

_We now instantiate a classifier, train it and then predicting unseen test cases as usual._

In [None]:
from sklearn.naive_bayes import MultinomialNB
# from https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB

model = MultinomialNB()
# train and test the classifier
model.fit(final_tfidf_train, y_train)
predicted = model.predict(X_test_transformed)
predicted_prob = model.predict_proba(X_test_transformed)

In [None]:
# debug
print(predicted[:3], predicted_prob[:3])

* Important observation: thanks to scikit great API, we can re-run the code below swapping in and out different classifiers, and compare their performance easily - for example, we could try a LinearSVC (https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)!

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

def calculate_confusion_matrix_and_report(y_predicted, y_golden, with_plot=True):
    # calculate confusion matrix: 
    cm = confusion_matrix(y_golden, y_predicted)
    # build a readable report;
    # https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report
    print('\nClassification Report')
    print(classification_report(y_golden, y_predicted))
    # plot the matrix
    if with_plot:
        plot_confusion_matrix(cm)
                                          
    return
                                          
def plot_confusion_matrix(c_matrix):
    plt.imshow(c_matrix, cmap=plt.cm.Blues)
    plt.xlabel("Predicted labels")
    plt.ylabel("True labels")
    plt.xticks([], [])
    plt.yticks([], [])
    plt.title("Confusion matrix")
    plt.colorbar()
    plt.show()
    
    return

As usual, we first try to understand the model behavior on unseen data through some rough quantitative measures and visualization. 

In [None]:
print("Total of # {} test cases".format(len(y_test)))
calculate_confusion_matrix_and_report(predicted, y_test)

### Error analysis

In [None]:
assert len(X_test) == len(predicted)
# manual inspection
mistakes = [(x, p, y, prob) for x, p, y, prob in zip(X_test, predicted, y_test, predicted_prob) if p != y]
print("Total of mistakes: {}".format(len(mistakes)))
# debug
print("Sentence: {}\nPredicted: {}, but it was: {}\nProbs: {}".format(*mistakes[0]))

In [None]:
for _ in range(3):
    rnd_mistake = choice(mistakes)
    print("Sentence: {}\nPredicted: {}, but it was: {}\nProbs: {}\n=======\n".format(*rnd_mistake))

In [None]:
# data slicing: instead of considering the performances on all dataset, 
# we split it according to categories important for our business

# for example, let's say we slice queries by quarter
slices = {
    "first quarter": [[], []],
    "second quarter": [[], []],
    "third quarter": [[], []],
    "fourth quarter": [[], []]
}

for x, p, y in zip(X_test, predicted, y_test):
    for _s in slices.keys():
        if _s in x:
            slices[_s][0].append(p)
            slices[_s][1].append(y)
            
# debug 
# print(slices["first quarter"])
            
for _slice, test_set in slices.items():
    if test_set[0]:
        print("Total of # {} cases in slice: {}".format(len(test_set[0]), _slice))
        calculate_confusion_matrix_and_report(test_set[0], test_set[1], with_plot=False)
        print("\n===========\n")

We should be able to adapt the “black box testing” from traditional software systems to ML systems: it should be possible to evaluate the performance of a complex system by treating it as a black box, and only supply input-output pairs that are relevant for our qualitative understanding.

(see for example the excellent paper: https://arxiv.org/abs/2005.04118)

In [None]:
# Test for edge cases / interesting cases / regression errors

# CASE 1:
# I'm particularly interesting in some company, say
# https://en.wikipedia.org/wiki/Comptel
# and want to make sure we are doing well there!

companies_I_care_about = ['comptel']

for company in companies_I_care_about:
    print("\n======\nFocus on target company: {}\n".format(company))
    for x, p, y in zip(X_test, predicted, y_test):
        if company in x:
            print("For '{}' =>\ngolden {}, predicted {}\n".format(
                x,
                y, 
                p))

In [None]:
# CASE 2:
# Assuming we have some specific sentences to monitor, let's check for that!
sentences_I_care_about = [
    'the company slipped to an operating loss of eur 26 million from a profit of eur 13 million',
    'revenue in the quarter fell 8 percent to  euro  24 billion compared to a year earlier'
]
labels_I_care_about = [0, 0]

for x, p in zip(X_test, predicted):
     if x.strip() in sentences_I_care_about:
        print("For '{}', I expect {}, it was {}\n".format(
            x,
            p, 
            labels_I_care_about[sentences_I_care_about.index(x.strip())]
        ))

In test classification, we desire to have a system robust to alternative specifications of the input string, for example we expect that the model response to:

"revenue in the quarter fell 8 percent to  euro  24 billion compared to a year earlier"

AND

"revenue in the quarter diminished by 8 percent to  euro  24 billion compared to the previous year"

would be identical.

While a full-fledge treatment of this problem is out of scope, let's see how this intuition works with working code.

In [None]:
# test for perturbations
test_sentences = [
    'the company slipped to an operating loss of eur 26 million from a profit of eur 13 million',
    'revenue in the quarter fell 8 percent to  euro  24 billion compared to a year earlier'
]

perturbated_sentences = [
    'operating loss surged to eur 26 million from a profit of eur 13 million',
    'revenue in the quarter diminished by 8 percent to  euro  24 billion compared to the previous year'
]

test_predicted = model.predict(final_tfidfvectorizer.transform(test_sentences))
perturbated_predicted = model.predict(final_tfidfvectorizer.transform(perturbated_sentences))

for s, t, p in zip(test_sentences, test_predicted, perturbated_predicted):
    print("\nFor sentence '{}', prediction was: {}, under perturbation: {}".format(
        s, t, p
    ))

Bonus: a quick and easy way to generate perturbation is called back-translation.

The idea is that you can use machine translation to go:

SOURCE -> TARGET -> NEW_SOURCE

where NEW_SOURCE is a semantically equivalent, but not identical version of source. For example:

'hi' -> Italian target: 'ciao' -> 'hello'

'hi' and 'hello' have the same meaning so, 'hello' may be considered a perturbation of the original text.

A quick and dirty example follows (not "real" code!)

In [None]:
# !pip install BackTranslation

In [None]:
from BackTranslation import BackTranslation
trans = BackTranslation(url=[
      'translate.google.com',
      'translate.google.co.kr',
    ])
for t in test_sentences:
    result = trans.translate(t, src='en', tmp = 'zh-cn')
    print("Original is: {}\nNew sentence is: {}\n\n".format(t, result.result_text))