# DIGI405 - Text Classification Introduction

See the [README](README.md) for further notes on this notebook (e.g. installing required libraries if you are not using the class JupyterHub). See the [CHANGELOG](CHANGELOG.md) for version number and a history of changes.  

This lab notebook will introduce text classification using **[scikit-learn](https://scikit-learn.org/stable/)**. The scikit-learn library is a flexible and comprehensive machine learning framework. It is helpful, because it has great documentation and examples, which you can explore to supplement your learning. It also encourages the use of pipelines, making it easy to understand what our classifier is doing and apply a classifier to new data. 

We are supplementing scikit-learn's functionality for working with text data with [Textplumber](https://geoffford.nz/textplumber/), which provides pipeline components for scikit-learn to make it easier to extract relevant features from text data, including tokens, parts of speech, lexicon scores, document-level statistics and embeddings. In this notebook, we will use Textplumber's `NLTKPreprocessor`, which implements tokenization and parts of speech tagging using [NLTK](https://www.nltk.org/), and `TokensVectorizer`, to extract token-based features from the text data.

This is quite a long notebook. It will take you through loading and inspecting the texts in your corpus, setting up feature extraction, classifying your texts and evaluating your text classification model. 

*Note*: You will need to jump around the notebook to change settings and rerun the classification to find good settings for feature extraction and to evaluate your model. 

**Important:** Each time you change settings below, you need to rerun the following cells in order to implement the classification pipeline.

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 0:</strong> Throughout the notebook there are defined tasks for you to do. Watch out for them - they will have a box around them like this! Make sure you take some notes as you go.
</div>

## Setup

Below we are importing required libraries. 

We will use the [Naive Bayes Classifier](https://scikit-learn.org/stable/modules/naive_bayes.html). We will also use scikit-learn's different feature extraction methods based on counts or tf-idf weights. The [NLTK](https://www.nltk.org/) library is used for pre-processing.

In [None]:
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as stop_words_sklearn

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

from textplumber.preprocess import NLTKPreprocessor
from textplumber.tokens import TokensVectorizer
from textplumber.core import get_stop_words
from textplumber.report import preview_row_text, plot_confusion_matrix, preview_pipeline_features
from textplumber.store import TextFeatureStore

This cell defines functions that will be used later in the notebook. 

In [None]:
def nb_binary_display_most_informative_features(pipeline, dataset, features_to_show=20):
	""" Display features indicative of each class for a 
	binary classification using a Naive Bayes classifier. """

	vect = pipeline.named_steps['vectorizer']
	clf = pipeline.named_steps['classifier']
	feature_names = vect.get_feature_names_out()
	logodds=clf.feature_log_prob_[1]-clf.feature_log_prob_[0]

	df = pd.DataFrame({
		'Feature': feature_names,
		'Log-Odds': logodds,
	})

	print("Features most indicative of",dataset.target_names[0])
	print('============================' + '='*len(dataset.target_names[0]))

	sorted_df = df.sort_values('Log-Odds', ascending=True).head(features_to_show)
	display(sorted_df)

	print("Features most indicative of",dataset.target_names[1])
	print('============================' + '='*len(dataset.target_names[1]))

	sorted_df = df.sort_values('Log-Odds', ascending=False).head(features_to_show)
	display(sorted_df)

def get_feature_frequencies(pipeline, text):
	""" Get feature frequencies for a text using the provided pipeline. """
	preprocessor = Pipeline(pipeline.steps[:-1])
	frequency = preprocessor.transform([text]).toarray()[0].T
	df = pd.DataFrame(frequency, index=preprocessor.named_steps['vectorizer'].get_feature_names_out(), columns=['frequency'])
	df = df[df['frequency'] > 0].sort_values('frequency', ascending=False)
	if len(df) < 1:
		return 'No features extracted from this document.'
	else:
		return df

## Preview stop word lists

As you've already learned in this course, pre-processing can have a major influence on the results of text classification tasks. 

In particular, you should put thought into whether a stop word list is sensible for your task. The scikit-learn website also makes this point at https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words and recommends caution about using its stop word list! That page also links to a recent paper discussing practical issues with stop word lists, including whether the way you are tokenising your documents matches the tokenisation approach used in your stop word list.

Using the cells below you can preview the stop word lists supplied by scikit-learn and NLTK, which we have used previously in class. You will notice the lists are different.

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 1:</strong> For each stop word list, think of a text classification task where words in the stop word list could be informative and where it would be a bad idea to remove them.
</div>

Here is the scikit-learn stop word list:

In [None]:
stop_words_sklearn = list(stop_words_sklearn)
print(stop_words_sklearn)

Here is the NLTK stop word list:

In [None]:
stop_words_nltk = get_stop_words(save_to = 'stop_words_nltk.txt')
print(stop_words_nltk)

## Load corpus and set train/test split

The scikit-learn libraryis packaged with a number of standard data-sets used in machine learning and provides a way to load other data. 

We will begin by loading texts from two categories in the **[20 newsgroups dataset](http://qwone.com/~jason/20Newsgroups/)** to work through an example classifying documents related to politics and religion.

*What is a newsgroup?* We are stretching back into internet history here - way before people talked to strangers on Facebook and X and other social media, there were Usenet Newsgroups! [Here is a link to a Deja News page from 1998](https://web.archive.org/web/19980127204536/http://emarket.dejanews.com/emarket/about/idgs/aboutidgs.shtml) and also a [Wikipedia article](https://en.wikipedia.org/wiki/Usenet_newsgroup) that explains what Newsgroups are all about. 

This data-set was built from discussions between real people on the internet in the 1990s. Please be aware that within this data-set are texts that include racist, sexist, and other offensive language use. 

Here are the categories available for the 20 newsgroups dataset.

In [None]:
categories = fetch_20newsgroups().target_names
for category in categories:
    print(category)

The first time you run this notebook you should use `cats = ['talk.politics.misc', 'talk.religion.misc']`. You will be able to try other categories later from the list above, but use these as the starting point.

In [None]:
# this chooses the categories to load
cats = ['talk.politics.misc', 'talk.religion.misc']

This cell loads the data. Once you've worked through the notebook the first time you will need to come back to this cell and make a change before running the notebook again.

In [None]:
# this downloads/loads the data
dataset = fetch_20newsgroups(subset='train', categories=cats)
#dataset = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=cats)

**Note:** This cell sets the following train/test split: **80% of the data is used for training and 20% is used for testing.** The documents are assigned to each group randomly. 

In [None]:
# assign the train/test split - 0.2 is 80% for training, 20% for testing
test_size = 0.2

# do the train test split ...
# X_train and X_test are the documents
# y_train and y_test are the labels
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, 
                                                    test_size = test_size, random_state=42)

## Inspect documents and labels

In the next cells we create a dataframe to allow us to inspect the data we have imported. Firstly, we will preview the document labels and a brief excerpt for the first few rows.

In [None]:
# combining the labels and text into a dataframe
df = pd.DataFrame(list(zip(y_train, X_train)), columns =['label', 'text'])
# using the class names for the labels
df['label'] = df['label'].apply(lambda x: dataset.target_names[x])

# setting the display width to show more of the text - change this to see more or less
pd.set_option('display.max_colwidth', 100)
# showing the first 10 rows
display(df.head(15))

### You can use the following cell to inspect a specific document and its label based on its index in the training set. 

Note: The indexes will change each time you import the data above because of the random train/test split.

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 2:</strong> Inspect some of the documents in each class and think about the kinds of words that might be useful features in this text classification task.
</div>

In [None]:
train_id = 0 # change this to see different rows
preview_row_text(df, train_id)

## Preprocessing

**This next section of the notebook steps you through some key kinds of pre-processing for text classification using Naive Bayes and a bag of words (BoW) model.**

On the first run you should read about each setting, but leave the settings as they are. You will come back to this section to tune your model.

Preprocessing is handled by Textplumber's `NLTKPreprocessor` and features are extracted by `TokensVectorizer`. 

### Tokenization: Set how you are tokenising the text

This notebook uses NLTK's [word_tokenize](https://www.nltk.org/api/nltk.tokenize.html) functionality.

### Stemming / Lemmatising

This allows to use NLTK stemmers or lemmatisers (or not). Valid options are shown below. Look for more information on the NLTK website: https://www.nltk.org/api/nltk.stem.html. Note: that stemming and lemmatising (in particular) require more processing time. 

```
normalizer = None
```
or
```
normalizer = 'PorterStemmer'
```
or
```
normalizer = 'SnowballStemmer'
```
or
```
normalizer = 'WordNetLemmatizer'
```

In [None]:
normalizer = None

### Lowercase

Setting lowercase to True will transform all document text to lowercase. Setting it to False will not do this transformation.

In [None]:
lowercase = True

### Filter based on token length

By default all tokens are included regardless of their length. You can set a minimum token length to remove short tokens. 

In [None]:
min_token_length = 0

### Remove punctuation

To remove punctuation, set `remove_punctuation = True`. To keep them, set it to `False`.

In [None]:
remove_punctuation = False

### Remove numbers

Numbers can be removed by setting `remove_numbers = True`. To keep them, set it to `False`.

In [None]:
remove_numbers = False

### Configure stop words

Hopefully you have read the notes on stop word lists above and previewed the different lists. 

Do you want to apply a stop_word list? Valid values for stop_words below are:
```
stop_word_list = None
```
or
```
stop_word_list = 'nltk'
```
or
```
stop_word_list = 'sklearn'
```

In [None]:
stop_word_list = None

You can also add extra stop words to any of the lists above.
For example:
```
extra_stop_words = ['stopword1','stopword2','stopword3']
```
If you don't want extra stop words, then the next cell should look like:
```
extra_stop_words = []
```

In [None]:
extra_stop_words = []

### Filter features based on document frequency

The following settings allow you to remove features that occur in many documents or in only a few documents.

Firstly, `min_df` ignores terms that occur below a minimum proportion of documents. For example, 0.01 would ignore terms that occur in less than 1% of documents.

In [None]:
min_df = 0.0

`max_df` allows you to ignore terms above a maximum proportion of documents. For example, 0.95 would ignore terms that occur in more than 95% of documents.

In [None]:
max_df = 1.0

### Choose between token counts or tf-idf weights

You can choose to vectorize your text using frequency or tf-idf weights. Textplumber's `TokensVectorizer` uses scikit-learn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) or [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) internally. Valid values are:
```
vectorizer_type = 'count'
```
or
```
vectorizer_type = 'tfidf'
```

In [None]:
vectorizer_type = 'count'  # Set the vectorization method you want to use

### Set a maximum number of features

`max_features` set this to `None` for no limit or set to the maximum number of the most frequent features (e.g setting it to 1000 would use the 1000 most frequent features).

In [None]:
max_features = 1000

### Ngrams

With ngram_range set to (1,1) you will use unigrams as features i.e. each feature will be a token. If you set it to (1,2) you will use unigrams and bigrams. (1,3) will use unigrams, bigrams and trigrams. If you just want bigrams you would use (2,2). Please note: increasing the ngram range from (1,1) will add more time to preprocessing, as there will be more features.

In [None]:
ngram_range = (1,1)

## Setup the feature extraction and classification pipeline

This cell sets up a Sci-kit learn pipeline for feature extraction and classification. 

**Important Note 1:** When you change settings above or reload your dataset you should rerun this cell to update the pipeline.

In [None]:
# the feature store is used to reduce preprocessing time after the first run
feature_store = TextFeatureStore('text-classification-introduction.sqlite')

In [None]:
# prepare stop words
if stop_word_list == 'nltk':
    stop_words = stop_words_nltk
elif stop_word_list == 'sklearn':
    stop_words = stop_words_sklearn
else:
    stop_words = []

if len(extra_stop_words) > 0:
	stop_words = stop_words + extra_stop_words

In [None]:
# you shouldn't need to change anything in this cell!

pipeline = Pipeline([
    ('preprocessor', NLTKPreprocessor(feature_store = feature_store)),
    ('vectorizer', TokensVectorizer(feature_store = feature_store,
                                   vectorizer_type = vectorizer_type,
								   lowercase = lowercase,
								   min_token_length = min_token_length,	
								   remove_punctuation = remove_punctuation,	
								   remove_numbers = remove_numbers,
								   stop_words = stop_words, 
                                   min_df = min_df,
								   max_df = max_df,
								   max_features = max_features,
								   ngram_range = ngram_range,
                                   normalizer = normalizer
                                   )),
    ('classifier', MultinomialNB()), #here is where you would specify an alternative classifier
])

display(pipeline)

**Important Note 2:** This cell outputs the settings you used above, which you can cut and paste into a document to keep track of changes you are making and their effects.

In [None]:
# you shouldn't need to change anything in this cell!

print('Classifier settings')
print('===================')
print('Classes:', dataset.target_names)
print()
print('Pipeline Components')
for i, step in enumerate(pipeline.named_steps):
    print(f'\tStep {i + 1}: {pipeline.named_steps[step].__class__.__name__}')

print()

print('vectorizer_type:', vectorizer_type)
print()

print('normalizer:', normalizer)
print('lowercase:', lowercase)
print('stop_word_list:', stop_word_list)
print('extra_stop_words:', extra_stop_words)
print('min_token_length:', min_token_length)
print('remove_punctuation:', remove_punctuation)
print('remove_numbers:', remove_numbers)

print()

print('min_df:', min_df)
print('max_df:', max_df)
print('max_features:', max_features)
print('ngram_range:', ngram_range)


## Train the classifier and predict labels on test data

This cell does the work of training the classifier and predicting labels on test data. It also outputs evaluation metrics, a confusion matrix and features indicative of each class.

**Important Note:** You can cut and paste the model output into a document (with the settings above) to keep track of changes you are making and their effects.

In [None]:
# you shouldn't need to change anything in this cell!
pipeline.fit(X_train, y_train)
y_predicted = pipeline.predict(X_test)

You've got a trained model now and predictions on the held-out test data. We will use this to evaluate the model. First, here are the evaluation metrics.

In [None]:
# print report
print(metrics.classification_report(y_test, y_predicted, target_names = dataset.target_names, digits=3))

A confusion matrix is helpful to understand predictions that are correct and incorrect for each class.

In [None]:
plot_confusion_matrix(y_test = y_test, y_predicted = y_predicted, target_classes = [0, 1], target_names = dataset.target_names)

The Naive Bayes classifier returns a `feature_log_prob_` attribute and this can be used to calculate a log-odds ratio for each feature in our model. The log odds ratio expresses the probability of a feature occurring in one class compared to the other. The sign of the log-odds ratio indicates the class related to the feature, while the absolute value indicates the relative importance of the feature in distinguishing between the classes.

In [None]:
features_to_show = 10

nb_binary_display_most_informative_features(pipeline, dataset, features_to_show)

## List all features

You can use a Textplumber function, `preview_pipeline_features`, to see how data moves through your pipeline and preview the features used for classification.

In [None]:
preview_pipeline_features(pipeline)

## Comments about the results on your first run of this notebook (with defaults)

You've probably got close to 0.900 for accuracy. This is pretty good! However, we should ask "why?"!

## Inspect correctly/incorrectly classified documents

The output in the next cell is quite long It will show you a preview of documents for correctly and incorrectly classified documents. 

In [None]:
# creating dataframe from y_predicted, y_test and the text
predictions_df = pd.DataFrame(data = {'true': y_test, 'predicted': y_predicted})
predictions_df['predicted'] = predictions_df['predicted'].apply(lambda x: dataset.target_names[x])
predictions_df['true'] = predictions_df['true'].apply(lambda x: dataset.target_names[x])
predictions_df['correct'] = predictions_df['true'] == predictions_df['predicted']
predictions_df['text'] = X_test

# output a preview of docs for each cell of confusion matrix ...
for true_target, target_name in enumerate(dataset.target_names):
    for predicted_target, target_name in enumerate(dataset.target_names):
        if true_target == predicted_target:
            print(f'\nCORRECTLY CLASSIFIED: {dataset.target_names[true_target]}')
        else:
            print(f'\n{dataset.target_names[true_target]} INCORRECTLY CLASSIFIED as: {dataset.target_names[predicted_target]}')
        print('=================================================================')

        display(predictions_df[(predictions_df['true'] == dataset.target_names[true_target]) & (predictions_df['predicted'] == dataset.target_names[predicted_target])])


## Preview document and its features

Use this cell to preview a document using its index in the test set. You can see the predicted label, its actual label, the full text and the features for this specific document.

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 3:</strong> Inspect documents that were correct and incorrectly classified. Loaded question: Are there features that are not related to the topics that are making it easier for the model to predict the class? (Hint: this is a "loaded question" because the answer is yes!).
</div>

In [None]:
test_id = 5 # preview a text from the cell above using its index

preview_row_text(predictions_df, test_id)

print('Features')
print('========')

print(get_feature_frequencies(pipeline, X_test[test_id]))


## Answer: Why this classification task is perhaps easier than it should be!

You will notice features like edu, com, and other portions of email addresses, as well as names that appear in the headers of these newsgroup messages. The information in message headers, as well as footers and quoted material means that the classification task is focused on metadata of the messages rather than the topics themselves. If we are interested in topic-based classifications, features like people's email addresses (who perhaps post often in a newsgroup) are not good signals. To make this more interesting and more challenging for the rest of the lab we need to load the data with just the message text itself. 

## Classifying based on the "text"

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 4:</strong> Return to the cell where you loaded the corpus and set the train/test split. Comment out the line that loads all the data and uncomment the line below it that loads the messages without headers, footers and quotes (note: you can use the keyboard shortcut <code>CTRL + /</code> to comment/uncomment lines). It should look like this after you make the change:
    <pre>
# this downloads/loads the data
# dataset = fetch_20newsgroups(subset='train', categories=cats)
dataset = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=cats)</pre>
</div>

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 5:</strong> If you haven't already re-run the cell that loads the data and then run the other cells to classify the texts. Note that your accuracy has decreased.     
</div>

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 6:</strong> Now we really get into the task of classifying the texts! You can change the settings to try different feature extraction/pre-processing. Pay attention to the way that preprocessing affects the results. Make sure you evaluate your model and inspect the output as you go. Keep notes on what you observe. Once you have improved the accuracy of your model chat to your tutor and also check the Online Text Classification Task on Learn!
</div>

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 7:</strong> What documents are the most difficult to classify and why?! Is it to do with specific documents? Or, is it about the classes and the kind of talk that appears in them? Are there certain sub topics that are challenging?
</div>

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 8:</strong> Carefully read through the instructions for the Online Text Classification Task on Learn. You can start this task whenever you like.
</div>