# DIGI405 - Text Classification with Feature Selection and Grid Search

See the [README](README.md) for further notes on this notebook (e.g. installing required libraries if you are not using the class JupyterHub). See the [CHANGELOG](CHANGELOG.md) for version number and a history of changes.  

What can text classification techniques tell us about sentiment or tone? Can text classification help us find distinguishing features between two groups of texts?  

This notebook introduces you to:

1. A new data-set relevant to sentiment classification and using the Huggingface Datasets library.  
2. Feature selection using [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)
3. Automating parameter tuning using [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

**Remember:** Each time you change settings below, you will need to rerun the cells that create the pipeline and does the classification.

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task:</strong> Throughout the notebook there are defined tasks for you to do. Watch out for them - they will have a box around them like this! Make sure you take some notes as you go.
</div>

## Setup

Below we are importing required libraries. We will be using [scikit-learn](https://scikit-learn.org) for text classification in DIGI405. We will use the Naive Bayes Classifier. Scikit-learn has different feature extraction methods based on counts or tf-idf weights. We will also use NLTK for pre-processing.

In [None]:
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as stop_words_sklearn

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.feature_selection import chi2
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score

from datasets import load_dataset

from textplumber.preprocess import NLTKPreprocessor
from textplumber.tokens import TokensVectorizer
from textplumber.core import get_stop_words
from textplumber.report import preview_row_text, plot_confusion_matrix, preview_pipeline_features, preview_dataset, preview_split_by_label_column
from textplumber.store import TextFeatureStore

import warnings

# in the interests of readability, ignoring this warning
warnings.filterwarnings("ignore", message="Your stop_words may be inconsistent with your preprocessing")

In [None]:
stop_words_sklearn = list(stop_words_sklearn)
stop_words_nltk = get_stop_words(save_to = 'stop_words_nltk.txt')

In [None]:
def nb_binary_display_features(pipeline, label_names, features_to_show=20):
	vect = pipeline.named_steps['vectorizer']
	clf = pipeline.named_steps['classifier']
	feature_names = vect.get_feature_names_out()
	logodds=clf.feature_log_prob_[1]-clf.feature_log_prob_[0]

	# if selector in pipeline
	if 'selector' in pipeline.named_steps:
		# Get SelectKBest feature scores
		features = pipeline.named_steps['selector']
		# get top k feature indices
		cols = features.get_support(indices=True)
		# get corresponding feature scores
		top_k_feature_scores = [features.scores_[i] for i in cols if i in cols]
		feature_names = [feature_names[i] for i in cols]

	df = pd.DataFrame({
		'Feature': feature_names,
		'Log-Odds': logodds,
	})

	if 'selector' in pipeline.named_steps:
		# if scoring func is mi 
		if pipeline.named_steps['selector'].score_func == mutual_info_classif:
			score_column_name = 'MI Score'
		else:
			score_column_name = 'Feature Score'

		df[score_column_name] = top_k_feature_scores

	if 'selector' in pipeline.named_steps:
		print('Top features by information gain')
		print('================================')
		sorted_df = df.sort_values([score_column_name], ascending=False).head(features_to_show)
		display(sorted_df)

	print("Features most indicative of", label_names[0])
	print('============================' + '='*len(label_names[0]))

	sorted_df = df.sort_values('Log-Odds', ascending=True).head(features_to_show)
	display(sorted_df)

	print("Features most indicative of", label_names[1])
	print('============================' + '='*len(label_names[1]))

	sorted_df = df.sort_values('Log-Odds', ascending=False).head(features_to_show)
	display(sorted_df)

def get_feature_frequencies(pipeline, text):
	preprocessor = Pipeline(pipeline.steps[:-1])
	frequency = preprocessor.transform([text]).toarray()[0].T
	feature_names = preprocessor.named_steps['vectorizer'].get_feature_names_out()
	
	if 'selector' in pipeline.named_steps:
		cols = pipeline.named_steps['selector'].get_support(indices=True)
		feature_names = [feature_names[i] for i in cols]

	df = pd.DataFrame(frequency, index=feature_names, columns=['frequency'])
	df = df[df['frequency'] > 0].sort_values('frequency', ascending=False)
	if len(df) < 1:
		return 'No features extracted from this document.'
	else:
		return df


## Load corpus and set train/test split

Today we will work with a movie reviews data-set. The reviews are annotated with sentiment polarities "pos" and "neg". The Sentiment Polarity Dataset Version 2.0 is distributed with NLTK, but we are using it here to introduce Huggingface's [Datasets library](https://huggingface.co/docs/datasets/en/index). We will use the datasets library to load the data. The Huggingface website has a [dataset page](https://huggingface.co/datasets/polsci/sentiment-polarity-dataset-v2.0) with more information, a preview, including links and citation information from the creators of the data-set. Take a look now to help understand the data-set and how it was created.  

In [None]:
dataset = load_dataset('polsci/sentiment-polarity-dataset-v2.0') 

Here we are using a Textplumber function to get a summary of splits, features and the number of rows. 

In [None]:
preview_dataset(dataset)

Printing the features for our train split shows the names of the features and the label names. Notice the field called label has type ClassLabel, which means it is defined as a label. Not all HuggingFace datasets will have this defined.

If we preview our train split as a dataframe we can see that the label is stored as a numeric value.

In [None]:
df = dataset['train'].to_pandas()
display(df.head())

Here we get a count for each label. 

In [None]:
preview_split_by_label_column(dataset, 'label')

In [None]:
X = list(dataset['train']['text'])
y = np.array(dataset['train']['label'])

label_names = dataset['train'].features['label'].names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This cell also sets the train/test split. 80% of the data is used for training and 20% is used for testing. The documents are assigned to each group randomly. 

## Inspect documents and labels

In the next cells we can look at the data we have imported. Firstly, we will preview the document labels and a brief excerpt.

In [None]:
# combining the labels and text into a dataframe
df = pd.DataFrame(list(zip(y_train, X_train)), columns =['label', 'text'])
# using the class_names for the labels
df['label'] = df['label'].apply(lambda x: label_names[x])

# setting the display width to show more of the text - change this to see more or less
pd.set_option('display.max_colwidth', 100)
# showing the first 10 rows
display(df.head(10))

You can use this cell to inspect a specific document and its label based on its index in the training set. Note: The indexes will change each time you import the data above because of the random train/test split.

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 1:</strong> Inspect some off the documents in each class and think about the kinds of words that might be useful features in this text classification task.
</div>

In [None]:
train_id = 1 # Change this to the the index of the document you want to preview
preview_row_text(df, train_id)

## Define preprocessing and feature extraction settings

You can consult the [text classification introduction notebook](https://github.com/polsci/text-classification-introduction) for more information on each setting below.

On the first run through, just use these settings.

In [None]:
normalizer = None
lowercase = True
min_token_length = 0
remove_punctuation = True
remove_numbers = False
stop_word_list = 'nltk'
extra_stop_words = []
min_df = 0.0
max_df = 1.0
vectorizer_type = 'count' # either 'count' or 'tfidf'
max_features = 1000
ngram_range = (1, 1) 

### New settings for the feature selection step

The feature selection step selects the top features based on [univariate statistical tests](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection). Here we are using [mutual information scores](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif) to assess the dependency between each feature and the class labels. 

The value below sets the number of features to select and use in our classifier. In this case we will start with 100 features, based on the mutual information score.

In [None]:
kbest = 100

## Setup a pipeline: feature extraction → feature selection → classifier

This is similar to the Scikit-learn pipeline we setup in the introductory notebook, but there is a new pipeline component for feature selection prior to training the classifier. The TokenVectorizer class takes a tokenized text as input and outputs either Tf-idf weights or counts depending on how you set it above.

**Important Note 1:** When you change settings above or reload your dataset you should rerun these cells again!

In [None]:
# the feature store is used to reduce preprocessing time after the first run
feature_store = TextFeatureStore('text-classification-feature-selection.sqlite')

In [None]:
# prepare stop words
if stop_word_list == 'nltk':
    stop_words = stop_words_nltk
elif stop_word_list == 'sklearn':
    stop_words = stop_words_sklearn
else:
    stop_words = []

if len(extra_stop_words) > 0:
	stop_words = stop_words + extra_stop_words

In [None]:
# you shouldn't need to change anything in this cell!

pipeline = Pipeline([
    ('preprocessor', NLTKPreprocessor(feature_store = feature_store)),
    ('vectorizer', TokensVectorizer(feature_store = feature_store,
                                   vectorizer_type = vectorizer_type,
								   lowercase = lowercase,
								   min_token_length = min_token_length,	
								   remove_punctuation = remove_punctuation,	
								   remove_numbers = remove_numbers,
								   stop_words = stop_words, 
                                   min_df = min_df,
								   max_df = max_df,
								   max_features = max_features,
								   ngram_range = ngram_range,
                                   normalizer = normalizer
                                   )),
    ('selector', SelectKBest(score_func = mutual_info_classif, k=kbest)),
    ('classifier', MultinomialNB()), #here is where you would specify an alternative classifier
])

display(pipeline)

**Important Note 2:** This cell outputs the settings you used above, which you can cut and paste into a document to keep track of changes you are making and their effects.

In [None]:
# you shouldn't need to change anything in this cell!

print('Classifier settings')
print('===================')
print('Classes:', label_names)
print()
print('Pipeline Components')
for i, step in enumerate(pipeline.named_steps):
    print(f'\tStep {i + 1}: {pipeline.named_steps[step].__class__.__name__}')

print()

print('vectorizer_type:', vectorizer_type)
print()

print('normalizer:', normalizer)
print('lowercase:', lowercase)
print('stop_word_list:', stop_word_list)
print('extra_stop_words:', extra_stop_words)
print('min_token_length:', min_token_length)
print('remove_punctuation:', remove_punctuation)
print('remove_numbers:', remove_numbers)

print()

print('min_df:', min_df)
print('max_df:', max_df)
print('max_features:', max_features)
print('ngram_range:', ngram_range)

print()

print('kbest:', kbest)

## Train the classifier and predict labels on test data

Because we are adding the feature selection step, the classifier will be slower as it has to calculate MI scores for each feature and rank them. This will increase the more features you extract.

**Important Note:** You can cut and paste the model output into a document (with the settings above) to keep track of changes you are making and their effects.

In [None]:
# you shouldn't need to change anything in this cell!
pipeline.fit(X_train, y_train)
y_predicted = pipeline.predict(X_test)

Inspect the evaluation metrics on the held-out data ...

In [None]:
# print report
print(metrics.classification_report(y_test, y_predicted, target_names = label_names, digits=3))

Examine the correct and incorrect predictions ...

In [None]:
plot_confusion_matrix(y_test = y_test, y_predicted = y_predicted, target_classes = [0, 1], target_names = label_names)

We can now look at the features ranked by information gain (MI) and by class.

In [None]:
features_to_show = 10
nb_binary_display_features(pipeline, label_names, features_to_show)

## List all features

You can inspect how the text data moves through the pipeline below, including which token features were output by the vectorizer, and which features were selected based on information gain. 

In [None]:
preview_pipeline_features(pipeline)

## Inspect correctly/incorrectly classified documents

In [None]:
# creating dataframe from y_predicted, y_test and the text
predictions_df = pd.DataFrame(data = {'true': y_test, 'predicted': y_predicted})
predictions_df['predicted'] = predictions_df['predicted'].apply(lambda x: label_names[x])
predictions_df['true'] = predictions_df['true'].apply(lambda x: label_names[x])
predictions_df['correct'] = predictions_df['true'] == predictions_df['predicted']
predictions_df['text'] = X_test

# output a preview of docs for each cell of confusion matrix ...
for true_target, target_name in enumerate(label_names):
    for predicted_target, target_name in enumerate(label_names):
        if true_target == predicted_target:
            print(f'\nCORRECTLY CLASSIFIED: {label_names[true_target]}')
        else:
            print(f'\n{label_names[true_target]} INCORRECTLY CLASSIFIED as: {label_names[predicted_target]}')
        print('=================================================================')

        display(predictions_df[(predictions_df['true'] == label_names[true_target]) & (predictions_df['predicted'] == label_names[predicted_target])])


<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 2:</strong> Inspect documents that were correct and incorrectly classified. Why are some documents incorrectly classified?
</div>

## Preview document and its features

Use this cell to preview a document using its index in the test set. You can see the predicted label, its actual label, the full text and the features for this specific document.

In [None]:
test_id = 11 # preview a text from the cell above using its index

preview_row_text(predictions_df, test_id)

print('Features')
print('========')

print(get_feature_frequencies(pipeline, X_test[test_id]))


<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 3:</strong> Try changing the tokenisation to include punctuation. What punctuation emerges as useful features? How are these punctuation features being used?
</div>    
<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 4:</strong>    Now increase the number of most frequent tokens to allow the feature selection step to inspect and score lots more less frequent words. What number of frequent tokens improves the features that can be identified? 
</div>    
<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 5:</strong>    After you’ve identified the best settings to improve the performance metrics of the classifier, review the incorrectly classified documents again. Identify any unexpected word features and identify whether they may be true indicators of sentiment, or just coincidence.
</div>    

# Automated Parameter Tuning

In the remainder of the lab we will work through automated parameter tuning. Warning: this may take some time!

## Define the Pipeline for the Gridsearch

In [None]:
# you shouldn't need to change anything in this cell!

pipeline_for_gridsearch = Pipeline([
    ('preprocessor', NLTKPreprocessor(feature_store = feature_store)),
    ('vectorizer', TokensVectorizer(feature_store = feature_store,
                                   vectorizer_type = vectorizer_type,
								   lowercase = lowercase,
								   min_token_length = min_token_length,	
								   remove_punctuation = remove_punctuation,	
								   remove_numbers = remove_numbers,
								   stop_words = stop_words, 
                                   min_df = min_df,
								   max_df = max_df,
								   max_features = max_features,
								   ngram_range = ngram_range,
                                   normalizer = normalizer
                                   )),
    ('selector', SelectKBest(score_func = mutual_info_classif, k=kbest)),
    ('classifier', MultinomialNB()), #here is where you would specify an alternative classifier
])

display(pipeline_for_gridsearch)

## Create Search Space

This step is to define the space of parameters and estimators we want to search through. We do this in the form of a dictionary and we use double underscore notation (__) to refer to the parameters of different steps in our pipeline. We will be trying out different values of k for the feature selector SelectKBest. However, this is not an exhaustive list of parameters that we can search. We could search parameters for the feature extraction steps as well. Some example parameters are commented out and you could test with them, but note that for every line you uncomment you are increasing the number of combinations and the time it will take to run the grid search.

In [None]:
search_space = [{#'vectorizer__normalizer'  : [None, 'SnowballStemmer'],
                 #'normalizer__stop_word_list'  : [None, stop_words_nltk],
				 #'vectorizer__vectorizer_type' : ['count', 'tfidf'],
                 #'vectorizer__min_df' : [0.0, 0.01, 0.5],
                 #'vectorizer__max_df' : [0.5, 0.75, 1.0],
                 #'vectorizer__max_features' : range(1000, 5001, 1000), # this starts at 1000 and ends at 5000 with steps of 1000
                 #'vectorizer__ngram_range' : [(1, 1), (1, 2)],
                 'selector__k'              : range(50, 701, 50), #this starts at 50 and ends at 700 with steps of 50
                 #'selector__score_func'     : [mutual_info_classif, chi2],
                 #'classifier': [MultinomialNB(), LogisticRegression(max_iter=1000)]
                }]

The scorers can be either be one of the predefined metric strings or a scorer callable, like the one returned by make_scorer

In [None]:
scoring = {'Accuracy': make_scorer(accuracy_score)}

# Run the GridSearch 

This is where the computation happens! We will now pass our pipeline into GridSearchCV to test our search space (of feature preprocessing, feature selection, model selection, and hyperparameter tuning combinations) using cross-validation with 3-folds. If we had more time, we would probably increase the number of folds (cv) to 5 or 10.

Setting refit='Accuracy', refits an estimator on the whole dataset with the parameter setting that has the best cross-validated Accuracy score.
That estimator is made available at ``gridsearch.best_estimator_`` along with parameters like ``gridsearch.best_score_``, ``gridsearch.best_params_`` and ``gridsearch.best_index_``

In [None]:
gridsearch = GridSearchCV(estimator = pipeline_for_gridsearch, 
                    param_grid         = search_space, 
                    scoring            = scoring,
                    cv                 = 3, 
                    refit              = 'Accuracy',
                    return_train_score = True,
                    verbose            = 3)

gridsearch.fit(X_train, y_train)

# Get the Results

We can access the best result of our search using the best_estimator_ attribute.

In [None]:
display(gridsearch.best_estimator_)
display(gridsearch.best_params_)
display(gridsearch.best_score_)

## Extracting from the cv_results_ dictionary

Demonstrating methods of extracting values from the cv_results_ dictionary.

In [None]:
means = gridsearch.cv_results_['mean_test_Accuracy']
stds  = gridsearch.cv_results_['std_test_Accuracy']

for mean, std, params in zip(means, stds, gridsearch.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

In [None]:
y_best_predicted = gridsearch.best_estimator_.predict(X_test)

# print report
print(metrics.classification_report(y_test, y_best_predicted, target_names = label_names, digits=3))

In [None]:
plot_confusion_matrix(y_test = y_test, y_predicted = y_best_predicted, target_classes = [0, 1], target_names = label_names)

## Visualise the results

Plot code taken from https://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html

Note: that the x and y axis range is set below - you may need to change this depending on values you chose above. 

In [None]:
# get our results
results = gridsearch.cv_results_

plt.figure(figsize=(16, 16))
plt.title("GridSearchCV evaluating parameters using the Accuracy scorer.",
          fontsize=16)

plt.xlabel("k")
plt.ylabel("Accuracy")

ax = plt.gca()

# adjust these according to your accuracy results and range values.
ax.set_xlim(0, 700)
ax.set_ylim(0.600, 1)

# Get the regular numpy array from the MaskedArray
X_axis = np.array(results['param_selector__k'].data, dtype=float)

for scorer, color in zip(sorted(scoring), ['b']):
    for sample, style in (('train', '--'), ('test', '-')):
        sample_score_mean = results['mean_%s_%s' % (sample, scorer)]
        sample_score_std = results['std_%s_%s' % (sample, scorer)]
        ax.fill_between(X_axis, sample_score_mean - sample_score_std,
                        sample_score_mean + sample_score_std,
                        alpha=0.1 if sample == 'test' else 0, color=color)
        ax.plot(X_axis, sample_score_mean, style, color=color,
                alpha=1 if sample == 'test' else 0.7,
                label="%s (%s)" % (scorer, sample))

    best_index = np.nonzero(results['rank_test_%s' % scorer] == 1)[0][0]
    best_score = results['mean_test_%s' % scorer][best_index]

    # Plot a dotted vertical line at the best score for that scorer marked by x
    ax.plot([X_axis[best_index], ] * 2, [0, best_score],
            linestyle='-.', color=color, marker='x', markeredgewidth=3, ms=8)

    # Annotate the best score for that scorer
    ax.annotate("%0.3f with k=%s" % (best_score, X_axis[best_index]),
                (X_axis[best_index], best_score + 0.005))

plt.legend(loc="best")
plt.grid(False)
plt.show()

If you got to this point in the lab, try changing search_space above to search more parameters.