# DIGI405 Text Analysis Project Notebook

Version 0.2

You should use this notebook as a starting point for your DIGI405 project. It provides code to select your dataset, and run a complete text classification pipeline with [textplumber](https://geoffford.nz/textplumber/), a package that provides an easy to use interface to methods covered in this course.

**Name:**  
**Student ID:**  
**Project option:** (ONE of 'Essay' or 'Sentiment' or 'Genre')  
**Project submission date:**  

Please also add your name to your notebook filename (where it says 'NAME').

### Notebook structure

Sections 1-4 provide code you should modify or extend. In your report, you can refer to code sections by their section number, eg 2.1.

## 1. Setup

You must select the Python 3.12 kernel to run the code in this notebook. 

In [None]:
from datasets import load_dataset, ClassLabel, DatasetDict
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import FeatureUnion
from sklearn.metrics import confusion_matrix, classification_report

from textplumber.core import *
from textplumber.clean import *
from textplumber.preprocess import *
from textplumber.tokens import *
from textplumber.pos import *
from textplumber.embeddings import *
from textplumber.report import *
from textplumber.store import *
from textplumber.lexicons import *
from textplumber.textstats import *

from imblearn.under_sampling import RandomUnderSampler 

import warnings

# in the interests of readability, ignoring this warning
warnings.filterwarnings("ignore", message="Your stop_words may be inconsistent with your preprocessing")

These settings control the display of Pandas dataframes in the notebook.

In [None]:
pd.set_option('display.max_columns', None) # show all columns
pd.set_option('display.max_colwidth', 500) # increase this to see more text in the dataframe

Get word lists: 
* The stop word list is from NLTK.   
* All of the word lists (including the stop word list) can be used to extract lexicon count features to extract features based on a set of words.

In [None]:
stop_words = get_stop_words()
stop_words_lexicon = {'stop_words': stop_words}
empath_lexicons = get_empath_lexicons()
vader_lexicons = get_sentiment_lexicons()

## 2. Load and inspect data

### 2.1 Choose a dataset and preview the labels

Below you can select a dataset for the assignment. The options are `sentiment`, `essay` and `genre`. Change the value of `dataset_option` below. The datasets available on Huggingface.co will be downloaded automatically and a link provided to the dataset card with more information. The `genre` dataset was distributed with this notebook.   

Note:  The `movie_reviews` dataset is being used to demonstrate the notebook and is not one of your options for the assignment.  

In [None]:
# Choose 'essay', 'sentiment', or 'genre' ('movie_reviews' is just for testing/demonstration)
dataset_option = 'movie_reviews' 

if dataset_option == 'movie_reviews':
	dataset_name = 'polsci/sentiment-polarity-dataset-v2.0'
	dataset_dir = None
	target_labels = ['neg', 'pos']
	text_column = 'text'
	label_column = 'label'
	train_split_name = 'train'
	test_split_name = 'train'
	print('The movie_reviews is to demonstrate the notebook and is not an assignment option.')
elif dataset_option == 'sentiment':
	dataset_name = 'cardiffnlp/tweet_eval'
	dataset_dir = 'sentiment'
	target_labels = ['negative', 'neutral', 'positive']
	text_column = 'text'
	label_column = 'label'
	train_split_name = 'train'
	test_split_name = 'validation'
	print('You selected the sentiment dataset. Read more about this at https://huggingface.co/datasets/cardiffnlp/tweet_eval')
elif dataset_option == 'essay':
	dataset_name = 'polsci/ghostbuster-essay-cleaned'
	dataset_dir = None
	target_labels = ['claude', 'gpt', 'human']
	text_column = 'text'
	label_column = 'label'
	train_split_name = 'train'
	test_split_name = 'test'
	print('You selected the essay dataset. Read more about this at https://huggingface.co/datasets/polsci/ghostbuster-essay-cleaned')
elif dataset_option == 'genre':
    dataset_name = 'genre'
    dataset_type = 'json'
    dataset_dir = 'genre_dataset.json'
    target_labels = ['Fiction', 'Letter', 'Notice', 'Obituary', 'Poetry or verse', 'Recipe', 'Review']
    text_column = 'text'
    label_column = 'label'
    train_split_name = 'train'
    test_split_name = 'test'
    print('You selected the genre dataset.')
else:
	print('Try again! That was not an option!')

#### Important notes about specific datasets:

* Make sure you go to the relevant Huggingface page to read more about the [essay](https://huggingface.co/datasets/polsci/ghostbuster-essay-cleaned) and [sentiment](https://huggingface.co/datasets/cardiffnlp/tweet_eval/viewer/sentiment) datasets. Note the sentiment dataset is one subset of the larger 'tweet_eval' dataset.  
* For the *sentiment* dataset, it is challenging to get good accuracy with three classes. If you like you can remove the `neutral` class. There is a cell below that does this for you - don't change the cell above.
* For the *essay* dataset, there are differences in punctuation between classes. You should use `character_replacements = {"’": "'", '“': '"', '”': '"',}` in the `TextCleaner` component in your pipeline to make sure you are not overfitting to a quirk of the data.

This loads the dataset. 

In [None]:
if dataset_option != 'genre': # if loading from huggingface ...
    dataset = load_dataset(dataset_name, data_dir=dataset_dir)
else: # if loading the genre dataset from the provided json file
    dataset = load_dataset(dataset_type, data_files=dataset_dir)
    train_dataset = dataset['train'].filter(lambda example: example['split'] == 'train')
    test_dataset = dataset['train'].filter(lambda example: example['split'] == 'test')
    dataset = DatasetDict({
        'train': train_dataset,
        'test': test_dataset
        })

This cell will show you information on the dataset fields and the splits.

In [None]:
preview_dataset(dataset)

Here is the breakdown of the composition of labels in each data-set split.

In [None]:
# casting label column to ClassLabel if not already
cast_column_to_label(dataset, label_column)
label_names = get_label_names(dataset, label_column)

dfs = {}
for split in dataset.keys():
    dfs[split] = dataset[split].to_pandas()
    dfs[split].insert(1, 'label_name', dfs[split][label_column].apply(lambda x: dataset[split].features[label_column].int2str(x)))
    print('Labels for {}:'.format(split))
    preview_label_counts(dfs[split], label_column, label_names)

### 2.2 Configure the labels (optional)

* You can override the default labels for the data-set here to make the task more or less challenging. High accuracy does not guarantee a high grade. 
* See the assignment instructions and the dataset card or corresponding paper for explanations of the data.  
* Read the comments below and uncomment the relevant lines for your data-set if and amend the label names if needed.
* Remember, this is optional.

In [None]:
# for the movie reviews dataset (this is just for testing/demonstration) - there are 2 labels and that is it!

# for the sentiment dataset - there are 3 labels - you can make the task simpler as a binary classification problem using one of these options:
#target_labels = ['negative', 'neutral']
#target_labels = ['negative', 'positive']
#target_labels = ['neutral', 'positive']

# for the essay dataset - there are 7 labels - you can make the task simpler as a binary classification problem using one of these options:
#target_labels = ['claude', 'gpt']
#target_labels = ['human', 'gpt'] 
#target_labels = ['human', 'claude']

# for the genre dataset - there are 7 labels - you can turn the task into one or more binary classification problems using options such as:
#target_labels = ['Letter', 'Notice']
#target_labels = ['Letter', 'Fiction']
#target_labels = ['Review', 'Fiction']
#target_labels = ['Notice', 'Obituary']

print(target_labels)

### 2.3 Prepare the train and test splits

* This cell handles the train-test split for you.
* Some of the data-sets are unbalanced. This cell will balance the training data using under-sampling.

In [None]:
target_classes = [label_names.index(name) for name in target_labels]
target_names = [label_names[i] for i in target_classes]

if train_split_name == test_split_name:
    X = dataset[train_split_name].to_pandas()
    X.insert(1, 'label_name', dfs[train_split_name][label_column].apply(lambda x: dataset[train_split_name].features[label_column].int2str(x)))
    y = np.array(dataset[train_split_name][label_column])

    mask = np.isin(y, target_classes)
    X = X.loc[mask]
    y = y[mask]

    # creating df splits with original data first  - so can look at the train data if needed
    dfs['train'], dfs['test'], y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    # we're just using the text for features
    X_train = np.array(dfs['train'][text_column])
    X_test = np.array(dfs['test'][text_column])
else:
    X_train = np.array(dataset[train_split_name][text_column])
    y_train = np.array(dataset[train_split_name][label_column])
    X_test = np.array(dataset[test_split_name][text_column])
    y_test = np.array(dataset[test_split_name][label_column])

    mask = np.isin(y_train, target_classes)
    mask_test = np.isin(y_test, target_classes)

    X_train = X_train[mask]
    y_train = y_train[mask]
    X_test = X_test[mask_test]
    y_test = y_test[mask_test]

# this cell undersamples all but the minority class to balance the training data
X_train = X_train.reshape(-1, 1)
X_train, y_train = RandomUnderSampler(random_state=0).fit_resample(X_train, y_train)
X_train = X_train.reshape(-1)

preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)

### 2.4 Preview the texts

Time to get to know your data. We will only preview the train split.

In [None]:
y_train_names = map(lambda x: label_names[x], y_train)
display(dfs['train'][dfs['train']['label_name'].isin(y_train_names)].sample(10))

Enter the index (the number in the first column) as `selected_index` to see the row. The `limit` value controls how much of the text you see. Set a higher limit to see more of the text or set it to 0 to see all of the text.

In [None]:
# We can display the full text of a selected article by dataframe index
selected_index = 10

preview_row_text(dfs['train'], selected_index, text_column = text_column, limit=400) # change limit to see more of the text if needed

## 3. Create a classification pipeline and train a model

Create a Sci-kit Learn pipeline to preprocess the texts and train a classification model. The pipeline components will be added in through the notebook. There are a number of pipeline components you can access through the `textplumber` package. You will have an opportunity to learn about this in labs, but documentation is [available here](https://geoffford.nz/textplumber).

To speed up preprocessing some of the pipeline components store the preprocessed data in a cache to avoid recomputing them. Run this as is - it will create an SQLite file with the name of your dataset option in the directory of the notebook. This will speed up some repeated processing (e.g. tokenization with Spacy).

In [None]:
feature_store = TextFeatureStore(f'assignment-{dataset_option}.sqlite')

The pipeline below includes a number of different components. Most are commented out on the first run of the notebook. There are lots of options for each component. You will need to look at the documentation and examples in labs to learn about these. These components can extract different kinds of features, any of which can be applied to build a model. The feature types include:

* Token features  
* Bigram features  
* Parts of speech features
* Lexicon-based features  
* Document-level statistics  
* Text embeddings


In [None]:
pipeline = Pipeline([
	('cleaner', TextCleaner(strip_whitespace=True)), # for the essay dataset you should use character_replacements = {"’": "'", '“': '"', '”': '"',}
	('spacy', SpacyPreprocessor(feature_store=feature_store)),
	('features', FeatureUnion([
		('tokens', # token features - these can be single tokens or ngrams of tokens using TokensVectorizer - see textplumber documentation for examples
			Pipeline([
				('spacy_token_vectorizer', TokensVectorizer(feature_store = feature_store, vectorizer_type='count', max_features=100, lowercase = True, remove_punctuation = True, stop_words = stop_words, min_df=0.0, max_df=1.0, ngram_range=(1, 1))),
				# ('selector', SelectKBest(score_func=mutual_info_classif, k=100)), # uncomment for feature selection
				# ('scaler', StandardScaler(with_mean=False)),
				], verbose = True)),

		# ('pos', # pos features - these can be a single label or ngrams of pos tags using POSVectorizer - see textplumber documentation for examples
		# 	Pipeline([
		# 		('spacy_pos_vectorizer', POSVectorizer(feature_store=feature_store)),
		# 		#('selector', SelectKBest(score_func=mutual_info_classif, k=5)),
		# 		('scaler', StandardScaler(with_mean=False)),
		# 		], verbose = True)),

		#('textstats', # document-level text statistics using TextstatsTransformer - see textplumber documentation for examples
		# 	Pipeline([
		# 		('textstats_vectorizer', TextstatsTransformer(feature_store=feature_store)),
		# 		('scaler', StandardScaler(with_mean=False)),
		# 		], verbose = True)),

		# ('lexicon', # lexicon features - defined above are empath_lexicons, sentiment_lexicons and stop_words_lexicon - see textplumber documentation for examples
		# 	Pipeline([
		# 		('lexicon_vectorizer', LexiconCountVectorizer(feature_store=feature_store, lexicons=empath_lexicons)), # the notebook has already provided example lexicons right at the top!
		#  		#('selector', SelectKBest(score_func=mutual_info_classif, k=5)),
		# 		('scaler', StandardScaler(with_mean=False)),
		# 		], verbose = True)),

		# ('embeddings', Model2VecEmbedder(feature_store=feature_store)), # extract embeddings using Model2Vec - textplumber documentation for examples

		], verbose = True)),
	
	('classifier', LogisticRegression(max_iter=5000, random_state=42)) # for logistic regression - only select one classifier!
    #('classifier', DecisionTreeClassifier(max_depth = 3, random_state=42)) # for decision tree - only select one classifier!
], verbose = True) # using verbose because I like to see what is going on

display(pipeline)


In [None]:
pipeline.fit(X_train, y_train)

Run the predictions and output model metrics and a confusion matrix using this cell.

In [None]:
y_predicted = pipeline.predict(X_test)
print(classification_report(y_test, y_predicted, target_names = target_names, digits=3, zero_division=0))
plot_confusion_matrix(y_test, y_predicted, target_classes, target_names)

The cell below is commented out, but you have the option to uncomment it to run a grid search based on the pipeline you've created above.

In [None]:
# # Note: if you get a warning about tokenizers and parallelism - uncomment this line 
# # os.environ["TOKENIZERS_PARALLELISM"] = "false"

# # setup gridsearch to test different max_features
# from sklearn.model_selection import GridSearchCV
# param_grid = {
#     'features__tokens__spacy_token_vectorizer__max_features': [50, 100, 150, 200, 250, 300],  # this assumes you are using the tokens part of the pipeline
#     # 'features__tokens__selector__k': [50, 100, 150, 200, 250, 300],  # this assumes you have enabled the selector for tokens
# }
# grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='f1_macro', verbose=100, n_jobs=1)
# grid_search.fit(X_train, y_train)

# print('\n-----------------------------------------------------------------')
# print("Best parameters found: ", grid_search.best_params_)
# print("Best score found: ", grid_search.best_score_)
# print('-----------------------------------------------------------------\n')

# y_pred = grid_search.predict(X_test)

# print(classification_report(y_test, y_pred, target_names = target_names, digits=3))
# plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

## 4. Evaluate your model and investigate model predictions

You already have some metrics in the cell above. Below is some additional reporting to help you understand your model.

### 4.1 Classifier-specific features

If you are using a Decision Tree classifier in your pipeline, this will plot it ...

In [None]:
if pipeline.named_steps['classifier'].__class__.__name__ == 'DecisionTreeClassifier':
    plot_decision_tree_from_pipeline(pipeline, X_train, y_train, target_classes, target_names, 'classifier', 'features')
else:
    print('The classifier is not a decision tree - so no plot is shown!')

If you are using a Logistic Regression classifier in your pipeline, this will plot the coefficients of the features in the model.


In [None]:
if pipeline.named_steps['classifier'].__class__.__name__ == 'LogisticRegression':
	plot_logistic_regression_features_from_pipeline(pipeline, target_classes, target_names, top_n=20, classifier_step_name = 'classifier', features_step_name = 'features')

### 4.2 Investigate correct and incorrect predictions

To see the predictions of your model run this cell. The output can be quite long depending on the dataset and the number of misclassifications. The Pandas `max_rows` is configured at the top of the cell to restrict the length of output. You can adjust this as required. This is reset back to the Pandas default at the end of the cell.

In [None]:
# adjust max rows
pd.set_option('display.max_rows', 5) # show all rows

# creating dataframe from y_predicted, y_test and the text
predictions_df = pd.DataFrame(data = {'true': y_test, 'predicted': y_predicted})
y_predicted_probs = pipeline.predict_proba(X_test)
y_predicted_probs = np.round(y_predicted_probs, 3)
columns = [f'{target_names[i]}_prob' for i in range(len(target_names))]
predictions_df['predicted'] = predictions_df['predicted'].apply(lambda x: label_names[x])
predictions_df['true'] = predictions_df['true'].apply(lambda x: label_names[x])
predictions_df['correct'] = predictions_df['true'] == predictions_df['predicted']
predictions_df['text'] = X_test
predictions_df = pd.concat([predictions_df, pd.DataFrame(y_predicted_probs, columns=columns)], axis=1)

# output a preview of docs for each cell of confusion matrix ...
for true_target, target_name in enumerate(target_names):
    for predicted_target, target_name in enumerate(target_names):
        if true_target == predicted_target:
            print(f'\nCORRECTLY CLASSIFIED: {target_names[true_target]}')
        else:
            print(f'\n{target_names[true_target]} INCORRECTLY CLASSIFIED as: {target_names[predicted_target]}')
        print('=================================================================')

        display(predictions_df[(predictions_df['true'] == target_names[true_target]) & (predictions_df['predicted'] == target_names[predicted_target])])

pd.set_option('display.max_rows', 60) # setting back to the default

### 4.3 Run inference on new (or old) data

You can also run inference on new data (or any of the texts from training/validation) by changing the contents of the `texts` list below. This outputs a prediction, the probabilities of each class and the features present within the text that are used by the model to make its predictions. The numbers for each feature are the input to the final step of the pipeline. They may be scaled or transformed depending on the pipeline components you've chosen.

In [None]:
texts = ['''
It was excellent!
''',
		'''
This was a terrible movie!
''',
	'''
This might not not be the best movie ever made, or it could be the best movie of no time.
''',
]

y_inference = pipeline.predict(texts)

preprocessor = Pipeline(pipeline.steps[:-1])
feature_names = preprocessor.named_steps['features'].get_feature_names_out()

for i, text in enumerate(texts):
	print(f"Text {i}: {text}")
	
	print(f"\tPredicted class: {label_names[y_inference[i]]}")
	print()

	y_inference_proba = pipeline.predict_proba([text])
	for i, prob in enumerate(y_inference_proba[0]):
		print(f"\tProbability of class {target_names[i]}: {prob:.2f}")

	print()
	print("\tFeatures:")

	embeddings = 0
    
	frequencies = preprocessor.transform([text])
	if not isinstance(frequencies, np.ndarray):
		frequencies = frequencies.toarray()
	frequencies = frequencies[0].T
    
	for j, freq in enumerate(frequencies):
		if feature_names[j].startswith('embeddings_'):
			embeddings += 1
		elif freq > 0:
			print(f"\t{feature_names[j]}: {freq:.2f}")
	if embeddings > 0:
		print(f"\tFeatures also include {embeddings} embedding dimensions")

	print()
