# Evaluation

In this notebook, we will be evaluating our model performance.

In [None]:
# Import Classes for tokenization and model training
from transformers import AutoTokenizer, TextClassificationPipeline

# tqdm is a progress bar that visualizes the training progress
from tqdm.auto import tqdm

# Import DatasetDict which will help us prepare our own dataset for use in training and evaulating machinelearning models
from datasets import DatasetDict

# Import library that helps us work with arrays
import numpy as np

# Import functions for model evaluation
from sklearn.metrics import mean_squared_error, cohen_kappa_score

In [20]:
# Load the DatasetDict object we created in the previous notebook. 
datadict = DatasetDict.load_from_disk('../data/ellipse.hf/')

# We are specifically interested in using the test set since we are in our model evaluation phase
ds = datadict['test']

In [64]:
# This function will compare the model's predictions to the actual labels, evaluate the model's performance using two different metrics,
# and print out the results.
def evaluate_performance(labels, predictions):
    mse = mean_squared_error(labels, predictions)
    qwk = cohen_kappa_score(np.round(labels),
                            np.round(predictions),
                            weights='quadratic')
    
    print('Mean squared error:', mse)
    print('Quadratic Weighted Kappa:', qwk)

In [76]:
# Model inference pipeline that uses our finetuned model
def predict(dataset, score_to_predict):
    
    pipe = pipeline('text-classification',
                    model=f'../bin/{score_to_predict.lower()}-model/',
                    truncation=True,
                    function_to_apply='none',
                   )
    
    labels = dataset
    predictions = [pipe(text)[0]['score'] for text in tqdm(dataset['text'])]
    
    return predictions

In [63]:
# Run model inference for the grammar prediction model
grammar_predictions = predict(ds, 'grammar')

# Evaluate model performance
evaluate_performance(ds['Grammar'], grammar_predictions)

  0%|          | 0/973 [00:00<?, ?it/s]

Mean squared error: 0.26103419570908415
Cohen's Kappa: 0.5627542019572562


In [77]:
# Do the same for the vocabulary model
vocabulary_predictions = predict(ds, 'vocabulary')
evaluate_performance(ds['Vocabulary'], vocabulary_predictions)

  0%|          | 0/973 [00:00<?, ?it/s]

Mean squared error: 0.19491337478093973
Quadratic Weighted Kappa: 0.552087599588894


## Drilling down on Truncation

Truncation has a small but predictable effect on model performance.

In [78]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def would_be_truncated(sample):
    input_ids = tokenizer(sample['text'], truncation=False)['input_ids']
    return True if len(input_ids) > 512 else False

# this is a list of boolean values that indicates whether each sample would be truncated.
truncated = np.array([would_be_truncated(sample) for sample in ds])

Token indices sequence length is longer than the specified maximum sequence length for this model (685 > 512). Running this sequence through the model will result in indexing errors


### Grammar
The quadratic weighted kappa between human raters for grammar score was 0.593, so the model's QWK of 0.53 on truncated texts is likely to be sufficient. As mentioned elsewhere, methods exist to overcome model max length if necessary.

In [71]:
print('Performance on samples that were truncated:')
evaluate_performance(np.array(ds['Grammar'])[truncated], np.array(grammar_predictions)[truncated])

Performance on samples that were truncated:
Mean squared error: 0.2680961383174671
Quadratic Weighted Kappa: 0.5381294964028777


In [72]:
print('Performance on samples that were NOT truncated:')
evaluate_performance(np.array(ds['Grammar'])[~truncated], np.array(grammar_predictions)[~truncated])

Performance on samples that were NOT truncated:
Mean squared error: 0.2563100831580218
Quadratic Weighted Kappa: 0.5731561102648518


### Vocabulary
The quadratic weighted kappa between human raters for Vocabulary score was 0.518. The same pattern holds for the vocabulary score predictions.

In [79]:
print('Performance on samples that were truncated:')
evaluate_performance(np.array(ds['Vocabulary'])[truncated], np.array(vocabulary_predictions)[truncated])

Performance on samples that were truncated:
Mean squared error: 0.22931245016904137
Quadratic Weighted Kappa: 0.4999765467423425


In [80]:
print('Performance on samples that were NOT truncated:')
evaluate_performance(np.array(ds['Vocabulary'])[~truncated], np.array(vocabulary_predictions)[~truncated])

Performance on samples that were NOT truncated:
Mean squared error: 0.17190198644241547
Quadratic Weighted Kappa: 0.5297352790203124
