In [None]:
#default_exp metrics

# Metrics

> Metrics, evaluations, and results for all models

In [None]:
from ought.starter import *
from ought.lstm import *
from ought.bart import *
from ought.gpt import *
import numpy as np
from tqdm import tqdm

## Metrics

One of the first orders of buisness is setting up a clear objective to optimize. Here, the goal is to get as high an accuracy as possible on text classification on the test set, so the metric is accuracy.

In [None]:
#export
class Metrics:
    def __init__(self, json='data/valid.jsonl', samples=50):
        self.samples = uniform_samples(json, samples)
        print(f"loaded {len(self.samples)} samples")
        
    def accuracy(self, predict_func):
        hits = []
        for sample in self.samples:
            prompt = sample['text']
            response = predict_func(prompt)
            
            # this portion is specific to binary AI/NOT AI classification
            # it can be replaced with a callback
            if (response.upper() == 'NOT AI'):
                pred = 'False'
            elif (response.upper() == 'AI'):
                pred = 'True'
            else:
                print(f"got invalid response: {response}")
                continue
                
            real = sample['label']
            hits.append(pred == real)
        
        return np.array(hits).sum() / len(hits)

In [None]:
metrics = Metrics('data/dev.jsonl', 10)

loaded 20 samples


The dataset is imbalanced heavily, but since the `Metrics` class uses a uniform sampler to get the samples for checking accuracy, a function that always predicts a constant label should have 50% aaaccuracy.

In [None]:
metrics.accuracy(lambda c: 'Not AI')

0.5

Perfect! Now we can test all our models. Please refer to the other pages/files for more details on what each model is/does.

## Evaluating Individual Models

We'll now check the accuracy of each model individually. Note that this section may require a few restarts to clear GPU memory, as we are loading all models together. 

### GPT-2

In [None]:
%%time
%%capture
model = GPTLMClassifier(samples=2)
acc = metrics.accuracy(model.predict)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

CPU times: user 12.5 s, sys: 785 ms, total: 13.3 s
Wall time: 9.35 s


In [None]:
acc

0.5

In [None]:
%%time
%%capture
model = GPTMatmulClassifier(samples=6)
acc = metrics.accuracy(model.predict)

CPU times: user 11.5 s, sys: 825 ms, total: 12.3 s
Wall time: 8.52 s


In [None]:
acc

0.65

In [None]:
%%time
%%capture
model = GPTSimilarityClassifier(samples=6)
acc = metrics.accuracy(model.predict)

CPU times: user 11.7 s, sys: 784 ms, total: 12.5 s
Wall time: 8.63 s


In [None]:
acc

0.6

### LSTM

In [None]:
%%time
%%capture
model = LSTMClassifier(samples=500)
acc = metrics.accuracy(model.predict)

CPU times: user 19.2 s, sys: 6.53 s, total: 25.7 s
Wall time: 24.2 s


In [None]:
acc

0.65

### BART

In [None]:
%%time
%%capture
model = BARTClassifier(samples=5)
acc = metrics.accuracy(model.predict)

Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartModel: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BartModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartForSequenceClassification: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification m

CPU times: user 57.7 s, sys: 1.55 s, total: 59.3 s
Wall time: 33.2 s


In [None]:
acc

0.5

## Ensembling

Finally, we can put it all together and ensemble across all models. This *should* give the most accurate predictions.

In [None]:
#export
class EnsembleClassifier:
    def __init__(self):
        gpt_lm = GPTLMClassifier(samples=2)
        gpt_mm = GPTMatmulClassifier(samples=4)
        gpt_sm = GPTSimilarityClassifier(samples=4)
        lstm = LSTMClassifier(samples=10)
        bart = BARTClassifier(samples=4)
        self.models = [gpt_lm, gpt_mm, gpt_sm, lstm, bart]
        
    def predict(self, prompt):
        preds = [model.predict(prompt) for model in self.models]
        return max(set(preds), key=preds.count)        

Unfortunately, there is not enough VRAM on this machine to run *all* models together. But running the following cells on a machine with sufficient memory will display the timings and results.

In [None]:
%%time
%%capture
model = EnsembleClassifier()

In [None]:
%%time
%%capture
acc = metrics.accuracy(model.predict)

In [None]:
acc

## Improvements

None of the above models seem to be doign particularly great, but `GPTSimilarityClassifier` and `GPTMatmulClassifier` seems to have a slight edge. Additionally, some prompt engineering *could* be applied to the GPT-2 and BART models, but this a substantial improvement seems unlikely.