This notebook can be run on a collab for free. We're just finetuning flan-t5-base (250M, in between bert-base and bert-large) with LoRa.

## Setup



In [1]:
!pip install transformers datasets evaluate rouge_score peft openai langchain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m73.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting peft
  Downloading peft-0.3.0-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m [31m7.4 MB/s[0m 

## Data

Generated [here](https://github.com/hwchase17/langchain/blob/26aff89b955193ced981a97d79d97364146e72a9/langchain/experimental/finetune/retrieval_qa/dataset.py#L190).

Dataset we'll use here can be found [here](https://github.com/hwchase17/langchain/commit/26aff89b955193ced981a97d79d97364146e72a9) for now.

In [7]:
fp = "/content/t2t_qa_ds_2023_05_17.json"
import json
with open(fp) as f:
  dataset = json.load(f)

print(dataset[0]["question_with_context"])
print(dataset[0]["answer"])

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

before this deep learning revolution all throughout the winters and the summers of AI? Sure. First, I would say as a side point, the winters and summers of AI are greatly exaggerated by Americans and in that, if you look at the publication record of the artificial intelligence community since say the 1950s, you would find a pretty steady growth in advance of ideas and papers. And what's thought of as an AI winter or summer was sort of how much money is the US military pumping into AI, which was meaningful. On the other hand, there was AI going on in Germany, UK and in Japan and in Russia, all

of what most people think of as AI, if you dream of the possibilities of AI, it's really expert systems. And those hit a few walls and there was challenges there. And I think, yes, they will reemerge again with some new breakthroughs a

In [3]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [8]:
inputs = tokenizer(dataset[0]["question_with_context"], return_tensors="pt")
outputs = model.generate(**{k: v.to(model.device) for k, v in inputs.items()})
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['I think the cycles are inevitable, but I think each time we get higher, right?']


flan-t5 (even base) is fairly capable - though maximum likelihood thing to do here is apparently to copy-paste from context, rather than synthesis.

In [9]:
def tokenize_seq2seq_dataset(dataset_fp: str, tokenizer_name: str):
    from datasets import load_dataset
    from torch.nn import CrossEntropyLoss
    from transformers import AutoTokenizer, DataCollatorForSeq2Seq

    assert dataset_fp.endswith(".json")
    dataset_dict = load_dataset("json", data_files=dataset_fp)
    assert set(dataset_dict.keys()) == {"train"}
    dataset = dataset_dict["train"]

    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    # Sanity check.
    tokenized_inputs = dataset.map(lambda x: tokenizer(x["question_with_context"]), batched=True)
    max_input_length = max(len(x) for x in tokenized_inputs["input_ids"])
    tokenized_inputs = dataset.map(lambda x: tokenizer(x["answer"]), batched=True)
    max_output_length = max(len(x) for x in tokenized_inputs["input_ids"])

    print(
        f"context size: {tokenizer.model_max_length}, max source length: {max_input_length}, max target length: {max_output_length}"
    )

    def tokenize(example):
        """Tokenize inputs, labels. Truncate inputs.
        Padding will be applied batch-wise by collator. Padding for labels will also be loss masked by collator.
        """
        tokenized = tokenizer(
            example["question_with_context"],
        )
        labels = tokenizer(text_target=example["answer"])
        tokenized["labels"] = labels["input_ids"]
        return tokenized

    tokenized_dataset = dataset.map(
        tokenize,
        batched=True,
        #remove_columns=["question", "question_with_context", "answer"]
    )
    # Drop these samples for now rather than confuse the completions.
    tokenized_dataset = tokenized_dataset.filter(
        lambda example: len(example["input_ids"]) <= tokenizer.model_max_length
    )
    print(f"{len(dataset)=} {len(tokenized_dataset)=}")
    
    train_test = tokenized_dataset.train_test_split(train_size=0.8)
    
    train = train_test["train"].remove_columns(["question", "question_with_context", "answer"])
    
    data_collator = DataCollatorForSeq2Seq(
        tokenizer,
        label_pad_token_id=CrossEntropyLoss().ignore_index,
    )
    return train, train_test["test"], data_collator

In [10]:
from datasets import set_caching_enabled
set_caching_enabled(False)

# !rm -r /root/.cache/huggingface/datasets/json

  set_caching_enabled(False)


In [11]:
train, test, collator = tokenize_seq2seq_dataset(fp, model_id)

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-380346926a52a3de/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-380346926a52a3de/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/1800 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (536 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/1800 [00:00<?, ? examples/s]

context size: 512, max source length: 600, max target length: 96


Map:   0%|          | 0/1800 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1800 [00:00<?, ? examples/s]

len(dataset)=1800 len(tokenized_dataset)=1356


Oof, for now I'm dropping quite a few examples that exceed the context size^.

In [12]:
train, test

(Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 1084
 }),
 Dataset({
     features: ['answer', 'question', 'question_with_context', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 272
 }))

In [13]:
train_val = train.train_test_split(train_size=0.8)
train, val = train_val["train"], train_val["test"]

In [14]:
train, val, test

(Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 867
 }),
 Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 217
 }),
 Dataset({
     features: ['answer', 'question', 'question_with_context', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 272
 }))

In [15]:
# Validate tokenization.
from torch.utils.data import DataLoader
next(iter(DataLoader(train, batch_size=2, collate_fn=collator)))

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': tensor([[ 2048,     8,   826,  ...,     0,     0,     0],
        [ 2048,     8,   826,  ..., 11801,    10,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([[    3, 16977,     7,   447,   343,     7,   320,    21,  1155,    12,
          5530,  1002,    24,  3098,   756,     8,  1030,     6,   298,  2647,
         17382,   992,    30,     8,  1030,    13,   149, 15651,   930,     5,
             1],
        [   37,   192,   614,   378,    33,   492,     8,  7567,  7951,   631,
            12,  6815,    16,     3,     9, 11743,   194,    28,     8,  1164,
            11,   578,   490,   296,  7833,     5,     1,  -100,  -100,  -100,
          -100]])}

## Finetune

In [16]:
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType

lora_config = LoraConfig(
  r=16,
  lora_alpha=32,
  target_modules=["q", "v"],
  lora_dropout=0.05,
  bias="none",
  task_type=TaskType.SEQ_2_SEQ_LM
)
model = prepare_model_for_int8_training(model)
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

trainable params: 1769472 || all params: 249347328 || trainable%: 0.7096414524241463


In [17]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
import numpy as np
import evaluate

evaluation_metrics = evaluate.combine(["bleu", "rouge"])

def compute_metrics(eval_preds):
  # Set Seq2SeqTrainer arg predict_with_generate=True to return tokens, not just loss + logits.
  labels, preds = eval_preds.label_ids, eval_preds.predictions
  # TODO: ignore_index in preds?
  preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
  decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
  return evaluation_metrics.compute(predictions=decoded_preds, references=decoded_labels)

output_dir = "lora-flan-t5-base"

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    learning_rate=5e-5,
    num_train_epochs=10,
    predict_with_generate=True,
    auto_find_batch_size=True,
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
)

trainer = Seq2SeqTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=train,
    eval_dataset=val,
    data_collator=collator,
    compute_metrics=compute_metrics
)

trainer.train(resume_from_checkpoint=False)
# trainer.save_model(output_dir)

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]



Epoch,Training Loss,Validation Loss,Bleu,Precisions,Brevity Penalty,Length Ratio,Translation Length,Reference Length,Rouge1,Rouge2,Rougel,Rougelsum
1,1.0994,0.961624,0.129806,"[0.6067697450898454, 0.3993566176470588, 0.32010178117048343, 0.26330690826727066]",0.343368,0.483337,2393,4951,0.414528,0.278491,0.387583,0.38908
2,1.0184,0.885696,0.219511,"[0.6422326832548756, 0.45701849836779107, 0.3724409448818898, 0.30335628227194494]",0.514396,0.600687,2974,4951,0.509619,0.375471,0.477256,0.477476
3,0.9861,0.870471,0.227136,"[0.6384590055976292, 0.45851063829787236, 0.3734152900499424, 0.3028906577293674]",0.532471,0.613411,3037,4951,0.51581,0.382176,0.483421,0.484121
4,0.9621,0.860097,0.230935,"[0.638228590035819, 0.4576033637000701, 0.372772089495639, 0.3023543990086741]",0.542168,0.620279,3071,4951,0.519719,0.383022,0.486764,0.487272
5,1.0006,0.852558,0.233605,"[0.6478827361563518, 0.46547493866105855, 0.37708649468892264, 0.3037190082644628]",0.541884,0.620077,3070,4951,0.526536,0.390304,0.493012,0.493368
6,1.1832,0.849606,0.235907,"[0.6473069435431538, 0.46701570680628274, 0.3787764350453172, 0.3059210526315789]",0.545297,0.622501,3082,4951,0.528278,0.391924,0.493224,0.493573
7,0.9275,0.847513,0.235041,"[0.6410587475790833, 0.4602568552585908, 0.3738738738738739, 0.30269607843137253]",0.54984,0.625732,3098,4951,0.526425,0.390363,0.48922,0.489813
8,1.0649,0.844452,0.235177,"[0.6436669906057662, 0.4627177700348432, 0.3765548435733132, 0.30529339351661877]",0.546718,0.62351,3087,4951,0.526249,0.390904,0.490958,0.49099
9,0.9318,0.842625,0.234384,"[0.6461788617886178, 0.4632610216934919, 0.37712987504733053, 0.3068041237113402]",0.543307,0.621087,3075,4951,0.526158,0.390254,0.490168,0.490209
10,0.9108,0.842144,0.235909,"[0.6465629053177692, 0.4649459365190094, 0.3781132075471698, 0.30690221857025474]",0.545866,0.622904,3084,4951,0.527053,0.390986,0.491561,0.491885


Trainer is attempting to log a value of "[0.6067697450898454, 0.3993566176470588, 0.32010178117048343, 0.26330690826727066]" of type <class 'list'> for key "eval/precisions" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[0.6422326832548756, 0.45701849836779107, 0.3724409448818898, 0.30335628227194494]" of type <class 'list'> for key "eval/precisions" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[0.6384590055976292, 0.45851063829787236, 0.3734152900499424, 0.3028906577293674]" of type <class 'list'> for key "eval/precisions" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[0.638228590035819, 0.4576033637000701, 0.372772089495639, 0.3023543990086741]" of type <class 'list'> for key 

TrainOutput(global_step=1090, training_loss=1.012335191079236, metrics={'train_runtime': 1371.8724, 'train_samples_per_second': 6.32, 'train_steps_per_second': 0.795, 'total_flos': 5921159120934912.0, 'train_loss': 1.012335191079236, 'epoch': 10.0})

In [19]:
max([len(example["labels"]) for example in test])

79

In [20]:
max_gen_length = 60

In [21]:
trainer.evaluate(test, max_length=max_gen_length)

Trainer is attempting to log a value of "[0.5882245383068203, 0.41774100442563017, 0.33908629441624366, 0.28506981740064447]" of type <class 'list'> for key "eval/precisions" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 0.8327173590660095,
 'eval_bleu': 0.333742608461613,
 'eval_precisions': [0.5882245383068203,
  0.41774100442563017,
  0.33908629441624366,
  0.28506981740064447],
 'eval_brevity_penalty': 0.850126837330369,
 'eval_length_ratio': 0.8603114676734309,
 'eval_translation_length': 5469,
 'eval_reference_length': 6357,
 'eval_rouge1': 0.5340473267369774,
 'eval_rouge2': 0.3946551400378029,
 'eval_rougeL': 0.4963179796182741,
 'eval_rougeLsum': 0.49672126997868793,
 'eval_runtime': 74.5363,
 'eval_samples_per_second': 3.649,
 'eval_steps_per_second': 0.456,
 'epoch': 10.0}

In [22]:
model = trainer.model

# Evaluation

Compare pretrained and finetune bleu, rouge, etc.

In [23]:
pretrained_model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
pretrained_model.to("cuda")

trainer = Seq2SeqTrainer(
    model=pretrained_model,
    args=training_args,
    train_dataset=train,
    eval_dataset=test,
    data_collator=collator,
    compute_metrics=compute_metrics
)
trainer.evaluate(test, max_length=max_gen_length)

Trainer is attempting to log a value of "[0.46619217081850534, 0.3181214000886132, 0.25528775209050664, 0.210984230560087]" of type <class 'list'> for key "eval/precisions" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 1.2289284467697144,
 'eval_bleu': 0.06580262170056317,
 'eval_precisions': [0.46619217081850534,
  0.3181214000886132,
  0.25528775209050664,
  0.210984230560087],
 'eval_brevity_penalty': 0.2201069409222395,
 'eval_length_ratio': 0.39782916470033036,
 'eval_translation_length': 2529,
 'eval_reference_length': 6357,
 'eval_rouge1': 0.26189188457293755,
 'eval_rouge2': 0.1677202840600932,
 'eval_rougeL': 0.243097338372541,
 'eval_rougeLsum': 0.2439010660774994,
 'eval_runtime': 60.1857,
 'eval_samples_per_second': 4.519,
 'eval_steps_per_second': 0.565}

Apart from precision, recall style metrics wrt reference (gpt-3.5-turbo synthesized) texts, we can study gpt-3.5-turbo's own evaluation, though this has some shortcomings out of the box.

In [25]:
from langchain.chat_models import ChatOpenAI
from langchain.evaluation.qa import QAEvalChain
chat_llm = ChatOpenAI(temperature=0)
qa_eval_chain = QAEvalChain.from_llm(chat_llm)

In [26]:
test_trunc = list(iter(test))[:100]

In [27]:
from tqdm import tqdm

results = []

for example in tqdm(test_trunc):

  inputs = tokenizer(example["question_with_context"], return_tensors="pt")
  outputs = model.generate(**{k: v.to(model.device) for k, v in inputs.items()}, max_length=max_gen_length)
  answer = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

  ground_truth = {"question": example["question"], "answer": example["answer"]}
  prediction = {"result": answer}
  res = qa_eval_chain.evaluate([ground_truth], [prediction], question_key="question")[0]
  # grade = grade_model_answer([ground_truth], [prediction],   grade_prompt, logger)
  results.append(
      {**ground_truth, **prediction, "evaluation": res["text"]}
  )

100%|██████████| 100/100 [03:30<00:00,  2.11s/it]


In [29]:
sum([result["evaluation"].lower() == "correct" for result in results])

76

In [30]:
pretrained_results = []

for example in tqdm(test_trunc):

  inputs = tokenizer(example["question_with_context"], return_tensors="pt")
  outputs = pretrained_model.generate(**{k: v.to(pretrained_model.device) for k, v in inputs.items()}, max_length=max_gen_length)
  answer = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

  ground_truth = {"question": example["question"], "answer": example["answer"]}
  prediction = {"result": answer}
  res = qa_eval_chain.evaluate([ground_truth], [prediction], question_key="question")[0]
  pretrained_results.append(
      {**ground_truth, **prediction, "evaluation": res["text"]}
  )

100%|██████████| 100/100 [03:12<00:00,  1.92s/it]


In [43]:
sum([result["evaluation"].lower() == "correct" for result in pretrained_results])

69

gpt-3.5-turbo scored accuracy of ~80, according to QAEvalChain, run offline. the finetuned accuracy is only a few points higher than pretrained accuracy, but the other metrics look a look more in the finetune favor. let's look manually.

And side by side.

In [44]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

rows = [
    {
        **res,
        "pretrained_result": pretrained_res["result"],
        "pretrained_evaluation": pretrained_res["evaluation"]
    }
    for res, pretrained_res in
    zip(results, pretrained_results)
    if res["question"] == pretrained_res["question"]
]

In [45]:
df = pd.DataFrame(rows)
df[df["evaluation"] == df["pretrained_evaluation"]]

Unnamed: 0,question,answer,result,evaluation,pretrained_result,pretrained_evaluation
0,What is the most important thing a teacher can do to be successful in the classroom?,Preparation,The most important thing a teacher can do to be successful in the classroom is to be exceptionally well prepared.,CORRECT,preparation,CORRECT
1,What happens when you take away some of the receptors in your cells?,You need more coffee to get the same effect as before.,"The cell starts to think, gee whiz, there's a lot of stimulation going on.",INCORRECT,"The cell starts to think, gee whiz, there's a lot of stimulation going on",INCORRECT
2,What does the speaker think about babies?,The speaker thinks that babies are cute but very stupid.,The speaker thinks that newborn babies come into the world with some degree of consciousness.,INCORRECT,Babies are very stupid,INCORRECT
3,What were some of the challenges faced by the Python project?,One of the challenges was there wasn't enough features and too many just changes without features. The empathy for the end user as to why they would switch wasn't there.,The challenge was that there wasn't enough features and too many just changes without features.,CORRECT,a bit of gratuitous change to the language,CORRECT
4,What does the speaker think about the idea of closing something?,The speaker thinks it's boring to think about recreating things that we already have when we could create something that's different.,The speaker thinks that closing something is boring and boring.,INCORRECT,It's boring,INCORRECT
5,What is the better way to think about the potential of self-driving cars?,The better way to think about it is that there's a whole continuum of how much driving and assisting the car can do.,The better way to think about the potential of self-driving cars is that there's a whole continuum of how much driving and assisting the car can do.,CORRECT,a whole continuum,CORRECT
7,"What is anthropomorphism and how is it related to animals, objects, and robots?","Anthropomorphism is the act of projecting human-like traits and behaviors onto nonhumans, such as animals, objects, and robots. It can lead to misinterpretation of their actual emotions and behaviors.","anthropomorphism is a tendency that we have to project human like traits and behaviors onto nonhumans. It is related to animals, objects, and robots.",CORRECT,"We have to project human like traits and behaviors onto nonhumans. And we often see it with animals, like we'll project emotions on animals that may or may not actually be there. We often see that we're trying to interpret things according to our own behavior when we get it",CORRECT
8,What did the speaker do during their postdoc to prepare for being a professor?,"The speaker added artificial large time consuming things into the middle of their day, such as exercising for two hours and doing productive meditation, to get good at putting artificial constraints on themselves and avoid getting flabby when their job became easy as a professor.",The speaker added artificial large time consuming things into the middle of their day.,CORRECT,Exercise for two hours in the middle of the day and do all this productive meditation,CORRECT
9,Why have they been trying to undermine these things without invasion?,Because it threatens their interests.,They've been trying to undermine these things without invasion because it threatens their interests.,CORRECT,Because it threatens their interests,CORRECT
10,What is the current level or granularity of tokenization?,The current level or granularity of tokenization generally means it's maybe two to five.,The current level or granularity of tokenization generally means it's maybe two to five.,CORRECT,maybe two to five,CORRECT


In [46]:
df[(df["evaluation"] == "INCORRECT") & (df["pretrained_evaluation"] == "CORRECT")]

Unnamed: 0,question,answer,result,evaluation,pretrained_result,pretrained_evaluation
6,What is the problem with MIT according to the text?,The problem with MIT is not being open and enforcing mediocrity and homogenization pressures.,The problem with MIT is that it's a complete stop on the ability to actually do work.,INCORRECT,It's like a complete stop on the ability to actually do work,CORRECT
15,What is one of the societal questions that we will grapple with for years to come?,Privacy is one of the societal questions that we will grapple with for years to come.,The reality is that giving over data to any AI system can be used to enrich our lives in profound ways.,INCORRECT,Privacy,CORRECT
20,What is the exclusion principle and how did Dirac use it to explain the behavior of electrons?,"The exclusion principle states that two electrons cannot be on the same orbit. Dirac explained that all negative energy states are filled orbits, so electrons can only go positive. Pauli suggested that electrons can change orbits, which would create a new particle.",Dirac used the exclusion principle to explain the behavior of electrons.,INCORRECT,two electrons cannot be on the same orbit,CORRECT
48,What is the author's opinion on whether a computer could get as good as humans at computing heuristic functions?,The author is unsure whether a computer could get as good as humans at computing heuristic functions.,The author believes that a computer could get as good as humans at computing heuristic functions.,INCORRECT,Maybe,CORRECT
50,What is the main reason why it is not advisable to put a superconducting system in a car?,It is questionable to put a cryostat in the trunk of everyone's car.,"The main reason is that the cooling is not being dissipated by the circuits themselves, not the cooling.",INCORRECT,Cooling errors,CORRECT
62,What are the benefits of working with a therapist for sleep compared to taking sleeping pills?,"The benefits of working with a therapist for sleep last for years later, while when you stop taking sleeping pills, you typically have rebound insomnia where your sleep is usually even worse than before.",The benefits of working with a therapist for sleep compared to taking sleeping pills are that they last for years later.,INCORRECT,Helpful,CORRECT
66,What was the author's background before learning Python?,"The author was a graduate student studying biomedical engineering and had experience in taking information from satellites and doing data processing in MATLAB, Perl, and scripting on a VMS.",The author was a graduate student studying biomedical engineering at the Mayo Clinic.,INCORRECT,a graduate student,CORRECT
74,What does the speaker suggest is missing from America's claim to be the greatest country in the world?,The speaker suggests that wisdom and beauty are missing from America's claim to be the greatest country in the world.,The speaker suggests that America's claim to be the greatest country in the world is missing the sense of gratitude.,INCORRECT,strength,CORRECT
89,What is the speaker's opinion on mortality?,The speaker thinks that mortality is beautiful.,The speaker thinks that all of us conscious beings in the grand scheme of basically every scale will be completely forgotten.,INCORRECT,I think it's beautiful,CORRECT
93,What is the difficulty in determining whether a driver is aggressive or defensive?,"It is difficult to determine whether a driver is aggressive or defensive because it is hard to tell if they will let you go in or not, and you may end up driving next to them without getting the observations you need.",The difficulty in determining whether a driver is aggressive or defensive is that it is difficult to just make up an answer.,INCORRECT,Helpful,CORRECT


TODO: append gpt-3.5-turbo for comparison, which will require actually running the RetrievalQAChain.