<a href="https://colab.research.google.com/github/inwaves/sample-notebooks/blob/master/GPT2_classification_AndreiAlexandru.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Before running, choose Runtime > Change runtime type and select GPU.


# Setup



In [1]:
!pip install transformers
!pip install torch



In [23]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
import json
import numpy as np

np.random.seed(1337)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('gpt2', pad_token_id=tokenizer.eos_token_id)
model.eval().cuda()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dro

# GPT-2 Utility functions

Resources:

1.   https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel
2.   https://github.com/huggingface/transformers/blob/master/examples/pytorch/text-generation/run_generation.py




In [24]:
def generate(prompt, max_length=5, stop_token=None):
  
  # The encoding of the prompt needs to be truncated to at most 
  # (max_sequence_length_of_GPT - max_length_of_text_to_generate).
  # Otherwise this fails, because some of the prompt encodings end up being more
  # than 1024, for example x_train[57].
  input_ids = tokenizer.encode(prompt, return_tensors="pt")
  generated_text_ids = model.generate(input_ids=input_ids.cuda(), max_length=max_length+len(input_ids[0]), do_sample=False)
  generated_text = tokenizer.decode(generated_text_ids[0], clean_up_tokenization_spaces=True)
  post_prompt_text = generated_text[len(tokenizer.decode(input_ids[0], clean_up_tokenization_spaces=True)):]
  return prompt + post_prompt_text[:post_prompt_text.find(stop_token) if stop_token else None]

In [4]:
# Note that the logits are shifted over 1 to the left, since HuggingFace doesn't give a logit for the first token
def get_logits_and_tokens(text):
    input_ids = tokenizer.encode(text, return_tensors="pt")
    tokens = [tokenizer.decode([input_id]) for input_id in input_ids[0]]
    output = model(input_ids.cuda())
    return output.logits[0][:-1], tokens

# Loading the data

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
def load_jsonl(filename):
    f = open(filename)
    return [json.loads(line) for line in f.read().splitlines()]

In [25]:
x_train = load_jsonl("/content/drive/MyDrive/train.jsonl")
del x_train[56] # This entry, when appended to a prompt and encoded, is over the 1024 maximum sequence length for the model. 
x_dev = load_jsonl("/content/drive/MyDrive/dev.jsonl")
x_test = load_jsonl("/content/drive/MyDrive/test_no_labels.jsonl")

x_train[-1]

{'label': 'False',
 'meta': {'id': '1310.8601', 'year': 2013},
 'text': 'non relativistic approach for cosmological scalar field dark matter. we derive non relativistic equations of motion for the formation of cosmological structure in a scalar field dark matter sfdm model corresponding to a complex scalar field endowed with a quadratic scalar potential. starting with the full equations of motion written in the newtonian gauge of scalar perturbations we separate out the fields involved into relativistic and non relativistic parts and find the equations of motion for the latter that can be used to build up the full solution. one important assumption will also be that the sfdm field is in the regime of fast oscillations under which its behavior is exactly that of cold dark matter. the resultant equations are quite similar to the schr odinger poisson system of newtonian boson stars plus relativistic leftovers. we exploit that similarity to show how to simulate with minimum numerical effor

# Basic prompt building

In [8]:
def render_example(example):
    title = example["text"].split(".")[0].strip()
    abstract = example["text"][len(title)+1:].strip()
    return f"""Title: {title}
Abstract: {abstract}
Label: {"AI" if example["label"] == "True" else "Not AI"}"""

In [9]:
def render_end_example(example):
    title = example["text"].split(".")[0].strip()
    abstract = example["text"][len(title)+1:].strip()
    return f"""Title: {title}
Abstract: {abstract}
Label:"""

In [10]:
def make_prompt(instructions, train_examples, end_example):
    rendered_train_examples = "\n\n--\n\n".join([render_example(example) for example in train_examples])
    return f"""{instructions}

{rendered_train_examples}

--

{render_end_example(end_example)}"""

In [11]:
INSTRUCTIONS = "Classify the following examples based on whether they are AI-relevant or not:"

prompt = make_prompt(INSTRUCTIONS, x_train[:4], x_train[4])
print(prompt)

Classify the following examples based on whether they are AI-relevant or not:

Title: thermodynamic analysis of quantum error correcting engines
Abstract: quantum error correcting codes can be cast in a way which is strikingly similar to a quantum heat engine undergoing an otto cycle. in this paper we strengthen this connection further by carrying out a complete assessment of the thermodynamic properties of strokes operator based error correcting codes. this includes an expression for the entropy production in the cycle which as we show contains clear contributions stemming from the different sources of irreversibility. to illustrate our results we study a classical qubit error correcting code well suited for incoherent states and the qubit shor code capable of handling fully quantum states. we show that the work cost associated with the correction gate is directly associated with the heat introduced by the error. moreover the work cost associated with encoding decoding quantum informa

In [14]:
generated_text = generate(prompt, stop_token="\n")
print(generated_text)

Classify the following examples based on whether they are AI-relevant or not:

Title: thermodynamic analysis of quantum error correcting engines
Abstract: quantum error correcting codes can be cast in a way which is strikingly similar to a quantum heat engine undergoing an otto cycle. in this paper we strengthen this connection further by carrying out a complete assessment of the thermodynamic properties of strokes operator based error correcting codes. this includes an expression for the entropy production in the cycle which as we show contains clear contributions stemming from the different sources of irreversibility. to illustrate our results we study a classical qubit error correcting code well suited for incoherent states and the qubit shor code capable of handling fully quantum states. we show that the work cost associated with the correction gate is directly associated with the heat introduced by the error. moreover the work cost associated with encoding decoding quantum informa

# 1. Finding GPT-2's baseline accuracy on this dataset.

## Some useful functions

In [15]:
def predicted_label_is_correct(x, y_pred):
  """Compare the label assigned by the model with the ground truth."""
  if ("Label: Not AI" in y_pred[-13:] and x["label"]=="False") or ("Label: AI" in y_pred[-13:] and x["label"]=="True"):
    return True
  return False

def predicted_label_is_correct_unittest():
  xs = [{"label": "True"}, {"label": "False"}]
  y_preds = ["Something or another\nLabel: AI", "Something or another\nLabel: Not AI"]
  assert predicted_label_is_correct(xs[0], y_preds[0]) == True
  assert predicted_label_is_correct(xs[0], y_preds[1]) == False
  assert predicted_label_is_correct(xs[1], y_preds[1]) == True
  assert predicted_label_is_correct(xs[1], y_preds[0]) == False

predicted_label_is_correct_unittest()

In [16]:
def eval_accuracy_on_set(examples, dataset):
  """Evaluate the accuracy of the model on a given dataset and set of examples to use in the prompt."""
  correct = 0

  for i, x in enumerate(dataset):
    prompt = make_prompt(INSTRUCTIONS, examples, x)
    generated_text = generate(prompt, stop_token="\n")
    # print(f"Example: {i}, label: {x['label']}, prediction: {generated_text}, correct: {predicted_label_is_correct(x, generated_text)}")
    if predicted_label_is_correct(x, generated_text):
      correct += 1

  accuracy = correct / len(dataset)

  return accuracy

Note on the `incorporate_examples` parameter: when choosing $k$, then $2k$ examples randomly (as in test accuracy with 5 random examples, then 10 random examples), there's no guarantee that the 10 examples include the first 5. I want that to be the case so the tests are more like-for-like. (Otherwise it's conceivable that a significant part of the performance difference between 5 vs. 10 examples is that the 10 examples were just different, and better suited for this particular task.)

In [17]:
def extract_examples_from_dataset(dataset, num_examples=5, incorporate_examples=[], randomise=False):
  """Extracts from a given dataset examples to use in prompts.
    :param dataset:               list of dict objects containing papers.
    :param num_examples:          total number of examples to extract, including incorporate_examples.
    :param incorporate_examples:  list of examples that *must* be extracted.
    :param randomise:             if True, random examples are picked, otherwise the first num_examples examples are used.
  """
  if randomise:
    example_indices = np.random.randint(0, len(x_train), num_examples-len(incorporate_examples))
    examples = [dataset[idx] for idx in example_indices]
    examples.extend(incorporate_examples)
    filtered_dataset = [dataset[idx] for idx in range(len(dataset)) if dataset[idx] not in examples]
    return examples, filtered_dataset
  else:
    # FIXME: if incorporate_examples is not empty, this does not return the correct result.
    return dataset[:num_examples], dataset[num_examples:]

## Accuracy results

Selecting $k$ examples from the training set, then evaluating the model on the remaining $500-k$ data points in the training set.

The examples (entries in the training set that go in the prompt) are selected in two ways:
- they're the first $k$ entries in the training set;
- they're randomly selected; (you should be able to replicate my results so long as you set the seed for `numpy` in the very first cell of this notebook.)

In [None]:
examples, dset = extract_examples_from_dataset(x_train, 5, False)
print(f"Accuracy using the first 5 examples from the dataset in the prompt: {100*eval_accuracy_on_set(examples, dset):.4f}%")

examples, dset = extract_examples_from_dataset(x_train, 10, False)
print(f"Accuracy using the first 10 examples from the dataset in the prompt: {100*eval_accuracy_on_set(examples, dset):.4f}%")

examples, dset = extract_examples_from_dataset(x_train, 20, False)
print(f"Accuracy using the first 20 examples from the dataset in the prompt: {100*eval_accuracy_on_set(examples, dset):.4f}%")

examples, dset = extract_examples_from_dataset(x_train, 5, True)
print(f"Accuracy using 5 randomly-selected examples in the prompt: {100*eval_accuracy_on_set(examples, dset):.4f}%")

examples, dset = extract_examples_from_dataset(x_train, 10, examples, True)
print(f"Accuracy using 10 randomly-selected examples in the prompt: {100*eval_accuracy_on_set(examples, dset):.4f}%")

examples, dset = extract_examples_from_dataset(x_train, 20, examples, True)
print(f"Accuracy using 20 randomly-selected examples in the prompt: {100*eval_accuracy_on_set(examples, dset):.4f}%")

Accuracy using the first 5 examples from the dataset in the prompt: 26.2626%
Accuracy using the first 10 examples from the dataset in the prompt: 0.0000%
Accuracy using the first 20 examples from the dataset in the prompt: 0.0000%
Accuracy using 5 randomly-selected examples in the prompt: 26.2626%
Accuracy using 10 randomly-selected examples in the prompt: 0.0000%


KeyboardInterrupt: ignored

Selecting $k$ examples from the training set, then evaluating the model on the entire dev set.

In [None]:
examples, dset = extract_examples_from_dataset(x_train, 5, False)
print(f"Accuracy using the first 5 examples from the dataset in the prompt: {100*eval_accuracy_on_set(examples, x_dev):.4f}%")

examples, dset = extract_examples_from_dataset(x_train, 10, False)
print(f"Accuracy using the first 10 examples from the dataset in the prompt: {100*eval_accuracy_on_set(examples, x_dev):.4f}%")

examples, dset = extract_examples_from_dataset(x_train, 20, False)
print(f"Accuracy using the first 20 examples from the dataset in the prompt: {100*eval_accuracy_on_set(examples, x_dev):.4f}%")

examples, dset = extract_examples_from_dataset(x_train, 5, True)
print(f"Accuracy using 5 randomly-selected examples in the prompt: {100*eval_accuracy_on_set(examples, x_dev):.4f}%")

examples, dset = extract_examples_from_dataset(x_train, 10, examples, True)
print(f"Accuracy using 10 randomly-selected examples in the prompt: {100*eval_accuracy_on_set(examples, x_dev):.4f}%")

examples, dset = extract_examples_from_dataset(x_train, 20, examples, True)
print(f"Accuracy using 20 randomly-selected examples in the prompt: {100*eval_accuracy_on_set(examples, x_dev):.4f}%")

Accuracy using the first 5 examples from the dataset in the prompt: 32.0000%
Accuracy using the first 10 examples from the dataset in the prompt: 11.0000%
Accuracy using the first 20 examples from the dataset in the prompt: 11.0000%
Accuracy using 5 randomly-selected examples in the prompt: 32.0000%
Accuracy using 10 randomly-selected examples in the prompt: 11.0000%
Accuracy using 20 randomly-selected examples in the prompt: 11.0000%


Ways to improve this:
- evaluate each setup using more than one run, average the accuracies, look at standard deviations; (unset the `numpy` random seed, otherwise every run has the same "random" examples)

# 2. Tweaks

- Trying GPT-J-6B.

In [None]:
gptj_tokeniser = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
gptj = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")

Downloading:   0%|          | 0.00/619 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/779k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.94k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/357 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/930 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/22.5G [00:00<?, ?B/s]

# 3. Write-up