# OpenAI GPT2

OpenAI GPT-2 model was proposed in Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever from OpenAI. It’s a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

Reference : https://huggingface.co/docs/transformers/model_doc/gpt2

In [21]:
# Check the model architecture 
from transformers import GPT2Model, GPT2Config
model = GPT2Model.from_pretrained("gpt2")

print(model)
config = GPT2Config()

print("Configuartion:\n")
print(config)

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
Configuartion:

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_embd": 768,
  "n

# GPT2 Vocabulary

In [23]:
# check Tokenizer 

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

print(tokenizer("Hello world")["input_ids"])

tokenizer.save_vocabulary("gpt2_vocab")


[15496, 995]


('gpt2_vocab\\vocab.json', 'gpt2_vocab\\merges.txt')

# Vectorization

In [9]:
from transformers import AutoTokenizer, GPT2Model
import torch

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
print(last_hidden_states)

tensor([[[-9.0342e-06, -1.4021e-01, -2.0845e-01,  ..., -1.5329e-01,
          -6.7826e-02, -1.9630e-01],
         [ 4.1949e-01,  2.3525e-01,  3.4816e-01,  ...,  4.5322e-02,
           1.5447e-01,  1.9547e-02],
         [-7.0056e-02,  2.6082e-01, -2.9146e-01,  ...,  9.0979e-02,
           4.9659e-01, -4.1824e-01],
         [-1.9695e-01, -2.9247e-01, -1.4119e-01,  ..., -8.9255e-02,
          -2.2392e-01,  1.2212e-01],
         [-6.4193e-01, -1.0236e-01, -4.2129e-01,  ...,  6.8697e-02,
          -5.1117e-01,  5.0044e-01],
         [ 4.1286e-03, -3.1454e-02, -1.0823e+00,  ..., -5.0159e-02,
          -3.0878e-02,  4.3480e-01]]], grad_fn=<ViewBackward0>)


In [11]:
from transformers import AutoTokenizer, GPT2ForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = GPT2ForQuestionAnswering.from_pretrained("gpt2")

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]

# target is "nice puppet"
target_start_index = torch.tensor([14])
target_end_index = torch.tensor([15])

outputs = model(**inputs, start_positions=target_start_index, end_positions=target_end_index)
loss = outputs.loss
print(outputs)

Some weights of GPT2ForQuestionAnswering were not initialized from the model checkpoint at gpt2 and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


QuestionAnsweringModelOutput(loss=tensor(nan, grad_fn=<DivBackward0>), start_logits=tensor([[ 3.0386,  9.8525,  4.7776,  6.5073,  8.6920,  9.9764,  5.4821, 13.4301,
          8.4862,  9.6751,  9.9907,  9.7177,  9.0636]],
       grad_fn=<CloneBackward0>), end_logits=tensor([[-0.6750, -1.9020, -1.1939, -0.0285, -1.6016, -0.7465, -1.4319, -3.7202,
         -2.1298, -1.9746, -1.6999, -1.5966, -1.2244]],
       grad_fn=<CloneBackward0>), hidden_states=None, attentions=None)


In [14]:
print(model)

GPT2ForQuestionAnswering(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (qa_outputs): Linear(in_features=768, out_features=2, bias=True)
)
