In [None]:
!pip install transformers # Install transformers library



# What is `transformers`(library)?

- This is library that developed by [Hugging Face](https://huggingface.co/)(Company).
- It's mainly used in NLP(Natural Language Processing) Fields.
- Like its name(transformers), It is including many [Transformer](https://en.wikipedia.org/wiki/Transformer)-based model.
- It can use with Pytorch, Tensorflow and other deep learning frameworks.
- These days, it is used by not only NLP but also many fields like CV, Audio and etc.
- It offers simple interface `pipeline`.

# GPT-2
![GPT-2 Logo](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fstorage.googleapis.com%2Fwandb-production.appspot.com%2Fwandb-public-images%2Ftk8kfl2kbl.png&f=1&nofb=1&ipt=193d2f581835833fffada67a17cef7bd3a2768c559bb57d5047a4418eecce938)

- GPT(Generative Pre-Trained Transformer) is a pre-trained model having **decoder-only** structure.
- GPT-2 is second version of GPT model.
- GPT-2 is published first time in [OpenAI](https://openai.com/) team's paper, [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever.
- GPT-2 was trained by 'webtext'(private) dataset that consist of 40GB [Reddit](https://www.reddit.com/) posts that is received high evaluation.
- GPT-2 was trained by **1.5B** parameters bigger than GPT-1's 170M parameters.
- GPT-2's Zero-shot Learning has been very successful. GPT-2 can process various task by pre-trained status without fine-tuning.

# GPT-1 vs GPT-2
- GPT-1(2018) and GPT-2(2019) are one generation apart. But they have a little difference in some parts.
- The most difference is number of parameters. Because of parameters, GPT-2 could gain big performance elevation.

The below is table that compared about GPT-1 vs GPT-2:

|  | GPT-1 | GPT-2 |
|---|---|---|
| **Goal** | Fine-Tuning model with pre-trained based | Zero-shot Learning Performance |
| **Model Size** | 117M parameters | **1.5B** parameters |
| **Train Data** | BookCorpus(7,000 non-published books, 1GB) | webtext(Reddit posts, **40GB**) |
| **Context Size** | 512 Tokens | **1024** Tokens |
| **Vocabulary Size** | about 40,000 | about **50,257** |

So, let's take a look more detail about between GPT-1 and GPT-2 difference.

## Zero-shot Learning
- Zero-shot Learning is a **methodology** that make model can process various tasks without fine-tuning.
- We can't expect zero-shot learning to GPT-1. Because GPT-1 had an insufficient number of parameters, its ability to effectively learn and generalize across various tasks was limited.
- But GPT-2 can process zero-shot learning, because GPT-2 1.5B parameters that is about 10x more than GPT-1's 117M parameters.
- So, GPT-2 can process various task without fine-tuning.

## A Slight Improvement about Model Architecture
- GPT-2 architecture is also based on transformer decoder, but it is a little changes than GPT-1.
  1. **Layer Normalization Location Changing**: GPT-2 architecture's most big changing is Layer Normalization location changing. In GPT-1's case, GPT-1 processed normalization after attention operation. This is called by **Post-LN**. This method can make learning unstable. So, GPT-2 adopted **Pre-LN** method that process normalization before attention operation. This was need even more to GPT-2 that has many number of transformer block.
  2. **Context Size Improvement**: Context size is increase about twofold. From GPT-1's 512 tokens to GPT-2's 1024 tokens. So, model can understand more better about conversation's context.

# GPT-2 Structure
![GPT-1 and GPT-2 Structure Diagram](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fpub-fb664c455eca46a2ba762a065ac900f7.r2.dev%2FGPT1_vs_GPT2_architecture.webp&f=1&nofb=1&ipt=79bce78eed066822b0f107e8740f3e63fdb8f37455304b22f4587d328bd7930b)

To look GPT-2's Structure, we will use transformer library. Transformer libary support this.

In [3]:
from transformers import GPT2Config

# Python class that including configuration information about GPT2 Model
# Configuartion
# For the more details, follow 32 line in this link: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/configuration_gpt2.py#L31
config = GPT2Config.from_pretrained('gpt2')

# Print configuration informations
print(config)

GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.56.2",
  "use_cache": true,
  "vocab_size": 50257
}



# GPT2Config
- We can see many informations for GPT2 Model.
- For example, `n_ctx` is context vector size(1024). It's the same as what we've seen. And we can more informations like `activation_function`(GPT use gelu), `n_embd`(Embeding vector dimension), etc.
- For a more inforamtion, you can see in this link: [transformer library source code in github repository](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2)

In [8]:
from transformers import GPT2Model

# GPT-2 Model is python class that is model generated by GPT2Config and learend by webtext dataset
model = GPT2Model.from_pretrained('gpt2')

# Print model layer information
print(model)

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D(nf=2304, nx=768)
        (c_proj): Conv1D(nf=768, nx=768)
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D(nf=3072, nx=768)
        (c_proj): Conv1D(nf=768, nx=3072)
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)


# GPT2Model
- We can see a real architecture information about GPT2 Model.
- For example, `LayerNorm` is located before `GPT2Atteion` in `GPT2Block`. It is same as what we've seen.
- And GPT2Model use number of 12 `GPT2Blcok`s in GPT2Model.
- And we can see some `Dropout` layers for preventing overfitting.