<a href="https://colab.research.google.com/github/prathimamp/Giet_Day2_programs/blob/main/Notebook_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Diving futher into GPT-2 Model

In [None]:
# Import necessary libraries
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# Step 1: Load the GPT-2 tokenizer and explain it
print("Loading GPT-2 tokenizer...")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
print("\n--- GPT-2 Tokenizer Details ---")
print("Tokenizer Type: Byte-Pair Encoding (BPE)")
print(
    "Explanation: Byte-Pair Encoding splits words into subword units to reduce vocabulary size."
    " For example, 'uncommon' might be split into 'un', 'common'."
    " This approach allows GPT-2 to handle a wide variety of text efficiently.\n"
)
print(f"Tokenizer Vocabulary Size: {tokenizer.vocab_size}")
print(f"Special Tokens: {tokenizer.all_special_tokens}")

Loading GPT-2 tokenizer...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]


--- GPT-2 Tokenizer Details ---
Tokenizer Type: Byte-Pair Encoding (BPE)
Explanation: Byte-Pair Encoding splits words into subword units to reduce vocabulary size. For example, 'uncommon' might be split into 'un', 'common'. This approach allows GPT-2 to handle a wide variety of text efficiently.

Tokenizer Vocabulary Size: 50257
Special Tokens: ['<|endoftext|>']


In [None]:
print("Example: Encoding and Decoding a Sentence")

text = "Exploring the GPT-2 model!"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)
print(f"Original Text: {text}")
print(f"Encoded Tokens: {encoded}")

# Decode tokens to see the mapping
decoded_tokens = [tokenizer.decode(token_id) for token_id in encoded]
print(f"Decoded Tokens: {decoded_tokens}")
print(f"Decoded Back: {decoded}")

Example: Encoding and Decoding a Sentence
Original Text: Exploring the GPT-2 model!
Encoded Tokens: [18438, 3255, 262, 402, 11571, 12, 17, 2746, 0]
Decoded Tokens: ['Expl', 'oring', ' the', ' G', 'PT', '-', '2', ' model', '!']
Decoded Back: Exploring the GPT-2 model!


In [None]:
# Step 2: Load the GPT-2 model and explain it
print("\nLoading GPT-2 model...")

#model = GPT2LMHeadModel.from_pretrained("gpt2")  # Default (124M)
#model = GPT2LMHeadModel.from_pretrained("gpt2-medium")  # 355M
model = GPT2LMHeadModel.from_pretrained("gpt2-large")  # 774M


Loading GPT-2 model...


config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
# Print model details
print("\n--- GPT-2 Model Details ---")
print("Architecture: Transformer-based model with self-attention.")
print(
    "GPT-2 uses a decoder-only architecture with masked self-attention to predict the next token in a sequence."
)
print(f"Number of Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Number of Layers: {model.config.n_layer}")
print(f"Number of Attention Heads per Layer: {model.config.n_head}")
print(f"Hidden Size: {model.config.n_embd}")



--- GPT-2 Model Details ---
Architecture: Transformer-based model with self-attention.
GPT-2 uses a decoder-only architecture with masked self-attention to predict the next token in a sequence.
Number of Parameters: 774,030,080
Number of Layers: 36
Number of Attention Heads per Layer: 20
Hidden Size: 1280


In [None]:
# Step 3: Generate outputs for multiple inputs
def generate_text(prompt, max_length=50):
    """Generate text using GPT-2 with proper handling of attention mask and padding."""
    # Tokenize the input prompt
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # Generate attention mask (all non-padding tokens get a value of 1)
    # An attention mask is a binary tensor used to specify which tokens in the input sequence the
    # model should focus on (attend to) and which tokens it should ignore.
    attention_mask = torch.ones_like(input_ids)

    # Ensure the model has a pad token set (use eos token as a fallback)
    # GPT-2 does not have a dedicated padding token by default, so we often set
    # the pad_token_id to the same value as the eos_token_id (end-of-sequence token).
    tokenizer.pad_token = tokenizer.eos_token
    pad_token_id = tokenizer.pad_token_id

    # Generate output using the model
    output = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_length=max_length,
        num_return_sequences=1,
        pad_token_id=pad_token_id,
        repetition_penalty=2.0,  # Apply repetition penalty
    )

    # Decode and return the generated text
    return tokenizer.decode(output[0], skip_special_tokens=True)


# Test with different prompts
prompts = [
    "Artificial intelligence lectures are",
    "The future of AI will be",
    "In a remote city of punjab, India, there is a far, far land ",
]

print("\n--- GPT-2 Generated Texts ---")
for i, prompt in enumerate(prompts, 1):
    print(f"\nPrompt {i}: {prompt}")
    print(f"Generated Text: {generate_text(prompt)}")
    print("*"*100)


--- GPT-2 Generated Texts ---

Prompt 1: Artificial intelligence lectures are
Generated Text: Artificial intelligence lectures are a great way to learn about the latest developments in AI.

The course is designed for people who want an introduction into artificial general Intelligence (AGI). It's not intended as any sort of formal training, but rather
****************************************************************************************************

Prompt 2: The future of AI will be
Generated Text: The future of AI will be a big deal.

"I think the biggest thing that's going to happen is we're not just talking about robots and computers," says David Levy, CEO at DeepMind Technologies in London who has worked on some
****************************************************************************************************

Prompt 3: In a remote city of punjab, India, there is a far, far land 
Generated Text: In a remote city of punjab, India, there is a far, far land  that has been called

# Overview of GPT-2 training

In [None]:
print("\n--- Overview of GPT-2 Training ---")
print(
    "GPT-2 was trained using unsupervised learning on a large dataset of text from the internet."
    " It uses a transformer architecture with a focus on causal language modeling, where the model predicts the next token in a sequence given previous tokens."
    " Training was performed on diverse text data, allowing GPT-2 to learn grammar, facts, reasoning, and some level of creativity."
    " The dataset was preprocessed by removing low-quality text and tokenizing the input using Byte-Pair Encoding.\n"
)
print("Dataset Size: Approximately 40 GB of text data (e.g., web pages, books, articles).")
print("Objective: Maximize the likelihood of predicting the correct next token in the sequence.")



--- Overview of GPT-2 Training ---
GPT-2 was trained using unsupervised learning on a large dataset of text from the internet. It uses a transformer architecture with a focus on causal language modeling, where the model predicts the next token in a sequence given previous tokens. Training was performed on diverse text data, allowing GPT-2 to learn grammar, facts, reasoning, and some level of creativity. The dataset was preprocessed by removing low-quality text and tokenizing the input using Byte-Pair Encoding.

Dataset Size: Approximately 40 GB of text data (e.g., web pages, books, articles).
Objective: Maximize the likelihood of predicting the correct next token in the sequence.


# Why do we need Alignment or Instruction Tuning?

In [None]:
# Test with different prompts
prompts = [
    "Explain Artificial intelligence to a 6 year old kid.",
    "Write a few lines on happiness."

]

print("\n--- GPT-2 Generated Texts ---")
for i, prompt in enumerate(prompts, 1):
    print(f"\nPrompt {i}: {prompt}")
    print(f"Generated Text: {generate_text(prompt,100)}")
    print("*"*100)


--- GPT-2 Generated Texts ---

Prompt 1: Explain Artificial intelligence to a 6 year old kid.
Generated Text: Explain Artificial intelligence to a 6 year old kid.
The first thing you'll notice is that the AI doesn't understand what it's doing, and can only guess at things like "what color are those flowers?" or how many people there were in this room when I was talking about my new book (which has been translated into several languages). It also seems unable even of understanding basic English words such as 'I' for example:


It does seem able though - if given enough time
****************************************************************************************************

Prompt 2: Write a few lines on happiness.
Generated Text: Write a few lines on happiness.
The first thing I did was to write down the words that came out of my mouth when someone said "I'm happy." It's not easy, but it can be done! Here are some examples:


"It feels good!" (when you're feeling sad) or even better…(

# This was the main problem solved by Chat-GPT3