# Pre-trained GPT-2 Notebook

<a target="_blank" href="https://colab.research.google.com/github/simonguest/CS-394/blob/main/src/01/notebooks/GPT-2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a target="_blank" href="https://github.com/simonguest/CS-394/raw/refs/heads/main/src/01/notebooks/GPT-2.ipynb">
  <img src="https://img.shields.io/badge/Download_.ipynb-blue" alt="Download .ipynb"/>
</a>

In [5]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Set pad token
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [11]:
import torch

def autocomplete(prompt, max_length=50, temperature=0.7, top_k=50, top_p=0.9, is_sampling = True):
    # Encode the prompt with attention mask
    inputs = tokenizer(prompt, return_tensors="pt")

    # Generate continuation
    with torch.no_grad():
        output = model.generate(
            inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_length=max_length,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            do_sample=is_sampling,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode and return the generated text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

Create 3 different story starters in different genres/styles. For example:
*   Fantasy/Adventure: “In a land of dragons and magic…”
*   Sci-fi: “The year is 2157. Humanity has just…”
*   Mystery: “The detective examined the crime scene and noticed…”
*   (or choose your own three)

Then adjust for:
*   Greedy decoding vs. sampling
*   Different temperature values
*   How the opening sentence shapes the continuation (e.g., short vs. long)

Document your observations (using Markdown in the notebook)
*   What differences do you notice between the strategies?
*   What worked well (or surprised you!)
*  What didn’t work that well












In [19]:
prompts = [
    "Gandalf is a wizard that is about to go on a journey to Hogwarts",
    "Ju-ve built a time machine, and went back in time 27 years ago",
    "Professor Dimitar Volpermort is kidnapped at DigiPen Institute of Technology",
    "Willaim is in love with Juliet"
]

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    print("-" * 50)
    completion = autocomplete(prompt, max_length=80, is_sampling = False)
    print(f"Greedy Output: {completion}\n")
    print("-" * 50)
    completion = autocomplete(prompt, max_length=80)
    print(f"Sampling Output: {completion}\n")
    print("-" * 50)
    completion = autocomplete(prompt, max_length=80, temperature=1.0)
    print(f"Sampling w/ 1.0 temperature Output: {completion}\n")
    print("-" * 50)
    completion = autocomplete(prompt, max_length=80, temperature=1.5)
    print(f"Sampling w/ 1.5 temperature Output: {completion}\n")
    print("-" * 50)
    completion = autocomplete(prompt, max_length=160, temperature=1.5)
    print(f"Sampling w/ 1.5 temperature and 160 max Output: {completion}\n")


Prompt: Gandalf is a wizard that is about to go on a journey to Hogwarts
--------------------------------------------------
Greedy Output: Gandalf is a wizard that is about to go on a journey to Hogwarts. He is a wizard who has been trained by the wizarding world to be a wizard. He is a wizard who has been trained by the wizarding world to be a wizard. He is a wizard who has been trained by the wizarding world to be a wizard. He is a wizard who has been trained by

--------------------------------------------------
Sampling Output: Gandalf is a wizard that is about to go on a journey to Hogwarts to find his father. He is the last of the Death Eaters to be killed in the Battle of Hogwarts, and is the last of the Death Eaters to die in the battle for Hogwarts.

Contents show]

Biography

Early life

Harry Potter, a boy, was born in

--------------------------------------------------
Sampling w/ 1.0 temperature Output: Gandalf is a wizard that is about to go on a journey to Hogwarts to s

Document your observations (using Markdown in the notebook)



1.   What differences do you notice between the strategies?
2.   What worked well (or surprised you!)
3.   What didn’t work that well



