# Downloading GPT2 (Generative pre-trained transformer) from Open-AI

### Install the transformer library 

In [3]:
! pip install transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




### Load GPT-2

This loads the GPT-2 model and tokenizer, in the base version. 
If you want you can specify other variants like: 
"gpt2-medium", "gpt2-large", "gpt2-xl"

- Tokenizer: Converts raw text into tokens (numbers), so the model can process it. 

- LMHeadModel: The GPT-2 model architecture that generates text (predicts the next token based on previous tokens).


In [4]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Load the model
model = GPT2LMHeadModel.from_pretrained("gpt2")

### Use the model (Example: Text Generation)

Here we're using a pre-trained language model (GPT-2) to generate text. 
Think of this model as a super advanced autocomplete that can write full sentences or paragraphs based on an input prompt (in this case, "Hello"). 

In [None]:
# Import PyTorch (needed because GPT-2 works with tensors, which are multi-dimensional arrays)
import torch

# Give a model a starting point
input_text = "Hello"
# Transformers don't read text, they read numbers!! 
# Tokenizer (here we set it up) breaks down your text into smaller pieces and maps each of them into a number (token ID)
# return_tensors='pt' tells the tokenizer to return a PyTorch tensor (not just a list), because that's what the model expects.
# After this line, input_ids holds a tensor like: tensor([[15496]]) — which is the ID for "Hello".
input_ids = tokenizer.encode(input_text, return_tensors='pt')

# Attention mask setup (for padding)
# Tells the model which tokens it should pay attention to. 1 means "use this token". 
attention_mask = torch.ones(input_ids.shape, dtype=torch.long)

# Models expect a special "padding token" in case your input has extra space.
# GPT-2 doesn’t have a padding token, so we use the end-of-sentence (eos) token instead.
pad_token_id = tokenizer.eos_token_id  # Set pad token ID to eos token ID

# Generate text with attention mask and pad token
output = model.generate(input_ids, # Input in number format
                        attention_mask=attention_mask, # Tells the model which words to pay attention to 
                        pad_token_id=pad_token_id, # Just in case the model needs padding
                        max_length=100, # The model will stop once it has generated 100 tokens
                        do_sample=False,# Tells the model to sample RANDOMLY instead of always choosing the most likely next word - this makes the output more creative
                        top_k=100) # At each step, the model chooses from the top 100 most likely next words - this keeps results interesting but not too random. 
# After this line, output contains generated tokens IDs - like: tensor([[15496,  11, 318, 617, 460, 460, 318, ...]])

# Convert numbers back to text
print(tokenizer.decode(output[0], skip_special_tokens=True))


Hello, I'm sorry, but I'm not sure if you're aware of this. I'm not sure if you're aware of this. I'm not sure if you're aware of this. I'm not sure if you're aware of this. I'm not sure if you're aware of this. I'm not sure if you're aware of this. I'm not sure if you're aware of this. I'm not sure if you're aware of this. I'm not sure if


⬆ ABOUT THE ABOVE CODE: 
Is **low-level**:
```python
input_ids = tokenizer.encode(...)
output = model.generate(...)
```
**Characteristics:**
- **More manual control** — You specify attention masks, padding tokens, decoding, etc.
- You’re interacting **directly** with the model and tokenizer objects.
- **Better for customization** — e.g., adding constraints, working with batches, doing masked generation, etc.
- Good when you want to:
  - Understand how generation works
  - Build complex workflows
  - Fine-tune models
  - Add post-processing
 
⬇ ABOUT THE FOLLOWING CODE: 

```python
generator = pipeline('text-generation', model='gpt2')
generator("Hello, I'm a language model,", ...)
```

**Characteristics:**
- **High-level abstraction** — It wraps the tokenizer and model together.
- You don’t need to manually encode or decode anything.
- **Easier and faster** for most use cases like prototyping or demos.
- Automatically handles:
  - Tokenization
  - Model inference
  - Decoding
  - Setting padding/attention/etc. under the hood

In [None]:
#! pip install tf-keras

# Import pipeline: shortcut provided by Hugging Face to quickly use pre-trained models
# Set seed of course is... to set the seed! Lol. 
from transformers import pipeline, set_seed

# Create a generator using PyTorch (avoid TensorFlow-related imports)
# You pass to the generator the task tipe and the model you want to use
# It automatically 
#   Loads the tokenizer
#   Loads the model
#   Decide whether to use PyTorch or TensorFlow (in this case, Pytorch)
generator = pipeline('text-generation', model='gpt2')

# Actually run the generator
set_seed(42)
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)


Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, and my project will get better with time, but I think there are a lot more things that can help you"},
 {'generated_text': "Hello, I'm a language model, not a language model, so if I don't have a problem, I can fix it by creating new words"},
 {'generated_text': "Hello, I'm a language model, and I'm trying to learn some stuff. I'll try to do some basic programming and just learn better ways"},
 {'generated_text': "Hello, I'm a language model, but I don't believe in grammar. This will work for every language model. You can define it very quickly"},
 {'generated_text': 'Hello, I\'m a language model, a model of how things should be, and then we look at different things as well." I\'d like to'}]