<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Architectures/gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generative Pre-trained Transformer (GPT)

The GPT architecture represents a family of large-scale language models based on the Transformer architecture that have revolutionized natural language processing. Developed by OpenAI, GPT models are trained on vast amounts of text data and have demonstrated remarkable capabilities in text generation, understanding, and various language tasks.

## Architectural Overview

GPT is built on the decoder portion of the Transformer architecture with some key characteristics:

### Key Components

1. **Transformer Decoder Blocks**: Multiple layers of transformer decoder blocks stacked together
2. **Self-Attention Mechanism**: Masked self-attention that prevents the model from attending to future tokens
3. **Feed-Forward Networks**: Position-wise fully connected layers for additional processing
4. **Layer Normalization**: Applied before each sub-block for training stability
5. **Positional Encodings**: To capture token position information within sequences

### Architecture Diagram

```
Input Sequence → Embedding + Positional Encoding →
                                                   ↓
                                      ┌─────────────────────┐
                                      │ Masked Self-Attention│
                                      └─────────────────────┘
                                                   ↓
                                      ┌─────────────────────┐  
                                      │   Feed-Forward NN   │ 
                                      └─────────────────────┘
                                                   ↓
                                             (Repeat N times)
                                                   ↓
                                              Output Layer
```

Unlike BERT which looks at both left and right context (bidirectional), GPT uses a unidirectional (left-to-right) attention mechanism, meaning each token can only attend to previous tokens and itself.

## Training Methodology

GPT models follow a two-phase training approach:

### 1. Pre-training (Generative)

The model is trained on a massive corpus of text using unsupervised learning to predict the next token in a sequence given all the previous tokens.

**Objective**: Maximize the likelihood of the next word given the previous context:

$$L_1(U) = \sum_i \log P(u_i | u_{i-k}, ..., u_{i-1}; \Theta)$$

Where:
- $U$ is the unlabeled text corpus
- $u_i$ is the current token
- $k$ is the context window size
- $\Theta$ represents the model parameters

### 2. Fine-tuning

The pre-trained model is then fine-tuned on specific downstream tasks using supervised learning with labeled data.

**Objective**: Maximize the likelihood of the correct output given the input:

$$L_2(C) = \sum_{(x,y) \in C} \log P(y | x; \Theta)$$

Where:
- $C$ is the labeled dataset for the specific task
- $x$ is the input sequence
- $y$ is the target output

## Evolution of GPT Models

The GPT architecture has evolved significantly across different versions:

### GPT-1 (2018)
- 117 million parameters
- 12 transformer layers
- First to demonstrate effective transfer learning in NLP
- Trained on BookCorpus (about 4.5GB of text)

### GPT-2 (2019)
- Up to 1.5 billion parameters
- 48 transformer layers in largest model
- Improved text generation capabilities
- Trained on WebText (40GB of text)
- Famous for its ability to generate coherent, lengthy passages

### GPT-3 (2020)
- 175 billion parameters
- 96 attention layers
- Demonstrated few-shot and zero-shot learning capabilities
- Trained on Common Crawl, WebText2, Books1, Books2, Wikipedia (about 570GB)
- Capable of performing tasks without explicit fine-tuning

### GPT-4 (2023)
- Estimated 1 trillion+ parameters (exact size undisclosed)
- Multimodal capabilities (can process both text and images)
- Significantly improved reasoning and safety features
- Reduced hallucinations and better factual accuracy

Each iteration has demonstrated substantial improvements in performance, capabilities, and breadth of applications.

## Applications of GPT

GPT models excel in a wide range of natural language processing tasks:

### Text Generation
- Creative writing (stories, poetry, scripts)
- Content creation for articles and marketing
- Code generation and completion

### Language Understanding
- Question answering
- Summarization
- Translation
- Sentiment analysis

### Conversational AI
- Chatbots and virtual assistants
- Customer support automation
- Interactive storytelling

### Specialized Applications
- Legal document analysis
- Medical text processing
- Educational content generation
- Scientific research assistance

## Implementation Example

Here's how to use GPT models from the Hugging Face library:

In [None]:
# Install required packages
!pip install transformers torch

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = "gpt2"  # Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Generate text
def generate_text(prompt, max_length=100):
    # Encode the prompt
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    
    # Generate text
    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        top_k=50,
        top_p=0.95,
        temperature=0.8,
        do_sample=True
    )
    
    # Decode and return the generated text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

In [None]:
# Example usage
prompt = "Artificial intelligence will"
generated_text = generate_text(prompt)
print(generated_text)

## Fine-tuning GPT for Specific Tasks

In [None]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset

# Example: Fine-tuning GPT-2 on a custom dataset
def finetune_gpt(dataset_path, output_dir="./finetuned-gpt"):
    # Load dataset
    dataset = load_dataset(dataset_path)
    
    # Tokenize dataset
    def tokenize_function(examples):
        return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
    
    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    
    # Set up training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        num_train_epochs=3,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        save_steps=10_000,
        save_total_limit=2,
    )
    
    # Data collator for language modeling
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False
    )
    
    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["validation"] if "validation" in tokenized_dataset else None,
        data_collator=data_collator,
    )
    
    # Train the model
    trainer.train()
    
    # Save the model
    trainer.save_model()
    
    return trainer

## Advantages and Limitations

### Advantages
- Powerful zero-shot and few-shot learning capabilities
- Excellent at generating coherent, contextually relevant text
- Versatile across multiple language tasks without specific architecture modifications
- Can understand nuanced prompts and follow complex instructions

### Limitations
- Unidirectional attention limits certain contextual understanding
- Can produce plausible-sounding but factually incorrect information ("hallucinations")
- Computationally expensive to train and run (especially larger models)
- Lacks explicit knowledge representation or reasoning capabilities
- Training data cutoff creates temporal limitations
- Ethical concerns around bias, misuse for misinformation, etc.

## References

1. Radford, A., et al. (2018). [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf). OpenAI.

2. Radford, A., et al. (2019). [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). OpenAI.

3. Brown, T. B., et al. (2020). [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165). arXiv preprint arXiv:2005.14165.

4. OpenAI (2023). [GPT-4 Technical Report](https://arxiv.org/abs/2303.08774). arXiv preprint arXiv:2303.08774.

5. Vaswani, A., et al. (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762). NeurIPS.

6. Wolf, T., et al. (2020). [Transformers: State-of-the-Art Natural Language Processing](https://www.aclweb.org/anthology/2020.emnlp-demos.6/). EMNLP.