# Tutorial 01: Introduction and Setup

Welcome to the Transformer Tutorial series! This set of notebooks will guide you through the foundations, implementation, and applications of transformer-based models in deep learning. By the end of this series, you will understand how transformers work from the ground up and be able to implement them yourself.

## What Are Transformers?

The Transformer is a neural network architecture introduced in the paper **"Attention Is All You Need"** (Vaswani et al., 2017). It revolutionized Natural Language Processing and is increasingly used in computer vision, speech recognition, and other domains.

### Key Innovations:

1. **Eliminating recurrence**: Unlike RNNs and LSTMs, Transformers process entire sequences in parallel
2. **Using self-attention**: Allowing each position to attend to all positions in the previous layer
3. **Scaling efficiently**: Training much faster on modern hardware (GPUs/TPUs)

### Core Attention Mechanism:

The fundamental attention operation is:

```
Attention(Q, K, V) = softmax(QK^T / √d_k)V
```

Where:
- **Q** (Query): What we're looking for
- **K** (Key): What we're comparing against
- **V** (Value): The actual information we want to retrieve
- **d_k**: Dimension of the key vectors (used for scaling)

---

## What You'll Learn in This Tutorial:
- Prerequisites and foundational concepts
- Hardware/software environment setup
- Installing core dependencies
- Verifying your installation
- Basic tensor operations for transformers
- Project structure overview

---

## 1. Prerequisites

Before starting, ensure you have a solid foundation in:

### Python Programming
- Object-oriented programming
- NumPy for array operations
- Basic PyTorch usage

### Mathematics
- **Linear algebra**: Matrix multiplication, vectors, dot products
- **Calculus basics**: Gradients, derivatives
- **Probability**: Softmax, probability distributions

### Deep Learning Concepts
- Neural networks and backpropagation
- Gradient descent optimization
- Embeddings and word representations

### System Requirements:
- **Python 3.8+** (Recommended: 3.10)
- GPU (NVIDIA CUDA) support recommended for faster training (optional)
- Familiarity with Jupyter notebooks

## 2. Environment Setup

We recommend setting up a virtual environment to avoid dependency conflicts. You can use **virtualenv** or **conda** for this purpose.

### Option 1: Using virtualenv (Linux/Mac/Windows)

In [None]:
# Install virtualenv if not already installed
# !pip install virtualenv

# Create virtual environment
# !python -m venv transformer_env

# Activate it:
# Linux/Mac: source transformer_env/bin/activate
# Windows: transformer_env\Scripts\activate

print("Virtual environment setup commands (uncomment to run)")

### Option 2: Using conda (Anaconda/Miniconda)

In [None]:
# Create conda environment
# !conda create -n transformer-env python=3.10 -y

# Activate it:
# !conda activate transformer-env

print("Conda environment setup commands (uncomment to run)")

### Install Required Libraries

We'll use PyTorch as our main framework, along with NumPy, Matplotlib, and other utilities.

**Note**: If you're using CUDA (GPU acceleration), refer to [PyTorch's installation guide](https://pytorch.org/get-started/locally/) for the correct commands for your system.

In [None]:
# Install PyTorch (CPU version)
# For GPU version, check: https://pytorch.org/get-started/locally/
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# Install other dependencies
# !pip install numpy matplotlib tqdm

# Optional: For working with pre-trained models
# !pip install transformers datasets

print("Installation commands (uncomment to run if not already installed)")

## 3. Environment Verification

Let's verify that all required libraries are installed correctly and check your system configuration.

### 3.1 Check Python Version

In [None]:
import sys

def check_python_version():
    """Check if Python version is 3.8 or higher."""
    version = sys.version_info
    print(f"Python Version: {version.major}.{version.minor}.{version.micro}")
    if version.major >= 3 and version.minor >= 8:
        print("✓ Python version is compatible")
        return True
    else:
        print("✗ Python version must be 3.8 or higher")
        return False

check_python_version()

### 3.2 Check PyTorch Installation

In [None]:
def check_torch():
    """Check if PyTorch is installed and working."""
    try:
        import torch
        print(f"✓ PyTorch version: {torch.__version__}")
        
        # Test basic tensor operations
        x = torch.tensor([1.0, 2.0, 3.0])
        y = torch.tensor([4.0, 5.0, 6.0])
        z = x + y
        print(f"  Basic operation test: {x.tolist()} + {y.tolist()} = {z.tolist()}")
        
        # Check CUDA availability
        if torch.cuda.is_available():
            print(f"  ✓ CUDA available: Yes (Device: {torch.cuda.get_device_name(0)})")
        else:
            print(f"  ℹ CUDA available: No (using CPU - this is fine for learning!)")
        
        return True
    except ImportError:
        print("✗ PyTorch is not installed")
        print("  Install it with: pip install torch")
        return False

check_torch()

### 3.3 Check NumPy Installation

In [None]:
def check_numpy():
    """Check if NumPy is installed and working."""
    try:
        import numpy as np
        print(f"✓ NumPy version: {np.__version__}")
        
        # Test basic array operations
        a = np.array([1, 2, 3])
        b = np.array([4, 5, 6])
        c = a + b
        print(f"  Basic operation test: {a.tolist()} + {b.tolist()} = {c.tolist()}")
        
        return True
    except ImportError:
        print("✗ NumPy is not installed")
        print("  Install it with: pip install numpy")
        return False

check_numpy()

## 4. Basic Concepts Demonstration

Let's explore the fundamental operations you'll encounter throughout these tutorials. Understanding these concepts is crucial for implementing transformers.

### 4.1 Tensor Shapes (Critical for Understanding Transformers)

In [None]:
import torch

# Common transformer dimensions
batch_size = 2      # Number of sequences processed together
seq_length = 4      # Number of tokens in each sequence
d_model = 8         # Embedding dimension (size of each token representation)

# Simulating a batch of sequences
x = torch.randn(batch_size, seq_length, d_model)

print("="*60)
print("TENSOR SHAPES")
print("="*60)
print(f"Input shape: {x.shape}")
print(f"  - Dimension 0 (batch_size={batch_size}): Number of sequences")
print(f"  - Dimension 1 (seq_length={seq_length}): Number of tokens per sequence")
print(f"  - Dimension 2 (d_model={d_model}): Embedding dimension per token")
print(f"\nThis represents {batch_size} sequences, each with {seq_length} tokens,")
print(f"where each token is represented as a {d_model}-dimensional vector.")

### 4.2 Matrix Multiplication (Core of Attention)

In [None]:
# Query and Key matrices for attention
Q = torch.randn(seq_length, d_model)
K = torch.randn(seq_length, d_model)

# Attention scores: Q @ K^T
# This creates a matrix showing how much each token attends to every other token
scores = torch.matmul(Q, K.transpose(-2, -1))

print("="*60)
print("MATRIX MULTIPLICATION IN ATTENTION")
print("="*60)
print(f"Q (Query) shape: {Q.shape}")
print(f"K (Key) shape: {K.shape}")
print(f"K^T (Key transposed) shape: {K.transpose(-2, -1).shape}")
print(f"\nAttention scores (Q @ K^T) shape: {scores.shape}")
print(f"\nThis creates a {seq_length}×{seq_length} attention matrix where:")
print(f"  - Each row represents one token's attention scores")
print(f"  - Each column represents how much that position is attended to")

### 4.3 Softmax (Turning Scores into Probabilities)

In [None]:
# Example attention scores
sample_scores = torch.tensor([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0]
])

# Apply softmax to convert scores to probabilities
probabilities = torch.softmax(sample_scores, dim=-1)

print("="*60)
print("SOFTMAX OPERATION")
print("="*60)
print("Input scores (raw attention scores):")
print(sample_scores)
print("\nAfter softmax (attention probabilities):")
print(probabilities)
print(f"\nSum of each row: {probabilities.sum(dim=-1)}")
print("Note: Each row sums to 1.0 (valid probability distribution)")
print("\nHigher input scores become higher probabilities.")

### 4.4 Broadcasting (Used in Positional Encoding)

In [None]:
# Simulating adding positional encoding to embeddings
embeddings = torch.ones(3, 4)  # 3 tokens, 4-dimensional embeddings
position_bias = torch.tensor([1.0, 2.0, 3.0, 4.0])

# Broadcasting automatically expands position_bias to match embeddings shape
result = embeddings + position_bias

print("="*60)
print("BROADCASTING")
print("="*60)
print(f"Embeddings shape: {embeddings.shape}")
print(f"Position bias shape: {position_bias.shape}")
print(f"Result shape: {result.shape}")
print("\nOriginal embeddings:")
print(embeddings)
print("\nPosition bias:")
print(position_bias)
print("\nResult (embeddings + position_bias):")
print(result)
print("\nThe bias is automatically broadcast to match embedding dimensions!")

### 4.5 Complete Mini-Example: Scaled Dot-Product Attention

In [None]:
import math

def scaled_dot_product_attention(Q, K, V):
    """
    Simplified implementation of scaled dot-product attention.
    
    Args:
        Q: Query matrix (seq_len, d_k)
        K: Key matrix (seq_len, d_k)
        V: Value matrix (seq_len, d_v)
    
    Returns:
        Attention output (seq_len, d_v)
    """
    d_k = Q.size(-1)
    
    # 1. Compute attention scores: Q @ K^T
    scores = torch.matmul(Q, K.transpose(-2, -1))
    
    # 2. Scale by sqrt(d_k) to prevent vanishing gradients
    scores = scores / math.sqrt(d_k)
    
    # 3. Apply softmax to get attention weights
    attention_weights = torch.softmax(scores, dim=-1)
    
    # 4. Apply weights to values: attention_weights @ V
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

# Example usage
seq_len = 3
d_k = 4
d_v = 4

Q = torch.randn(seq_len, d_k)
K = torch.randn(seq_len, d_k)
V = torch.randn(seq_len, d_v)

output, attention_weights = scaled_dot_product_attention(Q, K, V)

print("="*60)
print("SCALED DOT-PRODUCT ATTENTION EXAMPLE")
print("="*60)
print(f"Input shapes:")
print(f"  Q: {Q.shape}")
print(f"  K: {K.shape}")
print(f"  V: {V.shape}")
print(f"\nAttention weights shape: {attention_weights.shape}")
print(f"Output shape: {output.shape}")
print(f"\nAttention weights (how much each token attends to others):")
print(attention_weights)
print(f"\nEach row sums to 1: {attention_weights.sum(dim=-1)}")

## 5. Project Structure

Here's how this repository is organized:

```
transformer_tutorial/
├── README.md                    # Main documentation
├── requirements.txt             # Python dependencies
├── tutorials/
│   ├── 01_introduction/         # This tutorial
│   │   ├── 01_introduction.ipynb
│   │   ├── README.md
│   │   └── example.py
│   ├── 02_attention_mechanism/  # Deep dive into attention
│   ├── 03_multi_head_attention/ # Multi-head attention
│   ├── 04_positional_encoding/  # Adding position information
│   ├── 05_feed_forward/         # Feed-forward networks
│   ├── 06_encoder_layer/        # Complete encoder layer
│   ├── 07_decoder_layer/        # Complete decoder layer
│   └── 08_complete_transformer/ # Full transformer model
```

Each tutorial builds on the previous ones:
1. **Introduction** (this notebook): Setup and basic concepts
2. **Attention Mechanism**: Understanding and implementing attention
3. **Multi-Head Attention**: Parallel attention mechanisms
4. **Positional Encoding**: Adding sequence order information
5. **Feed-Forward Networks**: Position-wise transformations
6. **Encoder Layer**: Combining components into encoder
7. **Decoder Layer**: Building the decoder with masked attention
8. **Complete Transformer**: Putting it all together

## 6. Transformer Architecture Overview

Before we dive deep into the details, here's a high-level view of the complete architecture:

```
Input Sequence (e.g., "Hello world")
        ↓
[Input Embedding + Positional Encoding]
        ↓
┌─────────────────────────────────────┐
│   ENCODER (N=6 identical layers)   │
│                                     │
│   For each layer:                  │
│   ┌──────────────────────────────┐ │
│   │ 1. Multi-Head Self-Attention │ │
│   │    - Queries, Keys, Values   │ │
│   │    - 8 parallel attention    │ │
│   └──────────────────────────────┘ │
│            ↓                        │
│   [Add & Normalize]                │
│            ↓                        │
│   ┌──────────────────────────────┐ │
│   │ 2. Feed-Forward Network      │ │
│   │    - Two linear layers       │ │
│   │    - ReLU activation         │ │
│   └──────────────────────────────┘ │
│            ↓                        │
│   [Add & Normalize]                │
└─────────────────────────────────────┘
        ↓
[Encoder Output]
        ↓
┌─────────────────────────────────────┐
│   DECODER (N=6 identical layers)   │
│                                     │
│   For each layer:                  │
│   ┌──────────────────────────────┐ │
│   │ 1. Masked Self-Attention     │ │
│   │    - Prevents future peeking │ │
│   └──────────────────────────────┘ │
│            ↓                        │
│   [Add & Normalize]                │
│            ↓                        │
│   ┌──────────────────────────────┐ │
│   │ 2. Cross-Attention           │ │
│   │    - Attends to encoder      │ │
│   └──────────────────────────────┘ │
│            ↓                        │
│   [Add & Normalize]                │
│            ↓                        │
│   ┌──────────────────────────────┐ │
│   │ 3. Feed-Forward Network      │ │
│   └──────────────────────────────┘ │
│            ↓                        │
│   [Add & Normalize]                │
└─────────────────────────────────────┘
        ↓
[Linear Layer + Softmax]
        ↓
Output Probabilities
```

## 7. Common Applications

Transformers are used in numerous real-world applications:

### Natural Language Processing
- **Machine Translation**: Translating text between languages (e.g., Google Translate)
- **Text Summarization**: Condensing long documents into summaries
- **Question Answering**: Finding answers in text (e.g., search engines)
- **Text Generation**: Creating coherent text (e.g., GPT models, ChatGPT)
- **Sentiment Analysis**: Understanding emotions in text
- **Named Entity Recognition**: Identifying people, places, organizations

### Computer Vision
- **Image Classification**: Vision Transformers (ViT)
- **Object Detection**: DETR (Detection Transformer)
- **Image Generation**: DALL-E, Stable Diffusion

### Multimodal Applications
- **Image Captioning**: Describing images with text
- **Visual Question Answering**: Answering questions about images
- **Text-to-Image Generation**: Creating images from descriptions

## 8. Running the Notebooks

You can run these notebooks in several ways:

### Local Installation
```bash
# Install Jupyter
pip install notebook
jupyter notebook

# Or Jupyter Lab
pip install jupyterlab
jupyter lab
```

### Cloud Platforms
- **Google Colab**: Free GPU access - [colab.research.google.com](https://colab.research.google.com/)
- **Kaggle Kernels**: Free GPU/TPU access - [kaggle.com/kernels](https://www.kaggle.com/kernels)
- **AWS SageMaker**: Professional cloud environment

### Tips for Cloud Platforms
- Upload notebooks directly or clone the repository
- Make sure to enable GPU runtime for faster execution
- Save your work frequently

## 9. Tips for Success

### Learning Strategies
1. **Read and experiment**: Don't just run the cells—modify them and see what happens
2. **Understand the math**: Work through the mathematical operations step by step
3. **Visualize**: Draw diagrams of tensor shapes and data flow
4. **Code from scratch**: Try implementing components yourself before looking at solutions
5. **Ask questions**: Use GitHub Issues or Discussions for help

### Debugging Tips
- **Print tensor shapes**: Use `.shape` frequently to understand dimensions
- **Use small examples**: Test with tiny tensors first (e.g., seq_len=3, d_model=4)
- **Check intermediate results**: Print outputs at each step
- **Read error messages**: PyTorch errors often indicate dimension mismatches

### Resources
- [Original Paper: "Attention Is All You Need"](https://arxiv.org/abs/1706.03762)
- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) - Excellent visual guide
- [PyTorch Documentation](https://pytorch.org/docs/stable/index.html)
- [Hugging Face Transformers](https://huggingface.co/docs/transformers/) - Pre-trained models
- [Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/) - Line-by-line implementation

### Version Control
Consider using git to track your progress:
```bash
git init
git add .
git commit -m "Completed introduction tutorial"
```

## 10. Summary

In this tutorial, you:

✅ Learned what Transformers are and why they're important  
✅ Set up your development environment  
✅ Verified your installation  
✅ Explored fundamental tensor operations  
✅ Implemented a simple attention mechanism  
✅ Understood the project structure  
✅ Got an overview of the complete architecture  

### What's Next?

In **Tutorial 02: Attention Mechanism**, you'll:
- Understand the mathematics behind attention
- Implement scaled dot-product attention from scratch
- Visualize attention weights
- Explore different attention variants

---

**Ready to continue?** → [Tutorial 02: Attention Mechanism](../02_attention_mechanism/)

---

*If you found this helpful, consider ⭐ starring the repository on GitHub!*