[v0.2.9] feat: General LLM Execution - Attention layer and E2E inference

## Summary

Implement full LLM inference capabilities with Attention layer support, enabling GPT-2 end-to-end execution and compatibility with common LLM architectures.

## Goals

### 1. Attention Layer Implementation
- Multi-Head Self-Attention (MHSA)
- Causal masking for autoregressive generation
- KV-cache for efficient inference

### 2. GPT-2 E2E Inference
- Current: MLP-only (no coherent output)
- Target: Full transformer block (LayerNorm → Attention → LayerNorm → MLP)
- Verify against HuggingFace reference implementation

### 3. Architecture Compatibility
Support common LLM architectures without modification:

| Architecture | Models | Key Differences |
|-------------|--------|-----------------|
| **GPT-2** | GPT-2, DistilGPT-2 | Pre-LN, learned positional embeddings |
| **GPT-Neo** | GPT-Neo, GPT-J | Local + global attention |
| **LLaMA** | LLaMA, LLaMA-2, Mistral | RMSNorm, RoPE, SwiGLU |

## Implementation Plan

### Phase 1: Basic Attention
- [ ] `softmax` operation (GPU kernel)
- [ ] `scaled_dot_product_attention` function
- [ ] Basic MHSA class

### Phase 2: GPT-2 Full Model
- [ ] Update `TransformerBlock` with attention
- [ ] Add attention weight loading from SafeTensors
- [ ] Verify correctness against HuggingFace

### Phase 3: Architecture Variants
- [ ] RMSNorm (for LLaMA)
- [ ] Rotary Position Embedding (RoPE)
- [ ] SwiGLU activation (for LLaMA)

## Non-Goals (v0.2.9)
- Training/backpropagation
- Quantization (INT8/INT4)
- Flash Attention optimization (future work)

## Success Criteria
- GPT-2 Small generates coherent text
- Output matches HuggingFace within FP32 tolerance
- LLaMA-7B architecture expressible (inference may be slow without Flash Attention)

## References
- Current MLP-only model: `src/pygpukit/llm/model.py`
- GPT-2 architecture: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.2.9] feat: General LLM Execution - Attention layer and E2E inference #78

Summary

Goals

1. Attention Layer Implementation

2. GPT-2 E2E Inference

3. Architecture Compatibility

Implementation Plan

Phase 1: Basic Attention

Phase 2: GPT-2 Full Model

Phase 3: Architecture Variants

Non-Goals (v0.2.9)

Success Criteria

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Architecture	Models	Key Differences
GPT-2	GPT-2, DistilGPT-2	Pre-LN, learned positional embeddings
GPT-Neo	GPT-Neo, GPT-J	Local + global attention
LLaMA	LLaMA, LLaMA-2, Mistral	RMSNorm, RoPE, SwiGLU

[v0.2.9] feat: General LLM Execution - Attention layer and E2E inference #78

Description

Summary

Goals

1. Attention Layer Implementation

2. GPT-2 E2E Inference

3. Architecture Compatibility

Implementation Plan

Phase 1: Basic Attention

Phase 2: GPT-2 Full Model

Phase 3: Architecture Variants

Non-Goals (v0.2.9)

Success Criteria

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions