Skip to content

[v0.2.9] feat: General LLM Execution - Attention layer and E2E inference #78

@m96-chan

Description

@m96-chan

Summary

Implement full LLM inference capabilities with Attention layer support, enabling GPT-2 end-to-end execution and compatibility with common LLM architectures.

Goals

1. Attention Layer Implementation

  • Multi-Head Self-Attention (MHSA)
  • Causal masking for autoregressive generation
  • KV-cache for efficient inference

2. GPT-2 E2E Inference

  • Current: MLP-only (no coherent output)
  • Target: Full transformer block (LayerNorm → Attention → LayerNorm → MLP)
  • Verify against HuggingFace reference implementation

3. Architecture Compatibility

Support common LLM architectures without modification:

Architecture Models Key Differences
GPT-2 GPT-2, DistilGPT-2 Pre-LN, learned positional embeddings
GPT-Neo GPT-Neo, GPT-J Local + global attention
LLaMA LLaMA, LLaMA-2, Mistral RMSNorm, RoPE, SwiGLU

Implementation Plan

Phase 1: Basic Attention

  • softmax operation (GPU kernel)
  • scaled_dot_product_attention function
  • Basic MHSA class

Phase 2: GPT-2 Full Model

  • Update TransformerBlock with attention
  • Add attention weight loading from SafeTensors
  • Verify correctness against HuggingFace

Phase 3: Architecture Variants

  • RMSNorm (for LLaMA)
  • Rotary Position Embedding (RoPE)
  • SwiGLU activation (for LLaMA)

Non-Goals (v0.2.9)

  • Training/backpropagation
  • Quantization (INT8/INT4)
  • Flash Attention optimization (future work)

Success Criteria

  • GPT-2 Small generates coherent text
  • Output matches HuggingFace within FP32 tolerance
  • LLaMA-7B architecture expressible (inference may be slow without Flash Attention)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions