LLM Systems From Scratch

This is an open, educational repository. It is not a software product. It is not a deployable system. It is a structured, hands-on learning project — open to everyone.

This repository exists to help students, engineers, and developers deeply understand how Large Language Models work — not by using them, but by building every core function and algorithm they depend on, from scratch, in C++ and Rust.

No cloud. No APIs. No black boxes. Just code and fundamentals.

Who This Is For

Students studying machine learning, systems programming, or AI
Engineers who want to understand what actually happens inside an LLM
Developers learning C++ or Rust through a meaningful, real-world problem domain
Anyone curious enough to go deeper than the surface

What This Is Not

This is not a language model
This is not production software
This is not a framework or library meant for deployment
This is not affiliated with any AI company or product

What This Is

A complete map of every core module an LLM is built on — implemented one phase at a time, from the lowest level up.

Every phase = one core module. Every module = one concept that a real language model cannot exist without.

By the end, you will have built from scratch:

Tensors → Math → Autograd → Neural Layers → Embeddings →
Tokenizer → Positional Encoding → Attention → Transformer →
Normalization → Feed-Forward → Sampling → Inference Engine →
Python Bindings → Rust Memory Layer

Language Stack

Layer	Language	Purpose
Core engine	C++	All fundamental algorithms
Memory safety layer	Rust (via FFI)	Specific subsystems plugged into C++
Bindings	Python / JavaScript	Expose C++ engine for scripting and web

Rust does not replace C++. It extends it. Rust is used for specific components — memory management, safe concurrent pipelines, inference serving — and plugged into the C++ engine via FFI (Foreign Function Interface).

Core Modules — Phase by Phase

Each phase introduces one core module. That module is a real, named concept inside a real language model architecture.

Phase 1 — Tensor

Module: Tensor Operations

How LLMs store and compute on data

Everything in a language model is a number in a multidimensional array. This phase builds the foundational data structure.

N-dimensional array (Tensor class)
Shape and stride tracking
Element-wise operations: add, subtract, multiply
Memory layout (row-major)

Phase 2 — Linear Algebra Engine

Module: Matrix Multiplication

The single most-used operation in all of deep learning

Every layer in a transformer — attention, projection, feed-forward — is a matrix multiplication. This phase builds and optimizes it.

Naive matrix multiplication (baseline)
Cache-optimized version (loop tiling)
Multithreaded version (std::thread / OpenMP)

Phase 3 — Autograd

Module: Automatic Differentiation

How models learn — the algorithm behind every weight update

Without autograd, there is no training. This phase builds the computation graph and the backward pass.

Computation graph (nodes and edges)
Forward pass execution
Backward pass — gradient computation via chain rule

Phase 4 — Neural Network Layers

Module: Layer Functions

The building blocks stacked inside every neural network

This phase builds composable, trainable layers on top of tensors and autograd.

Dense (fully connected) layer
Activation functions: ReLU, tanh, sigmoid, GELU
Loss functions: MSE, cross-entropy
Test: XOR problem, simple classification

Phase 5 — Embeddings

Module: Embedding Layer

How LLMs convert token IDs into meaning

Tokens are integers. Embeddings turn those integers into vectors that carry semantic meaning — this is the first thing a transformer does with its input.

Embedding table (vocabulary × dimension)
Lookup operation
Gradient flow through embeddings

Phase 6 — Tokenizer

Module: Tokenization

How LLMs read text — converting raw strings into token sequences

Language models do not read characters. They read tokens — integer IDs from a fixed vocabulary. This phase builds the algorithm that produces them.

Whitespace tokenizer (baseline)
Byte Pair Encoding (BPE) — the core algorithm used in most real tokenizers
Vocabulary building
Encode and decode functions

Phase 7 — Positional Encoding

Module: Positional Encoding

How LLMs know the order of tokens

Attention has no built-in sense of position. This phase implements the algorithms that inject order information into the input sequence.

Sinusoidal positional encoding (original transformer)
Learned positional embeddings
Rotary Positional Encoding (RoPE) — used in modern architectures

Phase 8 — Attention

Module: Attention Mechanism

The core algorithm of transformer-based architectures

This is the mechanism that allows a model to relate any token to any other token in a sequence, regardless of distance.

Q, K, V projections (linear transforms)
Scaled dot-product attention
Softmax
Causal (masked) attention — for autoregressive generation
Multi-head attention

Phase 9 — Normalization

Module: Layer Normalization

How LLMs stay stable during training and inference

Without normalization, deep networks become numerically unstable. This phase implements the normalization used inside every transformer block.

Layer normalization (LayerNorm)
RMS normalization (RMSNorm) — used in modern architectures
Learnable scale and shift parameters (gamma, beta)

Phase 10 — Feed-Forward Network

Module: Feed-Forward Block

The computation that follows attention in every transformer layer

Every transformer block has two parts: attention and a feed-forward network. This phase builds the FFN.

Two-layer MLP with activation
GELU activation function
Projection in → hidden → projection out
Role in transformer block (context: runs after attention)

Phase 11 — Transformer Block

Module: Transformer Block

One complete unit of a transformer — assembled from all prior modules

This phase assembles the full transformer block from the components built in previous phases.

Pre-norm architecture (LayerNorm → Attention → residual)
Feed-forward sublayer (LayerNorm → FFN → residual)
Residual connections
Stacking N blocks into a full transformer backbone

Phase 12 — Sampling & Output Head

Module: Sampling Algorithms

How LLMs decide what token to generate next

After the transformer runs, a probability distribution over the vocabulary is produced. This phase implements the algorithms that select the next token from it.

Linear projection to vocabulary (logits)
Softmax to probability distribution
Greedy decoding
Temperature scaling
Top-k sampling
Top-p (nucleus) sampling

Phase 13 — Inference Engine

Module: Inference Runtime

The execution engine — how a trained model runs efficiently

This phase builds the runtime that loads weights and executes the full forward pass, without any training overhead.

Weight serialization and loading from file
Full forward pass pipeline
KV cache — stores past key/value pairs to avoid recomputation
Token generation loop (autoregressive decoding)

Phase 14 — Language Bindings

Module: Python & JavaScript Bindings

Exposing the C++ engine to higher-level languages

This phase wraps the C++ engine so it can be used from Python scripts and web environments.

Python (pybind11):

// C++ side
Tensor matmul(Tensor a, Tensor b);

# Python side
import llmsys
llmsys.matmul(a, b)

JavaScript / Web:

Compile C++ to WebAssembly (WASM) for in-browser execution
Or expose a lightweight REST API

Phase 15 — Rust Memory Layer

Module: Rust FFI Components

Precision components in Rust, plugged into the C++ engine

Rust does not replace C++. This phase builds specific subsystems in Rust where its ownership model and memory safety give a concrete advantage — then connects them to the C++ engine via FFI.

Safe memory allocator (plugged into C++ tensor operations)
Concurrent data pipeline (safe multithreaded loading)
Inference server (async, no GC pauses, production-grade serving)
FFI interface layer between Rust and C++

Explore: burn, candle, tch-rs

Full Module Map

Phase	Module	Concept in LLM
1	Tensor	Data storage and computation
2	Matrix Multiplication	Core math operation
3	Autograd	How models learn
4	Neural Network Layers	Trainable building blocks
5	Embeddings	Token → vector conversion
6	Tokenizer	Text → token IDs
7	Positional Encoding	Token order awareness
8	Attention	Token-to-token relationships
9	Normalization	Training and inference stability
10	Feed-Forward Network	Per-token computation after attention
11	Transformer Block	Full assembled unit
12	Sampling & Output Head	Next-token selection
13	Inference Engine	Runtime execution
14	Language Bindings	Python / JS interface
15	Rust Memory Layer	Safe, high-performance components via FFI

Open Collaboration

This repository is open for collaboration. Everyone is welcome to contribute — whether you are a student, a researcher, a hobbyist, or a professional engineer.

Ways to contribute:

Implement a module in a phase
Write tests for an existing implementation
Improve documentation or add explanations
Add alternative implementations (different algorithms, different approaches)
Fix bugs or improve performance

Please read CONTRIBUTING.md before opening a pull request. Please read CODE_OF_CONDUCT.md for community standards.

Requirements

C++17 or later
Rust (stable toolchain) — for Phase 15
No cloud accounts required
No paid APIs required
No GPU required (CPU-only by design for learning purposes)

License

MIT — free to use, study, fork, and build upon.

A Note on Purpose

This project exists because understanding AI at the source level matters.

The goal is not to compete with existing frameworks. The goal is to make the internals legible — so that anyone who works through these phases comes away with a genuine understanding of what a language model is made of, one algorithm at a time.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
configs		configs
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
FILE_STRUCTURE.md		FILE_STRUCTURE.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Systems From Scratch

Who This Is For

What This Is Not

What This Is

Language Stack

Core Modules — Phase by Phase

Phase 1 — Tensor

Phase 2 — Linear Algebra Engine

Phase 3 — Autograd

Phase 4 — Neural Network Layers

Phase 5 — Embeddings

Phase 6 — Tokenizer

Phase 7 — Positional Encoding

Phase 8 — Attention

Phase 9 — Normalization

Phase 10 — Feed-Forward Network

Phase 11 — Transformer Block

Phase 12 — Sampling & Output Head

Phase 13 — Inference Engine

Phase 14 — Language Bindings

Phase 15 — Rust Memory Layer

Full Module Map

Open Collaboration

Requirements

License

A Note on Purpose

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Systems From Scratch

Who This Is For

What This Is Not

What This Is

Language Stack

Core Modules — Phase by Phase

Phase 1 — Tensor

Phase 2 — Linear Algebra Engine

Phase 3 — Autograd

Phase 4 — Neural Network Layers

Phase 5 — Embeddings

Phase 6 — Tokenizer

Phase 7 — Positional Encoding

Phase 8 — Attention

Phase 9 — Normalization

Phase 10 — Feed-Forward Network

Phase 11 — Transformer Block

Phase 12 — Sampling & Output Head

Phase 13 — Inference Engine

Phase 14 — Language Bindings

Phase 15 — Rust Memory Layer

Full Module Map

Open Collaboration

Requirements

License

A Note on Purpose

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages