This repository contains two single-file implementations of Andrej Karpathy's microgpt — a minimal GPT-style language model built entirely from scratch, with no ML framework dependencies:
| File | Language | Runtime |
|---|---|---|
MicroGPT.cs |
C# 14 | dotnet run |
MicroGPT.fsx |
F# 10 | dotnet fsi |
The C# version was the original translation from Python, written to explore the algorithm in a familiar .NET stack. The F# version was then derived from the C# version as a natural evolution toward a more expressive, functional style.
The original microgpt is a minimal implementation of a GPT-style language model from scratch, created by Andrej Karpathy. It serves as an educational tool to understand the core mechanics of transformer-based language models.
For detailed information about the algorithm and its implementation, please visit: https://karpathy.github.io/2026/02/12/microgpt/
microgpt implements a character-level language model using the Transformer architecture. The key components include:
- Character-level tokenization: The model operates directly on characters rather than subword tokens
- Multi-head self-attention: Enables the model to focus on different parts of the input sequence simultaneously
- Position embeddings: Provides the model with information about the position of tokens in the sequence
- Feed-forward layers: Adds non-linear transformations to enhance the model's expressiveness
- RMS normalization: Stabilizes training by normalizing activations
- Residual connections: Helps with gradient flow during training
The model is trained to predict the next character in a sequence, learning patterns and structure from the training text.
- .NET 10 (or later) must be installed on your system
dotnet run MicroGPT.csOn Unix systems you can also make the file executable and run it directly:
chmod +x MicroGPT.cs
./MicroGPT.csdotnet fsi MicroGPT.fsxBoth versions accept the same CLI arguments:
--n_embd 16 --n_layer 1 --block_size 8 --num_steps 10000
--n_head 4 --learning_rate 0.01 --seed 42
F# is a functional-first language that runs on .NET, sharing the same runtime and standard library as C#. Translating this project to F# illustrates how the same algorithm can be expressed more concisely and with stronger compile-time guarantees.
| Aspect | C# | F# |
|---|---|---|
| Script execution | #!/usr/bin/dotnet run shebang in a .cs file |
Standard dotnet fsi F# script (.fsx) |
| Entry point | Top-level statements (C# 9+) | Top-level let bindings — idiomatic F# |
| Mutable state | var everywhere |
Explicit mutable keyword — immutability is the default |
| Value class | Primary constructor syntax | Type with explicit member definitions |
| Collections | List<T> |
ResizeArray<T> (same BCL type, idiomatic F# alias) |
| Null safety | null guards |
Empty arrays `[ |
| Pipeline style | LINQ method chains | ` |
| Stack tuples | (Value, int) |
struct (Value * int) — stack-allocated, zero allocation |
In F#, every binding is immutable unless you explicitly write mutable. This makes the data flow of the forward pass
and optimizer easier to reason about — mutation is visible and intentional, not accidental.
// Immutable by default
let x = rmsNorm x
// Mutation must be declared explicitly
let mutable loss = Value 0.0
loss <- loss + lThe |> pipe operator lets you read data transformations left-to-right, matching how you think about them:
let docs =
File.ReadAllLines "input.txt"
|> Array.map (fun l -> l.Trim())
|> Array.filter (fun l -> not (String.IsNullOrEmpty l))
|> (fun arr -> shuffle random (ResizeArray arr))F# requires no semicolons, fewer braces, and no return statements. The signal-to-noise ratio is higher, which helps
when studying an algorithm — you see the maths, not the ceremony.
F# infers types throughout, so you get full type safety without the verbosity of explicit annotations everywhere.
Both versions run on .NET and use the same System.Numerics.Vector<double> SIMD path inside Value.Dot. There is no
performance trade-off for choosing F#.
Both implementations have no external dependencies beyond .NET itself. Everything — the autograd engine, the transformer, the Adam optimizer, and the tokenizer — is implemented from scratch in a single file.
The author of the C# version deliberately optimized for raw CPU throughput. Several departures from the Python version were made on purpose:
- SIMD vectorization —
Value.DotusesSystem.Numerics.Vector<double>to process multiple elements per CPU instruction, giving a significant speedup over a scalar loop. - Iterative backward pass — The original Python
backward()is recursive and can hit Python's stack limit on long sequences. The C# version replaces recursion with an explicitStack<T>, making it both faster and safe for deep graphs. - Zero-allocation hot paths —
Value.Dotpre-allocates thechildrenandlocalGradsarrays once per node instead of creating intermediateValueobjects for each multiply-and-add. This keeps GC pressure low during the training loop. - Backward loop unrolling — The
Backwardmethod special-cases nodes with 1 or 2 children (which covers ~99% of the graph: Add, Mul, ReLU, Pow) to avoid loop setup overhead.
Original microgpt by Andrej Karpathy
C# implementation by @martinskuta
F# translation by @jonas1ara