Machine Learning Systems

Principles and Practices of Engineering Artificially Intelligent Systems

📘 Textbook • 📗 Vol I + 📘 Vol II • 🔥 TinyTorch • 🔬 Labs • 🔮 MLSys·im • 💼 StaffML

📚 Hardcopy edition coming 2026 with MIT Press.

Mission

The world is rushing to build AI systems. It is not engineering them.

That gap is what we mean by AI engineering.

AI engineering is the discipline of building efficient, reliable, safe, and robust intelligent systems that operate in the real world, not just models in isolation. Our mission is to establish AI engineering as a foundational discipline alongside software engineering and computer engineering, by teaching how to design, build, and evaluate end-to-end intelligent systems.

Our goal: Help 100,000 learners master ML Systems this year, and reach 1 million by 2030.

Why One Repository

I designed this as a single integrated curriculum, not a collection of independent projects. The textbook teaches the theory. TinyTorch makes you build the internals. The hardware kits force you to confront real constraints. The simulator lets you reason about infrastructure you can't afford to rent. Each piece exists because I found that students who only read don't internalize, and students who only code don't generalize.

The repository is the curriculum.

A growing community of contributors helps improve every part of it: fixing errors, sharpening explanations, testing on new hardware. Their work makes this better for everyone, and I'm grateful for every pull request.

The Curriculum

Every component connects. The textbook gives you the mental models. The labs let you reason through trade-offs interactively, powered by MLSys·im — a modeling engine for infrastructure you can't physically access, and a standalone tool in its own right. TinyTorch makes you build the machinery yourself. The hardware kits put you face-to-face with real deployment constraints. StaffML tests whether you actually understand it. Socratiq adds AI-guided reading, contextual quizzes, and spaced repetition inside the learning experience. And the instructor hub, slides, and newsletter give educators everything they need to bring this into a classroom.

For Students

	Component	Role in the Curriculum	Link
📖	Textbook	Two-volume MIT Press textbook. The theory, the mental models, and the quantitative reasoning that everything else builds on.	Vol I · Vol II
🔬	Labs	Interactive Marimo notebooks where you explore trade-offs from the textbook: change a parameter, see what breaks, build intuition. Powered by MLSys·im under the hood.	Launch labs · Repo guide
🔥	Tiny🔥Torch	Build your own ML framework from scratch across 20 progressive modules. You don't understand a system until you've built one.	Get started
🛠️	Hardware Kits	Deploy ML to Arduino, Seeed, Grove, and Raspberry Pi devices. Real memory limits, real power budgets, real latency.	Browse labs
🔮	MLSys·im	Calculate memory bottlenecks, network saturation, and scheduling limits at infrastructure scales you can't physically access.	Use simulator · Repo guide
💼	StaffML	Physics-grounded interview questions for ML systems roles. Vault, practice drills, mock interviews, and progress tracking.	Practice · Repo guide

For Educators

	Component	What It Provides	Link
🎓	Instructor Hub	The AI Engineering Blueprint: two 16-week syllabi, pedagogy guide, assessment rubrics, and a TA handbook.	View hub · Repo guide
🎬	Lecture Slides	Beamer slide decks for every chapter, with four theme variants. Drop into your course and teach.	Browse decks · Repo guide
📬	Newsletter	Updates on the curriculum, new chapters, and what the community is building.	Subscribe

Choose Your Path

The pieces are designed to work together, but you do not need to adopt everything at once.

If you are...	Start here	Then go deeper
A student or self-learner	Read Volume I and try Lab 00	Build TinyTorch, use MLSys·im, and practice with StaffML
An instructor	Open The AI Engineering Blueprint	Use the course map, slides, rubrics, and TA guide
A contributor	Pick the component you use most	Improve chapters, labs, tests, examples, hardware notes, simulator models, or assessment content

The learning loop is: Read → Explore → Build → Model → Deploy → Practice → Teach.

Adjacent and Experimental Work

Some projects are intentionally earlier-stage than the main curriculum:

Socratiq explores AI-guided reading, contextual quizzes, and spaced repetition for static learning sites.
MLPerf EDU is an under-construction pedagogical benchmark suite aligned with MLCommons MLPerf.
ML Systems Design Grammar is an experimental framework for reasoning from stable primitives, constraints, and rewrite rules.

What You Will Learn

This textbook teaches you to think at the intersection of machine learning and systems engineering. Each chapter bridges algorithmic concepts with the infrastructure that makes them work in practice.

You know...		You will learn...
How to train a model	→	How training scales across GPU clusters
That quantization shrinks models	→	How INT8 math maps to silicon
What a transformer is	→	Why KV-cache dominates memory at inference
Models run on GPUs	→	How schedulers balance latency vs throughput
Edge devices have limits	→	How to co-design models and hardware

Book Structure

The textbook follows the Hennessy & Patterson pedagogical model across two volumes:

	Volume	Theme	Scope
📗	Volume I	Build, Optimize, Deploy	Single-machine ML systems (1–8 GPUs). Foundations, optimization, and deployment on one node.
📘	Volume II	Scale, Distribute, Govern	Distributed systems at production scale. Multi-machine infrastructure, fault tolerance, and governance.

FAQ

Who is this for, and what should I know first?

This is for anyone who wants to engineer intelligent systems, not only train models: students, working engineers moving into ML infrastructure, and educators building a course. We assume you can program in Python and have met basic machine learning ideas, but the book builds the systems concepts from the ground up. You do not need a background in computer architecture, distributed systems, or datacenter operations. Volume I starts at the foundations, and the rest of the curriculum (TinyTorch, labs, hardware kits, and the simulator) lets you learn by building rather than only by reading.

Do I need Volume I before Volume II? What is the difference?

The two volumes differ in scope, not depth. Both are equally rigorous. Volume I is the single-machine world: how an ML system works on one node with a handful of accelerators, from data and a single neuron's computation up through training, optimization, and deployment. Volume II is the at-scale world: many machines across a network, distributed training, fault tolerance, fleet orchestration, inference at scale, and governance. Volume II does not assume you have read Volume I, so you can start there if you already have the foundations. The natural path, though, is Volume I to build the mental models and Volume II to apply them across the fleet. The analogy we follow is Hennessy and Patterson: Computer Organization and Design first, then Computer Architecture: A Quantitative Approach.

Do I need to use TinyTorch, the labs, and the kits, or can I just read the book?

You can just read the book. Each volume stands on its own. The rest of the curriculum (TinyTorch, the labs, the hardware kits, the simulator, and the interview practice) exists to deepen what the book teaches by making you build and measure it, but none of it is required to follow the text. Start with the book, and reach for the hands-on pieces when you want a concept to become muscle memory.

Isn't this just a deep learning book?

Deep learning books (Goodfellow et al.'s Deep Learning, Bishop, d2l.ai, fast.ai) teach you to design and train models: architectures, optimization, and the mathematics of learning. They mostly stop at the model. This book starts where they leave off. It treats the model as one component inside a system that has to ingest data, run on real silicon under power and latency budgets, serve predictions reliably, and keep working as the world drifts. You can finish a deep learning course knowing how a transformer learns and still not know why it stalls on a 4,000-accelerator training run, what the KV cache does to your serving memory, or why your accelerator sits idle. That gap is what we teach. Learn the model from a deep learning text, then learn the system here.

Isn't this MLOps, or the same as Designing Machine Learning Systems?

This is the most common mix-up, because "ML systems" and "MLOps" sound interchangeable and several good practitioner books share the words. MLOps books are operations guides: how to wire up a feature store, a pipeline, and a deployment with today's tools. They are valuable, and they age with the tooling. This book teaches the layer underneath: the physics and quantitative reasoning that explain why those tools exist and what they cost. We ask which questions matter, why a design is the way it is, and what it cannot escape (bandwidth, latency, power, failure rates).

Think of the difference between following a recipe and understanding how cooking works. A recipe gives you exact steps for one dish: this temperature, this pan, this many minutes. It works beautifully until the oven, the ingredients, or the kitchen changes. Understanding why heat, salt, acid, and time transform food is different. It lets you cook in any kitchen, rescue a dish that is going wrong, and invent one that no recipe covers.

An MLOps book hands you the recipe for the stack you have today. This book teaches the underlying science, so you can reason about any stack, debug the one that is failing, and design the one that does not exist yet.

How is this different from a classic systems reference like The Datacenter as a Warehouse-Scale Computer?

References like Barroso, Hölzle, and Clidaras's The Datacenter as a Warehouse-Scale Computer are excellent. They distill how one organization engineered one canonical system, written by the people who built it. This project is a different kind of artifact, and the two are complementary rather than competing.

Curriculum, not reference. A synthesis lecture documents a finished design for practitioners who already know the field. This book teaches the discipline from the ground up, with learning objectives, worked quantitative examples, labs, and an AI tutor, and it carries the reader from a single neuron (Volume I) all the way to the warehouse-scale fleet (Volume II).
Vendor neutral, not a single stack. A vendor reference can say "here is how we do it" and quote real production numbers. We generalize across accelerators (GPU and TPU), across cloud and edge, and teach why a design is the way it is, so the reasoning survives the next hardware generation and transfers anywhere.
Living, not a snapshot. A printed edition is frozen until its next revision. This is open source, continuously updated, and surrounded by code you build yourself (TinyTorch), hardware kits, a simulator, and interview practice.

In short, a warehouse-scale reference tells you how one machine was built. This curriculum teaches you to reason about why, and builds the judgment to design the next one.

Why read a textbook in the age of LLMs?

Because a textbook gives you something an LLM does not: perspective. An LLM is excellent at retrieval, and we are not trying to compete on retrieval. If a paragraph only delivers a fact you could get faster by asking a model, it has not earned its place. What a book builds instead is a structured mental model in the right order, the judgment to know which questions matter, and the reasoning behind why a design is the way it is and what it costs. Much of that comes from what a textbook chooses to leave out, since deciding what is central and what is peripheral is itself a lesson that an encyclopedic pile of facts cannot teach.

Bruce Davie makes this case well in "Textbooks in Tokenland" (Systems Approach): an LLM generates text that is not grounded in communicative intent, while a textbook is written by people trying to convey a model of the world to a reader. We agree, and we go one step further by building an AI tutor (SocratiQ) into the reading experience. The goal is not textbook versus LLM. A good book gives you the perspective to ask meaningful questions, and the LLM helps you answer them. Use each for what it does best.

Is it free, and how do I read it?

Yes. Both volumes are free to read online at mlsysbook.ai, and the textbook is open source under a Creative Commons license (CC BY-NC-SA 4.0), so you can share and adapt it for non-commercial use with attribution. If you prefer print, a hardcopy edition is coming in 2026 with MIT Press. The surrounding tools are open source too, each under its own license.

Quick Start

①	Read the textbook. Start with Volume I or continue to Volume II. It's the foundation for everything else.
②	Pick a hands-on path. Build a framework (TinyTorch), explore trade-offs (Labs), model constraints (MLSys·im), or deploy to real hardware (Kits).
③	Test yourself. Drill StaffML: physics-grounded systems design questions across cloud, edge, mobile, and TinyML.
④	Teach it. Adopt the curriculum with the AI Engineering Blueprint and lecture slides.

Branch Guide

Note

You are on the dev branch. Active development happens here. For the last stable release, see the main branch.

	Branch	What's on it	Status
🟢	`main` mlsysbook.ai	Single-volume textbook (current edition)	Live — this is what readers see today.
🟡	`dev` ← you are here	Volume I — two-volume split (content complete, editorial polish) Volume II — At Scale (active development) Curriculum — TinyTorch, Kits, MLSys·im, Labs, StaffML	TinyTorch and Hardware Kits are live. MLSys·im, Labs, and StaffML are early-release and actively iterated.

The two-volume split replaces the single-volume edition at launch.

Support This Work

Star the repo
Stars signal to universities and foundations that this work matters. They directly fund workshops and hardware kits for underserved classrooms.

100 → 1,000 → 10,000 → 100,000 → 1M learners by 2030

Fund the mission
All contributions go to Open Collective, a transparent fund for educational outreach. Every dollar goes to reaching more students.

Contributing

	I want to...	Go here
📖	Fix a typo or improve a chapter	Textbook contributing guide
🔥	Add a TinyTorch module or fix a bug	TinyTorch contributing guide
🛠️	Improve hardware labs	Hardware kits guide
🔬	Improve interactive labs or simulator models	Labs guide · MLSys·im guide
💼	Improve assessment or career-readiness content	StaffML guide · quiz refresh guide
🧠	Improve AI learning tools	Socratiq guide
🐛	Report an issue	GitHub Issues
💬	Ask a question	GitHub Discussions

License

This is a multi-component repository, and each component is released under its own license to match its purpose. The file inside each directory (e.g. tinytorch/LICENSE, interviews/staffml/LICENSE) is authoritative.

Component	License	What it means
Textbook (`book/`), Labs (`labs/`), Kits (`kits/`), Slides (`slides/`), Instructors (`instructors/`)	CC-BY-NC-SA 4.0	Share and adapt for non-commercial use, with attribution and same-license sharing.
TinyTorch	MIT	Permissive — use, modify, redistribute, including commercially.
MLSys·im	Apache 2.0	Permissive with explicit patent grant.
StaffML	AGPL v3	Strong copyleft — modifications to deployed services must be published. Commercial licensing available; contact the authors.
StaffML question corpus	CC BY-NC 4.0	Research and educational use; commercial use requires permission.
TinyDigits dataset	BSD 3-Clause	Permissive (matches sklearn ancestry).
TinyTalks dataset	CC BY 4.0	Permissive with attribution; commercial use allowed.

A user-facing summary lives at mlsysbook.ai/about/license.

If you are an institution considering adoption, or a company interested in commercial terms for a copyleft component, please reach out to edu@tinyML.org.

Contributors

Thanks goes to these wonderful people who have contributed to making this resource better for everyone!

Legend: 🪲 Bug Hunter · 🧑‍💻 Code Contributor · ✍️ Doc Wizard · 🎨 Design Artist · 🧠 Idea Spark · 🔎 Code Reviewer · 🧪 Test Tinkerer · 🛠️ Tool Builder

📖 Textbook Contributors

_{Vijay Janapa Reddi} 🪲 🧑‍💻 🎨 ✍️ 🧠 🔎 🧪 🛠️	_{Marcelo Rovai} 🧑‍💻 🎨 🧪	_{Gabriel Amazonas} 🪲 ✍️ 🧠	_{Zeljko Hrcek} 🧑‍💻 ✍️	_{Tess Watt} 🪲 ✍️	_{Kai Kleinbard} 🧑‍💻 🛠️	_{Didier Durand} ✍️ 🪲
_Rocky 🪲 🧑‍💻	_{Gustaf Hammarberg} 🪲 🧑‍💻	_{Aayush Kumar} 🪲 🧑‍💻	_{Yue Cheng} 🪲 🧑‍💻	_{Jason Jabbour} ✍️	_{Ikechukwu Uchendu} ✍️	_{Naeem Khoshnevis} ✍️
_{Sara Khosravi} ✍️	_{Douwe den Blanken} ✍️	_{Jeffrey Ma} ✍️	_{shanzehbatool} ✍️	_Elias ✍️	_{Jared Ping} ✍️	_{Itai Shapira} ✍️
_{Maximilian Lam} ✍️	_{Jayson Lin} ✍️	_{Sophia Cho} ✍️	_Andrea ✍️	_{Alex Rodriguez} ✍️	_{Korneel Van den Berghe} ✍️	_Nimo ✍️
_{Colby Banbury} ✍️	_{Zishen Wan} ✍️	_{Mark Mazumder} ✍️	_{Abdulrahman Mahmoud} ✍️	_{Divya Amirtharaj} ✍️	_{Srivatsan Krishnan} ✍️	_marin-llobet ✍️
_{Aghyad Deeb} ✍️	_{Haoran Qiu} ✍️	_{Emil Njor} ✍️	_{ELSuitorHarvard} ✍️	_kaiM0ves ✍️	_oishib ✍️	_{Jared Ni} ✍️
_{Aditi Raju} ✍️	_{Michael Schnebly} ✍️	_{Thuong Duong} ✍️	_{Yu-Shun Hsiao} ✍️	_{Henry Bae} ✍️	_{Eimhin Laverty} ✍️	_{Jae-Won Chung} ✍️
_{Shvetank Prakash} ✍️	_{Marco Zennaro} ✍️	_{Arya Tschand} ✍️	_{Andrew Bass} ✍️	_{Pong Trairatvorakul} ✍️	_{Eura Nofshin} ✍️	_{Matthew Stewart} ✍️
_{Emeka Ezike} ✍️	_jianqingdu ✍️	_{Jennifer Zhou} ✍️	_{The Random DIY} ✍️	_{Fatima Shah} ✍️	_{Bruno Scaglione} ✍️	_Allen-Kuang ✍️
_{Tauno Erik} ✍️	_gnodipac886 ✍️	_{Sercan Aygün} ✍️	_{TheHiddenLayer} ✍️	_{Gauri Jain} ✍️	_{Fin Amin} ✍️	_{Alex Oesterling} ✍️
_{Abenezer Angamo} ✍️	_{Baldassarre Cesarano} ✍️	_{Jahnic Beck} ✍️	_{अरनव शुक्ला \| Arnav Shukla} ✍️	_Rin ✍️	_{Bilge Acun} ✍️	_{Andy Cheng} ✍️
_{Aritra Ghosh} ✍️	_{abigailswallow} ✍️	_{Yang Zhou} ✍️	_{JEON HYUNJUN(Luciano)} ✍️	_{Emmanuel Rassou} ✍️	_{Jason Yik} ✍️	_{Jessica Quaye} ✍️
_{Cursor Agent} ✍️	_{happyappledog} ✍️	_Snuggs ✍️	_{Sam Wilcock} ✍️	_{Shreya Johri} ✍️	_{Sonia Murthy} ✍️	_{Costin-Andrei Oncescu} ✍️
_{formlsysbookissue} ✍️	_{Annie Laurie Cook} ✍️	_{Parampreet Singh} ✍️	_{Vijay Edupuganti} ✍️	_{Jothi Ramaswamy} ✍️	_{Batur Arslan} ✍️	_{Curren Iyer} ✍️
_{Edward Jin} ✍️	_bluebaer7 ✍️	_yanjingl ✍️	_a-saraf ✍️	_songhan ✍️	_jvijay ✍️	_Zishen ✍️
_{Kristian Radoš} ✍️	_{Dang Truong} 🧑‍💻	_pipme ✍️	_{Salman Chishti} ✍️	_{Paolo Estavillo} ✍️	_GronuJ ✍️	_{Pratham Chaudhary} 🧑‍💻
_Octopus ✍️

🔥 TinyTorch Contributors

_{Vijay Janapa Reddi} 🪲 🧑‍💻 🎨 ✍️ 🧠 🔎 🧪 🛠️	_kai 🪲 🧑‍💻 🎨 ✍️ 🧪	_{Dang Truong} 🪲 🧑‍💻 ✍️ 🧪	_{Farhan Asghar} 🪲 🧑‍💻 🎨 ✍️	_Rocky 🪲 🧑‍💻 ✍️ 🧪	_{Didier Durand} 🪲 🧑‍💻 ✍️	_rnjema 🧑‍💻 ✍️ 🛠️
_{Pratham Chaudhary} 🪲 🧑‍💻 ✍️	_{Karthik Dani} 🪲 🧑‍💻	_{Avik De} 🪲 🧪	_Takosaga 🪲 ✍️	_joeswagson 🧑‍💻 🛠️	_{AndreaMattiaGaravagno} 🧑‍💻 ✍️	_Rolds 🪲 🧑‍💻
_asgalon 🧑‍💻 ✍️	_bdub 🪲 🧑‍💻	_{Amir Alasady} 🪲	_jettythek 🧑‍💻	_wzz 🪲	_{Ng Bo Lin} ✍️	_keo-dara 🪲
_{Wayne Norman} 🪲	_{Ilham Rafiqin} 🪲	_{Oscar Flores} ✍️	_harishb00a ✍️	_{Pastor Soto} ✍️	_{Salman Chishti} 🧑‍💻	_{Aditya Mulik} ✍️
_{Ademola Arigbabuwo} ✍️	_{Yaroslav Halchenko} 🧑‍💻	_Harish ✍️

🚀 MLSys·im Contributors

_{Vijay Janapa Reddi}
🧑‍💻 🎨 ✍️ 🧠

_{Peter Koellner}
🪲 ✍️

_Rocky
🪲 🧑‍💻

_{Zeljko Hrcek}
🧑‍💻

🤖 StaffML Contributors

_{Vijay Janapa Reddi}
🎨 ✍️ 🧠

_Rocky
🪲 🧑‍💻

_{Pelin Balcı}
🎨 🧠

_{Farhan Asghar}
🧑‍💻

🛠️ Hardware Kits Contributors

_{Vijay Janapa Reddi}
🪲 🧑‍💻 🎨 ✍️ 🧪 🛠️

_{Marcelo Rovai}
✍️ 🧑‍💻 🎨

_{Farhan Asghar}
🪲 🧑‍💻

_{Salman Chishti}
🧑‍💻

_{Pratham Chaudhary}
🧑‍💻

_Rocky
🪲

🧪 Labs Contributors

_{Vijay Janapa Reddi}
🧑‍💻 🎨 ✍️

_Rocky
🪲 🧑‍💻 🎨

_{Salman Chishti}
🧑‍💻

_{Pratham Chaudhary}
🧑‍💻

_{Peter Koellner}
🪲

🎞️ Slides Contributors

_{Vijay Janapa Reddi}
🧑‍💻 🎨 ✍️

🗺️ Instructor Site Contributors

_{Vijay Janapa Reddi}
🧑‍💻 🎨 ✍️

_{Farhan Asghar}
🪲 🧑‍💻 🎨

_Rocky
🧑‍💻 ✍️ 🔎

⚗️ ML Systems Design Grammar Contributors

Coming soon!

✉️ Subscribe • 💬 Join discussions • 🌐 Visit mlsysbook.ai

Made with ❤️ for AI engineers
in the making, around the world 🌎

Name		Name	Last commit message	Last commit date
Latest commit History 17,898 Commits
.github		.github
.vale/styles/textbook		.vale/styles/textbook
README		README
binder		binder
book		book
design-grammar		design-grammar
docs		docs
instructors		instructors
interviews		interviews
kits		kits
labs		labs
mlperf-edu		mlperf-edu
mlsysim		mlsysim
scripts		scripts
shared		shared
site		site
slides		slides
socratiq		socratiq
tinytorch		tinytorch
tools		tools
wheels		wheels
.all-contributorsrc		.all-contributorsrc
.codespell-ignore-words.txt		.codespell-ignore-words.txt
.gitattributes		.gitattributes
.gitignore		.gitignore
.nojekyll		.nojekyll
.pre-commit-config.yaml		.pre-commit-config.yaml
.pre-commit-history.md		.pre-commit-history.md
.yamllint		.yamllint
CITATION.bib		CITATION.bib
CITATION.cff		CITATION.cff
CNAME		CNAME
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Systems

Mission

Why One Repository

The Curriculum

For Students

For Educators

Choose Your Path

Adjacent and Experimental Work

What You Will Learn

Book Structure

FAQ

Quick Start

①

②

③

④

Branch Guide

Support This Work

Contributing

License

Contributors

📖 Textbook Contributors

🔥 TinyTorch Contributors

🚀 MLSys·im Contributors

🤖 StaffML Contributors

🛠️ Hardware Kits Contributors

🧪 Labs Contributors

🎞️ Slides Contributors

🗺️ Instructor Site Contributors

⚗️ ML Systems Design Grammar Contributors

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 24

Uh oh!

Contributors

Uh oh!

Languages