A professional roadmap for mastering large language model internals, training, post-training, inference, retrieval, agents, evaluation, and production architecture.
This repository is designed for engineers who want to move beyond surface-level LLM usage and build production-grade LLM systems with measurable quality, latency, cost, and reliability.
It is not a collection of model news.
It is not a prompt-engineering cookbook.
It is a systems roadmap.
The central idea:
LLM competence = model internals + training logic + inference systems + retrieval architecture + agent control + evaluation discipline + production constraints
- Who this roadmap is for
- What this roadmap covers
- What this roadmap does not cover
- Core philosophy
- Competency map
- Roadmap overview
- Layer 1: LLM Foundations
- Layer 2: Training Pipeline
- Layer 3: Post-Training
- Layer 4: Reasoning Models
- Layer 5: Inference Fundamentals
- Layer 6: Serving Engines
- Layer 7: KV Cache and Long Context
- Layer 8: Quantization and Compression
- Layer 9: RAG Systems
- Layer 10: Agentic Systems
- Layer 11: Evaluation and Benchmarking
- Layer 12: Production Architecture
- Advanced tracks
- Master artifact portfolio
- Repository structure
- Definition of done
- Recommended source map
- Engineering checklists
- How to use this roadmap
This roadmap is for:
- AI engineers
- ML engineers
- NLP engineers
- backend engineers moving into LLM systems
- technical leads responsible for GenAI architecture
- researchers who want stronger production intuition
- product-minded engineers building applied LLM systems
- infrastructure engineers working with GPU serving stacks
You should use this roadmap if your target is to build systems like:
- enterprise RAG platforms
- on-prem LLM deployments
- multi-model inference gateways
- AI agents with tools and approvals
- evaluation harnesses for LLM products
- private domain assistants
- document intelligence systems
- multimodal knowledge systems
- production LLM observability pipelines
- cost-controlled LLM serving infrastructure
This roadmap covers the full technical stack behind modern LLM systems:
Text
β Tokens
β Transformer
β Pretraining
β Post-training
β Reasoning
β Inference runtime
β Serving engine
β KV cache
β Quantization
β Retrieval
β Agents
β Evaluation
β Production architecture
It explains the mechanisms, not only the buzzwords.
Each layer includes:
- objective
- core concepts
- what to understand deeply
- implementation artifacts
- engineering decisions
- failure modes
- evaluation gates
- recommended resources
This roadmap does not focus on:
- daily model release news
- shallow prompt collections
- generic AI career advice
- vendor marketing claims
- no-code tool tutorials
- toy demos without evaluation
- βbest modelβ lists without workload definition
The assumption is simple:
A model is not good or bad in isolation.
A model is good or bad for a workload, under constraints, measured by an eval.
Do not memorize model names. Learn what changed.
What changed in architecture?
What changed in data?
What changed in post-training?
What changed in inference?
What changed in memory layout?
What changed in evaluation?
Model names expire. Mechanisms compound.
A capable LLM system is not just a strong model.
It is a controlled pipeline:
model
+ tokenizer
+ chat template
+ retrieval
+ tools
+ serving engine
+ cache policy
+ eval set
+ observability
+ fallback logic
+ cost controls
+ safety boundaries
Most production failures happen outside the model weights.
For every layer, produce something measurable:
benchmark
eval set
notebook
dashboard
architecture diagram
serving comparison
failure analysis
cost model
red-team suite
If your knowledge cannot produce an artifact, it is not operational yet.
Do not trust:
- one prompt
- one demo
- one leaderboard
- one benchmark
- one model card
- one latency number
- one anecdotal answer
Use evals, traces, failure categories, and regression tests.
The final goal is not to know more terms.
The final goal is to make better technical decisions:
Should we fine-tune or use RAG?
Should we use vLLM or SGLang?
Should we quantize to INT4 or keep FP16?
Should we use long context or retrieval?
Should this be an agent or deterministic workflow?
Should this run on-prem or through an API?
Should we add a reranker?
Should we use a reasoning model?
Can call hosted APIs.
Typical abilities:
- writes prompts
- uses chat interfaces
- calls model endpoints
- adjusts temperature
- knows model names
Limit:
Cannot explain or debug failures below the API layer.
Can build demos.
Typical abilities:
- builds simple RAG
- uses LangChain/LlamaIndex
- connects vector databases
- builds tool-calling examples
- creates chatbot demos
Limit:
Often lacks evaluation, observability, failure analysis, and production constraints.
Can build useful applications.
Typical abilities:
- designs retrieval pipelines
- builds structured prompts
- manages citations
- performs basic evals
- handles tool calling
- integrates with backend systems
Limit:
May not deeply understand inference, KV cache, serving engines, or GPU cost.
Can build production systems.
Typical abilities:
- understands prefill/decode
- benchmarks inference
- chooses serving engines
- estimates KV cache memory
- evaluates quantization
- builds RAG evals
- instruments traces
- designs fallback paths
- controls latency and cost
Minimum serious professional level.
Can optimize large-scale serving.
Typical abilities:
- operates vLLM/SGLang/TensorRT-LLM
- handles multi-GPU serving
- manages concurrency
- tunes batching
- handles prefix caching
- evaluates quantized kernels
- monitors GPU utilization
- designs model gateways
- handles autoscaling
Can understand and modify methods.
Typical abilities:
- reads papers mechanically
- runs ablations
- modifies post-training recipes
- tests reasoning methods
- builds custom evals
- analyzes training/inference tradeoffs
- understands architecture deltas
Can design organization-scale platforms.
Typical abilities:
- defines platform architecture
- governs model usage
- builds eval infrastructure
- designs multi-tenant systems
- manages security and compliance
- controls cost at scale
- aligns model strategy with business constraints
Target level.
| Layer | Area | Core question |
|---|---|---|
| 1 | LLM Foundations | What happens during one token generation? |
| 2 | Training Pipeline | How are base models created? |
| 3 | Post-Training | How are models shaped into assistants? |
| 4 | Reasoning Models | How do models use extra compute to solve hard tasks? |
| 5 | Inference Fundamentals | Why is serving an LLM a systems problem? |
| 6 | Serving Engines | Which runtime fits which workload? |
| 7 | KV Cache and Long Context | What makes long context expensive and unreliable? |
| 8 | Quantization and Compression | How do we reduce cost without silent quality collapse? |
| 9 | RAG Systems | How do we ground outputs in external knowledge? |
| 10 | Agentic Systems | How do we safely connect models to tools and workflows? |
| 11 | Evaluation and Benchmarking | How do we measure quality, cost, latency, and safety? |
| 12 | Production Architecture | How do we deploy, monitor, scale, and govern LLM systems? |
Understand the core mechanics of a decoder-only LLM.
The minimum target:
Given a prompt, explain exactly how the model turns text into token probabilities.
The model does not read words. It reads tokens.
A tokenizer converts text into integer IDs:
"Large language models are useful"
β [24513, 4221, 4981, 527, 5562]
Tokenization affects cost, context length, latency, multilingual quality, code handling, Arabic morphology, prompt compression, domain vocabulary, and retrieval chunk size.
A bad tokenizer can increase token count and damage quality, especially for morphologically rich languages.
Key rule:
Never estimate LLM cost from word count.
Always measure with the model tokenizer.
Token IDs are indices.
The model maps each token ID to a dense vector through an embedding table:
vocabulary_size Γ hidden_dimension
The prompt becomes:
sequence_length Γ hidden_dimension
A decoder-only LLM is a stack of Transformer blocks:
input
β normalization
β self-attention
β residual connection
β normalization
β MLP
β residual connection
β output
Each block edits the representation. It does not rebuild everything from scratch.
Self-attention lets each token route information from previous tokens.
Each token representation is projected into:
Q = query
K = key
V = value
Intuition:
Query = what this position is looking for
Key = what each position offers for matching
Value = information carried if selected
Scaled dot-product attention:
Attention(Q, K, V) = softmax(QKα΅ / sqrt(d_k))V
Decoder-only models cannot look into the future.
For tokens:
A B C D
position C can attend to:
A B C
but not:
D
This enables next-token training without information leakage.
Multiple attention heads let the model route different information patterns in parallel: syntax, long-range reference, formatting, code indentation, list structure, and mathematical dependencies.
Classic multi-head attention stores separate keys and values for every attention head. That is expensive during inference.
Modern models often use:
- MQA: many query heads share one key/value head
- GQA: groups of query heads share key/value heads
Why this matters:
Fewer KV heads β smaller KV cache β better serving scalability
Attention mixes information across token positions.
MLP layers transform each token representation independently.
Attention = token-to-token communication
MLP = token-wise feature transformation
Many LLM parameters live in MLP blocks.
Attention alone does not know order.
Modern LLMs commonly use RoPE. RoPE injects position by rotating query and key vectors in a position-dependent way.
Important implication:
Long-context behavior is partly constrained by positional encoding design.
The final hidden state is projected into vocabulary logits:
hidden_dim β vocab_size
Softmax turns logits into probabilities. Decoding chooses the next token.
Common decoding methods:
- greedy decoding
- temperature sampling
- top-k sampling
- top-p sampling
- repetition penalties
- constrained decoding
During generation, the model stores previously computed key/value tensors. This avoids recomputing all previous tokens.
KV cache grows with:
batch_size Γ context_length Γ layers Γ KV_heads Γ head_dim Γ bytes_per_value
This is one of the most important memory bottlenecks in LLM serving.
Build a tiny decoder-only model.
Minimum components:
tokenizer
embedding layer
causal self-attention
MLP
residual connections
normalization
logits head
sampling loop
Compare token counts across English, Arabic, mixed Arabic-English, Python code, JSON, and legal text.
Record:
characters
words
tokens
tokens per word
strange splits
Run one model with:
temperature = 0
temperature = 0.3
temperature = 0.8
top_p = 0.9
top_k = 50
Observe correctness, variation, repetition, hallucination, and formatting stability.
Measure memory and latency at:
1k context
4k context
16k context
32k context
Track time to first token, time per output token, GPU memory, and throughput.
You pass this layer if you can explain:
text β tokens β embeddings β Transformer blocks β logits β probabilities β next token
without hand-waving.
Understand how base LLM capability is created before instruction tuning.
A base model is not yet a helpful assistant. It is a statistical language model trained to predict the next token.
raw data
β filtering
β deduplication
β classification
β mixture design
β tokenizer training
β sequence packing
β pretraining
β checkpointing
β validation
β contamination checks
β base model release/evaluation
Training data quality dominates model behavior.
Sources may include web pages, books, code, academic text, documentation, forums, math data, multilingual corpora, synthetic data, and domain-specific corpora.
Data quality issues include spam, boilerplate, duplicated pages, machine-generated junk, toxic content, benchmark contamination, stale facts, low-quality translations, formatting noise, and personally identifiable information.
Key principle:
Pretraining data is not just fuel.
It is the model's compressed world.
Deduplication reduces repeated content.
Why it matters:
- prevents memorization
- improves data diversity
- reduces overfitting
- reduces benchmark leakage
- improves compute efficiency
Types:
- exact deduplication
- near-duplicate detection
- document-level deduplication
- paragraph-level deduplication
- code clone detection
Not all data should have equal weight.
A data mixture controls how much of each domain the model sees.
Examples:
web text
code
math
books
scientific papers
multilingual content
instruction-like content
Data mixture affects code ability, reasoning, multilingual quality, factual recall, style, toxicity, and domain competence.
Tokenizer decisions affect the whole model.
Consider vocabulary size, BPE vs Unigram vs WordPiece, byte fallback, multilingual coverage, code tokens, special tokens, whitespace behavior, Arabic and dialect handling.
A tokenizer is hard to change after training.
Changing tokenizer usually means training or adapting the model again.
Most decoder-only LLMs use next-token prediction.
The model minimizes cross-entropy loss:
good prediction β low loss
bad prediction β high loss
Loss is useful but incomplete.
A lower loss does not automatically mean better instruction following, better reasoning, better safety, better RAG behavior, or better tool use.
Scaling laws relate model size, dataset size, compute budget, and loss.
They help answer:
Given fixed compute, should we train a larger model on fewer tokens or a smaller model on more tokens?
Important principle:
Compute-optimal training is a resource allocation problem.
Important components:
- AdamW
- learning rate schedule
- warmup
- gradient clipping
- weight decay
- mixed precision
- gradient accumulation
- batch size
- checkpointing
Training instability can come from bad data, bad learning rate, optimizer settings, numerical overflow, distributed training bugs, tokenizer/data mismatch, or corrupted batches.
Large models require distributed training.
Common strategies:
- data parallelism
- tensor parallelism
- pipeline parallelism
- sequence parallelism
- ZeRO-style optimizer sharding
- activation checkpointing
Training is constrained by GPU memory, interconnect bandwidth, compute utilization, checkpoint I/O, failure recovery, and cluster scheduling.
Build a mini pretraining pipeline:
collect text
clean text
train tokenizer
pack sequences
train tiny model
track loss
sample generations
evaluate basic capability
You pass this layer if you can read a model technical report and identify:
data recipe
token budget
model size
architecture
training objective
compute estimate
evaluation setup
contamination risks
Understand how base models become useful assistants.
Base models complete text. Post-trained models follow instructions.
base model
β supervised fine-tuning
β preference optimization
β reinforcement learning / direct optimization
β safety tuning
β refusal calibration
β formatting alignment
β evaluation
SFT trains the model on instruction-response pairs.
It teaches instruction following, chat behavior, formatting, domain response style, role behavior, and basic helpfulness.
But SFT alone can teach imitation, not necessarily preference quality.
Preference data contains comparisons:
prompt
chosen answer
rejected answer
The model learns which answer is preferred.
Methods include:
- RLHF
- DPO
- IPO
- KTO
- ORPO
- RLAIF
RLHF often uses:
preference data
β reward model
β policy optimization
Benefits:
- improves helpfulness
- aligns with human preference
- improves conversational behavior
Risks:
- reward hacking
- over-optimization
- verbosity bias
- style over truth
- calibration damage
Direct Preference Optimization removes the separate reward-model training loop.
It directly optimizes the model using preference pairs.
Benefits:
- simpler than RLHF
- easier to implement
- widely used for alignment experiments
Limit:
Quality depends heavily on preference data quality.
RL with verifiable rewards is important for reasoning tasks.
A reward is verifiable when correctness can be checked automatically.
Examples:
- math answer correctness
- code tests passing
- exact symbolic results
- game outcomes
- tool-verified facts
This is more reliable than subjective reward for many reasoning tasks.
Safety tuning shapes refusal behavior, policy adherence, harmful request handling, uncertainty expression, tool permission behavior, and sensitive data handling.
Bad safety tuning can cause over-refusal, under-refusal, evasive answers, false confidence, and degraded utility.
Run a small post-training experiment:
base model
β SFT
β preference optimization
β before/after eval
Track instruction following, factual accuracy, formatting, refusal behavior, hallucination, verbosity, and domain performance.
You pass this layer if you can decide between:
prompting
RAG
SFT
LoRA
DPO
continued pretraining
based on the actual failure mode.
Understand models that spend additional inference compute to solve harder tasks.
A reasoning model is not just a model that outputs long explanations.
Reasoning systems often involve longer internal deliberation, verifiable rewards, search, self-consistency, verifier models, test-time compute, and specialized post-training.
Chain-of-thought encourages intermediate reasoning.
It can help with math, logic, planning, code, and multi-step questions.
But it can fail when reasoning is ungrounded, the model fabricates steps, the task requires external knowledge, the chain is persuasive but wrong, or hidden assumptions go unchecked.
Test-time compute means spending more inference resources for better answers.
Examples:
- generate multiple candidates
- vote across answers
- use a verifier
- search over reasoning paths
- run code/tools
- critique and revise
Tradeoff:
better quality potential
vs
higher latency and cost
A verifier scores or checks candidate answers.
Types:
- outcome verifier
- process verifier
- unit test verifier
- symbolic verifier
- retrieval-grounded verifier
- human verifier
Verifiers work best when correctness is measurable.
Reasoning models can overthink.
Symptoms:
- unnecessary long reasoning
- changing correct answers
- unstable final answer
- higher cost without better quality
- worse performance on simple tasks
Decision rule:
Use reasoning models where extra compute changes accuracy.
Do not use them by default.
Create a reasoning eval harness:
question
baseline answer
reasoning answer
tool-verified answer
latency
cost
correctness
failure mode
You pass this layer if you can classify a task into:
direct answer
retrieval required
tool required
reasoning required
human approval required
Understand LLM inference as a systems problem.
request arrives
β tokenize
β build prompt/chat template
β prefill
β first token
β decode loop
β stream output
β stop condition
β log trace
Prefill processes the input prompt.
It creates KV cache for prompt tokens.
Main metric:
TTFT = time to first token
Long prompts increase TTFT.
Decode generates output one token at a time.
Main metric:
TPOT = time per output token
Decode is often constrained by memory bandwidth and KV cache reads.
Throughput:
tokens per second
Latency:
how long one user waits
They are not the same.
A system can have high throughput and poor user latency.
Measure both.
Batching improves GPU utilization.
But LLM requests have variable lengths.
Static batching wastes capacity.
Continuous batching dynamically adds and removes requests.
This improves utilization under live traffic.
Track:
TTFT
TPOT
end-to-end latency
tokens/sec
requests/sec
p50 latency
p95 latency
p99 latency
GPU utilization
VRAM usage
queue time
error rate
Build an inference benchmark suite.
Test:
single request
many concurrent requests
short prompt
long prompt
short output
long output
streaming
non-streaming
You pass this layer if you never report βtokens/secβ without workload definition.
Choose the correct runtime for a workload.
Examples:
- Ollama
- llama.cpp
Best for local experiments, CPU/Mac workflows, edge deployments, and quick testing.
Not ideal for high-concurrency production serving, advanced GPU scheduling, or multi-tenant inference platforms.
Examples:
- vLLM
- SGLang
- Hugging Face TGI
- LMDeploy
Best for high-throughput serving, OpenAI-compatible APIs, batching, prefix caching, multi-GPU serving, and production model endpoints.
Example:
- TensorRT-LLM
Best for NVIDIA GPU optimization, maximum performance, controlled deployment environments, and latency-sensitive workloads.
Tradeoff:
higher complexity
more hardware-specific optimization
Choose serving engine based on:
model architecture support
hardware
quantization format
latency target
throughput target
context length
concurrency
structured output needs
LoRA serving
multi-GPU support
observability
operational complexity
team skill
| Engine | Best use | Strength | Risk |
|---|---|---|---|
| vLLM | General production serving | Throughput, ecosystem | Model-specific edge cases |
| SGLang | Structured/high-performance workloads | Prefix reuse, structured generation | Operational learning curve |
| TensorRT-LLM | NVIDIA-optimized serving | Performance | Complexity |
| llama.cpp | Local/edge | Portability | Not ideal for high-concurrency serving |
| Ollama | Developer UX | Simplicity | Limited production control |
You pass this layer if you can justify engine choice using constraints, not preference.
Understand the real cost of context length.
KV cache stores previous keys and values for every generated/request token.
Memory grows with:
batch_size
context_length
num_layers
num_kv_heads
head_dim
dtype_bytes
Simplified:
KV memory β 2 Γ batch Γ seq_len Γ layers Γ kv_heads Γ head_dim Γ bytes
The factor 2 is for K and V.
Long context causes:
- higher prefill cost
- larger KV cache
- higher memory pressure
- slower scheduling
- lost-in-the-middle behavior
- attention dilution
- more prompt injection surface
- more irrelevant information
- higher cost
Prefix caching reuses KV cache for shared prompt prefixes.
Useful for repeated system prompts, few-shot examples, static policy blocks, agent frameworks, repeated document prefixes, and multi-turn sessions.
A 128k context window means the model can accept 128k tokens.
It does not mean it can reliably reason over all 128k tokens.
Quality still depends on position sensitivity, retrieval quality, prompt structure, instruction hierarchy, distractor density, and model training.
Use long context when:
- all context is relevant
- order matters
- context changes per request
- retrieval misses critical details
Use RAG when:
- corpus is large
- only small slices are relevant
- freshness matters
- citations matter
- permission control matters
- cost matters
Build a KV cache calculator.
Inputs:
layers
kv_heads
head_dim
dtype
batch size
context length
Output:
estimated KV memory
max concurrency
memory risk
You pass this layer if you can estimate memory before deployment.
Reduce memory and cost without destroying quality.
Common formats:
- FP32
- FP16
- BF16
- FP8
- INT8
- INT4
Lower precision reduces memory.
But it introduces numerical error.
Quantization is controlled damage.
Most common. Reduces model memory.
More complex. Can improve throughput if kernels support it.
Reduces memory for long context and high concurrency. Can damage long-context quality.
Post-training weight quantization. Often used for GPU inference.
Activation-aware weight quantization. Often strong for preserving quality in low-bit inference.
Common format for llama.cpp ecosystem. Useful for local and edge deployment.
Balances activation and weight quantization difficulty.
Uses quantized base weights for memory-efficient fine-tuning.
Measure:
quality
latency
throughput
VRAM
TTFT
TPOT
format stability
code correctness
reasoning accuracy
RAG faithfulness
tool-call validity
Do not evaluate quantization only with perplexity.
Run:
FP16 baseline
INT8
INT4 GPTQ
INT4 AWQ
GGUF
KV INT8
Compare against domain evals.
You pass this layer if you can say exactly what was quantized, how, and what quality changed.
Build retrieval systems that ground LLM outputs in external knowledge.
documents
β parsing
β cleaning
β chunking
β embedding
β indexing
β retrieval
β reranking
β prompt construction
β generation
β citation validation
β evaluation
Chunking controls what the retriever can find.
Bad chunking causes missing context, fragmented answers, irrelevant retrieval, citation mismatch, and hallucination.
Chunking strategies:
- fixed-size chunks
- semantic chunks
- section-based chunks
- parent-child chunks
- sliding windows
- page-level chunks
Good for exact terms, IDs, names, legal references, rare words.
Good for semantic similarity.
Combines lexical and semantic retrieval.
Often stronger than either alone.
Reciprocal Rank Fusion combines ranked lists from multiple retrievers.
Simple and effective.
A reranker scores query-document relevance more precisely.
Typical flow:
retrieve top 50
β rerank
β keep top 5-10
Reranking improves precision at the cost of latency.
Failures can happen at:
parsing
chunking
embedding
indexing
retrieval
reranking
prompt construction
generation
citation validation
Debug the stage.
Do not blame the model first.
Build a RAG system with:
BM25
dense retrieval
hybrid retrieval
RRF
reranker
citations
eval set
trace logging
You pass this layer if you can separate retrieval failure from generation failure.
Build controlled tool-using LLM systems.
An agent is a system where an LLM can choose actions.
Examples:
- call tools
- query databases
- browse documents
- write files
- send emails
- schedule actions
- run code
- ask for approval
The danger:
More autonomy means more failure surface.
Use deterministic workflows when the steps are known.
Use agents when the path must be chosen dynamically.
Rule:
Workflow first.
Agent only where decision flexibility is needed.
Chooses which path or model to use.
Calls external functions with structured arguments.
Breaks a task into steps.
Performs actions.
Checks output.
Stops risky actions before execution.
Agents need state.
State may include user goal, current plan, completed steps, tool outputs, constraints, errors, budget, and approval status.
Memory must be controlled.
Unbounded memory creates confusion and security risk.
- infinite loops
- tool misuse
- wrong tool arguments
- stale memory
- prompt injection
- unauthorized action
- hidden cost explosion
- hallucinated tool results
- invalid final answer
Build a bounded agent:
planner
tool registry
schema validation
executor
verifier
retry limit
cost limit
approval gate
trace log
You pass this layer if your agent can fail safely.
Measure LLM system quality before users discover failures.
Measure model behavior directly.
Examples: factuality, reasoning, coding, summarization, instruction following.
Measure retrieval and grounded generation.
Metrics:
- context precision
- context recall
- faithfulness
- answer relevance
- citation correctness
Measure action quality.
Metrics:
- task success
- tool correctness
- invalid tool calls
- loop rate
- approval violations
- cost per task
Measure real-world operation.
Metrics:
- latency
- error rate
- user correction rate
- fallback rate
- escalation rate
- cost
- safety incidents
A golden dataset is a curated set of cases representing expected behavior.
It should include easy cases, hard cases, edge cases, adversarial cases, outdated info cases, ambiguous cases, negative cases, and refusal cases.
LLM judges can help, but they must be controlled.
Use clear rubrics, pairwise comparisons, calibration examples, human-reviewed samples, and judge agreement checks.
Never blindly trust judge scores.
Every production change should run evals.
Changes include model update, prompt change, retrieval change, reranker change, chunking change, tool change, quantization change, and serving engine change.
Build an eval harness:
dataset
input
expected behavior
retrieved context
model output
judge rubric
latency
cost
failure category
release decision
You pass this layer if every major system change has a measurable before/after result.
Design LLM systems that survive real users, real latency, real cost, and real failure.
client
β API gateway
β authentication
β rate limiting
β request logger
β prompt builder
β router
β retrieval service
β tool service
β model gateway
β serving engine
β response validator
β trace store
β eval pipeline
β monitoring dashboard
A model gateway abstracts access to multiple models.
It handles routing, fallbacks, retries, budget policies, provider abstraction, model versioning, logging, and safety checks.
Track:
prompt
model
version
latency
tokens in
tokens out
retrieved chunks
tool calls
errors
cost
user feedback
eval score
Without traces, debugging becomes guessing.
Production LLM systems must handle prompt injection, indirect prompt injection, tool abuse, data exfiltration, PII leakage, retrieval poisoning, unauthorized access, tenant isolation, and audit logging.
Cost comes from input tokens, output tokens, model size, inference engine, GPU utilization, concurrency, reranking, embeddings, tool calls, retries, logging, and evaluation.
Cost must be measured per task, not only per token.
Design a complete production architecture document:
system diagram
data flow
model flow
failure modes
security controls
observability plan
cost model
eval gates
scaling plan
rollback plan
You pass this layer if you can review an LLM architecture and identify reliability, security, cost, and quality risks.
Learn:
- vision-language models
- audio-language models
- document understanding
- OCR pipelines
- image embeddings
- video frame sampling
- multimodal RAG
- visual grounding
- multimodal evals
Build:
PDF/image ingestion
OCR
layout extraction
visual chunking
text + image retrieval
grounded answer generation
citation to page/region
Learn:
- prompt adaptation
- RAG
- SFT
- LoRA
- QLoRA
- continued pretraining
- domain-specific tokenization
- ontology grounding
- terminology normalization
- legal/medical/financial evals
Decision hierarchy:
prompting
β RAG
β SFT/LoRA
β continued pretraining
Continued pretraining is expensive and should not be the default.
Learn:
- prompt injection
- jailbreaks
- indirect prompt injection
- retrieval poisoning
- tool abuse
- sandboxing
- output validation
- permission boundaries
- audit trails
- secure agent design
Build:
red-team suite
prompt injection tests
tool misuse tests
retrieval poisoning tests
PII leakage tests
policy bypass tests
Learn:
- HBM bandwidth
- tensor cores
- CUDA kernels
- FlashAttention
- NCCL
- NVLink
- PCIe
- tensor parallelism
- pipeline parallelism
- expert parallelism
- GPU memory fragmentation
Build:
hardware fit calculator
model memory estimator
KV cache estimator
throughput benchmark
GPU utilization dashboard
Use this template for every paper:
Claim:
Mechanism:
What changed:
What stayed constant:
Dataset:
Compute:
Ablation:
Metric:
Weakness:
Reproducibility:
Production implication:
The goal is to identify mechanism, not memorize title.
Build these artifacts to prove competence.
| ID | Artifact | Purpose |
|---|---|---|
| 01 | Tiny Transformer | Understand token generation mechanically |
| 02 | Tokenizer Comparison Notebook | Measure tokenizer impact across languages/domains |
| 03 | Mini Pretraining Pipeline | Understand data, tokenization, loss, and sampling |
| 04 | SFT Experiment | Learn instruction tuning |
| 05 | DPO/Preference Experiment | Learn preference optimization |
| 06 | Reasoning Eval Harness | Compare normal vs reasoning models |
| 07 | Inference Benchmark Suite | Measure TTFT, TPOT, latency, throughput |
| 08 | Serving Engine Matrix | Compare vLLM, SGLang, TensorRT-LLM, llama.cpp |
| 09 | KV Cache Calculator | Estimate serving memory |
| 10 | Quantization Benchmark | Measure quality/cost tradeoffs |
| 11 | Production RAG System | Ground answers with retrieval and citations |
| 12 | Agent Workflow | Build controlled tool use |
| 13 | Eval Dashboard | Track quality, latency, cost, safety |
| 14 | Production Architecture Diagram | Design deployable platform |
| 15 | Security Red-Team Suite | Test prompt injection and tool abuse |
| 16 | Cost Model | Estimate per-task and platform-level cost |
| 17 | Paper Review Database | Build research literacy |
Recommended structure:
llm-systems-engineering-roadmap/
β
βββ README.md
βββ LICENSE
βββ roadmap/
β βββ 01_llm_foundations.md
β βββ 02_training_pipeline.md
β βββ 03_post_training.md
β βββ 04_reasoning_models.md
β βββ 05_inference_fundamentals.md
β βββ 06_serving_engines.md
β βββ 07_kv_cache_long_context.md
β βββ 08_quantization_compression.md
β βββ 09_rag_systems.md
β βββ 10_agentic_systems.md
β βββ 11_evaluation_benchmarking.md
β βββ 12_production_architecture.md
β
βββ artifacts/
β βββ tiny_transformer/
β βββ tokenizer_comparison/
β βββ mini_pretraining/
β βββ post_training/
β βββ reasoning_eval/
β βββ inference_benchmark/
β βββ kv_cache_calculator/
β βββ quantization_benchmark/
β βββ rag_system/
β βββ agent_workflow/
β βββ eval_dashboard/
β βββ production_architecture/
β
βββ templates/
β βββ paper_review_template.md
β βββ model_eval_template.md
β βββ rag_eval_template.md
β βββ agent_eval_template.md
β βββ architecture_review_template.md
β βββ cost_model_template.md
β
βββ resources/
β βββ papers.md
β βββ docs.md
β βββ courses.md
β βββ tools.md
β βββ benchmarks.md
β
βββ checklists/
βββ model_selection_checklist.md
βββ rag_production_checklist.md
βββ agent_safety_checklist.md
βββ inference_benchmark_checklist.md
βββ quantization_checklist.md
βββ production_readiness_checklist.md
You are not done when you read the chapters.
You are done when you can produce these outputs.
Can implement and explain a tiny Transformer.
Can trace one token generation.
Can explain tokenization, logits, sampling, and KV cache.
Can build a mini pretraining loop.
Can explain data mixture, loss, scaling, and contamination.
Can compare SFT, RLHF, DPO, GRPO, and RLAIF.
Can choose adaptation method based on failure mode.
Can evaluate when reasoning models help.
Can measure accuracy vs latency/cost.
Can benchmark TTFT, TPOT, throughput, p95 latency, and VRAM.
Can choose a serving engine based on workload and hardware.
Can estimate KV cache memory and explain long-context tradeoffs.
Can evaluate quantization quality against domain tasks.
Can build and debug hybrid retrieval with citations and evals.
Can build bounded tool-using workflows with safe failure behavior.
Can build regression evals and release gates.
Can design secure, observable, scalable LLM architecture.
- Attention Is All You Need: https://arxiv.org/abs/1706.03762
- Hugging Face Tokenizer Summary: https://huggingface.co/docs/transformers/tokenizer_summary
- Hugging Face Transformers: https://huggingface.co/docs/transformers/index
- PyTorch Transformer Reference: https://docs.pytorch.org/docs/stable/generated/torch.nn.Transformer.html
- Hugging Face TRL: https://huggingface.co/docs/trl/index
- Hugging Face PEFT: https://huggingface.co/docs/peft/index
- LoRA: https://arxiv.org/abs/2106.09685
- QLoRA: https://arxiv.org/abs/2305.14314
- DPO: https://arxiv.org/abs/2305.18290
- DeepSeek-R1: https://arxiv.org/abs/2501.12948
- vLLM Docs: https://docs.vllm.ai/en/latest/
- SGLang Docs: https://docs.sglang.ai/
- TensorRT-LLM Docs: https://docs.nvidia.com/tensorrt-llm/index.html
- llama.cpp: https://github.com/ggerganov/llama.cpp
- Hugging Face TGI: https://huggingface.co/docs/text-generation-inference/index
- Ragas Metrics: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/
- BEIR Benchmark: https://github.com/beir-cellar/beir
- MS MARCO: https://microsoft.github.io/msmarco/
- Sentence Transformers: https://www.sbert.net/
- OpenAI Function Calling / Tools: https://platform.openai.com/docs/guides/function-calling
- LangGraph Workflows and Agents: https://docs.langchain.com/oss/python/langgraph/workflows-agents
- LangGraph Memory: https://docs.langchain.com/oss/python/langgraph/memory
- OWASP Top 10 for LLM Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/
- NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework
task type
language/domain
context length
latency target
cost target
quality target
tool use needed
reasoning needed
deployment mode
data privacy constraints
fine-tuning need
serving engine compatibility
quantization support
eval result
document parser tested
chunking strategy validated
metadata schema defined
hybrid retrieval implemented
reranker tested
citations validated
permission filters enforced
freshness handled
RAG eval set built
retrieval failures categorized
generation failures categorized
latency measured
cost measured
tools have schemas
arguments validated
permissions enforced
dangerous actions require approval
retry limits exist
budget limits exist
tool outputs are logged
state is inspectable
prompt injection tests exist
fallback path exists
human escalation exists
model version
precision
serving engine
GPU type
batch size
concurrency
prompt length
output length
TTFT
TPOT
p50 latency
p95 latency
p99 latency
tokens/sec
VRAM usage
GPU utilization
error rate
baseline measured
method identified
weights/activations/KV specified
calibration data documented
serving engine compatible
quality evaluated
latency evaluated
VRAM evaluated
hard cases tested
format stability tested
rollback available
authentication
authorization
tenant isolation
rate limiting
prompt logging policy
PII policy
retrieval permissions
model fallback
eval gate
monitoring
alerts
cost dashboard
security tests
rollback plan
incident response
Do not read it passively.
Use this loop:
study one layer
β implement one artifact
β measure it
β write failure notes
β create decision rules
β move to next layer
For every topic, produce:
1. mechanism explanation
2. code or architecture artifact
3. benchmark or eval
4. failure mode list
5. decision rule
The roadmap is complete only when it changes your engineering decisions.
LLM foundations teach how tokens become predictions.
Training teaches where base capability comes from.
Post-training teaches how behavior is shaped.
Reasoning teaches when extra inference compute helps.
Inference teaches why latency and memory dominate.
Serving engines teach how runtime choices affect production.
KV cache teaches why context is expensive.
Quantization teaches how to trade precision for cost.
RAG teaches how to ground outputs.
Agents teach how to connect models to actions.
Evaluation teaches how to know if anything works.
Production architecture teaches how to make it survive real usage.
The professional standard is not βI know LLMs.β
The professional standard is:
I can design, measure, debug, and operate LLM systems under real constraints.