The most comprehensive system design resource built exclusively for .NET developers.
Every concept explained with .NET context, real C# implementations, and production-ready patterns.
I built this because most system design resources use Java/Python examples and honestly that always bugged me as a .NET dev. Here, every concept maps directly to the libraries, tools, and patterns you'd use in real production C# apps.
If your new to System Design, start here: System Design was HARD until I Learned these 30 Concepts. Trust me it makes everything click.
Fundamentals — the concepts every .NET developer must know
- Core Concepts — scalability, availability, CAP theorem, consistency
- Networking Fundamentals — DNS, HTTP, TCP/UDP, load balancing
- API Fundamentals — REST, GraphQL, versioning, rate limiting
- Database Fundamentals — SQL vs NoSQL, indexing, sharding, replication
- Caching Fundamentals — strategies, eviction, CDN, distributed caching
Architecture & Patterns — how to structure distributed .NET systems
- Asynchronous Communication — pub/sub, queues, CDC, backpressure
- Distributed Systems & Microservices — service discovery, consensus, tracing
- Architectural Patterns — microservices, CQRS, event sourcing, hexagonal
- Design Patterns for Distributed Systems — saga, outbox, bulkhead, sidecar
Production Engineering — building, shipping, and running .NET systems
- Security Fundamentals — auth, JWT, OAuth, encryption, zero trust
- Observability & Monitoring — logs, metrics, traces, golden signals
- DevOps & Deployment Patterns — blue-green, canary, feature flags, IaC
- Performance & Optimization — ASP.NET Core, EF Core, caching tips
- Testing Distributed Systems — integration testing, chaos engineering
- Migration Strategies — zero-downtime migrations, strangler fig, CDC
Interview Preparation — ace the system design interview
- System Design Tradeoffs — the tradeoffs interviewers want to hear
- Back-of-Envelope Estimation — latency numbers, storage math, quick tricks
- Interview Tips & Tricks — framework, tips, common mistakes
- Common Mistakes & Anti-Patterns — what NOT to do
- .NET Architecture Decision Guide — when to use what
- Data Structures You Should Know — the ones that come up in interviews
Implementations & Resources — code, case studies, and learning materials
- C# Implementations — 11 runnable algorithms with detailed docs
- Interview Problems — 45+ problems sorted by difficulty
- Real-World Case Studies — Netflix, Uber, Stripe, WhatsApp
- .NET Resources — books, libraries, YouTube, newsletters
- Engineering Articles — 12 must-read articles from top companies
- Research Papers — 13 foundational distributed systems papers
- Quick Reference Cheat Sheets — HTTP codes, capacity numbers, ports
The building blocks. Everything else in system design builds on top of these — skip them and you'll struggle with the rest.
Basically — can your system handle more users without falling over? There are two ways to scale:
- Vertical scaling: throw a bigger machine at it (more CPU, more RAM). Simple but theres a ceiling.
- Horizontal scaling: add more machines and spread the load. More complex but basically unlimited.
In .NET land, this means keeping your ASP.NET Core services stateless so you can just spin up more instances behind a load balancer. If you need shared state, use IDistributedCache with Redis instead of sticking stuff in memory.
- Deep dive: Scalability
- In .NET: ASP.NET Core is stateless by default (which is great). Use
IDistributedCachefor shared state, Azure Service Bus for decoupling, Kubernetes HPA for auto-scaling your containers.
How much of the time is your system actually working? We measure this in "nines" — 99.9% (three nines) sounds great until you realize thats still like 8.7 hours of downtime a year. For somthing like a payment system, even that might be too much.
- Deep dive: Availability
- In .NET: Health checks via
Microsoft.Extensions.Diagnostics.HealthChecks(the/healthendpoint pattern), Azure Traffic Manager for geo-redundant failover, Polly for retry and circuit breaker patterns. If your building anything serious, you want all three.
Slightly different from availability. A system is reliable if it does the right thing even when stuff goes wrong — hardware fails, a deploy goes bad, someone fat-fingers a config. Its not just about being "up", its about being correct under failure.
- Deep dive: Reliability
- In .NET: Polly retry + circuit-breaker policies, idempotent message handlers with MassTransit or NServiceBus, transactional outbox pattern in EF Core. The outbox pattern especially — if your doing async messaging and not using it, your probably losing messages and dont know it yet.
Any component where if it dies, your whole system goes down. Classic ones in .NET apps:
- Single SQL Server instance → fix with Always On Availability Groups
- Single Redis node → fix with Redis Sentinel or Azure Cache clustering
- Single API gateway → fix with multiple YARP instances behind Azure Load Balancer
The whole point of system design is basically eliminating these one by one.
People mix these up all the time so lets be clear:
- Latency: how long ONE request takes (eg. 50ms for your API to respond)
- Throughput: how many requests per second your system can handle (eg. 10k RPS per Kestrel instance)
- Bandwidth: the pipe size — maximum data you can push through the network (eg. 10 Gbps)
You can have great throughput but terrible latency (batch processing), or great latency but low throughput (single-threaded server).
- Deep dive: Latency vs Throughput
- In .NET: Fun fact — Kestrel benchmarks show ASP.NET Core handling 7M+ requests per second on TechEmpower benchmarks. Use
StopwatchorSystem.Diagnostics.Activityfor latency measurement, OpenTelemetry .NET SDK for distributed tracing across services.
Ok this one is really cool. Normal hashing (key % N) breaks horribly when you add or remove a server — basically ALL keys get remapped. Consistent hashing fixes this by putting servers on a virtual ring, so when you add/remove a server only ~K/N keys need to move (K = total keys, N = total servers).
This is the backbone of distributed caches and databases. If you've ever wondered how Redis Cluster knows which node owns which key — its consistent hashing (well, hash slots technically, but same idea).
- Deep dive: Consistent Hashing
- In .NET: Microsoft Orleans uses this for grain placement. Redis Cluster uses 16384 hash slots. I've included a full C# implementation in this repo — go check it out, its actually pretty elegant.
The famous "pick 2 out of 3" theorem for distributed systems:
- Consistency: every read gets the most recent write
- Availability: every request gets a response (no errors)
- Partition Tolerance: system works even when network between nodes breaks
The thing people often miss — network partitions WILL happen, its not optional. So really your choosing between CP (consistent but might reject requests during partition) or AP (always responds but might give you stale data).
- Deep dive: CAP Theorem
- In .NET: SQL Server Always On with synchronous commit = CP. Azure Cosmos DB is interesting because it lets you choose on a sliding scale from Strong (CP) to Eventual (AP). Redis replication is AP by default — your read replicas might be slightly behind.
Most people stop at CAP but theres actually a more useful extension called PACELC. It says:
- If theres a Partition, choose between Availability and Consistency (same as CAP)
- Else (no partition, normal operation), choose between Latency and Consistency
This is actually more practical because most of the time your system is NOT partitioned. So the real question is: during normal operations, do you want low latency or strong consistency? Cosmos DB's consistency levels are basically a PACELC slider.
When the primary server dies, something needs to take over. Two flavors:
-
Active-Passive: standby server is just sitting there waiting. When primary dies it kicks in. Simple, but theres a gap during switchover.
-
Active-Active: both servers handling traffic simultaneously. No downtime on failure but way more complex (you need to deal with data conflicts).
-
In .NET: SQL Server Always On Failover Cluster, Azure App Service deployment slots (great for zero-downtime deploys), Azure Front Door for global failover.
Goes beyond failover — can your system keep working (maybe at reduced capacity) when things partially break? Like if one microservice goes down, does the whole app crash or does it gracefully degrade?
- Deep dive: Fault Tolerance
- In .NET: This is where Polly really shines. Chain together retry → circuit breaker → timeout → fallback in a
ResiliencePipeline. Combine withIHttpClientFactoryresilience handlers. Your middleware pipeline in ASP.NET Core should have graceful degradation baked in from day one.
I mentioned this in API fundamentals too but its worth repeating here because its one of the most underrated concepts in system design. An operation is idempotent if doing it multiple times has the same effect as doing it once.
Why does this matter? Because in distributed systems, messages get delivered more than once. Networks retry. Users double-click buttons. Queues redeliver. If your operations arent idempotent, you get duplicate orders, double charges, and other fun stuff.
Tip for interviews: whenever you design a write operation in a system design interview, mention idempotency. Interviewers love it because it shows you think about real-world failure modes.
Every request from your Blazor frontend to your ASP.NET Core API traverses these layers. When something breaks, knowing where to look saves hours of debugging.
- OSI Model — the 7-layer model. Your C# code lives at Layer 7 (Application). Kestrel handles layers 4-7. Cloud load balancers can operate at Layer 4 (TCP, faster) or Layer 7 (HTTP, smarter routing).
- IP Addresses — IPv4 (32-bit) and IPv6 (128-bit). In .NET use
System.Net.IPAddressfor parsing. Azure VNets use CIDR notation for subnets. - DNS — translates domain names to IPs. Heres a gotcha:
HttpClientin ASP.NET Core caches DNS by default. If your backend IPs change (common with cloud deployments), setSocketsHttpHandler.PooledConnectionLifetimeto control refresh. Learned this one the hard way. - Proxy vs Reverse Proxy — forward proxy sits infront of clients (VPN, content filtering). Reverse proxy sits infront of servers (load balancing, SSL termination). YARP is Microsoft's reverse proxy for .NET — really performant and easy to configure.
- HTTP/HTTPS — Kestrel defaults to HTTP/2 since .NET 6. HTTP/3 (QUIC) available in .NET 7+. TLS can be handled by Kestrel directly or offloaded to Azure Application Gateway.
- TCP vs UDP — TCP for reliable ordered delivery (HTTP, database connections). UDP for speed when you can tolerate some loss (DNS, video streaming, game servers). .NET has
TcpClient/TcpListenerandUdpClientinSystem.Net.Sockets. - Load Balancing — distributing traffic across servers. I've implemented 5 different algorithms in C# in this repo with detailed explanations for each one.
- Checksums — verifying data integrity. .NET has
System.IO.Hashing(XxHash, CRC32) andSystem.Security.Cryptography(SHA256, MD5).
Bad API design haunts you for years — once clients depend on it, changing it is painful. Get these fundamentals right from the start.
- What is an API — the contract between systems. In .NET you've got options: Minimal APIs (lightweight), Controllers (traditional), gRPC services (high performance internal).
- API Gateway — single entry point that handles routing, auth, rate limiting, aggregation. In .NET: YARP (my favorite, super flexible), Ocelot (popular but less maintained), Azure API Management (fully managed), or honestly just custom ASP.NET Core middleware for simpler cases.
- REST vs GraphQL — REST is resource-based with multiple endpoints. GraphQL is one endpoint where the client specifies exactly what data it wants. For .NET: ASP.NET Core Web API for REST, Hot Chocolate or GraphQL.NET for GraphQL. Most teams should just use REST unless they have a really good reason for GraphQL.
- WebSockets — full-duplex communication over a single TCP connection. In .NET, just use SignalR — it handles WebSockets with automatic fallbacks. For massive scale theres Azure SignalR Service. Only use raw
System.Net.WebSocketsif you really need a custom protocol. - Webhooks — server-to-server push notifications via HTTP POST. Your ASP.NET Core API recieves webhook payloads from Stripe, GitHub, etc. Always validate HMAC signatures — dont just blindly trust incoming payloads.
- Idempotency — making operations safe to retry. This is CRITICAL for payments and order creation. Pattern: accept an
IdempotencyKeyheader, store it in Redis/SQL, check before processing. Sounds simple but most people skip it until they get bitten by duplicate charges. - Rate Limiting — protecting your API from abuse and accidental DDoS from misbehaving clients. ASP.NET Core 7+ has built-in rate limiting (
Microsoft.AspNetCore.RateLimiting). I've implemented all 5 major algorithms in C# in this repo. - API Design — consistent naming, proper HTTP status codes, pagination, versioning. In .NET use
[ApiController]attribute,ProblemDetailsfor errors (please dont return custom error formats), and Swagger via Swashbuckle or NSwag. - API Versioning — you WILL need to change your API. Version from day 1. In .NET:
Asp.Versioning.Httppackage supports URL path (/v1/users), query string (?api-version=1.0), and header-based versioning. URL path is the most common and easiest to understand. - Pagination — never return unbounded lists. Three approaches: offset-based (
?page=2&size=20), cursor-based (?after=abc123), keyset pagination. Cursor-based is best for large datasets because offset gets slow. In EF Core:.Skip(offset).Take(limit)for offset, or.Where(x => x.Id > lastId).Take(limit)for keyset.
Getting the database wrong is expensive — both in performance and in "oh god we need to migrate 500 million rows." These fundamentals prevent costly mistakes.
- ACID Transactions — Atomicity, Consistency, Isolation, Durability. EF Core supports this via
DbContext.Database.BeginTransactionAsync(). For distributed transactions across microservices, look into the Saga pattern (MassTransit and NServiceBus both support this). Dont try to do distributed transactions withTransactionScopeacross services — it doesn't work the way you think it does. - SQL vs NoSQL — SQL (SQL Server, PostgreSQL) for structured relational data with ACID guarantees. NoSQL (Cosmos DB, MongoDB, Redis) for flexible schemas and horizontal scaling. The honest answer is most apps should start with SQL and add NoSQL where needed. EF Core supports both SQL Server and Cosmos DB which is pretty nice.
- Database Indexes — B-tree structures that make queries fast but writes slower. In EF Core:
entity.HasIndex(e => e.Email).IsUnique(). Rule of thumb: index columns in your WHERE, JOIN, and ORDER BY clauses. But dont over-index — every index slows down writes and takes storage. - Database Sharding — splitting data across multiple databases by a shard key (user ID, tenant ID, etc). In .NET: Azure SQL Elastic pools, Cosmos DB does this automaticaly with partition keys, or you can roll your own with a
DbContextfactory that routes to the right shard. Only do this when you actually need it — premature sharding is a nightmare. - Data Replication — copying data to multiple nodes for availability and read scaling. Synchronous replication = strong consistency but higher latency. Async replication = eventual consistency but lower latency. In .NET: SQL Server Always On, Cosmos DB multi-region, Redis replication.
- Database Scaling — vertical (bigger machine), horizontal (more machines via sharding + replication), read replicas (split reads and writes). In .NET you can use
DbContextOptionsBuilderwith connection string routing for read/write splitting. - Database Types — Relational (SQL Server, PostgreSQL), Document (Cosmos DB, MongoDB), Key-Value (Redis), Column-Family (Cassandra), Graph (Neo4j), Time-Series (InfluxDB), Search (Elasticsearch). All have .NET client libraries. Pick based on your access patterns, not hype.
- Bloom Filters — probabilistic data structure. Tells you "definetly not in the set" or "probably in the set". Used to avoid expensive DB lookups. .NET doesn't have one built-in but theres NuGet packages like
BloomFilter.NetCore. - Database Architectures — Active-Active (both nodes accept writes, need conflict resolution) vs Active-Passive (one writer, replicas for reads). Cosmos DB supports both.
- N+1 Query Problem — the most common performance killer in .NET apps using EF Core. You load a list of orders, then for each order you load the customer — thats N+1 queries instead of 1. Fix: use
.Include(o => o.Customer)for eager loading, or project with.Select(). If your app is slow, check for N+1 queries first. Seriously, its almost always the answer. - Connection Pooling — opening a DB connection is expensive (~20-50ms). Connection pools keep connections open and reuse them. In .NET, ADO.NET pools connections by default. But be careful with
DbContextlifetime — useAddDbContext(scoped) not singleton, or you'll exhaust the pool under load.
The single biggest performance win in most .NET apps. But stale data bugs are incredibly hard to track down, so understand the tradeoffs.
- What is Caching — storing frequently accessed data in a faster layer. In .NET you've got: in-memory (
IMemoryCache), distributed (IDistributedCachebacked by Redis or SQL), HTTP response caching, and the newer output caching middleware. - Caching Strategies — Cache-Aside (most common in .NET, check cache first, miss → load from DB → store in cache). Write-Through (write to cache AND DB at the same time). Write-Behind (write to cache, async flush to DB — risky but fast). Read-Through (cache itself fetches from DB on miss).
- Cache Eviction — LRU (least recently used, default for
IMemoryCache), LFU (least frequently used), TTL (time based expiry). In .NET:MemoryCacheEntryOptionsgives youAbsoluteExpiration,SlidingExpiration, andSizelimits. Always set a TTL — unbounded caches will eventually eat all your memory. - Distributed Caching — when your app runs on multiple servers, in-memory cache on each one gets out of sync. Thats when you need Redis (
StackExchange.Redis) or SQL-backed cache (Microsoft.Extensions.Caching.SqlServer). Azure Cache for Redis gives you managed Redis with clustering. - CDN — caches static content (images, JS, CSS) at edge locations worldwide. In .NET: Azure CDN or Azure Front Door. Set cache headers in ASP.NET Core with
[ResponseCache(Duration = 3600)]. - Cache Stampede / Thundering Herd — when a popular cache key expires, hundreds of requests simultaneously hit the database trying to repopulate it. Fixes: use lock-based refresh (only one request rebuilds the cache, others wait), or set a "soft TTL" where the cache is refreshed in the background before actual expiry. In .NET, use
SemaphoreSlimor Redis distributed lock for this. - Cache Invalidation — Phil Karlton famously said "there are only two hard things in computer science: cache invalidation and naming things." He was right. Event-driven invalidation (publish a message when data changes) is better than TTL-only. In .NET: use MassTransit to publish a
CacheInvalidatedevent when entities change.
If your building microservices without a message bus, your basically building a distributed monolith. Async patterns decouple services and make them resilient.
- Pub/Sub — publishers send messages to a topic, multiple subscribers recieve them independently. In .NET: Azure Service Bus Topics, RabbitMQ with MassTransit, Kafka with Confluent's .NET client, or Redis Pub/Sub for simpler use cases.
- Message Queues — point-to-point, each message consumed by exactly one consumer. In .NET: Azure Service Bus Queues, RabbitMQ, Amazon SQS. Use MassTransit or NServiceBus as abstractions — they handle retry, dead-letter, serialization and a ton of other stuff you dont want to write yourself.
- Change Data Capture (CDC) — capture row-level DB changes and publish them as events. In .NET: Debezium (via Kafka Connect) for SQL Server CDC events, EF Core interceptors for publishing domain events on SaveChanges. SQL Server has built-in CDC support which is handy.
- Dead Letter Queues (DLQ) — where messages go when they fail processing after all retries are exhausted. ALWAYS set up dead letter queues. Without them, failed messages just vanish and you have no idea what went wrong. In .NET: MassTransit configures DLQs automatically. Azure Service Bus has built-in dead-letter support. Monitor your DLQ — if messages are piling up, somethings broken.
- Exactly-Once vs At-Least-Once vs At-Most-Once — the three message delivery guarantees. At-most-once (fire and forget, might lose messages). At-least-once (guarantees delivery but might duplicate — most common, use idempotency to handle dupes). Exactly-once (theoretically impossible across network boundaries, but Kafka gets close with transactions). In practice, design for at-least-once + idempotency.
- Backpressure — what happens when your consumer cant keep up with the producer? Without backpressure, messages pile up, memory fills, things crash. Solutions: bounded queues (reject new messages when full), rate limiting on producers, auto-scaling consumers. In .NET:
System.Threading.ChannelshasBoundedChannelOptionsfor in-process backpressure.
The genuinely hard part of software engineering. Make sure you actually need microservices before going down this path — the complexity is real.
- Heartbeats — periodic "I'm alive" signals between services. In .NET: ASP.NET Core health checks (
/healthendpoint), Kubernetes liveness/readiness probes, Azure App Service health monitoring. - Service Discovery — how services find eachother without hardcoded URLs. In .NET: Kubernetes DNS, Azure Service Fabric naming service, Consul via Steeltoe, or
Microsoft.Extensions.ServiceDiscoveryin .NET Aspire (this one is really nice for local dev). - Consensus Algorithms — how distributed nodes agree on stuff (Paxos, Raft). You probably wont implement these yourself but understanding them helps you reason about why Cosmos DB consistency levels behave the way they do, or why etcd works the way it does.
- Distributed Locking — mutual exclusion across services. In .NET: Redis locks via
StackExchange.Redis(Redlock algorithm), Azure Blob lease-based locks, or SQL Serversp_getapplock. Read the Martin Kleppmann article though — distributed locking is harder than most people think and Redlock has some known issues. - Gossip Protocol — peer-to-peer info sharing where each node randomly gossips with neighbors. Used by Cassandra, Redis Cluster, and Orleans for membership. You probably wont implement this but its good to know how your tools work internally.
- Circuit Breaker — stops cascading failures by "tripping" when a downstream service is unhealthy. States: Closed (normal) → Open (all calls fail fast) → Half-Open (testing if service recovered). In .NET: Polly's
CircuitBreakerStrategyis THE answer. Plugs right intoIHttpClientFactory. - Disaster Recovery — RPO (how much data can you afford to lose?) and RTO (how fast do you need to recover?). Know these numbers for your system. In .NET: Azure Site Recovery, geo-redundant storage, SQL Server backup/restore, Cosmos DB automatic backups.
- Distributed Tracing — following a request as it hops across services. Without this, debugging microservices is basically impossible. In .NET: OpenTelemetry SDK (uses
System.Diagnostics.Activityunder the hood), Jaeger, Zipkin, Azure Application Insights, or .NET Aspire's dashboard for local dev. - Leader Election — in many distributed systems, one node needs to be the "leader" that coordinates work. If it dies, a new leader is elected. Used by Kafka (partition leaders), Kubernetes (controller manager), SQL Server Always On (primary replica). In .NET: you can implement this with Redis or Azure Blob leases, or use a library like
DistributedLock. - Vector Clocks — a way to track causality in distributed systems. Each node maintains a counter, and events are tagged with these counters. Lets you detect conflicts and determine ordering without a central clock. Used by DynamoDB and Riak. You probably wont implement this but its good to understand when someone mentions "conflict resolution in distributed databases".
- Split Brain — when a network partition makes two halves of a cluster think the other half is dead, and both halves try to be the leader. Classic problem in SQL Server Always On, Redis Sentinel, and Elasticsearch. Solutions: quorum-based decisions (need majority to elect leader), fencing tokens, witness nodes. This is why distributed systems are hard.
No "best" pattern exists — it depends on your team size, scale requirements, and timeline. Choose deliberately.
- Client-Server — the foundation. Blazor/Angular/React frontend talks to ASP.NET Core backend which talks to SQL Server. Most .NET apps are this and thats totally fine.
- Microservices — each service owns its data and logic. .NET tooling is actually pretty great here: .NET Aspire for orchestration, Docker + K8s for deployment, MassTransit/NServiceBus for messaging, YARP for gateway. But seriously — start with a modular monolith. Extract microservices only when you have a real reason. Ive seen too many teams go microservices-first and regret it.
- Serverless — code runs in response to events, no server management. In .NET: Azure Functions (C#), AWS Lambda (.NET runtime), Azure Durable Functions for stateful workflows. Great for event processing, scheduled jobs, and low-traffic APIs where you dont want to pay for idle servers.
- Event-Driven — services communicate by publishing and reacting to events. In .NET: MassTransit + RabbitMQ/Azure Service Bus, Wolverine (the new kid on the block, really good), EventStoreDB for event sourcing. This is the core pattern for CQRS implementations.
- Peer-to-Peer — nodes are both clients and servers. Used in file sharing, blockchain, real-time collab. In .NET: SignalR for real-time P2P, Orleans for virtual actor model.
- Modular Monolith — the architecture pattern I recommend most teams start with. Its a single deployable unit, but internally its organized into well-defined modules with clear boundaries. When a module needs to become its own service, you extract it. Way easier than going microservices from day 1. In .NET: use separate class library projects per module, communicate via MediatR or in-process events, and enforce boundaries with
internalaccess modifiers. - CQRS (Command Query Responsibility Segregation) — separate your read model from your write model. Writes go through commands (which validate, apply business rules, persist). Reads go through queries (which can be optimized separately, maybe with denormalized views or a read-only replica). In .NET: MediatR for in-process CQRS, MassTransit for distributed. Pairs naturally with event sourcing. Dont use CQRS everywhere though — its overengineering for simple CRUD.
- Event Sourcing — instead of storing current state, store the sequence of events that led to current state. Your "database" is an append-only event log. Current state is derived by replaying events. Powerful for audit trails, temporal queries, and debugging. In .NET: EventStoreDB, Marten (PostgreSQL-based), or roll your own with EF Core. Warning: adds significant complexity. Only use it when you actually need the event history.
- Hexagonal Architecture (Ports & Adapters) — your business logic is at the center, completely independent of infrastructure. Database, HTTP, messaging — all just "adapters" plugged into "ports". Makes testing easy and infrastructure swappable. In .NET: define interfaces (ports) in your Domain/Application layer, implement them (adapters) in Infrastructure. This is essentially what Clean Architecture popularized.
Learn these once and you'll have a toolbox for solving most distributed systems problems — both in interviews and production.
When you need a "transaction" that spans multiple services but cant use a distributed transaction (and you cant — they dont scale). Instead, each service does its local transaction and publishes an event. If something fails, you publish compensating events to undo previous steps.
Two types:
- Choreography: services listen for events and react. Simple but hard to track the overall flow.
- Orchestration: a central "saga orchestrator" tells each service what to do. Easier to understand and debug.
In .NET: MassTransit has built-in saga support with state machines. NServiceBus calls them "sagas" too. For orchestration, Azure Durable Functions is surprisngly good.
The problem: you need to save to the database AND publish a message, but doing both atomically is impossible (database transaction doesnt cover the message broker). If the app crashes between the two, you lose the message.
Solution: save the message TO the database (in an "outbox" table) as part of the same transaction. A background process reads the outbox and publishes to the message broker.
In .NET: MassTransit has built-in outbox support with EF Core. Wolverine does too. If your doing messaging without the outbox pattern, your losing messages and probably dont realize it.
For migrating from a legacy monolith to microservices. Instead of rewriting everything (which almost always fails), you gradually "strangle" the old system. New features go to the new system. Old features get migrated one by one. A routing layer (YARP, API gateway) directs traffic to old or new based on the feature.
Named after strangler fig trees that grow around existing trees. Its the safest migration strategy and I've used it successfully multiple times.
Deploy a helper process alongside your main service that handles cross-cutting concerns (logging, monitoring, networking, security). The main service doesnt need to know about these concerns.
In .NET: Dapr (Distributed Application Runtime) is basically the sidecar pattern as a product. It handles service invocation, state management, pub/sub, and more — all through a sidecar process. Also: Envoy proxy as a sidecar for service mesh (Istio, Linkerd).
Similar to sidecar but specifically for handling outbound connections. The ambassador acts as a proxy between your service and external services, handling retry logic, circuit breaking, and monitoring.
In .NET: You could argue that Polly + IHttpClientFactory IS the ambassador pattern, just built into your process instead of a sidecar.
Isolate components so that failure in one doesn't take down others. Named after the watertight compartments in ships — if one compartment floods, the ship doesnt sink.
In .NET: Use Polly's BulkheadPolicy to limit concurrent calls to a downstream service. Separate thread pools for different dependencies. In Kubernetes: resource limits per container. In Azure: separate App Service plans for critical vs non-critical services.
Multiple consumers reading from the same queue, each processing different messages. This is how you scale message processing horizontally. The queue ensures each message goes to exactly one consumer.
In .NET: This is the default behavior when multiple instances of your MassTransit/NServiceBus consumer are running. Azure Service Bus and RabbitMQ both support this natively.
When a call fails, retry it — but wait progressively longer between retries (1s, 2s, 4s, 8s...) with some randomness (jitter) to prevent thundering herd. Simple but critical.
In .NET: Polly's RetryStrategy with BackoffType.Exponential and UseJitter = true. Always add jitter. Without jitter, all your retrying clients hit the recovering server at the same time.
Not a full security course, but these are the concepts that come up in every system design conversation and code review.
- Authentication (AuthN): who are you? (login, JWT, OAuth)
- Authorization (AuthZ): what are you allowed to do? (roles, policies, permissions)
In .NET: ASP.NET Core Identity for user management, JWT bearer tokens for API auth, OAuth 2.0 / OpenID Connect with IdentityServer or Duende IdentityServer, Azure AD / Entra ID for enterprise. Use [Authorize] attribute + policy-based authorization.
OAuth 2.0 handles authorization (granting access to resources). OpenID Connect (OIDC) adds authentication on top (verifying identity). Together they power "Login with Google/GitHub/Microsoft".
In .NET: Microsoft.AspNetCore.Authentication.OpenIdConnect for server-side, Microsoft.Identity.Web for Azure AD integration. For building your own identity provider: Duende IdentityServer or OpenIddict.
Self-contained tokens that carry claims (user ID, roles, permissions). The server doesn't need to look anything up — just verify the signature. But they cant be revoked until they expire, so keep expiry times short (15 mins) and use refresh tokens for long sessions.
Gotcha: JWTs are NOT encrypted by default. Anyone can decode and read the claims. Dont put sensitive data in them. They're only signed (proving they havent been tampered with).
- Always use HTTPS (Kestrel handles this)
- Validate and sanitize ALL input (FluentValidation,
[ApiController]auto-validation) - Use parameterized queries (EF Core does this by default, but watch for raw SQL)
- Rate limit your APIs (built-in middleware in ASP.NET Core 7+)
- Implement CORS properly (dont just allow
*in production) - Use API keys for service-to-service, JWTs for user-facing
- Log authentication failures (Azure Application Insights)
- Never expose stack traces in production (
app.UseExceptionHandler())
- In transit: TLS/HTTPS for all communications (Kestrel, Azure App Gateway)
- At rest: SQL Server Transparent Data Encryption (TDE), Azure Storage encryption, Cosmos DB encryption. All enabled by default in Azure managed services.
- Application-level encryption: use
System.Security.Cryptographyfor encrypting specific fields (PII, credit cards). Consider Azure Key Vault for key management.
"Never trust, always verify." Even internal service-to-service calls should be authenticated and authorized. No more "its behind the firewall so its safe." In .NET: mutual TLS (mTLS) between services, service mesh (Istio/Linkerd), Azure Private Endpoints, managed identities for Azure resource access.
You cant fix what you cant see. Three pillars: logs, metrics, and traces. Without them, your flying blind.
Logs — what happened? Individual event records with timestamps.
- In .NET:
ILogger<T>+ Serilog or NLog. Structured logging is key — use_logger.LogInformation("Order {OrderId} created by {UserId}", orderId, userId)not string concatenation. Ship to Azure Application Insights, Seq, Elasticsearch, or Datadog.
Metrics — how much? Numerical measurements over time. Request rate, error rate, latency percentiles, CPU usage, queue depth.
- In .NET:
System.Diagnostics.Metrics(built-in since .NET 8), OpenTelemetry .NET SDK, Prometheus viaprometheus-net. Key metrics to track: request rate, error rate, p50/p95/p99 latency, saturation (CPU, memory, connections).
Traces — how did it flow? Following a single request across multiple services.
- In .NET:
System.Diagnostics.Activity(the underlying primitive), OpenTelemetry SDK for instrumentation, Jaeger or Zipkin for visualization, Azure Application Insights for managed solution, .NET Aspire dashboard for local dev.
- Latency — how long requests take (track p50, p95, p99, not averages)
- Traffic — how many requests per second
- Errors — what percentage of requests are failing (5xx rate)
- Saturation — how "full" your system is (CPU, memory, disk, connection pools)
If you only monitor four things, monitor these. Everything else is secondary.
// In your Program.cs
builder.Services.AddHealthChecks()
.AddSqlServer(connectionString) // checks DB connectivity
.AddRedis(redisConnectionString) // checks Redis
.AddAzureServiceBusTopic(sbConn, topic) // checks message broker
.AddUrlGroup(new Uri("https://dependency.api/health")); // checks downstream API
app.MapHealthChecks("/health");
app.MapHealthChecks("/health/ready", new() { Predicate = check => check.Tags.Contains("ready") });
app.MapHealthChecks("/health/live", new() { Predicate = _ => false }); // just checks if app is running
- Alert on symptoms (high error rate), not causes (high CPU). High CPU might be fine if error rate is normal.
- Use percentiles not averages for latency. P99 at 5s means 1 in 100 users waits 5 seconds — the average might be 200ms.
- Set up alerts for your SLOs (service level objectives), not arbitrary thresholds.
- Dont alert on things that dont need human intervention. Alert fatigue is real and dangerous.
Getting code to production safely is a system design problem in itself. These patterns minimize risk.
- Rolling Update: replace instances one at a time. Old and new versions run simultaneously during the rollout. Default in Kubernetes. Risk: API compatibility between old and new version.
- Blue-Green Deployment: run two identical environments (blue = current, green = new). Switch traffic from blue to green once verified. Instant rollback by switching back. In Azure: App Service deployment slots.
- Canary Deployment: route a small percentage of traffic (1-5%) to the new version. Monitor for errors. Gradually increase if healthy. In .NET: YARP with weighted routing, Azure Front Door traffic splitting.
- Feature Flags: deploy code to production but dont enable it for all users. Toggle features on/off without deploying. In .NET:
Microsoft.FeatureManagementpackage, LaunchDarkly, Azure App Configuration.
- Build and test on every PR (GitHub Actions, Azure DevOps Pipelines)
- Run integration tests against a real database (not just unit tests with mocks)
- Automated deployment to staging → smoke tests → production
- Always be able to roll back. If you cant roll back quickly, your deployment process is broken.
Define your infrastructure in code so its reproducible, versioned, and reviewable:
- Bicep / ARM Templates — Azure-native IaC. Bicep is much more readable than ARM.
- Terraform — cloud-agnostic, works with Azure, AWS, GCP
- Pulumi — IaC using real programming languages including C#. Nice for .NET teams because you can use the same language.
Docker + Kubernetes is the standard for deploying microservices:
- Multi-stage Dockerfiles keep images small
- Use
.dockerignoreto exclude unnecessary files - Never run as root in containers
- Health check endpoints for K8s liveness/readiness probes
- Resource limits to prevent noisy neighbor problems
In .NET: dotnet publish has built-in container support since .NET 8. No Dockerfile needed for simple cases.
No "right answer" exists in system design — only tradeoffs you can articulate. Interviewers want to hear WHY you'd choose one approach over another, not just what you chose.
- Top 15 Tradeoffs — great overview, worth bookmarking
- Vertical vs Horizontal Scaling — vertical = bigger Azure VM (easy but has a ceiling). Horizontal = more instances behind a load balancer (complex but unlimited). ASP.NET Core is built for horizontal scaling — just keep your services stateless.
- Concurrency vs Parallelism — concurrency = multiple tasks making progress (async/await,
Task.WhenAll). Parallelism = multiple tasks running simultaneously (Parallel.ForEachAsync, PLINQ). In ASP.NET Core: concurrency for I/O-bound work, parallelism for CPU-bound work. - Long Polling vs WebSockets — long polling holds the connection open waiting for data. WebSockets are full-duplex persistent connections. In .NET just use SignalR — it automaticaly negotiates the best transport.
- Batch vs Stream Processing — batch = process accumulated data periodically (Azure Data Factory, Hangfire). Stream = process as it arrives (Kafka, Azure Stream Analytics,
System.Threading.Channels). - Stateful vs Stateless — stateless = each request is independent (ASP.NET Core APIs, Azure Functions). Stateful = server remembers client state (SignalR connections, Orleans grains). Stateless is easier to scale, stateful has simpler logic.
- Strong vs Eventual Consistency — strong = reads always return latest write. Eventual = reads might be slightly stale. In .NET: SQL Server sync commit = strong. Cosmos DB lets you pick. Redis replicas = eventual.
- Read-Through vs Write-Through Cache — read-through: cache fetches from DB on miss. Write-through: writes update both cache and DB. Write-behind: cache first, async DB flush. Implement with
IDistributedCache+ custom middleware. - Push vs Pull — push = server sends to client (SignalR, webhooks). Pull = client requests data (REST polling). SignalR is .NET's go-to push technology.
- REST vs RPC — REST is resource-oriented (HTTP verbs, JSON). gRPC is action-oriented (binary Protobuf, strongly typed). Use REST for public APIs, gRPC for internal service-to-service calls (its like 10x faster on serialization).
- Sync vs Async Communication — sync = HTTP req/response, caller waits. Async = message queue, caller moves on. Use async for non-critical paths and cross-service communication.
- Latency vs Throughput — optimizing for one often hurts the other. Batching helps throughput but increases per-request latency. Caching reduces latency but adds complexity.
- Simplicity vs Flexibility — the one nobody talks about. Every abstraction layer, every configuration option, every "what if we need to change this later" adds complexity. Most of the time the simpler solution is better even if its less flexible. You can always refactor later when you actually need it. YAGNI (You Aint Gonna Need It) is the most underrated principle in software engineering.
- Consistency vs Performance — strong consistency requires coordination (locks, consensus, synchronous replication) which adds latency. Eventual consistency is faster but your code needs to handle stale reads. Pick based on your domain — financial transactions need strong consistency, social media feeds are fine with eventual.
"How much storage does this need?" "How many servers?" Interviewers love estimation questions. Memorize these numbers and you'll nail them every time.
┌──────────────────────────────────────────────────────────────┐
│ LATENCY NUMBERS (APPROXIMATE) │
├──────────────────────────────────────────────────────────────┤
│ L1 cache reference ..................... 0.5 ns │
│ L2 cache reference ..................... 7 ns │
│ Main memory (RAM) reference ........... 100 ns │
│ SSD random read ....................... 150 μs (~150,000 ns) │
│ HDD random read ....................... 10 ms │
│ Send 1KB over 1 Gbps network ......... 10 μs │
│ Read 1 MB sequentially from RAM ....... 250 μs │
│ Read 1 MB sequentially from SSD ....... 1 ms │
│ Read 1 MB sequentially from HDD ....... 20 ms │
│ Round trip within same datacenter ..... 0.5 ms │
│ Round trip US East → US West .......... 40 ms │
│ Round trip US → Europe ................ 80 ms │
│ Round trip US → Australia ............. 180 ms │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ STORAGE MATH │
├──────────────────────────────────────────────────────────────┤
│ 1 char (ASCII) = 1 byte │
│ 1 char (Unicode) = 2-4 bytes │
│ Average tweet/message = ~200 bytes │
│ Average URL = ~100 bytes │
│ Average photo (compressed) = ~200 KB │
│ Average short video clip = ~5 MB │
│ Average user profile row = ~1 KB │
│ │
│ 1 KB = 1,000 bytes (use 10^3 for quick math) │
│ 1 MB = 1,000 KB = 10^6 bytes │
│ 1 GB = 1,000 MB = 10^9 bytes │
│ 1 TB = 1,000 GB = 10^12 bytes │
│ 1 PB = 1,000 TB = 10^15 bytes │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ TIME MATH │
├──────────────────────────────────────────────────────────────┤
│ Seconds in a day ........... 86,400 (~10^5 or ~100K) │
│ Seconds in a month ......... 2.6M (~2.5 × 10^6) │
│ Seconds in a year .......... 31.5M (~3 × 10^7) │
│ Requests per day → per sec . divide by 100K │
│ │
│ Quick trick: 1M requests/day ≈ 12 requests/second │
│ 10M req/day ≈ 120 req/sec │
│ 100M req/day ≈ 1,200 req/sec │
└──────────────────────────────────────────────────────────────┘
- Round aggressively. Use powers of 10. Nobody cares if its 86,400 or 100,000 seconds in a day — just use 100K for quick math.
- State your assumptions. "I'll assume each user makes 10 requests per day on average." Interviewers want to see your thinking, not exact numbers.
- Work from the top down. Start with total users → daily active users → requests per day → requests per second → storage per request → total storage.
- Account for peaks. Average traffic is nice but you need to handle 2-5x spikes. Black Friday, viral content, etc.
- Show your work. Write it down. The calculation itself is more important than the result.
- 100M new URLs per month
- Each URL: ~100 bytes (short URL) + ~500 bytes (long URL + metadata) = ~600 bytes
- Monthly storage: 100M × 600 bytes = 60 GB/month
- 5 year storage: 60 GB × 60 months = 3.6 TB
- With replication (3x): ~10.8 TB → round to ~12 TB
See? Not that scary. Practice a few of these and you'll get fast at it.
Having been on both sides of the interview table, here's what actually separates pass from fail.
Step 1: Clarify Requirements (3-5 mins) Dont jump into designing. Ask questions first:
- What are the core features? (dont design everything, just the important stuff)
- What scale are we designing for? (100 users or 100M users?)
- What are the non-functional requirements? (latency, availability, consistency)
- Any constraints? (budget, existing tech stack, timeline)
Step 2: Estimate Scale (3-5 mins) Back of envelope math. How many users, requests per second, storage needed. This informs whether you need sharding, caching, CDN, etc.
Step 3: High-Level Design (10-15 mins) Draw the big boxes: client, load balancer, API servers, database, cache, message queue. Explain the data flow. Keep it simple first.
Step 4: Deep Dive (10-15 mins) The interviewer will pick a component to dig into. Be ready to go deep on:
- Database schema design
- API endpoint design
- How caching works
- How you handle failure scenarios
- How you'd scale specific bottlenecks
Step 5: Wrap Up (3-5 mins) Discuss tradeoffs you made, what you'd improve with more time, monitoring strategy.
-
Think out loud. The interview is about your thought process, not the final answer. Silence is the worst thing you can do.
-
Start simple, add complexity. Design for a single server first. Then scale it. Interviewers want to see you iterate, not drop a perfect architecture from the sky.
-
Always mention tradeoffs. "I chose SQL here because we need ACID for payments, but the tradeoff is it wont scale horizontally as easily as NoSQL." This is what separates senior from junior.
-
Dont memorize architectures, learn patterns. Every URL shortener article shows the same design. Interviewers know this. What impresses them is when you explain WHY each component exists.
-
Know your numbers. How long does a disk seek take? How much data can you store in Redis? How many connections can a SQL Server handle? See the estimation section above.
-
Mention failure modes. "What happens if Redis goes down?" "What if there's a network partition?" Proactively addressing failures shows maturity.
-
Talk about monitoring. "I'd add a dashboard tracking p99 latency, error rate, and queue depth." Shows you think about operations, not just building.
-
As a .NET developer, its totally fine to reference specific .NET tools and libraries. "I'd use SignalR for real-time notifications" or "YARP for the API gateway" — it shows you can actually build what you're designing, not just draw boxes.
- Jumping straight to the solution without clarifying requirements
- Over-engineering (adding Kafka when a simple queue would do)
- Under-engineering (ignoring scale when the problem clearly needs it)
- Not discussing tradeoffs
- Only designing the happy path (what happens when things fail?)
- Getting stuck on one component for too long
- Not drawing diagrams (always draw, even if your drawing is terrible)
Mistakes I've seen (and made) in production. Learn from other people's pain — its cheaper.
Distributed Monolith — you split into microservices but they're all tightly coupled, share a database, and must be deployed together. You got all the complexity of microservices with none of the benefits. Fix: each service owns its data, communicates via async messaging.
Premature Microservices — going microservices before you even have product-market fit. Your changing things so fast that the service boundaries keep shifting. Fix: start with a modular monolith, extract microservices when you have stable domain boundaries.
Shared Database — multiple services reading/writing to the same database tables. Changes to the schema require coordinating across all teams. This is the #1 killer of microservices architectures. Fix: database per service, sync via events.
God Service — one service that does everything. Its the new monolith, just hosted in a container. Usually called something vague like "CoreService" or "PlatformService". Fix: identify bounded contexts, split along business capability lines.
Chatty Services — a single user action requires 15 service-to-service calls. Latency adds up, reliability drops (each call can fail). Fix: redesign service boundaries so common operations are within a single service, or use the BFF (Backend For Frontend) pattern to aggregate.
N+1 Queries — already mentioned but saying it again because its SO common. Load list of parents, then for each parent query children. In .NET: always use .Include() or .Select() projection. Enable EF Core logging to see generated SQL.
Missing Indexes — slow queries that could be fast with an index. Check your query plans. In EF Core: use entity.HasIndex() in your OnModelCreating. For existing databases: look at SQL Server's missing index DMVs.
Unbounded Queries — SELECT * FROM Orders with no pagination, filtering, or limit. One day your table has 50 million rows and the query takes 30 seconds. Fix: always paginate, always set a limit, project only needed columns.
Storing Files in Database — putting images, PDFs, videos in SQL Server BLOB columns. Your database backup is now 500 GB and queries are slow. Fix: store files in Azure Blob Storage / S3, store only the URL in the database.
Cache Everything — caching data thats rarely accessed or changes frequently. The cache hit rate is low and stale data causes bugs. Fix: cache hot data with short TTLs, measure your hit rate.
No Expiry — cached data lives forever. Works great until the underlying data changes and your users see stale data for hours. Fix: always set TTL. Use event-driven invalidation for critical data.
Cache Avalanche — all your cache keys expire at the same time (because they were all set with the same TTL at the same time). Massive DB spike. Fix: add random jitter to TTLs (baseExpiry + Random(0, 60 seconds)).
Fire and Forget — publishing messages without confirming delivery. Messages silently get lost. Fix: use publisher confirms (RabbitMQ) or durable messaging with the outbox pattern.
No Dead Letter Queue — failed messages just disappear. You have no idea why processing is failing. Fix: always configure DLQs and monitor them.
Huge Messages — putting entire file contents or large payloads in messages. Bloats the queue, slow to serialize/deserialize. Fix: put the data in blob storage, put a reference (URL/ID) in the message. This is called the "claim check" pattern.
Opinionated decision trees for the most common "what should I use?" questions. Every situation is different, but these are solid defaults.
Need ACID + complex queries + relationships?
→ SQL Server or PostgreSQL with EF Core
Need document storage with flexible schema?
→ Azure Cosmos DB (or MongoDB)
Need a cache / session store / simple key-value?
→ Redis (via StackExchange.Redis)
Need full-text search?
→ Elasticsearch (via NEST/Elastic.Clients.Elasticsearch)
Need time-series data (metrics, IoT)?
→ InfluxDB or Azure Data Explorer
Need graph relationships (social networks, recommendations)?
→ Neo4j (via Neo4j.Driver)
Not sure?
→ Start with PostgreSQL. It handles 95% of use cases.
Client → Server (request/response)?
→ REST (ASP.NET Core Web API) for external
→ gRPC for internal service-to-service (faster)
Real-time server → client push?
→ SignalR
Service → Service async (fire and forget)?
→ Message queue (RabbitMQ/Azure Service Bus via MassTransit)
Service → Multiple services (fan-out)?
→ Pub/Sub (Azure Service Bus Topics, Kafka)
Long-running workflows?
→ Azure Durable Functions or MassTransit Sagas
Scheduled jobs?
→ Hangfire or Azure Functions with timer trigger
Small team (1-5 devs), single product?
→ Modular monolith. Dont overthink it.
Growing team (5-15 devs), multiple product areas?
→ Modular monolith with well-defined module boundaries.
Extract the first microservice when a module needs
independent scaling or a different deployment cadence.
Large org (15+ devs), multiple teams?
→ Microservices with clear ownership boundaries.
Each team owns 1-3 services end-to-end.
Use .NET Aspire for orchestration.
Event processing / data pipeline?
→ Azure Functions or dedicated worker services.
Need massive scale with complex state?
→ Microsoft Orleans (virtual actor model).
Database queries taking >100ms that are read-heavy?
→ Add Redis cache with cache-aside pattern.
Same API response requested thousands of times per second?
→ Add output caching middleware in ASP.NET Core.
Static assets (images, CSS, JS)?
→ CDN (Azure Front Door / Azure CDN).
User sessions?
→ Redis via IDistributedCache.
Expensive computation results?
→ IMemoryCache for single-server, Redis for multi-server.
Not sure if you need caching?
→ You probably dont yet. Optimize queries first.
Add caching only when you have measurable evidence.
You dont need to implement these from scratch, but knowing what they are, when they're used, and their tradeoffs gives you a real edge in interviews.
- Hash Table — O(1) lookup. The backbone of caches, databases indexes, and... basically everything. In .NET:
Dictionary<TKey, TValue>,ConcurrentDictionary<TKey, TValue>for thread safety. - B-Tree / B+ Tree — balanced tree used for database indexes. Optimized for disk access (minimizes I/O by keeping many keys per node). SQL Server and PostgreSQL both use B+ trees for their indexes. Understanding this helps you understand why index order matters.
- LSM-Tree (Log-Structured Merge Tree) — write-optimized data structure. Writes go to an in-memory table, then flush to sorted disk files, which periodically merge. Used by Cassandra, RocksDB, LevelDB. This is why NoSQL databases have great write performance.
- Skip List — probabilistic sorted data structure with O(log n) operations. Used by Redis for sorted sets (ZSET). When you do
ZADDandZRANGEBYSCOREin Redis, you're using a skip list. - Bloom Filter — already covered above. Probabilistic set membership. "Definitely not" or "maybe yes". Great for avoiding expensive lookups.
- Trie (Prefix Tree) — tree where each node is a character. Used for autocomplete, spell checking, and IP routing tables. Comes up in "Design Autocomplete" interviews.
- Consistent Hash Ring — covered in core concepts. Used for distributed caching and database sharding.
- Merkle Tree — hash tree where each node is the hash of its children. Used by Git, blockchain, and Cassandra/DynamoDB for data synchronization (anti-entropy). Efficiently detects differences between two datasets.
- Quadtree / Geohash — spatial data structures for geographic queries. "Find all restaurants within 5km." Used by Uber, Google Maps, Yelp. Comes up in "Design Uber" and "Design Yelp" interviews. In .NET:
NetTopologySuitefor spatial operations, SQL Server spatial indexes. - HyperLogLog — probabilistic data structure for counting unique elements. Uses tiny memory (~12KB) to estimate cardinality of billions of items with ~0.81% error. Redis has built-in HyperLogLog commands. Used for counting unique visitors, unique IPs, etc.
- Min-Heap / Priority Queue — for getting the minimum element efficiently. Used in job schedulers, rate limiters, and "top K" problems. In .NET:
PriorityQueue<TElement, TPriority>(built-in since .NET 6).
Practical tips for making .NET apps fast in production. Remember: measure first, optimize second.
Measure first, optimize second. Dont guess where the bottleneck is. Use BenchmarkDotNet for micro-benchmarks, dotnet-trace and dotnet-counters for runtime analysis, and Application Insights for production profiling.
- Use async/await everywhere for I/O operations. A synchronous DB call blocks the thread pool thread. In ASP.NET Core, a blocked thread means one less request you can handle concurrently.
- Minimize allocations. Use
Span<T>,ReadOnlySpan<T>,ArrayPool<T>.Shared, andstring.Create()for hot paths. The GC is good but not free. Every allocation creates work for the garbage collector. - Use
System.Text.Jsoninstead of Newtonsoft.Json for serialization. Its faster and allocates less. Source generators ([JsonSerializable]) make it even faster by avoiding reflection. - Connection pooling. ADO.NET pools database connections by default.
HttpClientpools HTTP connections. Dont create new instances per request — useIHttpClientFactoryand DI-injectedDbContext. - Response compression. Enable gzip/brotli compression middleware for API responses. Saves bandwidth, especially for JSON payloads.
builder.Services.AddResponseCompression(). - Output caching. New in .NET 7. Caches entire HTTP responses server-side. Way more efficient than re-executing the handler for every request.
[OutputCache(Duration = 60)].
- Use
.AsNoTracking()for read-only EF Core queries. Change tracking adds overhead you dont need for queries that wont update data. - Project only needed columns.
.Select(u => new { u.Id, u.Name })instead of loading entire entities. Less data transferred, less memory, faster queries. - Batch operations. EF Core 7+ supports
ExecuteUpdateAsync()andExecuteDeleteAsync()for bulk operations without loading entities into memory. - Use compiled queries for hot paths that execute the same query repeatedly.
EF.CompileAsyncQuery(...)eliminates query compilation overhead. - Check your query plans. Just because EF Core generates SQL doesnt mean its good SQL. Use SQL Server Management Studio or
EXPLAIN ANALYZEin PostgreSQL to check.
- L1 + L2 cache pattern.
IMemoryCache(L1, in-process, fastest) + Redis (L2, distributed, still fast). Check L1 first, then L2, then database. Reduces Redis roundtrips for hot data. - Cache serialization matters. If your caching complex objects in Redis, the serialization/deserialization cost adds up. Consider MessagePack or protobuf instead of JSON for cached values.
Unit tests alone wont cut it for distributed systems. Here's a practical testing strategy that actually catches production bugs.
Unit Tests — test individual components in isolation. Mock external dependencies. Fast, cheap, run thousands of them. Use xUnit + Moq/NSubstitute in .NET.
Integration Tests — test components talking to real dependencies (real database, real Redis). Use WebApplicationFactory<T> for in-memory ASP.NET Core server, Testcontainers for spinning up real Docker instances of SQL Server/Redis/RabbitMQ.
Contract Tests — verify that service A's expectations about service B's API match reality. Prevents "it works on my machine" across services. Use Pact in .NET.
End-to-End Tests — test the full flow across all services. Expensive, slow, flaky. Keep these to a minimum — only for critical user journeys (signup, checkout, payment).
Deliberately break things in production (or staging) to find weaknesses before they find you. Netflix pioneered this with Chaos Monkey.
In .NET:
- Kill random service instances and verify the system recovers
- Add artificial latency to downstream calls (Polly's
LatencyStrategy) - Simulate network partitions
- Fill up disk space, exhaust connection pools
- Azure Chaos Studio is a managed chaos engineering service
- Testcontainers is amazing. Spin up real SQL Server, Redis, RabbitMQ in Docker for integration tests. No more "works on my machine" or maintaining shared test databases.
- Dont mock everything. Mocking the database gives you false confidence. Integration tests with a real DB catch schema mismatches, missing indexes, and incorrect SQL.
- Test failure scenarios specifically. What happens when the database is down? When Redis is unreachable? When the message broker rejects a publish? These are the scenarios that cause production incidents.
Changing a live system without breaking it — some of the hardest problems in engineering. These strategies minimize risk.
- EF Core Migrations —
dotnet ef migrations addcreates migration files. Apply withdotnet ef database update. For production: generate SQL scripts withdotnet ef migrations scriptand review before applying. Never rundotnet ef database updatedirectly in production. - Zero-downtime schema changes — add new column (nullable), deploy code that writes to both old and new, backfill old data, deploy code that reads from new, drop old column. Never rename or remove columns in a single step.
- Expand and Contract pattern — expand the schema (add new), migrate data, contract the schema (remove old). Three deployments minimum for breaking schema changes.
- Strangler Fig (covered in patterns) — gradually migrate from old to new behind a routing layer.
- Parallel Run — run old and new systems simultaneously, compare results. When the new system produces the same results, switch over. Good for critical systems (payments, financial calculations).
- Feature Flags — gate new functionality behind flags. Roll out to 1% → 10% → 50% → 100%. Roll back by flipping the flag. In .NET:
Microsoft.FeatureManagement.
- ETL (Extract, Transform, Load) — for bulk data migration between systems. Azure Data Factory for cloud-scale ETL.
- CDC (Change Data Capture) — keep two systems in sync during migration. Old system writes, CDC captures changes, new system applies them. Debezium + Kafka is the standard approach.
- Dual Writes — write to both old and new system during migration. Simpler than CDC but risky (what if one write fails?). Use the outbox pattern if you go this route.
All code is written in C# targeting .NET 8+ with a proper solution file. Each file has detailed XML docs explaining the algorithm, tradeoffs, and .NET ecosystem usage. Every algorithm includes a runnable demo.
# Clone and run
git clone https://github.com/yourusername/System-Design-Overview.git
cd System-Design-Overview
dotnet run --project implementations/csharp
# Or open SystemDesign.sln in Visual Studio / Rider / VS CodeThe interactive menu lets you run any algorithm demo individually or all at once.
| Algorithm | What it does | Code |
|---|---|---|
| Consistent Hash Ring | Virtual node-based ring with MD5. Minimal key remapping when servers join/leave. | ConsistentHashing.cs |
Five different strategies. Which one you pick depends on your situation — theres no universally "best" one.
| Algorithm | What it does | Code |
|---|---|---|
| Round Robin | Simplest. Cycles through servers sequentially. | RoundRobin.cs |
| Weighted Round Robin | Same but servers with higher weights get more traffic. Good for mixed hardware. | WeightedRoundRobin.cs |
| IP Hash | Client IP determines server. Same client always hits same server (sticky sessions). | IpHash.cs |
| Least Connections | Routes to server with fewest active connections. Adapts to real-time load. | LeastConnections.cs |
| Least Response Time | Routes to fastest server. Best for heterogeneous environments. | LeastResponseTime.cs |
Five approaches. ASP.NET Core 7+ has three of these built-in, which is pretty great.
| Algorithm | What it does | ASP.NET Core Built-in? | Code |
|---|---|---|---|
| Fixed Window Counter | Simple counter per time window. Has boundary spike issue. | AddFixedWindowLimiter |
FixedWindowCounter.cs |
| Sliding Window Log | Tracks every timestamp. Most accurate but memory hungry. | Custom IRateLimiterPolicy |
SlidingWindowLog.cs |
| Sliding Window Counter | Weighted estimate across windows. Best balance of accuracy vs memory. | AddSlidingWindowLimiter |
SlidingWindowCounter.cs |
| Token Bucket | Tokens refill over time. Allows controlled bursts. | AddTokenBucketLimiter |
TokenBucket.cs |
| Leaky Bucket | Queue based, constant output rate. Smooths bursty traffic. | Nope, roll your own | LeakyBucket.cs |
┌──────────────────┬──────────┬──────────┬───────────────┐
│ Algorithm │ Memory │ Accuracy │ Burst Control │
├──────────────────┼──────────┼──────────┼───────────────┤
│ Fixed Window │ O(1) │ Low │ Poor (edges) │
│ Sliding Log │ O(N) │ Exact │ Excellent │
│ Sliding Counter │ O(1) │ High │ Good │
│ Token Bucket │ O(1) │ High │ Controlled │
│ Leaky Bucket │ O(N) │ High │ Smoothed │
└──────────────────┴──────────┴──────────┴───────────────┘
Practice these. Start with easy ones and work your way up. For each problem try to think about it with .NET in mind — what libraries would you use, what Azure services, how would you structure the ASP.NET Core solution.
- Design URL Shortener like TinyURL
- Design Autocomplete for Search Engines
- Design Load Balancer
- Design Content Delivery Network (CDN)
- Design Parking Garage
- Design Vending Machine
- Design Distributed Key-Value Store
- Design Distributed Cache
- Design Authentication System
- Design Unified Payments Interface (UPI)
- Design WhatsApp
- Design Spotify
- Design Instagram
- Design Notification Service
- Design Distributed Job Scheduler
- Design Tinder
- Design Facebook
- Design Twitter
- Design Reddit
- Design Netflix
- Design Youtube
- Design Google Search
- Design E-commerce Store like Amazon
- Design TikTok
- Design Shopify
- Design Airbnb
- Design Rate Limiter
- Design Distributed Message Queue like Kafka
- Design Flight Booking System
- Design Online Code Editor
- Design an Analytics Platform (Metrics & Logging)
- Design Payment System
- Design a Digital Wallet
- Design Location Based Service like Yelp
- Design Uber
- Design Food Delivery App like Doordash
- Design Google Docs
- Design Google Maps
- Design Zoom
- Design File Sharing System like Dropbox
- Design Ticket Booking System like BookMyShow
- Design Distributed Web Crawler
- Design Code Deployment System
- Design Distributed Cloud Storage like S3
- Design Distributed Locking Service
Real architectures from companies operating at massive scale. Each case study maps to common interview problems.
- Client → CDN (Open Connect) for video streaming
- Client → API Gateway (Zuul) → microservices for browse, search, recommendations
- Microservices communicate via async messaging (Kafka)
- Data: Cassandra for user data (AP, eventual consistency), MySQL for billing (CP, strong consistency)
- Caching: EVCache (memcached-based) for hot data
- Chaos Monkey randomly kills instances to test resilience
- Lesson: different data stores for different needs, async communication everywhere, test failure constantly
- Riders and drivers send GPS updates every few seconds
- Location data goes to a geospatial index (modified Google S2 cells)
- Matching: find available drivers near the rider using geohash queries
- ETA calculation: precomputed routing graphs + real-time traffic data
- All communication is async — the "requesting a ride" flow goes through a state machine saga
- Lesson: geospatial indexing is crucial, real-time systems need smart data structures (not just SQL queries)
- Each user maintains a persistent connection (WebSocket/MQTT)
- Messages stored temporarily until delivered, then deleted (not stored forever)
- Erlang/BEAM VM for massive concurrent connections (~2M connections per server)
- Messages are end-to-end encrypted — server cant read them
- Lesson: connection management at scale is a real problem, temporary storage reduces costs, encryption adds complexity but is non-negotiable for messaging
- Idempotency keys on every request (prevents double charges)
- Request lifecycle: validate → authorize → capture → settle
- Every state change is an event stored in an event log (event sourcing-ish)
- Strong consistency for financial operations (cant have eventual consistency for money)
- API versioning from day 1 (old versions supported for years)
- Lesson: idempotency is not optional for payments, event sourcing makes audit trails easy, API versioning must be a first-class concern
- System Design Fundamentals — good general overview
- System Design Interviews — interview focused
- .NET Aspire docs — Microsofts opinionated stack for cloud-native .NET. Worth learning if your doing microservices.
- Microsoft Learn: Architect Modern Apps — free ebooks on microservices, cloud-native, CQRS patterns. Surprisingly good for free content.
- Designing Data-Intensive Applications — the bible. Seriously just read this book. Every chapter is relevant.
- .NET Microservices: Architecture for Containerized .NET Applications — free ebook from Microsoft. The eShop reference architecture.
- Building Event-Driven Microservices — essential if your doing event sourcing or CQRS
- Software Architecture: The Hard Parts — when to break apart a monolith, trade-off analysis. Really practical.
- Building Microservices (Sam Newman) — the classic. Second edition covers modern patterns.
- Release It! (Michael Nygard) — patterns for production systems. Stability patterns, capacity planning. Changed how I think about building software.
These are the libraries you'll actually use when building distributed systems in .NET:
| Library | What its for |
|---|---|
| YARP | Reverse proxy / API gateway. Microsofts own, stupidly fast. |
| Polly | Resilience — retry, circuit breaker, timeout, fallback. Non-negotiable for microservices. |
| MassTransit | Message bus abstraction over RabbitMQ, Azure Service Bus, Kafka. Saves you so much boilerplate. |
| StackExchange.Redis | Redis client. Used by like everyone in the .NET ecosystem. |
| OpenTelemetry .NET | Distributed tracing, metrics, logging. The standard going forward. |
| EF Core | ORM. Supports SQL Server, PostgreSQL, Cosmos DB, SQLite. You know this one. |
| SignalR | Real-time WebSocket communication. Built into ASP.NET Core. |
| MediatR | In-process mediator, commonly used for CQRS. Lightweight and simple. |
| .NET Aspire | Cloud-native orchestration, service defaults, local dev dashboard. The new hotness. |
| Wolverine | Next-gen messaging + mediator. Think MassTransit meets MediatR. Worth watching. |
| Hangfire | Background job processing. Dashbord included. |
| FluentValidation | Request validation. Way better than data annotations for complex rules. |
| Testcontainers | Spin up real Docker containers (SQL, Redis, RabbitMQ) for integration tests. Game changer. |
| BenchmarkDotNet | Micro-benchmarking framework. For when you need to know exactly how fast something is. |
| Serilog | Structured logging. Sinks for everything (console, file, Seq, Elasticsearch, App Insights). |
| Mapster or AutoMapper | Object mapping. Mapster is newer and faster. AutoMapper is more established. |
| Refit | Type-safe REST client. Define an interface, Refit generates the implementation. Cleaner than raw HttpClient. |
- Nick Chapsas — .NET deep dives, performance stuff. Probably the best .NET YouTuber right now.
- Raw Coding — ASP.NET Core internals, authentication deep dives
- Milan Jovanovic — Clean Architecture, DDD, CQRS in .NET. Very practical.
- ByteByteGo — system design with great visuals. Not .NET specific but the concepts apply.
- CodeOpinion — distributed systems, messaging, architecture. Focuses on .NET examples.
- dotnet — official Microsoft channel. Standup recordings, .NET Conf, etc.
- Gaurav Sen — system design fundamentals. Great for interview prep.
- System Design Interview — exactly what it sounds like
- Alex Hyett — software architecture and system design, explains complex topics simply
- Hussein Nasser — backend engineering deep dives. Great for understanding protocols and networking.
- ArjanCodes — software design principles. Python-focused but concepts are universal.
- Milan Jovanovic — .NET architecture newsletter. Consistently good.
- The Morning Brew — daily .NET and software dev links. Been running forever.
- ASP.NET Community Standup — weekly updates from the ASP.NET team. Good to stay current.
- ByteByteGo Newsletter — system design concepts weekly. Really well written.
- The Pragmatic Engineer — big tech engineering culture and practices. Not .NET specific but invaluable.
Real production systems handling billions of requests. Not theoretical — these describe actual engineering decisions and their consequences.
- How Discord stores trillions of messages — they went MongoDB → Cassandra → ScyllaDB. Great lessons on when to migrate databases and data modeling at scale.
- Building In-Video Search at Netflix — ML pipeline for searching inside videos. Relevant for understanding async processing architectures.
- How Canva scaled Media uploads from Zero to 50 Million per Day — object storage, CDN, queue-based processing. The scaling journey is really well documented.
- How Airbnb avoids double payments — idempotency in distributed payment systems. If your building anything involving money in .NET, read this.
- Stripe's payments APIs - The first 10 years — API design masterclass. Study this before designing your REST APIs.
- Real time messaging at Slack — WebSocket architecture at massive scale. Directly relevant if your building SignalR-based systems.
- Scaling Memcache at Facebook — how Facebook uses memcached to handle billions of requests. Lessons on cache invalidation, thundering herd, and multi-region caching.
- How Shopify Manages API Versioning — practical lessons on API versioning at scale. Relevant for any API designer.
- Twitter's Architecture for Timelines — fan-out on write vs fan-out on read tradeoff. Classic system design interview topic.
- Uber's Real-Time Data Infrastructure — exactly-once processing, real-time data pipelines, deduplication at scale.
- How GitHub Manages MySQL at Scale — MySQL high availability, sharding, and zero-downtime schema migrations.
- Pinterest's Sharding Architecture — real-world database sharding. Great for understanding shard key selection and migration.
The foundational papers that invented modern distributed systems. Dense but rewarding — they explain why the tools you use daily work the way they do.
- Paxos: The Part-Time Parliament — the foundational consensus algorithm. Hard to read (Lamport wrote it as a story about a Greek parliament, which is either genius or annoying depending on your mood).
- Raft: In Search of an Understandable Consensus Algorithm — designed as an understandable alternative to Paxos. Used by etcd (Kubernetes), Consul, CockroachDB. Way easier to grok than Paxos.
- MapReduce — Google's parallel processing framework. The basis for Hadoop, Spark, and basically all batch processing.
- The Google File System — distributed file system. Foundation for HDFS and cloud storage services.
- Dynamo — Amazon's key-value store. Eventual consistency, consistent hashing, vector clocks. Influenced DynamoDB, Cassandra, Riak.
- Kafka — the paper behind Kafka. Essential if your doing event-driven .NET with MassTransit or Confluent.Kafka.
- Spanner — Google's globally distributed database with TrueTime. This paper influenced Cosmos DB's consistency model.
- Bigtable — column-family storage. Influenced HBase, Cassandra, Azure Table Storage.
- ZooKeeper — distributed coordination. Used by Kafka for cluster management.
- LSM-Tree — the data structure behind Cassandra, LevelDB, RocksDB. Explains why NoSQL databases have such good write performance.
- Chubby — Google's distributed lock service. Basically the predecessor to ZooKeeper and etcd.
- Amazon Aurora — how AWS rebuilt MySQL for the cloud. Separating storage from compute, quorum-based replication. Really interesting architecture.
- CRDTs: Conflict-free Replicated Data Types — data structures that can be merged without coordination. Used for real-time collaboration (Google Docs, Figma). Comes up in "Design Google Docs" interviews.
Quick-reference tables for interview prep and design sessions. Bookmark this section.
200 OK — success
201 Created — resource created (POST)
204 No Content — success, nothing to return (DELETE)
301 Moved Permanently — permanent redirect (SEO)
304 Not Modified — use cached version
400 Bad Request — client sent invalid data
401 Unauthorized — not authenticated (need to login)
403 Forbidden — authenticated but not authorized
404 Not Found — resource doesnt exist
409 Conflict — request conflicts with current state
422 Unprocessable — validation error (use this for business rules)
429 Too Many Reqs — rate limited
500 Internal Error — server broke (never expose details!)
502 Bad Gateway — reverse proxy couldnt reach backend
503 Service Unavail — server overloaded or maintenance
504 Gateway Timeout — backend took too long to respond
1 web server (ASP.NET Core / Kestrel) .... ~1,000-10,000 RPS
1 SQL Server instance .................... ~5,000-50,000 QPS
1 Redis instance ......................... ~100,000 ops/sec
1 Kafka broker ........................... ~200,000 msg/sec
1 RabbitMQ instance ...................... ~20,000-50,000 msg/sec
Azure Cosmos DB (single partition) ....... ~10,000 RU/s
Azure Service Bus (Premium) .............. ~1,000 msg/sec per MU
Single SQL table comfortable limit ....... ~100M-500M rows
Redis max memory (practical) ............. ~25-100 GB
Max HTTP request size (default) .......... ~28.6 MB (Kestrel)
WebSocket connections per server ......... ~10,000-65,000
Availability Annual Downtime Monthly Downtime
99% 3.65 days 7.31 hours
99.9% 8.77 hours 43.83 minutes
99.95% 4.38 hours 21.92 minutes
99.99% 52.60 minutes 4.38 minutes
99.999% 5.26 minutes 26.30 seconds
80 — HTTP
443 — HTTPS
1433 — SQL Server
5432 — PostgreSQL
6379 — Redis
5672 — RabbitMQ (AMQP)
15672 — RabbitMQ Management UI
9092 — Apache Kafka
8080 — Common alternative HTTP
5000 — ASP.NET Core default (HTTP)
5001 — ASP.NET Core default (HTTPS)
GNU General Public License v3.0 — see the LICENSE file.
Found something wrong? Want to add a concept or improve an explanation? PRs are welcome. Just keep the .NET focus and the practical, no-nonsense tone.
Built for .NET developers who want to understand system design, not just memorize answers.
If this helped you, give it a ⭐ and share it with your team.
Every star helps other .NET developers discover this resource.

