Skip to content

mihaduldev/System-Design-Overview

Repository files navigation

System Design for .NET Developers

The most comprehensive system design resource built exclusively for .NET developers.
Every concept explained with .NET context, real C# implementations, and production-ready patterns.

.NET 8+ C# 12 License PRs Welcome

I built this because most system design resources use Java/Python examples and honestly that always bugged me as a .NET dev. Here, every concept maps directly to the libraries, tools, and patterns you'd use in real production C# apps.

If your new to System Design, start here: System Design was HARD until I Learned these 30 Concepts. Trust me it makes everything click.


Whats Inside

Fundamentals — the concepts every .NET developer must know
Architecture & Patterns — how to structure distributed .NET systems
Production Engineering — building, shipping, and running .NET systems
Interview Preparation — ace the system design interview
Implementations & Resources — code, case studies, and learning materials

⚙️ Core Concepts

The building blocks. Everything else in system design builds on top of these — skip them and you'll struggle with the rest.

Scalability

Basically — can your system handle more users without falling over? There are two ways to scale:

  • Vertical scaling: throw a bigger machine at it (more CPU, more RAM). Simple but theres a ceiling.
  • Horizontal scaling: add more machines and spread the load. More complex but basically unlimited.

In .NET land, this means keeping your ASP.NET Core services stateless so you can just spin up more instances behind a load balancer. If you need shared state, use IDistributedCache with Redis instead of sticking stuff in memory.

  • Deep dive: Scalability
  • In .NET: ASP.NET Core is stateless by default (which is great). Use IDistributedCache for shared state, Azure Service Bus for decoupling, Kubernetes HPA for auto-scaling your containers.

Availability

How much of the time is your system actually working? We measure this in "nines" — 99.9% (three nines) sounds great until you realize thats still like 8.7 hours of downtime a year. For somthing like a payment system, even that might be too much.

  • Deep dive: Availability
  • In .NET: Health checks via Microsoft.Extensions.Diagnostics.HealthChecks (the /health endpoint pattern), Azure Traffic Manager for geo-redundant failover, Polly for retry and circuit breaker patterns. If your building anything serious, you want all three.

Reliability

Slightly different from availability. A system is reliable if it does the right thing even when stuff goes wrong — hardware fails, a deploy goes bad, someone fat-fingers a config. Its not just about being "up", its about being correct under failure.

  • Deep dive: Reliability
  • In .NET: Polly retry + circuit-breaker policies, idempotent message handlers with MassTransit or NServiceBus, transactional outbox pattern in EF Core. The outbox pattern especially — if your doing async messaging and not using it, your probably losing messages and dont know it yet.

Single Point of Failure (SPOF)

Any component where if it dies, your whole system goes down. Classic ones in .NET apps:

  • Single SQL Server instance → fix with Always On Availability Groups
  • Single Redis node → fix with Redis Sentinel or Azure Cache clustering
  • Single API gateway → fix with multiple YARP instances behind Azure Load Balancer

The whole point of system design is basically eliminating these one by one.

Latency vs Throughput vs Bandwidth

People mix these up all the time so lets be clear:

  • Latency: how long ONE request takes (eg. 50ms for your API to respond)
  • Throughput: how many requests per second your system can handle (eg. 10k RPS per Kestrel instance)
  • Bandwidth: the pipe size — maximum data you can push through the network (eg. 10 Gbps)

You can have great throughput but terrible latency (batch processing), or great latency but low throughput (single-threaded server).

  • Deep dive: Latency vs Throughput
  • In .NET: Fun fact — Kestrel benchmarks show ASP.NET Core handling 7M+ requests per second on TechEmpower benchmarks. Use Stopwatch or System.Diagnostics.Activity for latency measurement, OpenTelemetry .NET SDK for distributed tracing across services.

Consistent Hashing

Ok this one is really cool. Normal hashing (key % N) breaks horribly when you add or remove a server — basically ALL keys get remapped. Consistent hashing fixes this by putting servers on a virtual ring, so when you add/remove a server only ~K/N keys need to move (K = total keys, N = total servers).

This is the backbone of distributed caches and databases. If you've ever wondered how Redis Cluster knows which node owns which key — its consistent hashing (well, hash slots technically, but same idea).

CAP Theorem

The famous "pick 2 out of 3" theorem for distributed systems:

  • Consistency: every read gets the most recent write
  • Availability: every request gets a response (no errors)
  • Partition Tolerance: system works even when network between nodes breaks

The thing people often miss — network partitions WILL happen, its not optional. So really your choosing between CP (consistent but might reject requests during partition) or AP (always responds but might give you stale data).

  • Deep dive: CAP Theorem
  • In .NET: SQL Server Always On with synchronous commit = CP. Azure Cosmos DB is interesting because it lets you choose on a sliding scale from Strong (CP) to Eventual (AP). Redis replication is AP by default — your read replicas might be slightly behind.

PACELC Theorem

Most people stop at CAP but theres actually a more useful extension called PACELC. It says:

  • If theres a Partition, choose between Availability and Consistency (same as CAP)
  • Else (no partition, normal operation), choose between Latency and Consistency

This is actually more practical because most of the time your system is NOT partitioned. So the real question is: during normal operations, do you want low latency or strong consistency? Cosmos DB's consistency levels are basically a PACELC slider.

Failover

When the primary server dies, something needs to take over. Two flavors:

  • Active-Passive: standby server is just sitting there waiting. When primary dies it kicks in. Simple, but theres a gap during switchover.

  • Active-Active: both servers handling traffic simultaneously. No downtime on failure but way more complex (you need to deal with data conflicts).

  • Deep dive: Failover

  • In .NET: SQL Server Always On Failover Cluster, Azure App Service deployment slots (great for zero-downtime deploys), Azure Front Door for global failover.

Fault Tolerance

Goes beyond failover — can your system keep working (maybe at reduced capacity) when things partially break? Like if one microservice goes down, does the whole app crash or does it gracefully degrade?

  • Deep dive: Fault Tolerance
  • In .NET: This is where Polly really shines. Chain together retry → circuit breaker → timeout → fallback in a ResiliencePipeline. Combine with IHttpClientFactory resilience handlers. Your middleware pipeline in ASP.NET Core should have graceful degradation baked in from day one.

Idempotency (yes its that important)

I mentioned this in API fundamentals too but its worth repeating here because its one of the most underrated concepts in system design. An operation is idempotent if doing it multiple times has the same effect as doing it once.

Why does this matter? Because in distributed systems, messages get delivered more than once. Networks retry. Users double-click buttons. Queues redeliver. If your operations arent idempotent, you get duplicate orders, double charges, and other fun stuff.

Tip for interviews: whenever you design a write operation in a system design interview, mention idempotency. Interviewers love it because it shows you think about real-world failure modes.


🌐 Networking Fundamentals

Every request from your Blazor frontend to your ASP.NET Core API traverses these layers. When something breaks, knowing where to look saves hours of debugging.

  • OSI Model — the 7-layer model. Your C# code lives at Layer 7 (Application). Kestrel handles layers 4-7. Cloud load balancers can operate at Layer 4 (TCP, faster) or Layer 7 (HTTP, smarter routing).
  • IP Addresses — IPv4 (32-bit) and IPv6 (128-bit). In .NET use System.Net.IPAddress for parsing. Azure VNets use CIDR notation for subnets.
  • DNS — translates domain names to IPs. Heres a gotcha: HttpClient in ASP.NET Core caches DNS by default. If your backend IPs change (common with cloud deployments), set SocketsHttpHandler.PooledConnectionLifetime to control refresh. Learned this one the hard way.
  • Proxy vs Reverse Proxy — forward proxy sits infront of clients (VPN, content filtering). Reverse proxy sits infront of servers (load balancing, SSL termination). YARP is Microsoft's reverse proxy for .NET — really performant and easy to configure.
  • HTTP/HTTPS — Kestrel defaults to HTTP/2 since .NET 6. HTTP/3 (QUIC) available in .NET 7+. TLS can be handled by Kestrel directly or offloaded to Azure Application Gateway.
  • TCP vs UDP — TCP for reliable ordered delivery (HTTP, database connections). UDP for speed when you can tolerate some loss (DNS, video streaming, game servers). .NET has TcpClient/TcpListener and UdpClient in System.Net.Sockets.
  • Load Balancing — distributing traffic across servers. I've implemented 5 different algorithms in C# in this repo with detailed explanations for each one.
  • Checksums — verifying data integrity. .NET has System.IO.Hashing (XxHash, CRC32) and System.Security.Cryptography (SHA256, MD5).

🔌 API Fundamentals

Bad API design haunts you for years — once clients depend on it, changing it is painful. Get these fundamentals right from the start.

  • What is an API — the contract between systems. In .NET you've got options: Minimal APIs (lightweight), Controllers (traditional), gRPC services (high performance internal).
  • API Gateway — single entry point that handles routing, auth, rate limiting, aggregation. In .NET: YARP (my favorite, super flexible), Ocelot (popular but less maintained), Azure API Management (fully managed), or honestly just custom ASP.NET Core middleware for simpler cases.
  • REST vs GraphQL — REST is resource-based with multiple endpoints. GraphQL is one endpoint where the client specifies exactly what data it wants. For .NET: ASP.NET Core Web API for REST, Hot Chocolate or GraphQL.NET for GraphQL. Most teams should just use REST unless they have a really good reason for GraphQL.
  • WebSockets — full-duplex communication over a single TCP connection. In .NET, just use SignalR — it handles WebSockets with automatic fallbacks. For massive scale theres Azure SignalR Service. Only use raw System.Net.WebSockets if you really need a custom protocol.
  • Webhooks — server-to-server push notifications via HTTP POST. Your ASP.NET Core API recieves webhook payloads from Stripe, GitHub, etc. Always validate HMAC signatures — dont just blindly trust incoming payloads.
  • Idempotency — making operations safe to retry. This is CRITICAL for payments and order creation. Pattern: accept an IdempotencyKey header, store it in Redis/SQL, check before processing. Sounds simple but most people skip it until they get bitten by duplicate charges.
  • Rate Limiting — protecting your API from abuse and accidental DDoS from misbehaving clients. ASP.NET Core 7+ has built-in rate limiting (Microsoft.AspNetCore.RateLimiting). I've implemented all 5 major algorithms in C# in this repo.
  • API Design — consistent naming, proper HTTP status codes, pagination, versioning. In .NET use [ApiController] attribute, ProblemDetails for errors (please dont return custom error formats), and Swagger via Swashbuckle or NSwag.
  • API Versioning — you WILL need to change your API. Version from day 1. In .NET: Asp.Versioning.Http package supports URL path (/v1/users), query string (?api-version=1.0), and header-based versioning. URL path is the most common and easiest to understand.
  • Pagination — never return unbounded lists. Three approaches: offset-based (?page=2&size=20), cursor-based (?after=abc123), keyset pagination. Cursor-based is best for large datasets because offset gets slow. In EF Core: .Skip(offset).Take(limit) for offset, or .Where(x => x.Id > lastId).Take(limit) for keyset.

🗄️ Database Fundamentals

Getting the database wrong is expensive — both in performance and in "oh god we need to migrate 500 million rows." These fundamentals prevent costly mistakes.

  • ACID Transactions — Atomicity, Consistency, Isolation, Durability. EF Core supports this via DbContext.Database.BeginTransactionAsync(). For distributed transactions across microservices, look into the Saga pattern (MassTransit and NServiceBus both support this). Dont try to do distributed transactions with TransactionScope across services — it doesn't work the way you think it does.
  • SQL vs NoSQL — SQL (SQL Server, PostgreSQL) for structured relational data with ACID guarantees. NoSQL (Cosmos DB, MongoDB, Redis) for flexible schemas and horizontal scaling. The honest answer is most apps should start with SQL and add NoSQL where needed. EF Core supports both SQL Server and Cosmos DB which is pretty nice.
  • Database Indexes — B-tree structures that make queries fast but writes slower. In EF Core: entity.HasIndex(e => e.Email).IsUnique(). Rule of thumb: index columns in your WHERE, JOIN, and ORDER BY clauses. But dont over-index — every index slows down writes and takes storage.
  • Database Sharding — splitting data across multiple databases by a shard key (user ID, tenant ID, etc). In .NET: Azure SQL Elastic pools, Cosmos DB does this automaticaly with partition keys, or you can roll your own with a DbContext factory that routes to the right shard. Only do this when you actually need it — premature sharding is a nightmare.
  • Data Replication — copying data to multiple nodes for availability and read scaling. Synchronous replication = strong consistency but higher latency. Async replication = eventual consistency but lower latency. In .NET: SQL Server Always On, Cosmos DB multi-region, Redis replication.
  • Database Scaling — vertical (bigger machine), horizontal (more machines via sharding + replication), read replicas (split reads and writes). In .NET you can use DbContextOptionsBuilder with connection string routing for read/write splitting.
  • Database Types — Relational (SQL Server, PostgreSQL), Document (Cosmos DB, MongoDB), Key-Value (Redis), Column-Family (Cassandra), Graph (Neo4j), Time-Series (InfluxDB), Search (Elasticsearch). All have .NET client libraries. Pick based on your access patterns, not hype.
  • Bloom Filters — probabilistic data structure. Tells you "definetly not in the set" or "probably in the set". Used to avoid expensive DB lookups. .NET doesn't have one built-in but theres NuGet packages like BloomFilter.NetCore.
  • Database Architectures — Active-Active (both nodes accept writes, need conflict resolution) vs Active-Passive (one writer, replicas for reads). Cosmos DB supports both.
  • N+1 Query Problem — the most common performance killer in .NET apps using EF Core. You load a list of orders, then for each order you load the customer — thats N+1 queries instead of 1. Fix: use .Include(o => o.Customer) for eager loading, or project with .Select(). If your app is slow, check for N+1 queries first. Seriously, its almost always the answer.
  • Connection Pooling — opening a DB connection is expensive (~20-50ms). Connection pools keep connections open and reuse them. In .NET, ADO.NET pools connections by default. But be careful with DbContext lifetime — use AddDbContext (scoped) not singleton, or you'll exhaust the pool under load.

⚡ Caching Fundamentals

The single biggest performance win in most .NET apps. But stale data bugs are incredibly hard to track down, so understand the tradeoffs.

  • What is Caching — storing frequently accessed data in a faster layer. In .NET you've got: in-memory (IMemoryCache), distributed (IDistributedCache backed by Redis or SQL), HTTP response caching, and the newer output caching middleware.
  • Caching StrategiesCache-Aside (most common in .NET, check cache first, miss → load from DB → store in cache). Write-Through (write to cache AND DB at the same time). Write-Behind (write to cache, async flush to DB — risky but fast). Read-Through (cache itself fetches from DB on miss).
  • Cache Eviction — LRU (least recently used, default for IMemoryCache), LFU (least frequently used), TTL (time based expiry). In .NET: MemoryCacheEntryOptions gives you AbsoluteExpiration, SlidingExpiration, and Size limits. Always set a TTL — unbounded caches will eventually eat all your memory.
  • Distributed Caching — when your app runs on multiple servers, in-memory cache on each one gets out of sync. Thats when you need Redis (StackExchange.Redis) or SQL-backed cache (Microsoft.Extensions.Caching.SqlServer). Azure Cache for Redis gives you managed Redis with clustering.
  • CDN — caches static content (images, JS, CSS) at edge locations worldwide. In .NET: Azure CDN or Azure Front Door. Set cache headers in ASP.NET Core with [ResponseCache(Duration = 3600)].
  • Cache Stampede / Thundering Herd — when a popular cache key expires, hundreds of requests simultaneously hit the database trying to repopulate it. Fixes: use lock-based refresh (only one request rebuilds the cache, others wait), or set a "soft TTL" where the cache is refreshed in the background before actual expiry. In .NET, use SemaphoreSlim or Redis distributed lock for this.
  • Cache Invalidation — Phil Karlton famously said "there are only two hard things in computer science: cache invalidation and naming things." He was right. Event-driven invalidation (publish a message when data changes) is better than TTL-only. In .NET: use MassTransit to publish a CacheInvalidated event when entities change.

🔄 Asynchronous Communication

If your building microservices without a message bus, your basically building a distributed monolith. Async patterns decouple services and make them resilient.

  • Pub/Sub — publishers send messages to a topic, multiple subscribers recieve them independently. In .NET: Azure Service Bus Topics, RabbitMQ with MassTransit, Kafka with Confluent's .NET client, or Redis Pub/Sub for simpler use cases.
  • Message Queues — point-to-point, each message consumed by exactly one consumer. In .NET: Azure Service Bus Queues, RabbitMQ, Amazon SQS. Use MassTransit or NServiceBus as abstractions — they handle retry, dead-letter, serialization and a ton of other stuff you dont want to write yourself.
  • Change Data Capture (CDC) — capture row-level DB changes and publish them as events. In .NET: Debezium (via Kafka Connect) for SQL Server CDC events, EF Core interceptors for publishing domain events on SaveChanges. SQL Server has built-in CDC support which is handy.
  • Dead Letter Queues (DLQ) — where messages go when they fail processing after all retries are exhausted. ALWAYS set up dead letter queues. Without them, failed messages just vanish and you have no idea what went wrong. In .NET: MassTransit configures DLQs automatically. Azure Service Bus has built-in dead-letter support. Monitor your DLQ — if messages are piling up, somethings broken.
  • Exactly-Once vs At-Least-Once vs At-Most-Once — the three message delivery guarantees. At-most-once (fire and forget, might lose messages). At-least-once (guarantees delivery but might duplicate — most common, use idempotency to handle dupes). Exactly-once (theoretically impossible across network boundaries, but Kafka gets close with transactions). In practice, design for at-least-once + idempotency.
  • Backpressure — what happens when your consumer cant keep up with the producer? Without backpressure, messages pile up, memory fills, things crash. Solutions: bounded queues (reject new messages when full), rate limiting on producers, auto-scaling consumers. In .NET: System.Threading.Channels has BoundedChannelOptions for in-process backpressure.

🧩 Distributed Systems and Microservices

The genuinely hard part of software engineering. Make sure you actually need microservices before going down this path — the complexity is real.

  • Heartbeats — periodic "I'm alive" signals between services. In .NET: ASP.NET Core health checks (/health endpoint), Kubernetes liveness/readiness probes, Azure App Service health monitoring.
  • Service Discovery — how services find eachother without hardcoded URLs. In .NET: Kubernetes DNS, Azure Service Fabric naming service, Consul via Steeltoe, or Microsoft.Extensions.ServiceDiscovery in .NET Aspire (this one is really nice for local dev).
  • Consensus Algorithms — how distributed nodes agree on stuff (Paxos, Raft). You probably wont implement these yourself but understanding them helps you reason about why Cosmos DB consistency levels behave the way they do, or why etcd works the way it does.
  • Distributed Locking — mutual exclusion across services. In .NET: Redis locks via StackExchange.Redis (Redlock algorithm), Azure Blob lease-based locks, or SQL Server sp_getapplock. Read the Martin Kleppmann article though — distributed locking is harder than most people think and Redlock has some known issues.
  • Gossip Protocol — peer-to-peer info sharing where each node randomly gossips with neighbors. Used by Cassandra, Redis Cluster, and Orleans for membership. You probably wont implement this but its good to know how your tools work internally.
  • Circuit Breaker — stops cascading failures by "tripping" when a downstream service is unhealthy. States: Closed (normal) → Open (all calls fail fast) → Half-Open (testing if service recovered). In .NET: Polly's CircuitBreakerStrategy is THE answer. Plugs right into IHttpClientFactory.
  • Disaster Recovery — RPO (how much data can you afford to lose?) and RTO (how fast do you need to recover?). Know these numbers for your system. In .NET: Azure Site Recovery, geo-redundant storage, SQL Server backup/restore, Cosmos DB automatic backups.
  • Distributed Tracing — following a request as it hops across services. Without this, debugging microservices is basically impossible. In .NET: OpenTelemetry SDK (uses System.Diagnostics.Activity under the hood), Jaeger, Zipkin, Azure Application Insights, or .NET Aspire's dashboard for local dev.
  • Leader Election — in many distributed systems, one node needs to be the "leader" that coordinates work. If it dies, a new leader is elected. Used by Kafka (partition leaders), Kubernetes (controller manager), SQL Server Always On (primary replica). In .NET: you can implement this with Redis or Azure Blob leases, or use a library like DistributedLock.
  • Vector Clocks — a way to track causality in distributed systems. Each node maintains a counter, and events are tagged with these counters. Lets you detect conflicts and determine ordering without a central clock. Used by DynamoDB and Riak. You probably wont implement this but its good to understand when someone mentions "conflict resolution in distributed databases".
  • Split Brain — when a network partition makes two halves of a cluster think the other half is dead, and both halves try to be the leader. Classic problem in SQL Server Always On, Redis Sentinel, and Elasticsearch. Solutions: quorum-based decisions (need majority to elect leader), fencing tokens, witness nodes. This is why distributed systems are hard.

🖇️ Architectural Patterns

No "best" pattern exists — it depends on your team size, scale requirements, and timeline. Choose deliberately.

  • Client-Server — the foundation. Blazor/Angular/React frontend talks to ASP.NET Core backend which talks to SQL Server. Most .NET apps are this and thats totally fine.
  • Microservices — each service owns its data and logic. .NET tooling is actually pretty great here: .NET Aspire for orchestration, Docker + K8s for deployment, MassTransit/NServiceBus for messaging, YARP for gateway. But seriously — start with a modular monolith. Extract microservices only when you have a real reason. Ive seen too many teams go microservices-first and regret it.
  • Serverless — code runs in response to events, no server management. In .NET: Azure Functions (C#), AWS Lambda (.NET runtime), Azure Durable Functions for stateful workflows. Great for event processing, scheduled jobs, and low-traffic APIs where you dont want to pay for idle servers.
  • Event-Driven — services communicate by publishing and reacting to events. In .NET: MassTransit + RabbitMQ/Azure Service Bus, Wolverine (the new kid on the block, really good), EventStoreDB for event sourcing. This is the core pattern for CQRS implementations.
  • Peer-to-Peer — nodes are both clients and servers. Used in file sharing, blockchain, real-time collab. In .NET: SignalR for real-time P2P, Orleans for virtual actor model.
  • Modular Monolith — the architecture pattern I recommend most teams start with. Its a single deployable unit, but internally its organized into well-defined modules with clear boundaries. When a module needs to become its own service, you extract it. Way easier than going microservices from day 1. In .NET: use separate class library projects per module, communicate via MediatR or in-process events, and enforce boundaries with internal access modifiers.
  • CQRS (Command Query Responsibility Segregation) — separate your read model from your write model. Writes go through commands (which validate, apply business rules, persist). Reads go through queries (which can be optimized separately, maybe with denormalized views or a read-only replica). In .NET: MediatR for in-process CQRS, MassTransit for distributed. Pairs naturally with event sourcing. Dont use CQRS everywhere though — its overengineering for simple CRUD.
  • Event Sourcing — instead of storing current state, store the sequence of events that led to current state. Your "database" is an append-only event log. Current state is derived by replaying events. Powerful for audit trails, temporal queries, and debugging. In .NET: EventStoreDB, Marten (PostgreSQL-based), or roll your own with EF Core. Warning: adds significant complexity. Only use it when you actually need the event history.
  • Hexagonal Architecture (Ports & Adapters) — your business logic is at the center, completely independent of infrastructure. Database, HTTP, messaging — all just "adapters" plugged into "ports". Makes testing easy and infrastructure swappable. In .NET: define interfaces (ports) in your Domain/Application layer, implement them (adapters) in Infrastructure. This is essentially what Clean Architecture popularized.

🔮 Design Patterns for Distributed Systems

Learn these once and you'll have a toolbox for solving most distributed systems problems — both in interviews and production.

Saga Pattern

When you need a "transaction" that spans multiple services but cant use a distributed transaction (and you cant — they dont scale). Instead, each service does its local transaction and publishes an event. If something fails, you publish compensating events to undo previous steps.

Two types:

  • Choreography: services listen for events and react. Simple but hard to track the overall flow.
  • Orchestration: a central "saga orchestrator" tells each service what to do. Easier to understand and debug.

In .NET: MassTransit has built-in saga support with state machines. NServiceBus calls them "sagas" too. For orchestration, Azure Durable Functions is surprisngly good.

Outbox Pattern

The problem: you need to save to the database AND publish a message, but doing both atomically is impossible (database transaction doesnt cover the message broker). If the app crashes between the two, you lose the message.

Solution: save the message TO the database (in an "outbox" table) as part of the same transaction. A background process reads the outbox and publishes to the message broker.

In .NET: MassTransit has built-in outbox support with EF Core. Wolverine does too. If your doing messaging without the outbox pattern, your losing messages and probably dont realize it.

Strangler Fig Pattern

For migrating from a legacy monolith to microservices. Instead of rewriting everything (which almost always fails), you gradually "strangle" the old system. New features go to the new system. Old features get migrated one by one. A routing layer (YARP, API gateway) directs traffic to old or new based on the feature.

Named after strangler fig trees that grow around existing trees. Its the safest migration strategy and I've used it successfully multiple times.

Sidecar Pattern

Deploy a helper process alongside your main service that handles cross-cutting concerns (logging, monitoring, networking, security). The main service doesnt need to know about these concerns.

In .NET: Dapr (Distributed Application Runtime) is basically the sidecar pattern as a product. It handles service invocation, state management, pub/sub, and more — all through a sidecar process. Also: Envoy proxy as a sidecar for service mesh (Istio, Linkerd).

Ambassador Pattern

Similar to sidecar but specifically for handling outbound connections. The ambassador acts as a proxy between your service and external services, handling retry logic, circuit breaking, and monitoring.

In .NET: You could argue that Polly + IHttpClientFactory IS the ambassador pattern, just built into your process instead of a sidecar.

Bulkhead Pattern

Isolate components so that failure in one doesn't take down others. Named after the watertight compartments in ships — if one compartment floods, the ship doesnt sink.

In .NET: Use Polly's BulkheadPolicy to limit concurrent calls to a downstream service. Separate thread pools for different dependencies. In Kubernetes: resource limits per container. In Azure: separate App Service plans for critical vs non-critical services.

Competing Consumers Pattern

Multiple consumers reading from the same queue, each processing different messages. This is how you scale message processing horizontally. The queue ensures each message goes to exactly one consumer.

In .NET: This is the default behavior when multiple instances of your MassTransit/NServiceBus consumer are running. Azure Service Bus and RabbitMQ both support this natively.

Retry with Exponential Backoff

When a call fails, retry it — but wait progressively longer between retries (1s, 2s, 4s, 8s...) with some randomness (jitter) to prevent thundering herd. Simple but critical.

In .NET: Polly's RetryStrategy with BackoffType.Exponential and UseJitter = true. Always add jitter. Without jitter, all your retrying clients hit the recovering server at the same time.


🔒 Security Fundamentals

Not a full security course, but these are the concepts that come up in every system design conversation and code review.

Authentication vs Authorization

  • Authentication (AuthN): who are you? (login, JWT, OAuth)
  • Authorization (AuthZ): what are you allowed to do? (roles, policies, permissions)

In .NET: ASP.NET Core Identity for user management, JWT bearer tokens for API auth, OAuth 2.0 / OpenID Connect with IdentityServer or Duende IdentityServer, Azure AD / Entra ID for enterprise. Use [Authorize] attribute + policy-based authorization.

OAuth 2.0 & OpenID Connect

OAuth 2.0 handles authorization (granting access to resources). OpenID Connect (OIDC) adds authentication on top (verifying identity). Together they power "Login with Google/GitHub/Microsoft".

In .NET: Microsoft.AspNetCore.Authentication.OpenIdConnect for server-side, Microsoft.Identity.Web for Azure AD integration. For building your own identity provider: Duende IdentityServer or OpenIddict.

JWT (JSON Web Tokens)

Self-contained tokens that carry claims (user ID, roles, permissions). The server doesn't need to look anything up — just verify the signature. But they cant be revoked until they expire, so keep expiry times short (15 mins) and use refresh tokens for long sessions.

Gotcha: JWTs are NOT encrypted by default. Anyone can decode and read the claims. Dont put sensitive data in them. They're only signed (proving they havent been tampered with).

API Security Basics

  • Always use HTTPS (Kestrel handles this)
  • Validate and sanitize ALL input (FluentValidation, [ApiController] auto-validation)
  • Use parameterized queries (EF Core does this by default, but watch for raw SQL)
  • Rate limit your APIs (built-in middleware in ASP.NET Core 7+)
  • Implement CORS properly (dont just allow * in production)
  • Use API keys for service-to-service, JWTs for user-facing
  • Log authentication failures (Azure Application Insights)
  • Never expose stack traces in production (app.UseExceptionHandler())

Encryption at Rest and in Transit

  • In transit: TLS/HTTPS for all communications (Kestrel, Azure App Gateway)
  • At rest: SQL Server Transparent Data Encryption (TDE), Azure Storage encryption, Cosmos DB encryption. All enabled by default in Azure managed services.
  • Application-level encryption: use System.Security.Cryptography for encrypting specific fields (PII, credit cards). Consider Azure Key Vault for key management.

Zero Trust Architecture

"Never trust, always verify." Even internal service-to-service calls should be authenticated and authorized. No more "its behind the firewall so its safe." In .NET: mutual TLS (mTLS) between services, service mesh (Istio/Linkerd), Azure Private Endpoints, managed identities for Azure resource access.


📊 Observability & Monitoring

You cant fix what you cant see. Three pillars: logs, metrics, and traces. Without them, your flying blind.

The Three Pillars

Logs — what happened? Individual event records with timestamps.

  • In .NET: ILogger<T> + Serilog or NLog. Structured logging is key — use _logger.LogInformation("Order {OrderId} created by {UserId}", orderId, userId) not string concatenation. Ship to Azure Application Insights, Seq, Elasticsearch, or Datadog.

Metrics — how much? Numerical measurements over time. Request rate, error rate, latency percentiles, CPU usage, queue depth.

  • In .NET: System.Diagnostics.Metrics (built-in since .NET 8), OpenTelemetry .NET SDK, Prometheus via prometheus-net. Key metrics to track: request rate, error rate, p50/p95/p99 latency, saturation (CPU, memory, connections).

Traces — how did it flow? Following a single request across multiple services.

  • In .NET: System.Diagnostics.Activity (the underlying primitive), OpenTelemetry SDK for instrumentation, Jaeger or Zipkin for visualization, Azure Application Insights for managed solution, .NET Aspire dashboard for local dev.

The Four Golden Signals (from Google SRE)

  1. Latency — how long requests take (track p50, p95, p99, not averages)
  2. Traffic — how many requests per second
  3. Errors — what percentage of requests are failing (5xx rate)
  4. Saturation — how "full" your system is (CPU, memory, disk, connection pools)

If you only monitor four things, monitor these. Everything else is secondary.

Health Checks in .NET

// In your Program.cs
builder.Services.AddHealthChecks()
    .AddSqlServer(connectionString)        // checks DB connectivity
    .AddRedis(redisConnectionString)        // checks Redis
    .AddAzureServiceBusTopic(sbConn, topic) // checks message broker
    .AddUrlGroup(new Uri("https://dependency.api/health")); // checks downstream API

app.MapHealthChecks("/health");
app.MapHealthChecks("/health/ready", new() { Predicate = check => check.Tags.Contains("ready") });
app.MapHealthChecks("/health/live", new() { Predicate = _ => false }); // just checks if app is running

Alerting Rules of Thumb

  • Alert on symptoms (high error rate), not causes (high CPU). High CPU might be fine if error rate is normal.
  • Use percentiles not averages for latency. P99 at 5s means 1 in 100 users waits 5 seconds — the average might be 200ms.
  • Set up alerts for your SLOs (service level objectives), not arbitrary thresholds.
  • Dont alert on things that dont need human intervention. Alert fatigue is real and dangerous.

🚀 DevOps & Deployment Patterns

Getting code to production safely is a system design problem in itself. These patterns minimize risk.

Deployment Strategies

  • Rolling Update: replace instances one at a time. Old and new versions run simultaneously during the rollout. Default in Kubernetes. Risk: API compatibility between old and new version.
  • Blue-Green Deployment: run two identical environments (blue = current, green = new). Switch traffic from blue to green once verified. Instant rollback by switching back. In Azure: App Service deployment slots.
  • Canary Deployment: route a small percentage of traffic (1-5%) to the new version. Monitor for errors. Gradually increase if healthy. In .NET: YARP with weighted routing, Azure Front Door traffic splitting.
  • Feature Flags: deploy code to production but dont enable it for all users. Toggle features on/off without deploying. In .NET: Microsoft.FeatureManagement package, LaunchDarkly, Azure App Configuration.

CI/CD Essentials

  • Build and test on every PR (GitHub Actions, Azure DevOps Pipelines)
  • Run integration tests against a real database (not just unit tests with mocks)
  • Automated deployment to staging → smoke tests → production
  • Always be able to roll back. If you cant roll back quickly, your deployment process is broken.

Infrastructure as Code

Define your infrastructure in code so its reproducible, versioned, and reviewable:

  • Bicep / ARM Templates — Azure-native IaC. Bicep is much more readable than ARM.
  • Terraform — cloud-agnostic, works with Azure, AWS, GCP
  • Pulumi — IaC using real programming languages including C#. Nice for .NET teams because you can use the same language.

Containerization

Docker + Kubernetes is the standard for deploying microservices:

  • Multi-stage Dockerfiles keep images small
  • Use .dockerignore to exclude unnecessary files
  • Never run as root in containers
  • Health check endpoints for K8s liveness/readiness probes
  • Resource limits to prevent noisy neighbor problems

In .NET: dotnet publish has built-in container support since .NET 8. No Dockerfile needed for simple cases.


⚖️ System Design Tradeoffs

No "right answer" exists in system design — only tradeoffs you can articulate. Interviewers want to hear WHY you'd choose one approach over another, not just what you chose.

  • Top 15 Tradeoffs — great overview, worth bookmarking
  • Vertical vs Horizontal Scaling — vertical = bigger Azure VM (easy but has a ceiling). Horizontal = more instances behind a load balancer (complex but unlimited). ASP.NET Core is built for horizontal scaling — just keep your services stateless.
  • Concurrency vs Parallelism — concurrency = multiple tasks making progress (async/await, Task.WhenAll). Parallelism = multiple tasks running simultaneously (Parallel.ForEachAsync, PLINQ). In ASP.NET Core: concurrency for I/O-bound work, parallelism for CPU-bound work.
  • Long Polling vs WebSockets — long polling holds the connection open waiting for data. WebSockets are full-duplex persistent connections. In .NET just use SignalR — it automaticaly negotiates the best transport.
  • Batch vs Stream Processing — batch = process accumulated data periodically (Azure Data Factory, Hangfire). Stream = process as it arrives (Kafka, Azure Stream Analytics, System.Threading.Channels).
  • Stateful vs Stateless — stateless = each request is independent (ASP.NET Core APIs, Azure Functions). Stateful = server remembers client state (SignalR connections, Orleans grains). Stateless is easier to scale, stateful has simpler logic.
  • Strong vs Eventual Consistency — strong = reads always return latest write. Eventual = reads might be slightly stale. In .NET: SQL Server sync commit = strong. Cosmos DB lets you pick. Redis replicas = eventual.
  • Read-Through vs Write-Through Cache — read-through: cache fetches from DB on miss. Write-through: writes update both cache and DB. Write-behind: cache first, async DB flush. Implement with IDistributedCache + custom middleware.
  • Push vs Pull — push = server sends to client (SignalR, webhooks). Pull = client requests data (REST polling). SignalR is .NET's go-to push technology.
  • REST vs RPC — REST is resource-oriented (HTTP verbs, JSON). gRPC is action-oriented (binary Protobuf, strongly typed). Use REST for public APIs, gRPC for internal service-to-service calls (its like 10x faster on serialization).
  • Sync vs Async Communication — sync = HTTP req/response, caller waits. Async = message queue, caller moves on. Use async for non-critical paths and cross-service communication.
  • Latency vs Throughput — optimizing for one often hurts the other. Batching helps throughput but increases per-request latency. Caching reduces latency but adds complexity.
  • Simplicity vs Flexibility — the one nobody talks about. Every abstraction layer, every configuration option, every "what if we need to change this later" adds complexity. Most of the time the simpler solution is better even if its less flexible. You can always refactor later when you actually need it. YAGNI (You Aint Gonna Need It) is the most underrated principle in software engineering.
  • Consistency vs Performance — strong consistency requires coordination (locks, consensus, synchronous replication) which adds latency. Eventual consistency is faster but your code needs to handle stale reads. Pick based on your domain — financial transactions need strong consistency, social media feeds are fine with eventual.

📐 Back-of-Envelope Estimation

"How much storage does this need?" "How many servers?" Interviewers love estimation questions. Memorize these numbers and you'll nail them every time.

Numbers Every Developer Should Know

┌──────────────────────────────────────────────────────────────┐
│              LATENCY NUMBERS (APPROXIMATE)                   │
├──────────────────────────────────────────────────────────────┤
│ L1 cache reference ..................... 0.5 ns               │
│ L2 cache reference ..................... 7 ns                 │
│ Main memory (RAM) reference ........... 100 ns               │
│ SSD random read ....................... 150 μs (~150,000 ns)  │
│ HDD random read ....................... 10 ms                 │
│ Send 1KB over 1 Gbps network ......... 10 μs                 │
│ Read 1 MB sequentially from RAM ....... 250 μs               │
│ Read 1 MB sequentially from SSD ....... 1 ms                 │
│ Read 1 MB sequentially from HDD ....... 20 ms                │
│ Round trip within same datacenter ..... 0.5 ms               │
│ Round trip US East → US West .......... 40 ms                │
│ Round trip US → Europe ................ 80 ms                 │
│ Round trip US → Australia ............. 180 ms               │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                    STORAGE MATH                              │
├──────────────────────────────────────────────────────────────┤
│ 1 char (ASCII) = 1 byte                                     │
│ 1 char (Unicode) = 2-4 bytes                                │
│ Average tweet/message = ~200 bytes                           │
│ Average URL = ~100 bytes                                     │
│ Average photo (compressed) = ~200 KB                         │
│ Average short video clip = ~5 MB                             │
│ Average user profile row = ~1 KB                             │
│                                                              │
│ 1 KB = 1,000 bytes (use 10^3 for quick math)                │
│ 1 MB = 1,000 KB = 10^6 bytes                                │
│ 1 GB = 1,000 MB = 10^9 bytes                                │
│ 1 TB = 1,000 GB = 10^12 bytes                               │
│ 1 PB = 1,000 TB = 10^15 bytes                               │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                    TIME MATH                                 │
├──────────────────────────────────────────────────────────────┤
│ Seconds in a day ........... 86,400  (~10^5 or ~100K)        │
│ Seconds in a month ......... 2.6M    (~2.5 × 10^6)          │
│ Seconds in a year .......... 31.5M   (~3 × 10^7)            │
│ Requests per day → per sec . divide by 100K                  │
│                                                              │
│ Quick trick: 1M requests/day ≈ 12 requests/second            │
│              10M req/day ≈ 120 req/sec                        │
│              100M req/day ≈ 1,200 req/sec                     │
└──────────────────────────────────────────────────────────────┘

Estimation Tips

  1. Round aggressively. Use powers of 10. Nobody cares if its 86,400 or 100,000 seconds in a day — just use 100K for quick math.
  2. State your assumptions. "I'll assume each user makes 10 requests per day on average." Interviewers want to see your thinking, not exact numbers.
  3. Work from the top down. Start with total users → daily active users → requests per day → requests per second → storage per request → total storage.
  4. Account for peaks. Average traffic is nice but you need to handle 2-5x spikes. Black Friday, viral content, etc.
  5. Show your work. Write it down. The calculation itself is more important than the result.

Example: Estimate storage for a URL shortener

  • 100M new URLs per month
  • Each URL: ~100 bytes (short URL) + ~500 bytes (long URL + metadata) = ~600 bytes
  • Monthly storage: 100M × 600 bytes = 60 GB/month
  • 5 year storage: 60 GB × 60 months = 3.6 TB
  • With replication (3x): ~10.8 TB → round to ~12 TB

See? Not that scary. Practice a few of these and you'll get fast at it.


🎯 System Design Interview Tips & Tricks

Having been on both sides of the interview table, here's what actually separates pass from fail.

The Framework (use this for every problem)

Step 1: Clarify Requirements (3-5 mins) Dont jump into designing. Ask questions first:

  • What are the core features? (dont design everything, just the important stuff)
  • What scale are we designing for? (100 users or 100M users?)
  • What are the non-functional requirements? (latency, availability, consistency)
  • Any constraints? (budget, existing tech stack, timeline)

Step 2: Estimate Scale (3-5 mins) Back of envelope math. How many users, requests per second, storage needed. This informs whether you need sharding, caching, CDN, etc.

Step 3: High-Level Design (10-15 mins) Draw the big boxes: client, load balancer, API servers, database, cache, message queue. Explain the data flow. Keep it simple first.

Step 4: Deep Dive (10-15 mins) The interviewer will pick a component to dig into. Be ready to go deep on:

  • Database schema design
  • API endpoint design
  • How caching works
  • How you handle failure scenarios
  • How you'd scale specific bottlenecks

Step 5: Wrap Up (3-5 mins) Discuss tradeoffs you made, what you'd improve with more time, monitoring strategy.

Tips That Actually Help

  1. Think out loud. The interview is about your thought process, not the final answer. Silence is the worst thing you can do.

  2. Start simple, add complexity. Design for a single server first. Then scale it. Interviewers want to see you iterate, not drop a perfect architecture from the sky.

  3. Always mention tradeoffs. "I chose SQL here because we need ACID for payments, but the tradeoff is it wont scale horizontally as easily as NoSQL." This is what separates senior from junior.

  4. Dont memorize architectures, learn patterns. Every URL shortener article shows the same design. Interviewers know this. What impresses them is when you explain WHY each component exists.

  5. Know your numbers. How long does a disk seek take? How much data can you store in Redis? How many connections can a SQL Server handle? See the estimation section above.

  6. Mention failure modes. "What happens if Redis goes down?" "What if there's a network partition?" Proactively addressing failures shows maturity.

  7. Talk about monitoring. "I'd add a dashboard tracking p99 latency, error rate, and queue depth." Shows you think about operations, not just building.

  8. As a .NET developer, its totally fine to reference specific .NET tools and libraries. "I'd use SignalR for real-time notifications" or "YARP for the API gateway" — it shows you can actually build what you're designing, not just draw boxes.

Common Mistakes in Interviews

  • Jumping straight to the solution without clarifying requirements
  • Over-engineering (adding Kafka when a simple queue would do)
  • Under-engineering (ignoring scale when the problem clearly needs it)
  • Not discussing tradeoffs
  • Only designing the happy path (what happens when things fail?)
  • Getting stuck on one component for too long
  • Not drawing diagrams (always draw, even if your drawing is terrible)

⚠️ Common Mistakes & Anti-Patterns

Mistakes I've seen (and made) in production. Learn from other people's pain — its cheaper.

Architecture Anti-Patterns

Distributed Monolith — you split into microservices but they're all tightly coupled, share a database, and must be deployed together. You got all the complexity of microservices with none of the benefits. Fix: each service owns its data, communicates via async messaging.

Premature Microservices — going microservices before you even have product-market fit. Your changing things so fast that the service boundaries keep shifting. Fix: start with a modular monolith, extract microservices when you have stable domain boundaries.

Shared Database — multiple services reading/writing to the same database tables. Changes to the schema require coordinating across all teams. This is the #1 killer of microservices architectures. Fix: database per service, sync via events.

God Service — one service that does everything. Its the new monolith, just hosted in a container. Usually called something vague like "CoreService" or "PlatformService". Fix: identify bounded contexts, split along business capability lines.

Chatty Services — a single user action requires 15 service-to-service calls. Latency adds up, reliability drops (each call can fail). Fix: redesign service boundaries so common operations are within a single service, or use the BFF (Backend For Frontend) pattern to aggregate.

Database Anti-Patterns

N+1 Queries — already mentioned but saying it again because its SO common. Load list of parents, then for each parent query children. In .NET: always use .Include() or .Select() projection. Enable EF Core logging to see generated SQL.

Missing Indexes — slow queries that could be fast with an index. Check your query plans. In EF Core: use entity.HasIndex() in your OnModelCreating. For existing databases: look at SQL Server's missing index DMVs.

Unbounded QueriesSELECT * FROM Orders with no pagination, filtering, or limit. One day your table has 50 million rows and the query takes 30 seconds. Fix: always paginate, always set a limit, project only needed columns.

Storing Files in Database — putting images, PDFs, videos in SQL Server BLOB columns. Your database backup is now 500 GB and queries are slow. Fix: store files in Azure Blob Storage / S3, store only the URL in the database.

Caching Anti-Patterns

Cache Everything — caching data thats rarely accessed or changes frequently. The cache hit rate is low and stale data causes bugs. Fix: cache hot data with short TTLs, measure your hit rate.

No Expiry — cached data lives forever. Works great until the underlying data changes and your users see stale data for hours. Fix: always set TTL. Use event-driven invalidation for critical data.

Cache Avalanche — all your cache keys expire at the same time (because they were all set with the same TTL at the same time). Massive DB spike. Fix: add random jitter to TTLs (baseExpiry + Random(0, 60 seconds)).

Messaging Anti-Patterns

Fire and Forget — publishing messages without confirming delivery. Messages silently get lost. Fix: use publisher confirms (RabbitMQ) or durable messaging with the outbox pattern.

No Dead Letter Queue — failed messages just disappear. You have no idea why processing is failing. Fix: always configure DLQs and monitor them.

Huge Messages — putting entire file contents or large payloads in messages. Bloats the queue, slow to serialize/deserialize. Fix: put the data in blob storage, put a reference (URL/ID) in the message. This is called the "claim check" pattern.


🗺️ .NET Architecture Decision Guide

Opinionated decision trees for the most common "what should I use?" questions. Every situation is different, but these are solid defaults.

When to Use What Database

Need ACID + complex queries + relationships?
  → SQL Server or PostgreSQL with EF Core

Need document storage with flexible schema?
  → Azure Cosmos DB (or MongoDB)

Need a cache / session store / simple key-value?
  → Redis (via StackExchange.Redis)

Need full-text search?
  → Elasticsearch (via NEST/Elastic.Clients.Elasticsearch)

Need time-series data (metrics, IoT)?
  → InfluxDB or Azure Data Explorer

Need graph relationships (social networks, recommendations)?
  → Neo4j (via Neo4j.Driver)

Not sure?
  → Start with PostgreSQL. It handles 95% of use cases.

When to Use What Communication Pattern

Client → Server (request/response)?
  → REST (ASP.NET Core Web API) for external
  → gRPC for internal service-to-service (faster)

Real-time server → client push?
  → SignalR

Service → Service async (fire and forget)?
  → Message queue (RabbitMQ/Azure Service Bus via MassTransit)

Service → Multiple services (fan-out)?
  → Pub/Sub (Azure Service Bus Topics, Kafka)

Long-running workflows?
  → Azure Durable Functions or MassTransit Sagas

Scheduled jobs?
  → Hangfire or Azure Functions with timer trigger

When to Use What Architecture

Small team (1-5 devs), single product?
  → Modular monolith. Dont overthink it.

Growing team (5-15 devs), multiple product areas?
  → Modular monolith with well-defined module boundaries.
    Extract the first microservice when a module needs
    independent scaling or a different deployment cadence.

Large org (15+ devs), multiple teams?
  → Microservices with clear ownership boundaries.
    Each team owns 1-3 services end-to-end.
    Use .NET Aspire for orchestration.

Event processing / data pipeline?
  → Azure Functions or dedicated worker services.

Need massive scale with complex state?
  → Microsoft Orleans (virtual actor model).

When to Add Caching

Database queries taking >100ms that are read-heavy?
  → Add Redis cache with cache-aside pattern.

Same API response requested thousands of times per second?
  → Add output caching middleware in ASP.NET Core.

Static assets (images, CSS, JS)?
  → CDN (Azure Front Door / Azure CDN).

User sessions?
  → Redis via IDistributedCache.

Expensive computation results?
  → IMemoryCache for single-server, Redis for multi-server.

Not sure if you need caching?
  → You probably dont yet. Optimize queries first.
    Add caching only when you have measurable evidence.

🏗️ Data Structures You Should Know

You dont need to implement these from scratch, but knowing what they are, when they're used, and their tradeoffs gives you a real edge in interviews.

  • Hash Table — O(1) lookup. The backbone of caches, databases indexes, and... basically everything. In .NET: Dictionary<TKey, TValue>, ConcurrentDictionary<TKey, TValue> for thread safety.
  • B-Tree / B+ Tree — balanced tree used for database indexes. Optimized for disk access (minimizes I/O by keeping many keys per node). SQL Server and PostgreSQL both use B+ trees for their indexes. Understanding this helps you understand why index order matters.
  • LSM-Tree (Log-Structured Merge Tree) — write-optimized data structure. Writes go to an in-memory table, then flush to sorted disk files, which periodically merge. Used by Cassandra, RocksDB, LevelDB. This is why NoSQL databases have great write performance.
  • Skip List — probabilistic sorted data structure with O(log n) operations. Used by Redis for sorted sets (ZSET). When you do ZADD and ZRANGEBYSCORE in Redis, you're using a skip list.
  • Bloom Filter — already covered above. Probabilistic set membership. "Definitely not" or "maybe yes". Great for avoiding expensive lookups.
  • Trie (Prefix Tree) — tree where each node is a character. Used for autocomplete, spell checking, and IP routing tables. Comes up in "Design Autocomplete" interviews.
  • Consistent Hash Ring — covered in core concepts. Used for distributed caching and database sharding.
  • Merkle Tree — hash tree where each node is the hash of its children. Used by Git, blockchain, and Cassandra/DynamoDB for data synchronization (anti-entropy). Efficiently detects differences between two datasets.
  • Quadtree / Geohash — spatial data structures for geographic queries. "Find all restaurants within 5km." Used by Uber, Google Maps, Yelp. Comes up in "Design Uber" and "Design Yelp" interviews. In .NET: NetTopologySuite for spatial operations, SQL Server spatial indexes.
  • HyperLogLog — probabilistic data structure for counting unique elements. Uses tiny memory (~12KB) to estimate cardinality of billions of items with ~0.81% error. Redis has built-in HyperLogLog commands. Used for counting unique visitors, unique IPs, etc.
  • Min-Heap / Priority Queue — for getting the minimum element efficiently. Used in job schedulers, rate limiters, and "top K" problems. In .NET: PriorityQueue<TElement, TPriority> (built-in since .NET 6).

⚡ Performance & Optimization

Practical tips for making .NET apps fast in production. Remember: measure first, optimize second.

The Golden Rule

Measure first, optimize second. Dont guess where the bottleneck is. Use BenchmarkDotNet for micro-benchmarks, dotnet-trace and dotnet-counters for runtime analysis, and Application Insights for production profiling.

ASP.NET Core Performance Tips

  • Use async/await everywhere for I/O operations. A synchronous DB call blocks the thread pool thread. In ASP.NET Core, a blocked thread means one less request you can handle concurrently.
  • Minimize allocations. Use Span<T>, ReadOnlySpan<T>, ArrayPool<T>.Shared, and string.Create() for hot paths. The GC is good but not free. Every allocation creates work for the garbage collector.
  • Use System.Text.Json instead of Newtonsoft.Json for serialization. Its faster and allocates less. Source generators ([JsonSerializable]) make it even faster by avoiding reflection.
  • Connection pooling. ADO.NET pools database connections by default. HttpClient pools HTTP connections. Dont create new instances per request — use IHttpClientFactory and DI-injected DbContext.
  • Response compression. Enable gzip/brotli compression middleware for API responses. Saves bandwidth, especially for JSON payloads. builder.Services.AddResponseCompression().
  • Output caching. New in .NET 7. Caches entire HTTP responses server-side. Way more efficient than re-executing the handler for every request. [OutputCache(Duration = 60)].

Database Performance Tips

  • Use .AsNoTracking() for read-only EF Core queries. Change tracking adds overhead you dont need for queries that wont update data.
  • Project only needed columns. .Select(u => new { u.Id, u.Name }) instead of loading entire entities. Less data transferred, less memory, faster queries.
  • Batch operations. EF Core 7+ supports ExecuteUpdateAsync() and ExecuteDeleteAsync() for bulk operations without loading entities into memory.
  • Use compiled queries for hot paths that execute the same query repeatedly. EF.CompileAsyncQuery(...) eliminates query compilation overhead.
  • Check your query plans. Just because EF Core generates SQL doesnt mean its good SQL. Use SQL Server Management Studio or EXPLAIN ANALYZE in PostgreSQL to check.

Caching Performance Tips

  • L1 + L2 cache pattern. IMemoryCache (L1, in-process, fastest) + Redis (L2, distributed, still fast). Check L1 first, then L2, then database. Reduces Redis roundtrips for hot data.
  • Cache serialization matters. If your caching complex objects in Redis, the serialization/deserialization cost adds up. Consider MessagePack or protobuf instead of JSON for cached values.

🧪 Testing Distributed Systems

Unit tests alone wont cut it for distributed systems. Here's a practical testing strategy that actually catches production bugs.

Testing Pyramid for Distributed Systems

Unit Tests — test individual components in isolation. Mock external dependencies. Fast, cheap, run thousands of them. Use xUnit + Moq/NSubstitute in .NET.

Integration Tests — test components talking to real dependencies (real database, real Redis). Use WebApplicationFactory<T> for in-memory ASP.NET Core server, Testcontainers for spinning up real Docker instances of SQL Server/Redis/RabbitMQ.

Contract Tests — verify that service A's expectations about service B's API match reality. Prevents "it works on my machine" across services. Use Pact in .NET.

End-to-End Tests — test the full flow across all services. Expensive, slow, flaky. Keep these to a minimum — only for critical user journeys (signup, checkout, payment).

Chaos Engineering

Deliberately break things in production (or staging) to find weaknesses before they find you. Netflix pioneered this with Chaos Monkey.

In .NET:

  • Kill random service instances and verify the system recovers
  • Add artificial latency to downstream calls (Polly's LatencyStrategy)
  • Simulate network partitions
  • Fill up disk space, exhaust connection pools
  • Azure Chaos Studio is a managed chaos engineering service

Testing Tips

  • Testcontainers is amazing. Spin up real SQL Server, Redis, RabbitMQ in Docker for integration tests. No more "works on my machine" or maintaining shared test databases.
  • Dont mock everything. Mocking the database gives you false confidence. Integration tests with a real DB catch schema mismatches, missing indexes, and incorrect SQL.
  • Test failure scenarios specifically. What happens when the database is down? When Redis is unreachable? When the message broker rejects a publish? These are the scenarios that cause production incidents.

🔄 Migration Strategies

Changing a live system without breaking it — some of the hardest problems in engineering. These strategies minimize risk.

Database Migrations

  • EF Core Migrationsdotnet ef migrations add creates migration files. Apply with dotnet ef database update. For production: generate SQL scripts with dotnet ef migrations script and review before applying. Never run dotnet ef database update directly in production.
  • Zero-downtime schema changes — add new column (nullable), deploy code that writes to both old and new, backfill old data, deploy code that reads from new, drop old column. Never rename or remove columns in a single step.
  • Expand and Contract pattern — expand the schema (add new), migrate data, contract the schema (remove old). Three deployments minimum for breaking schema changes.

Service Migrations

  • Strangler Fig (covered in patterns) — gradually migrate from old to new behind a routing layer.
  • Parallel Run — run old and new systems simultaneously, compare results. When the new system produces the same results, switch over. Good for critical systems (payments, financial calculations).
  • Feature Flags — gate new functionality behind flags. Roll out to 1% → 10% → 50% → 100%. Roll back by flipping the flag. In .NET: Microsoft.FeatureManagement.

Data Migrations

  • ETL (Extract, Transform, Load) — for bulk data migration between systems. Azure Data Factory for cloud-scale ETL.
  • CDC (Change Data Capture) — keep two systems in sync during migration. Old system writes, CDC captures changes, new system applies them. Debezium + Kafka is the standard approach.
  • Dual Writes — write to both old and new system during migration. Simpler than CDC but risky (what if one write fails?). Use the outbox pattern if you go this route.

💻 C# Implementations

All code is written in C# targeting .NET 8+ with a proper solution file. Each file has detailed XML docs explaining the algorithm, tradeoffs, and .NET ecosystem usage. Every algorithm includes a runnable demo.

Quick Start

# Clone and run
git clone https://github.com/yourusername/System-Design-Overview.git
cd System-Design-Overview
dotnet run --project implementations/csharp

# Or open SystemDesign.sln in Visual Studio / Rider / VS Code

The interactive menu lets you run any algorithm demo individually or all at once.

Consistent Hashing

Algorithm What it does Code
Consistent Hash Ring Virtual node-based ring with MD5. Minimal key remapping when servers join/leave. ConsistentHashing.cs

Load Balancing Algorithms

Five different strategies. Which one you pick depends on your situation — theres no universally "best" one.

Algorithm What it does Code
Round Robin Simplest. Cycles through servers sequentially. RoundRobin.cs
Weighted Round Robin Same but servers with higher weights get more traffic. Good for mixed hardware. WeightedRoundRobin.cs
IP Hash Client IP determines server. Same client always hits same server (sticky sessions). IpHash.cs
Least Connections Routes to server with fewest active connections. Adapts to real-time load. LeastConnections.cs
Least Response Time Routes to fastest server. Best for heterogeneous environments. LeastResponseTime.cs

Rate Limiting Algorithms

Five approaches. ASP.NET Core 7+ has three of these built-in, which is pretty great.

Algorithm What it does ASP.NET Core Built-in? Code
Fixed Window Counter Simple counter per time window. Has boundary spike issue. AddFixedWindowLimiter FixedWindowCounter.cs
Sliding Window Log Tracks every timestamp. Most accurate but memory hungry. Custom IRateLimiterPolicy SlidingWindowLog.cs
Sliding Window Counter Weighted estimate across windows. Best balance of accuracy vs memory. AddSlidingWindowLimiter SlidingWindowCounter.cs
Token Bucket Tokens refill over time. Allows controlled bursts. AddTokenBucketLimiter TokenBucket.cs
Leaky Bucket Queue based, constant output rate. Smooths bursty traffic. Nope, roll your own LeakyBucket.cs

Quick Comparison

┌──────────────────┬──────────┬──────────┬───────────────┐
│ Algorithm        │ Memory   │ Accuracy │ Burst Control │
├──────────────────┼──────────┼──────────┼───────────────┤
│ Fixed Window     │ O(1)     │ Low      │ Poor (edges)  │
│ Sliding Log      │ O(N)     │ Exact    │ Excellent     │
│ Sliding Counter  │ O(1)     │ High     │ Good          │
│ Token Bucket     │ O(1)     │ High     │ Controlled    │
│ Leaky Bucket     │ O(N)     │ High     │ Smoothed      │
└──────────────────┴──────────┴──────────┴───────────────┘


💻 System Design Interview Problems

Practice these. Start with easy ones and work your way up. For each problem try to think about it with .NET in mind — what libraries would you use, what Azure services, how would you structure the ASP.NET Core solution.

Easy

Medium

Hard


🏢 Real-World Architecture Case Studies

Real architectures from companies operating at massive scale. Each case study maps to common interview problems.

How Netflix Works (simplified)

  • Client → CDN (Open Connect) for video streaming
  • Client → API Gateway (Zuul) → microservices for browse, search, recommendations
  • Microservices communicate via async messaging (Kafka)
  • Data: Cassandra for user data (AP, eventual consistency), MySQL for billing (CP, strong consistency)
  • Caching: EVCache (memcached-based) for hot data
  • Chaos Monkey randomly kills instances to test resilience
  • Lesson: different data stores for different needs, async communication everywhere, test failure constantly

How Uber Handles Location (simplified)

  • Riders and drivers send GPS updates every few seconds
  • Location data goes to a geospatial index (modified Google S2 cells)
  • Matching: find available drivers near the rider using geohash queries
  • ETA calculation: precomputed routing graphs + real-time traffic data
  • All communication is async — the "requesting a ride" flow goes through a state machine saga
  • Lesson: geospatial indexing is crucial, real-time systems need smart data structures (not just SQL queries)

How WhatsApp Scales Messaging

  • Each user maintains a persistent connection (WebSocket/MQTT)
  • Messages stored temporarily until delivered, then deleted (not stored forever)
  • Erlang/BEAM VM for massive concurrent connections (~2M connections per server)
  • Messages are end-to-end encrypted — server cant read them
  • Lesson: connection management at scale is a real problem, temporary storage reduces costs, encryption adds complexity but is non-negotiable for messaging

How Stripe Processes Payments

  • Idempotency keys on every request (prevents double charges)
  • Request lifecycle: validate → authorize → capture → settle
  • Every state change is an event stored in an event log (event sourcing-ish)
  • Strong consistency for financial operations (cant have eventual consistency for money)
  • API versioning from day 1 (old versions supported for years)
  • Lesson: idempotency is not optional for payments, event sourcing makes audit trails easy, API versioning must be a first-class concern

🧰 .NET-Specific Resources

Courses

Books

Essential .NET Libraries

These are the libraries you'll actually use when building distributed systems in .NET:

Library What its for
YARP Reverse proxy / API gateway. Microsofts own, stupidly fast.
Polly Resilience — retry, circuit breaker, timeout, fallback. Non-negotiable for microservices.
MassTransit Message bus abstraction over RabbitMQ, Azure Service Bus, Kafka. Saves you so much boilerplate.
StackExchange.Redis Redis client. Used by like everyone in the .NET ecosystem.
OpenTelemetry .NET Distributed tracing, metrics, logging. The standard going forward.
EF Core ORM. Supports SQL Server, PostgreSQL, Cosmos DB, SQLite. You know this one.
SignalR Real-time WebSocket communication. Built into ASP.NET Core.
MediatR In-process mediator, commonly used for CQRS. Lightweight and simple.
.NET Aspire Cloud-native orchestration, service defaults, local dev dashboard. The new hotness.
Wolverine Next-gen messaging + mediator. Think MassTransit meets MediatR. Worth watching.
Hangfire Background job processing. Dashbord included.
FluentValidation Request validation. Way better than data annotations for complex rules.
Testcontainers Spin up real Docker containers (SQL, Redis, RabbitMQ) for integration tests. Game changer.
BenchmarkDotNet Micro-benchmarking framework. For when you need to know exactly how fast something is.
Serilog Structured logging. Sinks for everything (console, file, Seq, Elasticsearch, App Insights).
Mapster or AutoMapper Object mapping. Mapster is newer and faster. AutoMapper is more established.
Refit Type-safe REST client. Define an interface, Refit generates the implementation. Cleaner than raw HttpClient.

YouTube Channels

  • Nick Chapsas — .NET deep dives, performance stuff. Probably the best .NET YouTuber right now.
  • Raw Coding — ASP.NET Core internals, authentication deep dives
  • Milan Jovanovic — Clean Architecture, DDD, CQRS in .NET. Very practical.
  • ByteByteGo — system design with great visuals. Not .NET specific but the concepts apply.
  • CodeOpinion — distributed systems, messaging, architecture. Focuses on .NET examples.
  • dotnet — official Microsoft channel. Standup recordings, .NET Conf, etc.
  • Gaurav Sen — system design fundamentals. Great for interview prep.
  • System Design Interview — exactly what it sounds like
  • Alex Hyett — software architecture and system design, explains complex topics simply
  • Hussein Nasser — backend engineering deep dives. Great for understanding protocols and networking.
  • ArjanCodes — software design principles. Python-focused but concepts are universal.

Newsletters


📜 Must-Read Engineering Articles

Real production systems handling billions of requests. Not theoretical — these describe actual engineering decisions and their consequences.


🗞️ Must-Read Distributed Systems Papers

The foundational papers that invented modern distributed systems. Dense but rewarding — they explain why the tools you use daily work the way they do.

  • Paxos: The Part-Time Parliament — the foundational consensus algorithm. Hard to read (Lamport wrote it as a story about a Greek parliament, which is either genius or annoying depending on your mood).
  • Raft: In Search of an Understandable Consensus Algorithm — designed as an understandable alternative to Paxos. Used by etcd (Kubernetes), Consul, CockroachDB. Way easier to grok than Paxos.
  • MapReduce — Google's parallel processing framework. The basis for Hadoop, Spark, and basically all batch processing.
  • The Google File System — distributed file system. Foundation for HDFS and cloud storage services.
  • Dynamo — Amazon's key-value store. Eventual consistency, consistent hashing, vector clocks. Influenced DynamoDB, Cassandra, Riak.
  • Kafka — the paper behind Kafka. Essential if your doing event-driven .NET with MassTransit or Confluent.Kafka.
  • Spanner — Google's globally distributed database with TrueTime. This paper influenced Cosmos DB's consistency model.
  • Bigtable — column-family storage. Influenced HBase, Cassandra, Azure Table Storage.
  • ZooKeeper — distributed coordination. Used by Kafka for cluster management.
  • LSM-Tree — the data structure behind Cassandra, LevelDB, RocksDB. Explains why NoSQL databases have such good write performance.
  • Chubby — Google's distributed lock service. Basically the predecessor to ZooKeeper and etcd.
  • Amazon Aurora — how AWS rebuilt MySQL for the cloud. Separating storage from compute, quorum-based replication. Really interesting architecture.
  • CRDTs: Conflict-free Replicated Data Types — data structures that can be merged without coordination. Used for real-time collaboration (Google Docs, Figma). Comes up in "Design Google Docs" interviews.

📋 Quick Reference Cheat Sheets

Quick-reference tables for interview prep and design sessions. Bookmark this section.

HTTP Status Codes You Should Know

200 OK              — success
201 Created         — resource created (POST)
204 No Content      — success, nothing to return (DELETE)
301 Moved Permanently — permanent redirect (SEO)
304 Not Modified    — use cached version
400 Bad Request     — client sent invalid data
401 Unauthorized    — not authenticated (need to login)
403 Forbidden       — authenticated but not authorized
404 Not Found       — resource doesnt exist
409 Conflict        — request conflicts with current state
422 Unprocessable   — validation error (use this for business rules)
429 Too Many Reqs   — rate limited
500 Internal Error  — server broke (never expose details!)
502 Bad Gateway     — reverse proxy couldnt reach backend
503 Service Unavail — server overloaded or maintenance
504 Gateway Timeout — backend took too long to respond

Quick Capacity Numbers

1 web server (ASP.NET Core / Kestrel) .... ~1,000-10,000 RPS
1 SQL Server instance .................... ~5,000-50,000 QPS
1 Redis instance ......................... ~100,000 ops/sec
1 Kafka broker ........................... ~200,000 msg/sec
1 RabbitMQ instance ...................... ~20,000-50,000 msg/sec
Azure Cosmos DB (single partition) ....... ~10,000 RU/s
Azure Service Bus (Premium) .............. ~1,000 msg/sec per MU

Single SQL table comfortable limit ....... ~100M-500M rows
Redis max memory (practical) ............. ~25-100 GB
Max HTTP request size (default) .......... ~28.6 MB (Kestrel)
WebSocket connections per server ......... ~10,000-65,000

Availability Nines Table

Availability    Annual Downtime     Monthly Downtime
99%             3.65 days           7.31 hours
99.9%           8.77 hours          43.83 minutes
99.95%          4.38 hours          21.92 minutes
99.99%          52.60 minutes       4.38 minutes
99.999%         5.26 minutes        26.30 seconds

Common .NET Port Numbers

80   — HTTP
443  — HTTPS
1433 — SQL Server
5432 — PostgreSQL
6379 — Redis
5672 — RabbitMQ (AMQP)
15672 — RabbitMQ Management UI
9092 — Apache Kafka
8080 — Common alternative HTTP
5000 — ASP.NET Core default (HTTP)
5001 — ASP.NET Core default (HTTPS)

License

GNU General Public License v3.0 — see the LICENSE file.


Contributing

Found something wrong? Want to add a concept or improve an explanation? PRs are welcome. Just keep the .NET focus and the practical, no-nonsense tone.


Built for .NET developers who want to understand system design, not just memorize answers.

If this helped you, give it a ⭐ and share it with your team.
Every star helps other .NET developers discover this resource.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages