- Beware this necessary evil. All integration points fail.
- Prepare for the many forms of failure. Never nice, always odd, slow, hangs, etc.
- Know when to open up abstractions. Debugging may mean diving in.
- Failures propagate quickly.
- Apply patterns to avert Integration Points problems. Circuit Breaker, Timeout, Decoupling Middleware, Handshaking.
- One server down jeopardizes the rest. Increased load.
- Hunt for resource leaks. Traffic -> memory leaks.
- Hunt for obscure timing bugs. Race conditions.
- Defend with Bulkheads. Partitioning on server side, Circuit Breaker on calling side.
- Stop cracks from jumping the gap. Stay up when they go down.
- Scrutinize resource pools. Safe resource pools always timeout threads.
- Defend with Timeouts and Circuit Breaker. Former ensures you come back, latter ensures you avoid hammering a troubled Integration Point.
- Users consume memory. Minimize occupany per user, only use sessions for caching so purging is an option.
- Users do weird, random things. Need crazy testing.
- Malicious users are out there. Patch, stay frosty.
- Users will gang up on you. Do stress testing on all points.
- The Blocked Threads antipattern is the proximate cause of most failures. Leads to Chain Reactions and Cascading Failurs.
- Scrutinize resource pools. e.g. deadlocks cause connections to be lost, incorrect exception handling.
- Use proven primitives. e.g. queues.
- Defend with Timeouts.
- Beware the code you cannot see. i.e. third party code.
- Keep the lines of communication open. Static landing zones for destintions to special offers. No embedded session IDs.
- Protect shared resources. Fight Club bugs where front-end load -> exponential back-end load.
- Expect rapid redistribution of any cool or valuable offer.
- Examine production versus QA environments to spot Scaling Effects. Respective sizes need comparing.
- Watch out for point-to-point communication. Full mesh -> O(n^2) connections.
- Watch out for shared resources. Bottleneck, constraint. Stress test them, test clients' behaviour to slowness to hangs.
- Examine server and thread counts. Check ratio of front-end to back-end servers, compare threads.
- Observe near scaling effects and users. Watch of changes in patterns of load.
- Stress both sides of the interface. Flood back-end with x10 maximum load. Mimic slow or dead back end, see what happens to front-end.
- Slow Responses triggers Cascading Failures.
- For websites, Slow Responses causes more traffic. Hit reload.
- Consider Fail Fast. Track your own responsiveness, consider sending immediate failure when average response time too high.
- Hunt for memory leaks or resource contention.
- Don't make empty promises. Your SLA is the lowest SLA of your dependencies.
- Examine every dependency. DNS? SMTP? Enterprise SAN? Message queues? Brokers?
- Decouple your SLAs. Maintain service in the face of failure.
- Use realistic data volumes. Test production sizes.
- Don't rely on the data producers. Only sizes you care about are "zero", "one", or "lots".
- Put limits onto other application-level protocols. RMI, DCOM, XML-RPC, all can return massive sets.
- Apply to Integration Points, Blocked Threads, and Slow Responses. Prevents Blocked Threads, averts Cascading Failures.
- Apply to recover from unexpected failures.
- Consider delayed retries. Most problems will always exists immediately after.
- "Closed" -> fine. X problems in Y time -> "Open". When "Open" all calls fail. After Z time becomes "Half-Open"; even one failure makes it open again. Else "Closed".
- Don't do it if it hurts. If an Integration Point hits many problems, stop calling it!
- Use together with Timeouts. Timeouts offer the indication of a problem.
- Escape, track, and report state changes. Popping a Circuit Breaker is always a serious problem.
- Save part of the ship. Partition for partial functionality.
- Decide whether to accept less efficient use of resources. Means keeping resource in reserve.
- Pick a useful granularity. Thread pools, CPUs, or servers in a cluster.
- Very important with shared services models.
- Avoid fiddling. Eliminate need for recurring human intervention.
- Purge data with application logic.
- Limit caching. Bound memory usage.
- Roll the logs. Cap their size.
- Avoid Slow Responses and Fail Fast. If can't meet SLA, tell callers fast.
- Reserve resources, verify Integration Points early. e.g. if Circuit Breaker popped on a required call, don't waste time by starting.
- Use for input validation. Prevents wasting resources for duff requests.
- Create cooperative demand control. Both client and server must be built to perform Handshaking.
- Consider health checks. Application-level workaround for lack of Handshaking.
- Build Handshaking into your own low-level protocols. Endpoints inform the other when they are not ready to accept work.
- Emulate out-of-spec failures.
- Stress the caller. Slow responses, no responses, garbage responses.
- Leverage shared harnesses for common failures.
- Supplement, don't replace, other test methods. Not substitute for unit tests, acceptance tests, etc, for functional behaviour. This is for "non-functional" behaviour.
- Decide at the last responsible moment. Massive decision, make it early.
- Avoid many failure modes through total decoupling. More adaptable too.
- Learn many architectures, and choose among them.
- Eliminate contention under normal loads.
- If possible, size resource pools to the request thread pool. Watch out for failover scenarios.
- Prevent vicious cycles. Resource contention -> slow responses -> resource contention.
- Watch for the Blocked Threads pattern.
- Avoid needles requests. Don't poll for autocompletion. If you need it, send requests on updates.
- Respect your session architecture. Use session IDs on AJAX.
- Minimize the size of replies. Use JSON, not HTML.
- Increase the size of your web tier.
- Curtail session retention. Short as possible.
- Remember that users don't understand sessions. Users get auto logout for security. Use it as a cache, not the only store of user data.
- Keep keys, not whole objects.
- Make the Reload button irrelevant. Serve pages so fast no reload, else hurts to re-request resources.
- Pool connections. Just do it.
- Protect request-handling threads. Make infinite blocks impossible. Use timeouts.
- Size the pools for maximum throughput. Monitor callers for wait times.
- Limit cache sizes.
- Build a flush mechanism. Clock, calendar, or event based, needs flushing eventually. Rate-limit, else could happen too often.
- Don't cache trivial objects.
- Compare access and change frequency. Don't change write-heavy objects.
- Precompute content that changes infrequently. Consider cost of generation into change probability and request frequency and cost.
- Tune the garbage collector in production. Need actual usage pattern to tune against.
- Keep it up. Tune every cycle.
- Don't pool ordinary objects. Try to rely on garbage collector.