## 46. Technical debt trade‑offs & when to pay it down

**Technical debt** is the gap between the quick solution you shipped and the cleaner design you skipped.

*Good debt*: conscious, documented shortcut with a pay‑back plan (e.g., TODO #123 fix SQL injection).
*Bad debt*: unknown hacks that rot silently.

**Decision factors** for paying debt:
1. **Interest rate** – How fast does it slow dev speed?  
2. **Compound effect** – Does it block new features?  
3. **Risk** – Security, data loss.

Rule: budget regular “debt sprints” or fold fixes into each story’s definition of done.

```text
Shortcut: use plain string concatenation for SQL query.
Risk: SQL injection. Interest: need custom escaping every new query.
Plan: migrate to parameterised queries next sprint (ticket DEV‑98).
```

### Quick check

1. True / False All technical debt must be paid immediately.

2. Debt “interest” refers to:
  a. cost of leaving debt   b. money owed to finance

<details><summary>Answer key</summary>

1. **False** – weigh cost vs. benefit.
2. **a**.

</details>

## 47. Designing for observability: structured logs, metrics, traces

Observable systems expose **what** they’re doing and **why**.

* **Structured logs** – JSON/ key‑value pairs → easy search.  
* **Metrics** – numeric time‑series (latency, QPS).  
* **Distributed traces** – end‑to‑end request graph.

Design tips:
• Include correlation IDs across services.  
• Emit RED metrics (Rate, Errors, Duration).  
• Log at decision points (retry, fallback) with context.

```python
import json, uuid, time, logging
req_id = str(uuid.uuid4())
start = time.time()
try:
    result = do_work()
    logging.info(json.dumps({'req': req_id, 'status':'ok', 'ms': int(1000*(time.time()-start))}))
except Exception as e:
    logging.error(json.dumps({'req': req_id, 'error': str(e)}))
```

### Quick check

1. Which tells you *duration* of a request chain?
  a. metric   b. trace

2. True / False Free‑text logs are easier to query than structured logs.

<details><summary>Answer key</summary>

1. **b**.
2. **False** – structured logs are machine‑friendly.

</details>

## 48. Graceful degradation & fallback strategies

When parts of your system fail, aim for **degraded but functional** user experience.

Patterns:
• **Timeout + cached data** if live call slow.  
• **Circuit breaker** to stop hammering failing dependency.  
• **Feature flags** to disable non‑critical widgets.
Always surface clear but polite error messages.

```python
def get_exchange_rate():
    try:
        return live_rate_service.fetch()
    except TimeoutError:
        return stale_rate_cache.get()  # fallback
```

### Quick check

1. Circuit breaker trips after:
  a. success   b. repeated failures

2. True / False Graceful degradation is less important in internal tools.

<details><summary>Answer key</summary>

1. **b**.
2. **False** – operators are users too.

</details>

## 49. Security‑first thinking: threat modelling lite

Before coding, ask: **What could go wrong?**  Steps:
1. Draw data flow diagram.  
2. Identify trust boundaries (internet ↔ backend).  
3. Apply STRIDE mnemonic (Spoofing, Tampering, Repudiation, Information disclosure, Denial, Elevation).

Mitigations then map to code (auth, validation, encryption).

```text
Web form → API → DB
Threat: Tampering – user modifies price field.
Mitigation: server‑side price lookup; ignore client value.
```

### Quick check

1. STRIDE letter for data leak:
  a. I   b. D

2. True / False Client‑side validation alone prevents tampering.

<details><summary>Answer key</summary>

1. **a** – Information disclosure.
2. **False** – must validate server‑side.

</details>

## 50. Designing for scalability: horizontal vs. vertical, stateless vs. sticky

* **Vertical scaling**: bigger box; simple but limits.  
* **Horizontal scaling**: more boxes; needs load balancer.

Design services to be **stateless** so any instance can handle any request → easy horizontal scale.  
Avoid *sticky* sessions storing user state in memory; use shared DB/cache instead.

```text
Stateless web → AWS ALB distributes users; session stored in Redis.
Vertical scale fallback: temporarily raise CPU/RAM until traffic subsides.
```

### Quick check

1. Adding CPUs to same server is:
  a. vertical   b. horizontal

2. True / False Stateless design simplifies auto‑scaling.

<details><summary>Answer key</summary>

1. **a**.
2. **True**.

</details>