### Chapter 17: **Making Complex Decisions**

### **17.1 Sequential Decision Problems**
- **Focus**: Computational decision-making in **stochastic environments**.
- **Sequential Problems**:
  - Decisions depend on sequences of actions, unlike episodic problems where outcomes are independent.
  - Incorporates **uncertainty, sensing**, and **utilities**.
  - Examples include Markov Decision Processes (MDPs).
  
#### **Markov Decision Processes (MDPs)**:
- Defined for **fully observable, stochastic environments**:
  1. **States (S)**: The environment's possible configurations.
  2. **Actions (A)**: Choices available in each state.
  3. **Transition Model (P(s'|s, a))**: Probability of reaching state `s'` from `s` after action `a`.
  4. **Reward Function (R(s))**: Reward for being in state `s`.

- **Policy (π)**:
  - A strategy specifying the best action `π(s)` for each state.
  - **Optimal Policy (π*)** maximizes the **expected utility** of the sequence.

#### **Key Features of MDPs**:
1. **Stationary Policies**:
   - Optimal policies depend only on the current state, not the time step.
2. **Utility of a State**:
   - Sum of **discounted rewards** over time:
     $$
     U(s) = R(s) + \gamma \max_{a \in A} \sum_{s'} P(s'|s, a) U(s')
     $$
   - `γ`: Discount factor (0 ≤ γ ≤ 1) prioritizing immediate rewards over future ones.

#### **Decision Strategies**:
- **Finite Horizon**: Decisions optimized over a fixed number of steps.
- **Infinite Horizon**: No deadline; stationary policies apply.

---

### **17.2 Value Iteration**
- **Goal**: Compute state utilities and derive optimal policies.

#### **Bellman Equation**:
- Core formula linking a state's utility to its neighbors:
  $$
  U(s) = R(s) + \gamma \max_{a \in A(s)} \sum_{s'} P(s'|s, a) U(s')
  $$

#### **Value Iteration Algorithm**:
1. Initialize utilities arbitrarily (e.g., U(s) = 0).
2. Update iteratively using the Bellman equation.
3. Stop when changes in utilities are smaller than a given threshold.

- **Convergence**:
  - Guaranteed due to the **contraction property** of Bellman updates.
  - Error bounds ensure policies are nearly optimal after finite iterations.

---

### **17.3 Policy Iteration**
- **Goal**: Directly improve policies without exact utilities.
1. **Policy Evaluation**:
   - Compute utilities under the current policy.
   - Linear equations are solved for all states.
2. **Policy Improvement**:
   - Update the policy using one-step look-ahead based on current utilities.
3. Repeat until the policy stabilizes (converges).

- **Modified Policy Iteration**:
  - Combines value iteration and policy evaluation for efficiency.

---

### **Summary**
- **MDPs**: Framework for sequential decisions in fully observable stochastic settings.
- **POMDPs**: Extensions for partially observable environments.
- **Key Algorithms**:
  - **Value Iteration** and **Policy Iteration** for MDPs.
  - **Belief State-based Approaches** for POMDPs.
- Decision-making balances **risk, reward**, and **information gathering**.
