# Benchmark Environments and Applications

## Overview
This notebook explores three classic benchmark environments for testing MDP algorithms.

## Learning Objectives
1. Understand different environment characteristics
2. Solve each environment with different algorithms
3. Analyze optimal policies
4. Compare algorithm performance

## Part 1: Setup

In [None]:
import sys
sys.path.insert(0, '/home/dingziming/PycharmProjects/AI-Practices/07-reinforcement-learning/马尔科夫决策过程')

import numpy as np
import matplotlib.pyplot as plt
from src.environments import GridWorld, FrozenLake, CliffWalking
from src.solvers import ValueIterationSolver, PolicyIterationSolver

print("Imports successful!")

## Part 2: GridWorld Environment

**Characteristics**:
- Deterministic transitions
- Grid-based navigation
- Obstacles and goals
- Intuitive visualization

In [None]:
# Create GridWorld
gridworld = GridWorld(height=5, width=5, discount_factor=0.99)

# Configure environment
gridworld.set_start(0, 0)
gridworld.set_goal(4, 4)
gridworld.set_obstacle(2, 2)
gridworld.set_obstacle(2, 3)

# Build transitions
gridworld.build_transitions()

print("GridWorld created")
print(f"States: {gridworld.get_state_space_size()}")
print(f"Actions: {gridworld.get_action_space_size()}")

In [None]:
# Solve GridWorld
gw_solver = ValueIterationSolver(gridworld, theta=1e-6)
gw_value_fn, gw_policy = gw_solver.solve(verbose=False)

print(f"GridWorld solved in {gw_solver.get_iteration_count()} iterations")

In [None]:
# Visualize GridWorld policy
print("\nGridWorld Optimal Policy:")
print(gridworld.render(policy=gw_policy))

In [None]:
# Display value function
print("\nGridWorld Value Function:")
values = gw_value_fn.to_array().reshape((gridworld.height, gridworld.width))
for row in range(gridworld.height):
    for col in range(gridworld.width):
        print(f"{values[row, col]:7.2f}", end=" ")
    print()

## Part 3: FrozenLake Environment

**Characteristics**:
- Stochastic transitions (slippery ice)
- 4x4 grid
- Holes (terminal states with 0 reward)
- Goal (terminal state with +1 reward)
- Demonstrates importance of handling uncertainty

In [None]:
# Create FrozenLake
frozenlake = FrozenLake(discount_factor=0.99)

# Build transitions
frozenlake.build_transitions()

print("FrozenLake created")
print(f"States: {frozenlake.get_state_space_size()}")
print(f"Actions: {frozenlake.get_action_space_size()}")
print(f"\nLayout:")
print(frozenlake.get_layout_description())

In [None]:
# Solve FrozenLake
fl_solver = ValueIterationSolver(frozenlake, theta=1e-6)
fl_value_fn, fl_policy = fl_solver.solve(verbose=False)

print(f"FrozenLake solved in {fl_solver.get_iteration_count()} iterations")

In [None]:
# Visualize FrozenLake policy
print("\nFrozenLake Optimal Policy:")
print(frozenlake.render(policy=fl_policy))

In [None]:
# Display value function
print("\nFrozenLake Value Function:")
values = fl_value_fn.to_array().reshape((frozenlake.height, frozenlake.width))
for row in range(frozenlake.height):
    for col in range(frozenlake.width):
        print(f"{values[row, col]:7.3f}", end=" ")
    print()

## Part 4: CliffWalking Environment

**Characteristics**:
- Risk-reward trade-off
- 4x12 grid
- Cliff (large negative reward)
- Optimal policy hugs cliff (risky but efficient)
- Demonstrates importance of robust policies

In [None]:
# Create CliffWalking
cliffwalking = CliffWalking(height=4, width=12, discount_factor=0.99)

# Build transitions
cliffwalking.build_transitions()

print("CliffWalking created")
print(f"States: {cliffwalking.get_state_space_size()}")
print(f"Actions: {cliffwalking.get_action_space_size()}")
print(f"\n{cliffwalking.get_grid_info()}")

In [None]:
# Solve CliffWalking
cw_solver = ValueIterationSolver(cliffwalking, theta=1e-6)
cw_value_fn, cw_policy = cw_solver.solve(verbose=False)

print(f"CliffWalking solved in {cw_solver.get_iteration_count()} iterations")

In [None]:
# Visualize CliffWalking policy
print("\nCliffWalking Optimal Policy:")
print(cliffwalking.render(policy=cw_policy))

In [None]:
# Display value function (first few rows)
print("\nCliffWalking Value Function (first 2 rows):")
values = cw_value_fn.to_array().reshape((cliffwalking.height, cliffwalking.width))
for row in range(2):
    for col in range(cliffwalking.width):
        print(f"{values[row, col]:7.2f}", end=" ")
    print()

## Part 5: Environment Comparison

In [None]:
# Compare environments
print("Environment Comparison:")
print("="*70)
print(f"{'Environment':<15} {'States':<10} {'Actions':<10} {'Iterations':<15}")
print("="*70)
print(f"{'GridWorld':<15} {gridworld.get_state_space_size():<10} {gridworld.get_action_space_size():<10} {gw_solver.get_iteration_count():<15}")
print(f"{'FrozenLake':<15} {frozenlake.get_state_space_size():<10} {frozenlake.get_action_space_size():<10} {fl_solver.get_iteration_count():<15}")
print(f"{'CliffWalking':<15} {cliffwalking.get_state_space_size():<10} {cliffwalking.get_action_space_size():<10} {cw_solver.get_iteration_count():<15}")
print("="*70)

In [None]:
# Compare convergence
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# GridWorld convergence
gw_conv = gw_solver.get_convergence_history()
axes[0].semilogy(range(1, len(gw_conv)+1), gw_conv, 'b-o', linewidth=2, markersize=4)
axes[0].set_title('GridWorld Convergence')
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Max Value Change')
axes[0].grid(True, alpha=0.3)

# FrozenLake convergence
fl_conv = fl_solver.get_convergence_history()
axes[1].semilogy(range(1, len(fl_conv)+1), fl_conv, 'g-o', linewidth=2, markersize=4)
axes[1].set_title('FrozenLake Convergence')
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Max Value Change')
axes[1].grid(True, alpha=0.3)

# CliffWalking convergence
cw_conv = cw_solver.get_convergence_history()
axes[2].semilogy(range(1, len(cw_conv)+1), cw_conv, 'r-o', linewidth=2, markersize=4)
axes[2].set_title('CliffWalking Convergence')
axes[2].set_xlabel('Iteration')
axes[2].set_ylabel('Max Value Change')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Part 6: Policy Analysis

In [None]:
# Analyze GridWorld policy
print("GridWorld Policy Analysis:")
print(f"Max value: {gw_value_fn.get_max_value():.3f}")
print(f"Min value: {gw_value_fn.get_min_value():.3f}")
print(f"Mean value: {gw_value_fn.get_mean_value():.3f}")
print(f"Policy is complete: {gw_policy.is_complete()}")

In [None]:
# Analyze FrozenLake policy
print("\nFrozenLake Policy Analysis:")
print(f"Max value: {fl_value_fn.get_max_value():.3f}")
print(f"Min value: {fl_value_fn.get_min_value():.3f}")
print(f"Mean value: {fl_value_fn.get_mean_value():.3f}")
print(f"Policy is complete: {fl_policy.is_complete()}")

In [None]:
# Analyze CliffWalking policy
print("\nCliffWalking Policy Analysis:")
print(f"Max value: {cw_value_fn.get_max_value():.3f}")
print(f"Min value: {cw_value_fn.get_min_value():.3f}")
print(f"Mean value: {cw_value_fn.get_mean_value():.3f}")
print(f"Policy is complete: {cw_policy.is_complete()}")

## Summary

Key insights:
1. **GridWorld**: Deterministic, straightforward optimal path
2. **FrozenLake**: Stochastic, requires robust policy
3. **CliffWalking**: Risk-reward trade-off, optimal policy is risky
4. **All environments solved successfully** with Value Iteration
5. **Convergence speed varies** with environment complexity