# bettermdptools experiments demo

This notebook demonstrates the **optional** `bettermdptools.experiments.run(...)` entrypoint on a few environments.

Notes:
- The runs below use **small** iteration and episode counts to keep runtime short.
- The purpose is to show the API shape and configuration patterns, not to produce strong policies.


In [1]:
# If you are running this in a fresh environment, you may need:
# !pip install bettermdptools

from bettermdptools.experiments import run


## 1) FrozenLake - value iteration

This shows a small value iteration run on a deterministic FrozenLake.


In [2]:
out_vi = run(
    algo="vi",
    env_id="FrozenLake-v1",
    seed=0,
    env_kwargs={"is_slippery": False},  # deterministic transitions
    algo_kwargs={
        "gamma": 0.99,
        "n_iters": 200,   # small for demo
        "theta": 1e-10,
    },
    eval_kwargs={
        "n_iters": 50,    # evaluate for a few episodes
        "render": False,
    },
)

print("algo:", out_vi.algo)
print("env:", out_vi.env_id)
print("train keys:", list(out_vi.train.keys()))
print("eval keys:", None if out_vi.eval is None else list(out_vi.eval.keys()))
if out_vi.eval and "scores" in out_vi.eval:
    scores = out_vi.eval["scores"]
    print("eval scores head:", scores[:10])


algo: vi
env: FrozenLake-v1
train keys: ['V', 'V_track', 'pi']
eval keys: ['scores']
eval scores head: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


  if not isinstance(terminated, (bool, np.bool8)):


## 2) FrozenLake - q-learning

This uses a small number of episodes, so the learned policy may be weak.
The goal is to show how `algo_kwargs` and `eval_kwargs` fit into the call.


In [3]:
out_ql = run(
    algo="q_learning",   # alias "q" is also supported
    env_id="FrozenLake-v1",
    seed=1,
    env_kwargs={"is_slippery": False},
    algo_kwargs={
        "gamma": 0.99,
        "n_episodes": 2000,    # small for demo
        "init_epsilon": 1.0,
        "min_epsilon": 0.05,
    },
    eval_kwargs={
        "n_iters": 50,
        "render": False,
    },
)

print("train keys:", list(out_ql.train.keys()))
if "rewards" in out_ql.train:
    r = out_ql.train["rewards"]
    print("rewards len:", len(r), "last10:", r[-10:])
if out_ql.eval and "scores" in out_ql.eval:
    scores = out_ql.eval["scores"]
    print("eval mean:", sum(scores) / len(scores))


  from .autonotebook import tqdm as notebook_tqdm
  if not isinstance(terminated, (bool, np.bool8)):
                                                    

train keys: ['Q', 'V', 'pi', 'Q_track', 'pi_track', 'rewards']
rewards len: 2000 last10: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
eval mean: 1.0




## 3) Blackjack - q-learning (very small demo)

Blackjack is stochastic and can require many episodes for stable learning.
This uses a tiny number of episodes to keep runtime short.


In [4]:
out_bj = run(
    algo="q_learning",
    env_id="Blackjack-v1",
    seed=2,
    algo_kwargs={
        "gamma": 0.99,
        "n_episodes": 5000,   # small for demo, increase for better results
        "min_epsilon": 0.05,
    },
    eval_kwargs={
        "n_iters": 200,
        "render": False,
    },
)

print("train keys:", list(out_bj.train.keys()))
if out_bj.eval and "scores" in out_bj.eval:
    scores = out_bj.eval["scores"]
    print("eval scores head:", scores[:10])


                                                     

train keys: ['Q', 'V', 'pi', 'Q_track', 'pi_track', 'rewards']
eval scores head: [ 1. -1.  1. -1.  1. -1. -1. -1. -1.  1.]


## 4) CartPole - discretized q-learning via wrapper registry

CartPole is continuous, so planning and tabular RL require discretization.
The experiments layer can apply a wrapper (via an internal registry) and pass
discretization parameters using `wrapper_kwargs`.

This is a short demo run and will not learn a good policy.


In [5]:
out_cp = run(
    algo="q_learning",
    env_id="CartPole-v1",
    seed=0,
    # wrapper is omitted here to rely on the internal registry for CartPole
    wrapper_kwargs={
        "position_bins": 6,
        "velocity_bins": 6,
        "angular_velocity_bins": 6,
        "threshold_bins": 0.5,
        "angular_center_resolution": 0.1,
        "angular_outer_resolution": 0.5,
    },
    algo_kwargs={
        "gamma": 0.99,
        "n_episodes": 1500,   # small for demo
        "min_epsilon": 0.05,
    },
    eval_kwargs={
        "n_iters": 20,
        "render": False,
    },
)

print("meta:", out_cp.meta)
print("train keys:", list(out_cp.train.keys()))
if "rewards" in out_cp.train:
    r = out_cp.train["rewards"]
    print("rewards len:", len(r), "last10:", r[-10:])
if out_cp.eval and "scores" in out_cp.eval:
    scores = out_cp.eval["scores"]
    print("eval scores:", scores)


                                                   

meta: {'env': {'source': 'wrapped', 'wrapped': True, 'wrapper': 'CartpoleWrapper'}}
train keys: ['Q', 'V', 'pi', 'Q_track', 'pi_track', 'rewards']
rewards len: 1500 last10: [194. 159. 273. 268. 278. 111. 195. 206. 187. 190.]
eval scores: [269. 166. 206. 264. 222. 153. 162. 179. 220. 238. 230. 236. 234. 241.
 224. 184. 222. 228. 144. 231.]
