## RLDMUU 2025
#### UCRL
jakub.tluczek@unine.ch

We continue exploring more advanced approaches to reinforcement learning, this time taking a look at [UCRL](https://papers.nips.cc/paper_files/paper/2006/file/c1b70d965ca504aa751ddb62ad69c63f-Paper.pdf). The main idea is, that when estimating rewards and transition probabilities, we can maintain a set of possible MDPs that fit our problem, by calculating the confidence bounds. Then we optimistically assume that the MDP with the biggest reward is the correct one, and we compute the policy, for example by using value iteration.

In today's task we can use the original bounds for rewards and transitions as presented in paper, that is respectively:

$$ \text{conf}_r (t,s,a) = \min \left\{ 1, \sqrt{\frac{\log(2 t^{\alpha} |S| |A|)}{2 N_t (s,a)}} \right\} $$

$$ \text{conf}_p (t,s,a) = \min \left\{ 1, \sqrt{\frac{\log(4 t^{\alpha} |S|^2 |A|)}{2 N_t (s,a)}} \right\}$$

While the estimates for $\hat{r}_t (s,a)$ and $\hat{p}_t (s, a, s')$, are just:

$$ \hat{r}_t (s,a) = \frac{R_t (s,a)}{N_t (s,a)} $$

$$ \hat{p}_t (s, a, s') = \frac{P_t (s,a,s')}{N_t (s,a)} $$

where $R_t (s,a)$, $P_t(s,a,s')$ and $N_t (s,a)$ are the sums of rewards, transitions to $s'$ from $(s,a)$ and number of times visited, respectively. 

In [1]:
import gymnasium as gym 
import numpy as np

In [None]:
class UCRL:
    def __init__(self, states, actions, alpha):
        self.num_states = states 
        self.num_actions = actions 
        self.alpha = alpha 
        # TODO: initialize R_t, P_t and N_t

        # TODO: initialize a policy

    def act(self,state):
        # TODO: act greedily
        pass

    def get_confidence_bounds(self):
        # TODO: get P and R estimates 
        # TODO: get confidence bounds 
        pass

    def update_policy(self, r_estimate, r_bound, p_estimate, p_bound):
        # TODO: get the most optimistic rewards and transitions within the confidence intervals

        # TODO: perform value iteration and update greedy policy
        pass

    def update_counters(self, state, action, next_state):
        # TODO: Update R_t, P_t and N_t
        pass

In [None]:
env = gym.make('FrozenLake-v1', is_slippery=False)

N_EPISODES = 10000
N_ITER = 1000

ALPHA = 0.1

state, info = env.reset()
done = False

algo = UCRL(states=env.observation_space.n, actions=env.action_space.n, alpha=ALPHA)

nsteps = np.ones(N_EPISODES) * N_ITER
mean_episode_rewards = np.zeros(N_EPISODES)

for e in range(N_EPISODES):
    algo.update_policy(algo.get_confidence_bounds())

    for i in range(N_ITER):
        action = algo.act(state)

        next_state, reward, done, truncated, info = env.step(action)

        algo.update_counters(state, action, next_state)

        if done or truncated:
            state, info = env.reset()
            done = False 
            truncated = False
            if reward == 1:
                nsteps[e] = i
                mean_episode_rewards[e] = 1 / i
            break 

        state = next_state