# Reinforcement Learning
## Returns, policy, and value functions

The **return** $G_t$ is the total discounted reward the agent accumulates over time, starting from time step $t$. It is defined as:
<br> $G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+... $
<br>where $\gamma$ is the discount rate such that $0\le \gamma \le 1$.
<br> There is a recursive relation between returns as specified below:
<br>$G_t=R_{t+1}+\gamma G_{t+1}$
---
A **policy** is a strategy that the agent uses to decide which actions to take in each state. It can be *deterministic* or *stochastic*.
- **Determinitstic policy** maps each state to a specific action: $\pi(s)=a$
- **Stochastic policy** maps each stte to a probability distirbution over actions: $\pi(a|s)=probability(A_t=a|S_t=s)$
---
**Reward hypothesis:** The reward hypothesis assumes that any goal or objective an agent might have can be expressed as the maximization of the expected return $E[G_t]$.
 - This hypothesis simplifies the problem of defining goals to the problem of designing a reward function that aligns with the desired behavior.

---
**Value functions** are used to evaluate how good a state or action is under a given policy. There are two main types of value functions:
<br>The **state-value** function $v_\pi (s)$ represents the expected return (cumulative discounted reward) when starting in state $s$ and following policy π thereafter:
<br>$v_\pi (s)=E_\pi[G_t|S_t=s]$
<br>The **action-value** function $q_\pi(s,a)$ represents the expected return when starting in state $s$, taking action $a$, and following policy π thereafter:
<br>$q_\pi (s,a)=E_\pi[G_t|S_t=s,A_t=a]$

---
In the following, a Grid world of size 3-by-3 is defined in which a robot can select one of four pssible actions in each cell of the grid. The rewards and the discount rate $\gamma$ are also defined. A random policy is employed for the robot. The robot with the random policy explores the grid world and collects the returns for each cell gradually. THe returns are averaged to estimate the value functions of the cells. Finally, with the help of value functions, the best policy is found.
<br> **Hint:** This example is a bit advanced at this stage, but it gives a good knowledge of the topcis we have covered so far.
<hr>
https://github.com/ostad-ai/Reinforcement-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Reinforcement-Learning

In [1]:
# Import the required module
import numpy as np

In [2]:
# Define the GridWorld environment
class GridWorld:
    def __init__(self):
        self.grid_size = (3, 3)
        self.start_state = (0, 0)
        self.goal_state = (2, 2)
        self.current_state = self.start_state
        self.actions = ["up", "down", "left", "right"]
        self.rewards = {
            self.goal_state: 10,  # Reward for reaching the goal
            "default": -1         # Reward for all other steps
        }
        self.gamma = 0.9  # Discount factor

    def reset(self):
        """Reset the environment to the start state."""
        self.current_state = self.start_state
        return self.current_state

    def step(self, action):
        """Take a step in the environment based on the action."""
        x, y = self.current_state

        # Perform the action
        if action == "up":
            x = max(x - 1, 0)
        elif action == "down":
            x = min(x + 1, self.grid_size[0] - 1)
        elif action == "left":
            y = max(y - 1, 0)
        elif action == "right":
            y = min(y + 1, self.grid_size[1] - 1)

        self.current_state = (x, y)

        # Check if the goal is reached
        if self.current_state == self.goal_state:
            reward = self.rewards[self.goal_state]
            done = True
        else:
            reward = self.rewards["default"]
            done = False

        return self.current_state, reward, done

# Define a random policy
def random_policy():
    return np.random.choice(["up", "down", "left", "right"])

# Simulate episodes and compute returns, value functions, and policy
def simulate_episodes(env, num_episodes=500):
    returns = {}  # Stores cumulative returns for each state
    value_function = np.zeros(env.grid_size)  # Stores the value function
    state_counts = np.zeros(env.grid_size)   # Counts visits to each state

    for _ in range(num_episodes):
        episode = []
        state = env.reset()
        done = False

        # Generate an episode
        while not done:
            action = random_policy()
            next_state, reward, done = env.step(action)
            episode.append((state, action, reward))
            state = next_state

        # Compute returns for each state in the episode
        G = 0
        for t in reversed(range(len(episode))):
            state, action, reward = episode[t]
            G = reward + env.gamma * G

            # Update returns and value function
            if state not in returns:
                returns[state] = []
            returns[state].append(G)
            value_function[state] = np.mean(returns[state])
            state_counts[state] += 1

    # Estimate the policy (greedy policy based on value function)
    policy = np.empty(env.grid_size, dtype=object)
    for x in range(env.grid_size[0]):
        for y in range(env.grid_size[1]):
            if (x, y) == env.goal_state:
                policy[x, y] = "goal"
                continue

            # Find the action that maximizes the value of the next state
            best_action = None
            best_value = -np.inf
            for action in env.actions:
                next_x, next_y = x, y
                if action == "up":
                    next_x = max(x - 1, 0)
                elif action == "down":
                    next_x = min(x + 1, env.grid_size[0] - 1)
                elif action == "left":
                    next_y = max(y - 1, 0)
                elif action == "right":
                    next_y = min(y + 1, env.grid_size[1] - 1)

                # Get the reward for taking this action
                if (next_x, next_y) == env.goal_state:
                    reward = env.rewards[env.goal_state]
                else:
                    reward = env.rewards["default"]

                # Get the value of the next state
                next_state_value = value_function[next_x, next_y]

                # Compute the total value: reward + discounted value of the next state
                total_value = reward + env.gamma * next_state_value

                # Update the best action if this action leads to a higher total value
                if total_value > best_value:
                    best_value = total_value
                    best_action = action

            policy[x, y] = best_action

    return value_function, policy, state_counts

In [3]:
# Run the simulation
env = GridWorld()
value_function, policy, state_counts = simulate_episodes(env, num_episodes=1000)

# Print results
print("Value Function:")
print(value_function)
print("\nPolicy:")
print(policy)
print("\nState Counts (Visits):")
print(state_counts)

Value Function:
[[-5.9999017  -5.12966767 -3.86475725]
 [-5.02454069 -3.12148514  0.32475494]
 [-3.6617277   0.35562481  0.        ]]

Policy:
[['down' 'down' 'down']
 ['right' 'down' 'down']
 ['right' 'right' 'goal']]

State Counts (Visits):
[[6135. 3993. 3181.]
 [4011. 2946. 1950.]
 [2953. 1928.    0.]]
