# Reinforcement Learning
### Monte Carlo methods: On-policy Monte Carlo control

A **Monte Carlo** (**MC**) method is a model-free method which is used in **Reinforcement Learning** (RL). Its difference wuth **temporal difference** (TD) methods is that MC methods do not use *bootstrapping* and they are **episode-driven**, which means they update value functions and policies after an episode terminates. 
<br> An MC control is the MC method in which both **policy evaluation** and **policy improvement** are performed. **On-policy** for an MC control means that the **behavier policy** and the **target policy** are the same.
<br>**Hint:** *Policy evaluation* estimates the value function for the current policy using MC sampling. In contrast, *policy improvement* greedily updates the policy to choose actions with the highest estimated values.

<hr>

In the example in this Notebook, we use the same **Grid World** environment (similar to the one we used before):
 - **States:** A sizexsize grid (size*size states), labeled as (0,0) to (size-1,size-1).
 - **Actions:** Up, Down, Left, Right.
 - **Rewards:**
    - Reaching the goal state (size-1,size-1) gives a reward of +10.
    - Reaching a "pit" state (size//2,size//2) gives a reward of −10.
    - All other transitions give a reward of −1.
- **Terminal States:** (size-1,size-1) (goal) and (size//2,size//2) (pit).
- **Transition Probabilities:**
    - Moving in the intended direction succeeds with probability 0.8.
    - With probability 0.2, the agent moves in a random direction

<hr>

In the following, we first implement the GridWrold. Then, we implement on-policy MC control both for **first-visit** and **every-visit** variants.
<hr>
https://github.com/ostad-ai/Reinforcement-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Reinforcement-Learning

In [1]:
# Import required modules
import numpy as np
from collections import defaultdict
import random

In [2]:
# Implement the GridWorld environment
class GridWorld:
    def __init__(self, size=5):
        self.size = size
        self.actions = ['up', 'down', 'left', 'right']
        self.terminal = {(size-1, size-1):10}  # Goal at bottom-right
        self.pits = {(size//2, size//2):-10}   # Pit at center
        self.current_state = None

    def reset(self):
        self.current_state = (0, 0)  # Start at top-left
        return self.current_state

    def _move(self, action_idx, i, j):
        action=self.actions[action_idx]
        if action == 'up': return (max(i-1, 0), j)
        elif action == 'down': return (min(i+1, self.size-1), j)
        elif action == 'left': return (i, max(j-1, 0))
        elif action == 'right': return (i, min(j+1, self.size-1))
        
    def step(self, action_idx):
        i, j = self.current_state
        if self.current_state in self.terminal or self.current_state in self.pits:
            return self.current_state, 0, True
        
        # Movement with stochasticity
        if random.random() < 0.8:
            next_state = self._move(action_idx, i, j)
        else:
            next_state = self._move(random.choice(range(len(self.actions))), i, j)
        
        self.current_state =next_state

        reward = self.terminal.get(next_state, self.pits.get(next_state, -1))
        done = next_state in self.terminal or next_state in self.pits
        return self.current_state, reward, done  # Step penalty

In [3]:
# Every-visit or first-visit on-policy Monte Carlo control for the GridWorld
# Choose FRIST_VISIT=True to have first-visit MC control
# otherwise,it performs every-visit MC control
def monte_carlo_q_learning(env, episodes=10000, gamma=0.9, epsilon=0.3,FIRST_VISIT=False):
    # Initialize Q-table: maps (i,j) → [Q(s,a₁), Q(s,a₂), ...]
    Q = defaultdict(lambda: np.zeros(len(env.actions)))
    returns_count = defaultdict(lambda: np.zeros(len(env.actions)))
    
    for _ in range(episodes):
        # Generate episode
        episode = []
        state = env.reset()
        done = False
        
        while not done:
            # ε-greedy policy
            if random.random() < epsilon:
                action = random.choice(range(len(env.actions)))
            else:
                action = np.argmax(Q[state])
            next_state, reward, done = env.step(action)
            episode.append((state, action, reward))
            state = next_state
        
        # if you want first-visit, do these
        if FIRST_VISIT:
            first_occurrences = {}
            for t, (s, a, _) in enumerate(episode):
                if (s, a) not in first_occurrences:
                    first_occurrences[(s, a)] = t

        # Update Q-values (Every-Visit MC)
        G = 0
        
        for t in reversed(range(len(episode))):
            state_t, action_t, reward_t = episode[t]
            G = reward_t + gamma * G
            
            # first visit or every-visit
            if (FIRST_VISIT and t == first_occurrences.get((state_t, action_t), -1)) or not FIRST_VISIT:
                returns_count[state_t][action_t] += 1
                Q[state_t][action_t] += (G - Q[state_t][action_t]) / returns_count[state_t][action_t]  # Incremental average

    return Q

In [4]:
# Run Monte Carlo for the GridWorld
env = GridWorld(size=5)
Q = monte_carlo_q_learning(env, episodes=10000)#,FIRST_VISIT=True)

In [5]:
# Check each state best action based on the greedy policy
for i in range(env.size):
    for j in range(env.size):
        state=(i,j)
        action = np.argmax(Q[state])
        print(f'state({i},{j}): {env.actions[action]}',end=';  ')
    print()

state(0,0): right;  state(0,1): right;  state(0,2): right;  state(0,3): right;  state(0,4): down;  
state(1,0): right;  state(1,1): right;  state(1,2): right;  state(1,3): right;  state(1,4): down;  
state(2,0): down;  state(2,1): down;  state(2,2): up;  state(2,3): down;  state(2,4): down;  
state(3,0): down;  state(3,1): down;  state(3,2): right;  state(3,3): right;  state(3,4): down;  
state(4,0): right;  state(4,1): right;  state(4,2): right;  state(4,3): right;  state(4,4): up;  
