# Reinforcement Learning
### The REINFORCE method: A policy gradient method

The **REINFORCE** is a **policy gradient** algorithm that:  
1. **Directly optimizes** a stochastic policy by estimating gradients from **complete episodes**.  
2. Uses **Monte Carlo returns** $G_t$ to weight action probabilities, favoring high-reward trajectories.  
3. Updates policy parameters via gradient *ascent* on $\nabla_{\theta} \, log (\pi_{\theta}(a_t|s_t)) \cdot G_t$.  
4. Requires **no value function** (vanilla version).  

Here, we use a MLP for approximationg policies, which we call it the policy network. The policy network is trained by the gradient ascent for the GridWorld we have used so far. 
<hr>

We use the same **Grid World** environment (similar to the one we used before) but wiht increased rewards to help REINFORCE:
 - **States:** A sizexsize grid (size*size states), labeled as (0,0) to (size-1,size-1).
 - **Actions:** Up, Down, Left, Right.
 - **Rewards:**
    - Reaching the goal state (size-1,size-1) gives a reward of +1000.
    - Reaching a "pit" state (size//2,size//2) gives a reward of −1000.
    - All other transitions give a reward of −1.
- **Terminal States:** (size-1,size-1) (goal) and (size//2,size//2) (pit).
- **Transition Probabilities:**
    - Moving in the intended direction succeeds with probability 0.8.
    - With probability 0.2, the agent moves in a random direction

<hr>

In the following, we first implement the GridWrold. Then, we train the policy network by the REINFORCE.
<hr>
https://github.com/ostad-ai/Reinforcement-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Reinforcement-Learning

In [1]:
# Import required modules
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import random

In [12]:
# GridWorld environment with increased rewards
class GridWorld:
    def __init__(self, size=5):
        self.size = size
        self.actions = ['up', 'down', 'left', 'right']
        self.terminal = {(size-1, size-1): 1000.}  # Goal at bottom-right
        self.pits = {(size//2, size//2): -1000.}   # Pit at center
        self.current_state = None

    def reset(self):
        self.current_state = (0, 0)
        return self._state_to_features(self.current_state)

    def step(self, action_idx):
        action = self.actions[action_idx]
        i, j = self.current_state

        if self.current_state in self.terminal:
            return self._state_to_features(self.current_state), 0, True

        # Movement with stochasticity
        if random.random() < 0.8:
            next_state = self._move(action, i, j)
        else:
            next_state = self._move(random.choice(self.actions), i, j)

        self.current_state = next_state
        reward = self.terminal.get(next_state, self.pits.get(next_state, -1.))
        done = next_state in self.terminal or next_state in self.pits
        return self._state_to_features(next_state), reward, done

    def _move(self, action, i, j):
        if action == 'up': return (max(i-1, 0), j)
        elif action == 'down': return (min(i+1, self.size-1), j)
        elif action == 'left': return (i, max(j-1, 0))
        elif action == 'right': return (i, min(j+1, self.size-1))

    def _state_to_features(self, state):
        """Enhanced features for larger grids"""
        i, j = state
        features = [
            i / (self.size-1),  # Normalized x
            j / (self.size-1),  # Normalized y
             i / (self.size-1)*j / (self.size-1),  
            (self.size-1 - i) / (self.size-1),  # Progress right
            (self.size-1 - j) / (self.size-1),  # Progress down
            abs(i - self.size//2) / self.size,  # Pit x-distance
            abs(j - self.size//2) / self.size,  # Pit y-distance
            float(i in [0, self.size-1]),  # Edge x
            float(j in [0, self.size-1])   # Edge y
        ]
        return torch.FloatTensor(features)

In [13]:
# The policy network (last layer uses softmax)
class PolicyNetwork(nn.Module):
    def __init__(self, input_size, output_size,hidden_size=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.Tanh(),
            nn.Linear(hidden_size,hidden_size),
            nn.Tanh(),
            nn.Linear(hidden_size, output_size)
        )
        
    def forward(self, x):
        return torch.softmax(self.net(x), dim=-1)

In [16]:
# The main training function
def train_reinforce(env, policy, optimizer, episodes=1000, gamma=0.99):
    rewards_history = []
    
    for episode in range(episodes):
        state = env.reset()
        log_probs = []
        rewards = []

        # Generate an episode
        while True:
            probs = policy(state)
            m = Categorical(probs)
            action = m.sample()
            log_probs.append(m.log_prob(action))
            next_state, reward, done = env.step(action.item())
            rewards.append(reward)
            state = next_state.clone()
            
            if done:
                break
        # Calculate discounted returns
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)
        
        # Normalize returns
        returns = torch.FloatTensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-9)
        
        # Calculate loss
        policy_loss = []
        for log_prob, G in zip(log_probs, returns):
            policy_loss.append(-log_prob * G)
        
        # Update policy
        optimizer.zero_grad()
        loss = torch.stack(policy_loss).sum()
        loss.backward()
        optimizer.step()
        
        # Track rewards
        total_reward = sum(rewards)
        rewards_history.append(total_reward)
        
        if (episode+1) % 200 == 0:
            avg_reward = np.mean(rewards_history[-100:])
            print(f"Episode {episode+1}, Avg Reward: {avg_reward:.1f}")
    
    return rewards_history

# Initialize
env = GridWorld(size=5)
policy = PolicyNetwork(input_size=9, output_size=4)  # feature vector as input
optimizer = optim.Adam(policy.parameters(), lr=0.00002)

# Train
history = train_reinforce(env, policy, optimizer, episodes=5000)

Episode 200, Avg Reward: -607.5
Episode 400, Avg Reward: -344.6
Episode 600, Avg Reward: -382.1
Episode 800, Avg Reward: -281.7
Episode 1000, Avg Reward: -441.1
Episode 1200, Avg Reward: 37.6
Episode 1400, Avg Reward: -78.7
Episode 1600, Avg Reward: 78.2
Episode 1800, Avg Reward: 81.1
Episode 2000, Avg Reward: 184.4
Episode 2200, Avg Reward: 86.0
Episode 2400, Avg Reward: 223.2
Episode 2600, Avg Reward: 165.9
Episode 2800, Avg Reward: 105.7
Episode 3000, Avg Reward: 286.7
Episode 3200, Avg Reward: 148.3
Episode 3400, Avg Reward: 246.4
Episode 3600, Avg Reward: 328.5
Episode 3800, Avg Reward: 307.8
Episode 4000, Avg Reward: 267.6
Episode 4200, Avg Reward: 368.7
Episode 4400, Avg Reward: 428.7
Episode 4600, Avg Reward: 368.6
Episode 4800, Avg Reward: 449.5
Episode 5000, Avg Reward: 249.6


In [17]:
# The greedy policy learnt for each state of the environment. 
for i in range(env.size):
    for j in range(env.size):
        state=[i,j]
        state_tensor=env._state_to_features(state)
        #state_feature=env.get_state()
        #state_tensor = torch.FloatTensor(state_feature)
        with torch.no_grad():
            action_probs = policy(state_tensor)
        action = torch.argmax(action_probs).item()
        action_str = env.actions[action]
        print(f'state({i},{j}): {action_str}',end=',')
    print()

state(0,0): right,state(0,1): right,state(0,2): right,state(0,3): right,state(0,4): down,
state(1,0): down,state(1,1): right,state(1,2): right,state(1,3): down,state(1,4): down,
state(2,0): down,state(2,1): right,state(2,2): right,state(2,3): right,state(2,4): down,
state(3,0): down,state(3,1): right,state(3,2): right,state(3,3): right,state(3,4): down,
state(4,0): right,state(4,1): right,state(4,2): right,state(4,3): right,state(4,4): right,
