# Reinforcement Learning
### SARSA With an RBF network

We talked about **SARSA**, which is a **temporal difference** (TD) **model-free** method used in **Reinforcement Learning** (RL) to obtain the optimal policy for a **Markov Decision Process** (MDP). The SARSA uses the following iteration for updating action-value function $q(s,a)$:
<br> $\large q(s,a)\leftarrow q(s,a)+\alpha (r+\gamma q(s',a')-q(s,a))$
<br> where $s'$ is the next state, and $a'$ is the action chosen at state $s'$. Also, $r$ is the reward received after taking action $a$.
<br> **Hint** Along with the algorithm SARSA, we use the ϵ-greedy for action-selection. We talked about ϵ-greedy in the previosu post.
So far, we have used tables for $q$-values. But, this time we employ an RBF network to approximate $q(s,a)$ such that:
<br> $\large q(s,a)=F_a(\boldsymbol{x}(s)))$
<br>where $\boldsymbol{x}(s)$ is the feature vector extracted from state $s$.
We use SGD (stochastic gradient descent) to adjust weights of the RBF network.
<hr>

The example in this Notebook is almost the same **Grid World** we introduced earlier. geenrally, we can have a grid of any size
 - **States:** A sizexsize grid (size*size states), labeled as (0,0) to (size-1,size-1).
 - **Actions:** Up, Down, Left, Right.
 - **Rewards:**
    - Reaching the goal state (size-1,size-1) gives a reward of +10.
    - Reaching a "pit" state (size/2,size/2) gives a reward of −10.
    - All other transitions give a reward of −1.
- **Terminal States:** (size-1,size-1) (goal) and (size/2,size/2) (pit).
- **Transition Probabilities:**
    - Moving in the intended direction succeeds with probability 0.8.
    - With probability 0.2, the agent moves in a random direction

<hr>
https://github.com/ostad-ai/Reinforcement-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Reinforcement-Learning

In [1]:
# Import required modules
import numpy as np
import random
from collections import deque
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

In [2]:
# Define RBF network class with SGD and regularization
# num_centers is also the number of neurons in the hidden layer
class RBFNetwork:
    def __init__(self, state_dim, action_dim, num_centers=20, sigma=1.):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.num_centers = num_centers
        self.sigma = sigma
        
        # Initialize RBF centers, weights, and biases
        self.centers = np.random.rand(num_centers, state_dim)
        self.weights = np.random.rand(num_centers, action_dim) * 0.001
        self.biases = np.zeros(action_dim)  # Bias term for each action
        
    def rbf(self, x):
        # Calculate RBF activations
        x = np.array(x).reshape(1, -1)
        distances = np.linalg.norm(self.centers - x, axis=1)
        return np.exp(-self.sigma * distances**2)
    
    def predict(self, x):
        # Predict Q-values for all actions (including bias terms)
        phi = self.rbf(x)
        # Add biases to each output
        return np.dot(phi, self.weights) + self.biases  
    
    def update(self, x, action, target, learning_rate):
        # Update weights and biases using gradient descent
        phi = self.rbf(x)
        q_values = np.dot(phi, self.weights) + self.biases
        error = target - q_values[action]
        
        # Update weights for the chosen action
        self.weights[:, action] += learning_rate * error * phi
        
        # Update bias for the chosen action
        self.biases[action] += learning_rate * error

In [3]:
# Define the environment as a grid world
# the agent has four pissible actions
# the grid contains size*size cells
# We also define pit cell and goal cell
# also we specify the rewards
class GridWorld:
    def __init__(self, size=5):
        self.size = size
        self.actions = ['up', 'down', 'left', 'right']
        self.action_map = {a:i for i,a in enumerate(self.actions)}
        self.terminal = {(size-1, size-1): 10}  # Goal at bottom-right
        self.pits = {(size//2, size//2): -10}   # Pit at center
        self.current_state = None

    def reset(self):
        self.current_state = (0, 0)
        return self._state_to_features(self.current_state)

    def step(self, action_idx):
        action = self.actions[action_idx]
        i, j = self.current_state

        if self.current_state in self.terminal:
            return self._state_to_features(self.current_state), 0, True

        # Movement with stochasticity
        if random.random() < 0.8:
            next_state = self._move(action, i, j)
        else:
            next_state = self._move(random.choice(self.actions), i, j)

        self.current_state = next_state
        reward = self.terminal.get(next_state, self.pits.get(next_state, -1))
        done = next_state in self.terminal or next_state in self.pits
        return self._state_to_features(next_state), reward, done

    def _move(self, action, i, j):
        if action == 'up': return (max(i-1, 0), j)
        elif action == 'down': return (min(i+1, self.size-1), j)
        elif action == 'left': return (i, max(j-1, 0))
        elif action == 'right': return (i, min(j+1, self.size-1))

    def _state_to_features(self, state):
        i, j = state
        features = [
            i / (self.size-1),
            j / (self.size-1),
            i / (self.size-1)*j / (self.size-1),
            (self.size-1 - i) / (self.size-1),
            (self.size-1 - j) / (self.size-1),
            abs(i - self.size//2) / self.size,
            abs(j - self.size//2) / self.size,
            float(i in [0, self.size-1]),
            float(j in [0, self.size-1])]
        return np.array(features)

In [4]:
# The function to initiallzie RBF centers
def initialize_rbf_centers(env, rbf_net, num_samples=1000):
    """Initialize RBF centers using random states from the environment"""
    states = []
    for _ in range(num_samples):
        env.reset()
        for _ in range(10):
            action = random.randint(0, len(env.actions)-1)
            state, _, done = env.step(action)
            states.append(state)
            if done:
                break
    
    states = np.unique(np.array(states), axis=0)  # Remove duplicate states
    actual_centers = min(rbf_net.num_centers, len(states))  # Adjust centers based on unique states
    
    kmeans = KMeans(n_clusters=actual_centers, n_init=10)  # Explicitly set n_init
    kmeans.fit(states)
    
    # If we got fewer centers than requested, fill the rest with random states
    if actual_centers < rbf_net.num_centers:
        additional_centers = np.random.rand(rbf_net.num_centers - actual_centers, rbf_net.state_dim)
        rbf_net.centers = np.vstack([kmeans.cluster_centers_, additional_centers])
    else:
        rbf_net.centers = kmeans.cluster_centers_
    D = cdist(rbf_net.centers, rbf_net.centers)
    np.fill_diagonal(D, np.inf)  # Ignore self-distance
    rbf_net.sigma = np.mean(np.min(D, axis=1))  # Avg dist to nearest center  
    
# The epsilon-greedy action-selection policy
def epsilon_greedy(rbf_net, state, epsilon, action_dim):
    if random.random() < epsilon:
        return random.randint(0, action_dim-1)
    else:
        q_values = rbf_net.predict(state)
        return np.argmax(q_values)

In [8]:
# Train SARSA with the RBF network and SGD
def train_sarsa_rbf(env, episodes=2000, batch_size=1, gamma=0.99, learning_rate=0.06):
    state_dim = len(env._state_to_features((0,0)))
    action_dim = len(env.actions)
    agent = RBFNetwork(state_dim, action_dim, num_centers=25)  # Reduced centers to 25    
    # Initialize RBF centers
    initialize_rbf_centers(env, agent)    
    buffer = deque(maxlen=10000)
    epsilon = 1.0    
    for episode in range(episodes):
        state = env.reset()
        action = epsilon_greedy(agent, state, epsilon, action_dim)
        episode_reward = 0        
        while True:
            next_state, reward, done = env.step(action)
            next_action = epsilon_greedy(agent, next_state, epsilon, action_dim)            
            # Store experience in buffer
            buffer.append((state, action, reward, next_state, next_action, done))            
            episode_reward += reward            
            # Online update (batch_size=1)
            if len(buffer) >= batch_size:
                batch_samples = random.sample(buffer, batch_size)
                for sample in batch_samples:
                    s, a, r, s_next, a_next, d = sample                    
                    # SARSA update
                    current_q = agent.predict(s)[a]
                    next_q = agent.predict(s_next)[a_next] if not d else 0
                    target = r + gamma * next_q
                    agent.update(s, a, target, learning_rate)            
            if done: break                
            state, action = next_state, next_action        
        # Decay epsilon
        epsilon=max(.01,1-episode/episodes)
        if episode % 100 == 0:
            print(f"Episode {episode}, Reward: {episode_reward}, Epsilon: {epsilon:.2f}",end='; ')    
    return agent

In [9]:
# The function to test the learned policy
def test_policy(env, agent):
    state = env.reset()
    done = False
    total_reward = 0
    steps = 0
    
    print("\n---------Testing trained policy:")
    while not done:
        action = np.argmax(agent.predict(state))
        state, reward, done = env.step(action)
        total_reward += reward
        steps += 1
        print(f"Step {steps}: At {env.current_state}, took {env.actions[action]}, Reward: {reward}")
    
    print(f"\nTotal reward: {total_reward}")
    print(f"Reached goal in {steps} steps")

# The main function to call
if __name__ == "__main__":
    env = GridWorld(size=5)
    trained_agent = train_sarsa_rbf(env, episodes=3000)
    test_policy(env, trained_agent)

Episode 0, Reward: -60, Epsilon: 1.00; Episode 100, Reward: -28, Epsilon: 0.97; Episode 200, Reward: -48, Epsilon: 0.93; Episode 300, Reward: -13, Epsilon: 0.90; Episode 400, Reward: -46, Epsilon: 0.87; Episode 500, Reward: -24, Epsilon: 0.83; Episode 600, Reward: -31, Epsilon: 0.80; Episode 700, Reward: -15, Epsilon: 0.77; Episode 800, Reward: -25, Epsilon: 0.73; Episode 900, Reward: -19, Epsilon: 0.70; Episode 1000, Reward: -13, Epsilon: 0.67; Episode 1100, Reward: -9, Epsilon: 0.63; Episode 1200, Reward: -8, Epsilon: 0.60; Episode 1300, Reward: -13, Epsilon: 0.57; Episode 1400, Reward: -15, Epsilon: 0.53; Episode 1500, Reward: -15, Epsilon: 0.50; Episode 1600, Reward: 0, Epsilon: 0.47; Episode 1700, Reward: 2, Epsilon: 0.43; Episode 1800, Reward: 1, Epsilon: 0.40; Episode 1900, Reward: -15, Epsilon: 0.37; Episode 2000, Reward: -17, Epsilon: 0.33; Episode 2100, Reward: 0, Epsilon: 0.30; Episode 2200, Reward: 1, Epsilon: 0.27; Episode 2300, Reward: -11, Epsilon: 0.23; Episode 2400, Re

In [10]:
# Check each state best action based on the greedy policy
for i in range(env.size):
    for j in range(env.size):
        state=env._state_to_features((i,j))
        action = np.argmax(trained_agent.predict(state))
        print(f'state({i},{j}): {env.actions[action]}',end=',')
    print()

state(0,0): right,state(0,1): right,state(0,2): right,state(0,3): right,state(0,4): down,
state(1,0): down,state(1,1): right,state(1,2): right,state(1,3): right,state(1,4): down,
state(2,0): down,state(2,1): right,state(2,2): right,state(2,3): right,state(2,4): down,
state(3,0): right,state(3,1): right,state(3,2): right,state(3,3): right,state(3,4): down,
state(4,0): right,state(4,1): right,state(4,2): right,state(4,3): right,state(4,4): left,
