# Reinforcement learning with GridWorld

This code defines a GridWorld environment, where an agent navigates a grid to reach a goal while avoiding a pit. It is useful for reinforcement learning (RL) experiments.

We import our required packages and define the GridWorld class.
The GridWorld class defines how the world works. The agent starts at the top-left corner (0,0), and the goal is at the bottom-right corner with a pit one step before it. The agent moves up, down, left, or right, with boundaries preventing it from going outside the grid. Each move results in a reward or penalty: reaching the goal gives +10 points, falling into the pit gives -10 points, and every step incurs a -1 penalty. The reset() function returns the agent to the starting position, step(action) updates its position and returns the new state and reward, and render() prints a grid visualization showing the agent (A), goal (G), and pit (P). This setup can be used for reinforcement learning experiments like Q-learning.

In [1]:
#Import the required packages
import numpy as np
import random

#Define the GridWorld class
class GridWorld:
    def __init__(self, size=5):
        self.size = size
        self.state = (0, 0)  # Agent starts at top-left corner
        self.goal = (size - 1, size - 1)  # Goal at bottom-right corner
        self.pit = (size - 2, size - 2)  # Pit at one position before the goal

    def reset(self):
        self.state = (0, 0)  # Reset agent to the starting position
        return self.state

    def step(self, action):
        x, y = self.state
        # Define movement directions: [Up, Down, Left, Right]
        moves = [(-1, 0), (1, 0), (0, -1), (0, 1)]
        dx, dy = moves[action]
        new_state = (x + dx, y + dy)

        # Stay within grid boundaries
        new_state = (max(0, min(self.size - 1, new_state[0])),
                     max(0, min(self.size - 1, new_state[1])))

        self.state = new_state

        # Define rewards
        if self.state == self.goal:
            return self.state, 10, True  # Goal reached
        elif self.state == self.pit:
            return self.state, -10, True  # Fell into the pit
        else:
            return self.state, -1, False  # Step penalty

    def render(self):
        grid = np.zeros((self.size, self.size), dtype=str)
        grid[:] = '.'
        grid[self.goal] = 'G'
        grid[self.pit] = 'P'
        x, y = self.state
        grid[x, y] = 'A'
        print("\n".join(" ".join(row) for row in grid))
        print()

The QLearningAgent class implements a Q-learning algorithm for reinforcement learning in the GridWorld environment. It maintains a Q-table, a state-action matrix where each entry represents the agent's expected reward for taking a specific action in a given state. The agent balances exploration (choosing random actions) and exploitation (choosing the best-known action) using an epsilon-greedy strategy, where the exploration rate (epsilon) decays over time. The update_q_value method updates the Q-values using the Bellman equation, incorporating the learning rate (lr) and discount factor (gamma). The train method runs for multiple episodes, allowing the agent to learn by interacting with the environment, receiving rewards, and refining its Q-values. After training, the test method evaluates the agent's performance by making greedy (optimal) moves based on the learned Q-table, while visually rendering the environment. The model gradually improves as it learns the best path to reach the goal while avoiding pitfalls.

In [2]:
#Define the QLearning functions
class QLearningAgent:
    def __init__(self, env, learning_rate=0.1, discount_factor=0.9, exploration_rate=1.0, exploration_decay=0.99):
        self.env = env
        self.q_table = np.zeros((env.size, env.size, 4))  # State-action table
        self.lr = learning_rate
        self.gamma = discount_factor
        self.epsilon = exploration_rate
        self.epsilon_decay = exploration_decay

    def choose_action(self, state):
        if random.random() < self.epsilon:
            return random.randint(0, 3)  # Random action (exploration)
        else:
            x, y = state
            return np.argmax(self.q_table[x, y])  # Best action (exploitation)

    def update_q_value(self, state, action, reward, next_state):
        x, y = state
        next_x, next_y = next_state
        old_value = self.q_table[x, y, action]
        next_max = np.max(self.q_table[next_x, next_y])
        # Update Q-value using the Bellman equation
        self.q_table[x, y, action] = old_value + self.lr * (reward + self.gamma * next_max - old_value)

    def train(self, episodes=1000):
        for episode in range(episodes):
            state = self.env.reset()
            done = False
            while not done:
                action = self.choose_action(state)
                next_state, reward, done = self.env.step(action)
                self.update_q_value(state, action, reward, next_state)
                state = next_state
            # Decay exploration rate
            self.epsilon *= self.epsilon_decay
            if episode % 100 == 0:
                print(f"Episode {episode}, Epsilon: {self.epsilon:.2f}")

    def test(self):
        state = self.env.reset()
        done = False
        self.env.render()
        while not done:
            action = np.argmax(self.q_table[state[0], state[1]])
            state, reward, done = self.env.step(action)
            self.env.render()


The agent.train(episodes=1000) function runs the training process for 1,000 episodes, during which the agent repeatedly explores the environment, updates its Q-values using the Bellman equation, and gradually improves its decision-making. As training progresses, the agent reduces exploration (epsilon decay) and shifts towards exploiting its learned policy to maximize rewards. After training, the agent will have learned an optimal path to the goal while minimizing penalties.

In [3]:
# Initialize the environment and agent
env = GridWorld(size=5)
agent = QLearningAgent(env)

# Train the agent
print("Training the agent...")
agent.train(episodes=1000)

Training the agent...
Episode 0, Epsilon: 0.99
Episode 100, Epsilon: 0.36
Episode 200, Epsilon: 0.13
Episode 300, Epsilon: 0.05
Episode 400, Epsilon: 0.02
Episode 500, Epsilon: 0.01
Episode 600, Epsilon: 0.00
Episode 700, Epsilon: 0.00
Episode 800, Epsilon: 0.00
Episode 900, Epsilon: 0.00


This code tests the trained Q-learning agent to see if it has successfully learned how to navigate the GridWorld environment.

In [4]:
# Let's test the agent and see if it can make it to the end
print("Testing the agent...")
agent.test()

Testing the agent...
A . . . .
. . . . .
. . . . .
. . . P .
. . . . G

. . . . .
A . . . .
. . . . .
. . . P .
. . . . G

. . . . .
. A . . .
. . . . .
. . . P .
. . . . G

. . . . .
. . A . .
. . . . .
. . . P .
. . . . G

. . . . .
. . . . .
. . A . .
. . . P .
. . . . G

. . . . .
. . . . .
. . . A .
. . . P .
. . . . G

. . . . .
. . . . .
. . . . A
. . . P .
. . . . G

. . . . .
. . . . .
. . . . .
. . . P A
. . . . G

. . . . .
. . . . .
. . . . .
. . . P .
. . . . A

