### How the Agent Works

The drone is controlled using **Q-learning**, a type of reinforcement learning. This method lets the agent learn behaviors by interacting with the environment and receiving rewards. At the start, the agent knows nothing about the environment, but it learns over time by trying actions, getting feedback, and updating its knowledge.

#### Decision-Making Through the Q-Table

The agent uses a **Q-table** to guide its decisions. The Q-table is like a reference sheet where the agent stores information about the best actions to take in different situations (states). Here’s how it works:

- **Exploration and Exploitation**: At first, the drone explores by trying random actions. Over time, it uses its Q-table to choose actions that are known to give the best rewards.
- **State-Action Match**: The Q-table links each state (e.g., the drone’s position, energy level, and surroundings) to the possible actions (up, down, left, right). The agent checks the Q-values for each action and picks the one with the highest value.
- **Learning the Best Behavior**: After every move, the agent updates the Q-table based on the rewards it gets. The more the agent trains, the better it becomes at choosing actions.

#### Why This Algorithm is Realistic

Unlike deterministic algorithms like **A*** or **Dijkstra**, where the full map is given and static, the Q-learning agent doesn't rely on having complete information. Instead, it explores and adapts. In a **dynamic environment**, where survivors move, obstacles appear, or resources are replenished, deterministic algorithms fail because they can't adapt to these changes. 

This is why Q-learning is more realistic. It’s designed to handle environments that change unpredictably, like a real disaster zone. However, because the Q-learning agent doesn’t know the full map and relies on exploration, it will be slower and less efficient in static, large maps compared to algorithms like A* or Dijkstra. But in dynamic environments, the Q-learning agent’s ability to adapt makes it a better choice.


In [1]:
import numpy as np
import random
from collections import defaultdict
import pickle
import matplotlib.pyplot as plt

class DisasterZoneEnv:
    """
    A simplified 2D grid environment for a drone exploring a disaster zone.

    Legend:
      0 -> Empty cell
      1 -> Obstacle
      2 -> Survivor
      3 -> Resource
      D -> Drone (tracked separately, but displayed in render)
    """

    def __init__(self, width=8, height=8, num_obstacles=5, num_survivors=3, num_resources=2, initial_energy=20, dynamic=False):
        """
        Initialize the environment with configurable dimensions and grid contents.

        :param width: Width of the grid.
        :param height: Height of the grid.
        :param num_obstacles: Number of obstacles to place in the grid.
        :param num_survivors: Number of survivors to place in the grid.
        :param num_resources: Number of resources to place in the grid.
        :param initial_energy: Initial energy level of the drone.
        :param dynamic: Whether the environment is dynamic (changes during simulation).
        """
        self.width = width
        self.height = height
        self.num_obstacles = num_obstacles
        self.num_survivors = num_survivors
        self.num_resources = num_resources
        self.initial_energy = initial_energy
        self.dynamic = dynamic  # Enable or disable dynamic changes

        # Define possible actions - up, down, left, right
        self.action_space = {
            0: (-1, 0),  # up
            1: (1, 0),   # down
            2: (0, -1),  # left
            3: (0, 1)    # right
        }

        self.reset()

    def reset(self):
        """
        Resets the environment to a starting state.
        """
        self.grid = np.zeros((self.height, self.width), dtype=int)

        # Place obstacles randomly
        for _ in range(self.num_obstacles):
            x, y = self._get_random_empty_cell()
            self.grid[x, y] = 1

        # Place survivors randomly
        for _ in range(self.num_survivors):
            x, y = self._get_random_empty_cell()
            self.grid[x, y] = 2

        # Place resources randomly
        for _ in range(self.num_resources):
            x, y = self._get_random_empty_cell()
            self.grid[x, y] = 3

        # Drone's starting position
        self.drone_x, self.drone_y = self._get_random_empty_cell()
        self.energy = self.initial_energy

        return self._get_state()

    def _get_random_empty_cell(self):
        """
        Finds a random empty cell (not an obstacle, survivor, or resource).
        """
        while True:
            x = random.randint(0, self.height - 1)
            y = random.randint(0, self.width - 1)
            if self.grid[x, y] == 0:
                return x, y

    def _get_state(self):
        """
        Returns the current state, which includes:
        - Drone's position (x, y)
        - Drone's energy level
        - Information about the cells up, down, left, and right of the drone

        A* and Dijkstra can ignore self.around, but it is essential for Q-Learning
        """
        up = self.grid[self.drone_x - 1, self.drone_y] if self.drone_x > 0 else -1
        down = self.grid[self.drone_x + 1, self.drone_y] if self.drone_x < self.height - 1 else -1
        left = self.grid[self.drone_x, self.drone_y - 1] if self.drone_y > 0 else -1
        right = self.grid[self.drone_x, self.drone_y + 1] if self.drone_y < self.width - 1 else -1

    
        # Store surrounding information in a tuple
        self.around = (up, down, left, right)
    
        # Return the full state
        return (self.drone_x, self.drone_y, self.energy, self.around)


    def step(self, action):
        """
        Executes a step in the environment.
        """
        dx, dy = self.action_space[action]
        new_x = self.drone_x + dx
        new_y = self.drone_y + dy

        reward = 0
        done = False

        if not self._in_bounds(new_x, new_y):
            reward -= 10  # Penalty for trying to move out of bounds
        elif self.grid[new_x, new_y] == 1:  # Obstacle collision
            reward -= 10
        else:
            # Valid move
            self.drone_x, self.drone_y = new_x, new_y

            if self.grid[new_x, new_y] == 0:
                reward += 1  # Reward for moving to an empty cell
            elif self.grid[new_x, new_y] == 2:
                reward += 10  # Reward for rescuing a survivor
                self.grid[new_x, new_y] = 0  # Remove survivor
            elif self.grid[new_x, new_y] == 3:
                reward += 5  # Reward for collecting a resource
                self.energy += 5  # Add 5 energy when collecting a resource
                self.grid[new_x, new_y] = 0  # Remove resource

        # Energy cost per move
        self.energy -= 1
        reward -= 1  # Decrease reward for the energy spent moving

        if self.energy <= 0:
            done = True

        return self._get_state(), reward, done


    def _in_bounds(self, x, y):
        """Check if the position is within the grid boundaries."""
        return 0 <= x < self.height and 0 <= y < self.width

    def apply_dynamic_changes(self, step_count):
        """
        Applies dynamic changes to the grid, such as adding obstacles, moving survivors,
        and placing new resources, based on the current step count.

        :param step_count: The current simulation step.
        """
        if self.dynamic:
            # Add a new obstacle every 5 steps
            if step_count % 5 == 0:
                x, y = self._get_random_empty_cell()
                self.grid[x, y] = 1  # Add an obstacle
                #print(f"Dynamic Change: Added obstacle at ({x}, {y})")

            # Move survivors every 3 steps
            if step_count % 3 == 0:
                survivor_positions = [(x, y) for x in range(self.height)
                                      for y in range(self.width) if self.grid[x, y] == 2]
                for x, y in survivor_positions:
                    self.grid[x, y] = 0  # Remove survivor from the current position
                    new_x, new_y = self._get_random_empty_cell()
                    self.grid[new_x, new_y] = 2  # Place survivor in a new position
                    #print(f"Dynamic Change: Moved survivor from ({x}, {y}) to ({new_x}, {new_y})")

            # Add a new resource every 7 steps
            if step_count % 7 == 0:
                x, y = self._get_random_empty_cell()
                self.grid[x, y] = 3  # Add a resource
                #print(f"Dynamic Change: Added resource at ({x}, {y})")      
    
    def render(self):
        """
        Display the current environment state.
        """
        grid_copy = self.grid.astype(str)
        grid_copy[grid_copy == '0'] = '.'
        grid_copy[grid_copy == '1'] = '#'
        grid_copy[grid_copy == '2'] = 'S'
        grid_copy[grid_copy == '3'] = 'R'
        grid_copy[self.drone_x, self.drone_y] = 'D'

        for row in grid_copy:
            print(" ".join(row))
        print(f"Energy: {self.energy}\n")


### Explanation of Functions

#### `initialize_q_table_dict(action_space)`
Initializes the Q-table as a dictionary, where each state is associated with a zeroed action value. This helps in handling large state spaces efficiently.

#### `compute_state_key(energy, around)`
Generates a unique key for each state based on the drone's energy level and its surroundings, allowing the Q-table to track state-action pairs effectively.

#### `q_learning_train_dict(env, q_table, ...)`
This function implements the Q-learning training loop. The core idea is to update the Q-values based on the **epsilon-greedy** algorithm:

- **Epsilon-Greedy**: The agent chooses an action based on exploration or exploitation. With probability `epsilon`, the agent will explore and pick a random action; otherwise, it will exploit the Q-table by choosing the action with the highest value.
- **Epsilon Decay**: Over time, `epsilon` decays, making the agent shift from exploration to more exploitation (using the knowledge it’s gained from its Q-table).
  
The exploration-exploitation balance is crucial for learning in unknown environments. The decay allows the agent to explore less as it becomes more confident in its actions.

Other elements of the training loop involve updating the Q-values using the Bellman equation and tracking rewards to improve performance.


In [3]:
from collections import defaultdict

def initialize_q_table_dict(action_space):
    """
    Initializes a Q-table using a dictionary to handle large state spaces efficiently.
    :param action_space: The action space of the environment to determine the action space size.
    :return: A defaultdict for the Q-table.
    """
    return defaultdict(lambda: np.zeros(len(action_space)))
    
def compute_state_key(energy, around):
    """
    Computes a unique state key based on the agent's energy and surrounding grid.
    :param energy: Drone's current energy level.
    :param around: Tuple containing information about up, down, left, and right cells.
    :return: A hashable state key.
    """
    return (energy, tuple(around))

def q_learning_train_dict(env, q_table, episodes=10000, max_steps=100, alpha=0.1, gamma=0.9, epsilon=1.0, epsilon_decay=0.996):
    """
    Q-learning training loop for an agent with a simplified state structure (energy + surroundings).
    Tracks average reward every 10,000 episodes and saves the Q-table.
    """
    total_rewards = []  # Store rewards per episode for analysis

    for episode in range(episodes):
        state = env.reset()  # Reset the environment for a new episode
        total_reward = 0
        step_count = 0  # Step counter for dynamic changes

        for step in range(max_steps):
            # Apply dynamic environment changes if applicable (this should happen *before* the agent perceives the state)
            if env.dynamic:
                env.apply_dynamic_changes(step_count)

            # Unpack the state after dynamic changes
            energy, around = state[2], state[3]
            state_key = compute_state_key(energy, around)  # State as a tuple for Q-table indexing

            # Choose action (epsilon-greedy)
            if np.random.random() < epsilon:
                action = np.random.choice(list(env.action_space.keys()))  # Exploration
            else:
                action = np.argmax(q_table[state_key])  # Exploitation

            # Take the action and observe the next state, reward, and whether the episode is done
            next_state, reward, done = env.step(action)

            # Unpack the next state
            next_energy, next_around = next_state[2], next_state[3]
            next_state_key = compute_state_key(next_energy, next_around)

            # Compute Q-values
            current_q = q_table[state_key][action]
            max_future_q = np.max(q_table[next_state_key])

            # Update Q-value using the Bellman equation
            q_table[state_key][action] = current_q + alpha * (reward + gamma * max_future_q - current_q)

            # Update the state and accumulate the total reward
            state = next_state
            total_reward += reward

            # Increment step counter for dynamic changes
            step_count += 1

            # Break if the episode is done
            if done:
                break

        # Decay epsilon
        epsilon = max(epsilon * epsilon_decay, 0.1)

        # Track rewards
        total_rewards.append(total_reward)

        # Print average reward every 5,000 episodes
        if (episode + 1) % 10000 == 0:
            avg_reward = np.mean(total_rewards[-10000:])
            print(f"Episode {episode + 1}/{episodes}: Average Reward (Last 10000 Episodes): {avg_reward:.2f}")

    return total_rewards


### Testing the Pre-Trained Agent

The `test_pretrained_agent` function is used to run a single experiment where the agent follows a pre-trained Q-table. It executes one episode, making decisions based on the learned values, and takes into account dynamic changes in the environment during the process. The function tracks the agent's steps, rewards, and prints useful information for debugging.


In [5]:
def test_pretrained_agent(env, q_table, max_steps=100):
    """
    Test the agent using a pre-trained Q-table for one episode, including dynamic changes.
    """
    state = env.reset()
    total_reward = 0
    step_count = 0  # Track steps for dynamic changes

    print("\nTesting the agent:")
    env.render()

    for step in range(max_steps):
        # Unpack the state
        energy, around = state[2], state[3]
        state_key = (energy, tuple(around))  # Adjusted to match training state key

        # Exploit the pre-trained Q-table
        if state_key in q_table:
            action = np.argmax(q_table[state_key])
            print(f"Q-Values: {q_table[state_key]}")
        else:
            # Fallback in case state is not in Q-table (shouldn't happen if trained well)
            action = np.random.choice(list(env.action_space.keys()))
            print("State not found in Q-table. Taking random action.")

        # Print debug information
        print(f"\nStep {step + 1}:")
        print(f"Energy: {energy}")
        print(f"Surrounding Information (around): {around}")
        print(f"State: {state_key}")
        print(f"Action Taken: {action}")

        # Take action
        next_state, reward, done = env.step(action)
        env.apply_dynamic_changes(step_count)  # Apply dynamic environment changes

        print(f"Reward: {reward}")
        env.render()

        # Update state and accumulate reward
        state = next_state
        total_reward += reward

        # Increment step count for dynamic changes
        step_count += 1

        if done:
            print(f"\nEpisode finished after {step + 1} steps with total reward {total_reward}")
            break

    if not done:
        print(f"\nEpisode ended after {max_steps} steps with total reward {total_reward}")


### Training the Agent

In this section, the agent is trained through two environments: a static one and a dynamic one. We begin by training the agent in the **static environment**, where the map remains unchanged throughout the episodes. This allows the Q-table to be updated based on consistent surroundings and helps the agent optimize its actions over time.

After completing the training in the static environment, we move on to the **dynamic environment**, where the map changes randomly. For example, obstacles may appear, and survivors or resources may be relocated, making it more challenging for the agent to adapt. The Q-table continues to be updated during this phase, enabling the agent to learn how to navigate the dynamic changes more effectively.

As the agent trains, the **reward typically improves** with each iteration, reflecting better decision-making. However, after many episodes, the improvement in reward becomes **smaller**, indicating that the agent is approaching optimal behavior for the given environment. This process is part of the agent's learning curve, where the updates to the Q-table become less significant as the agent stabilizes its strategy.


In [7]:
# In your main function:
if __name__ == "__main__":
    # Global variable to hold the Q-table
    q_table_global = None
    n = 300000

    # Initialize the environment and Q-table for training
    env_dynamic = DisasterZoneEnv(width=8, height=8, num_obstacles=5, num_survivors=4, num_resources=3, initial_energy=20, dynamic=True)
    env_static = DisasterZoneEnv(width=8, height=8, num_obstacles=5, num_survivors=4, num_resources=2, initial_energy=20, dynamic=False)
    
    # Initialize the Q-table (pass the action space size from the environment)
    q_table = initialize_q_table_dict(env_static.action_space)  # Use static env's action space

    # Train the agent in the static environment
    print("Training the Q-learning agent in the static environment...")
    static_rewards = q_learning_train_dict(env_static, q_table, episodes=n, max_steps=100)

    # Train the agent in the dynamic environment
    print("Training the Q-learning agent in the dynamic environment...")
    dynamic_rewards = q_learning_train_dict(env_dynamic, q_table, episodes=n, max_steps=100)

    # Store the trained Q-table in a global variable
    q_table_global = q_table

Training the Q-learning agent in the static environment...
Episode 10000/300000: Average Reward (Last 10000 Episodes): 6.42
Episode 20000/300000: Average Reward (Last 10000 Episodes): 10.94
Episode 30000/300000: Average Reward (Last 10000 Episodes): 11.72
Episode 40000/300000: Average Reward (Last 10000 Episodes): 12.14
Episode 50000/300000: Average Reward (Last 10000 Episodes): 12.10
Episode 60000/300000: Average Reward (Last 10000 Episodes): 12.28
Episode 70000/300000: Average Reward (Last 10000 Episodes): 12.67
Episode 80000/300000: Average Reward (Last 10000 Episodes): 12.80
Episode 90000/300000: Average Reward (Last 10000 Episodes): 12.57
Episode 100000/300000: Average Reward (Last 10000 Episodes): 12.84
Episode 110000/300000: Average Reward (Last 10000 Episodes): 12.83
Episode 120000/300000: Average Reward (Last 10000 Episodes): 12.61
Episode 130000/300000: Average Reward (Last 10000 Episodes): 13.00
Episode 140000/300000: Average Reward (Last 10000 Episodes): 13.08
Episode 15000

### Testing the Agent in Different Environments

To better understand the agent's behavior, we will test it in both a **dynamic** and a **static** environment, using the **pre-trained Q-table**. This will help visualize how the agent moves and makes decisions based on the learned Q-values.

In both environments, we can observe the agent’s movements and decisions, providing a clear understanding of its behavior. By testing the agent in these environments, we can see how it navigates, adapts, and applies its learning. This gives valuable insight into how Q-learning works in different contexts, whether the environment is stable or changes dynamically over time.


In [20]:
env_dynamic2 = DisasterZoneEnv(width=8, height=8, num_obstacles=5, num_survivors=4, num_resources=3, initial_energy=20, dynamic=True)
env_static2 = DisasterZoneEnv(width=8, height=8, num_obstacles=5, num_survivors=4, num_resources=2, initial_energy=20, dynamic=False)

# Test the pre-trained agent in the dynamic environment
print("\nTesting the agent in the dynamic environment:")
test_pretrained_agent(env_dynamic2, q_table_global, max_steps=100)

# Test the pre-trained agent in the static environment
print("\nTesting the agent in the static environment:")
test_pretrained_agent(env_static2, q_table_global, max_steps=100)


Testing the agent in the dynamic environment:

Testing the agent:
D . . . . . . .
. # . S . . . .
. # S . . . . #
# . . . . S . S
. . . . R . . .
. . . . . . . .
. R . . . . . .
# R . . . . . .
Energy: 20

Q-Values: [-1.42583566 12.21212334 -2.01170586  8.64395492]

Step 1:
Energy: 20
Surrounding Information (around): (-1, 0, -1, 0)
State: (20, (-1, 0, -1, 0))
Action Taken: 1
Reward: 0
. S . . . . . R
D # . . . . . .
. # S . . . . #
# . . . . . # .
. . . . R . . .
. . . . . . . .
. R . . S . . .
# R . . . . . S
Energy: 19

Q-Values: [ 9.49186316  8.91595046 -2.68002658 -2.5357945 ]

Step 2:
Energy: 19
Surrounding Information (around): (0, 0, -1, 1)
State: (19, (0, 0, -1, 1))
Action Taken: 0
Reward: 0
D S . . . . . R
. # . . . . . .
. # S . . . . #
# . . . . . # .
. . . . R . . .
. . . . . . . .
. R . . S . . .
# R . . . . . S
Energy: 18

Q-Values: [-0.38220488  8.79261418  0.07957802  3.62517259]

Step 3:
Energy: 18
Surrounding Information (around): (-1, 0, -1, 2)
State: (18, (-1, 0, 

### Running the Pre-trained Q-learning Agent

As we trained the model and computed the Q-table in the previous steps, all you need to do now is utilize the computed `q_table_global`. 

The `q_learning_agent` function takes in a random environment and the pre-trained Q-table to calculate the total reward in that setting. It exploits the Q-table to decide the best actions for the agent based on its state, and returns the total reward accumulated over the episode.

#### How to Use It:
To run the agent, simply provide a random environment and the `q_table_global` like so:

```python
total_reward = q_learning_agent(env, q_table_global, max_steps=100)


In [68]:
def q_learning_agent(env, q_table, max_steps=100):
    """
    Runs a single episode using the pre-trained Q-table and returns the total reward.
    """
    state = env.reset()
    total_reward = 0

    for step in range(max_steps):
        # Apply dynamic environment changes if applicable
        if env.dynamic:
            env.apply_dynamic_changes(step)

        # Unpack the state after dynamic changes
        energy, around = state[2], state[3]  # Extract energy and surrounding info from the state
        state_key = compute_state_key(energy, around)  # Generate the state key

        # Exploit the Q-table
        if state_key in q_table:
            action = np.argmax(q_table[state_key])
        else:
            # Handle unseen states (fallback to random action)
            action = np.random.choice(list(env.action_space.keys()))

        # Take action in the environment
        next_state, reward, done = env.step(action)

        # Update state and accumulate reward
        state = next_state
        total_reward += reward

        # End the episode if done
        if done:
            break

    return total_reward

if __name__ == "__main__":
    total_rewards = []  # List to store total rewards for each test

    for i in range(100):  # Loop for 100 random environments
        # Create a new random environment for each iteration
        env = DisasterZoneEnv(width=8, height=8, num_obstacles=5, num_survivors=4, num_resources=2, initial_energy=20, dynamic=True)

        # Get the total reward for the current environment
        total_reward = q_learning_agent(env, q_table_global, max_steps=100)
        total_rewards.append(total_reward)  # Store the total reward

    # Compute the mean total reward over all tests
    mean_reward = sum(total_rewards) / len(total_rewards)
    print(f"\nMean Total Reward over 100 tests: {mean_reward}")



Mean Total Reward over 100 tests: 33.57
