

**HeatBug Native Model** typically operates based on simple rules (e.g., approaching heat sources, avoiding obstacles), with its state space limited to its own position and local temperature values. In contrast, **MARL (Multi-Agent Reinforcement Learning) requires an extended complex state space**. The key differences between the two are as follows:  

1. **Rule-Driven vs. Learning-Driven**  
    - The native model relies on predefined rules, eliminating the need for users to design a state space.  
    - MARL demands active definition of state variables (e.g., collaboration signals, adversary intentions) and handling of high-dimensional data.  

2. **Local Perception vs. Global Coordination**  
    - HeatBug's decisions depend solely on its own perception, while MARL requires encoding multi-agent coordination logic (e.g., shared goals, role allocation) into the state space.  

3. **Static Environment vs. Dynamic Environment**  
    - The native model operates in a fixed environment, whereas MARL must address state uncertainty arising from dynamic changes in other agents' strategies.



# HeatBug-MARL Integration Framework  

## **Core Design Principles**  
1. **Decentralized Autonomy**: Each HeatBug acts as an independent RL agent  
2. **Local Observations**: Partial observability (3x3 thermal grid + neighbor count)  
3. **Dynamic Coupling**: Collective heat generation affects individual rewards  

---

## **MARL Components**  

### 1. **State Representation**  
- **Local Thermal Map**: 3x3 temperature matrix centered at agent  
- **Self Status**:  
  - Current energy level (normalized)  
  - Distance to ideal temperature: $|T_{current} - T_{ideal}|$  
- **Neighbor Context**:  
  - Number of bugs in perception range (Moore neighborhood)  

### 2. **Action Space**  
- Maintain original movement options: `{N, S, E, W, Stand}`  
- Add communication primitive: `Broadcast_Thermal_Signal` (optional)  

### 3. **Reward Mechanism**  
$$ R_t = \underbrace{w_1 e^{-|T_t - T_{ideal}|}}_{\text{Thermal Fit}} - \underbrace{w_2 \Delta E}_{\text{Energy Cost}} + \underbrace{w_3 \sum_{n \in Neighbors} \frac{1}{d_n^2}}_{\text{Social Term}}$$  
- Adaptive weights: $w_1 + w_2 + w_3 = 1$  

### 4. **Learning Architecture**  
- **Independent Q-Learning**:  
  ```python
  class HeatBugRL(HeatBug):
      def update_q(self, state, action, reward, next_state):
          q_val = self.q_table[state][action]
          max_next = np.max(self.q_table[next_state])
          new_q = (1 - ALPHA) * q_val + ALPHA * (reward + GAMMA * max_next)
          self.q_table[state][action] = new_q
  ```
- **Centralized Critic** (Optional): Shared value function for thermal stability  

---

## **Emergent Behavior Targets**  
1. **Phase Transition**:  
   - From random exploration → coordinated thermal regulation  
2. **Self-Organized Patterns**:  
   - Dynamic cluster formation/ dispersion responding to global temperature  
3. **Energy-Efficient Migration**:  
   - Learned trade-off between movement cost and thermal optimization  


> **Key Insight**: Combines ABM's environmental dynamics with MARL's adaptive decision-making, creating a testbed for studying emergent coordination in physical-coupled multi-agent systems.

---

# **Implementation Roadmap**  




## Phase 1: Baseline Integration  
1. Augment `HeatBug` class with Q-tables  
2. Modify `decide_movement()` to use ε-greedy policy  
3. Add reward calculation to `move()` method  


In [None]:
# Phase 1: Baseline Integration
class HeatBugMARL(HeatBug):
    def __init__(self):
        self.q_table = defaultdict(lambda: np.zeros(5))  # 5 actions
        self.epsilon = 0.1

    def decide_movement(self):
        if random() < self.epsilon:
            return random_action()
        else:
            return np.argmax(self.q_table[current_state])

    def move(self, action):
        energy_cost = calculate_energy(action)
        new_pos = apply_movement(action)
        reward = self.calculate_reward(new_pos, energy_cost)
        return new_pos, reward


## Phase 2: Thermal-Aware Learning  
1. Implement state encoder: 3x3 temp grid → discrete state ID  
2. Design reward shaping functions  
3. Add experience replay buffer (prioritized by temp deviation)  



In [None]:
# Phase 2: Thermal-Aware Learning
class StateEncoder:
    def __call__(self, temp_grid):  # 3x3 grid → int
        return hash(str(temp_grid.flatten()))

class RewardShaper:
    def __call__(self, temp_diff, energy, social=0):
        return 0.6*exp(-temp_diff) - 0.3*energy + 0.1*social

class ReplayBuffer:
    def add(self, state, action, reward, next_state, priority):
        self.buffer.append( (state, action, reward, next_state) )
        self.priorities.append(abs(priority))  # Temp deviation

## Phase 3: Collective Adaptation  
1. Introduce social reward component  
2. Test with varying agent densities (5-50 bugs)  
3. Analyze emergent heat distribution patterns  




In [None]:

# Phase 3: Collective Adaptation
def social_reward(neighbors):
    return len(neighbors) * 0.05 if 0 < len(neighbors) < 4 else -0.1

def test_density(density):
    env = HeatGrid()
    env.add_agents(density)  # 5-50 agents
    return env.run_simulation()

## **Evaluation Metrics**  
1. **Individual Level**:  
   - Energy efficiency: (ΔT achieved)/(Energy spent)  
   - Thermal satisfaction rate: % time in ±2°C of $T_{ideal}$  
2. **Collective Level**:  
   - Global temp variance across grid  
   - Cluster stability index (Lifespan of bug groups)

In [None]:
# Evaluation Metrics
def energy_efficiency(bug):
    return bug.total_temp_change / (bug.energy_used + 1e-6)

def cluster_stability(clusters):
    lifetimes = [c.end_step - c.start_step for c in clusters]
    return np.mean(lifetimes)