# Reinforcement Learning Basics for our project

This Jupyter notebook is a proof of concept (PoC) for reinforcement learning applied to our use case. It includes the basic elements: the environment, the policy, the agent (pending), and the training loop.

In this example:

- The nodes represent random (x, y) coordinates.

- The starting node is selected at random.

- The goal is to reach all remaining nodes while minimizing the total distance traveled.




## Configurations

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random

# Configuración de dispositivo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Trabajando con: {device}")

Trabajando con: cpu


## The environment

The environment is like a snapshot of a single moment:

- nodes
- current node
- nodes already visited

In [3]:
class TSPEnvironment:
    def __init__(self, num_nodes):
        self.num_nodes = num_nodes
        self.reset()

    def reset(self):
        #input matrix of graph node coordinates (x,y)  num_nodes rows and 2 columns
        self.nodes = torch.rand(self.num_nodes, 2)
        #self.current = torch.randint(0, self.num_nodes, (1,)).item()
        self.current = random.randint(0, self.num_nodes - 1)
        self.visited = torch.zeros(self.num_nodes, dtype=torch.bool)
        self.visited[self.current] = True
        self.tour = [self.current]
        return self.get_state()

    def get_state(self):
        return self.nodes, self.current, self.visited

    def step(self, action):
        prev = self.current
        self.current = action
        self.visited[action] = True
        self.tour.append(action)

        dist = torch.norm(self.nodes[prev] - self.nodes[action])
        reward = -dist

        done = self.visited.all()
        return self.get_state(), reward, done

## The policy

This is a policy learning model, which optimizes its learnable parameters during the training loop.

In this example, it is an attention-based policy. It implements an attention mechanism to determine the next node to visit.

In [4]:
class AttentionPolicy(nn.Module):
    def __init__(self, node_dim=2, embed_dim=128):
        super().__init__()
        self.node_embed = nn.Linear(node_dim, embed_dim)

        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)

    def forward(self, nodes, current_node_idx, visited_mask):
        """
        nodes: [N, node_dim]             node features (e.g. coordinates)
        current_node_idx: int            current position
        visited_mask: [N] (bool)         True = already visited
        """

        # Embed nodes
        h = self.node_embed(nodes)       # [N, embed_dim]

        # Query = embedding of current node
        q = self.query(h[current_node_idx])   # [embed_dim]

        # Keys = all nodes
        k = self.key(h)                       # [N, embed_dim]

        # Attention scores
        scores = torch.matmul(k, q)           # [N]

        # Mask visited nodes, the agent cannot revisit them
        scores = scores.masked_fill(visited_mask, float("-inf"))

        # Policy = probability of choosing next node
        # política estocástica
        probs = F.softmax(scores, dim=0)

        return probs
    

## Traingin Loop

The training loop is simple; as we can see, the model is successfully learning. It is necessary to clone the "visited" tensor to avoid a PyTorch error related to in-place operations.

In [5]:

def train():
    env = TSPEnvironment(num_nodes=10)
    policy = AttentionPolicy()
    optimizer = optim.Adam(policy.parameters(), lr=1e-3)

    for episode in range(500):
        log_probs = []
        rewards = []

        state = env.reset()
        done = False

        while not done:
            nodes, current, visited = state
            probs = policy(nodes, current, visited.clone())

            dist = torch.distributions.Categorical(probs)
            action = dist.sample()

            log_probs.append(dist.log_prob(action))

            state, reward, done = env.step(action.item())
            rewards.append(reward)

        # Total return
        R = sum(rewards)

        # REINFORCE loss
        loss = -torch.stack(log_probs).sum() * R

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if episode % 50 == 0:
            print(f"Episode {episode}, total reward: {R:.2f}")



train()

Episode 0, total reward: -5.81
Episode 50, total reward: -2.94
Episode 100, total reward: -4.73
Episode 150, total reward: -3.31
Episode 200, total reward: -3.18
Episode 250, total reward: -3.10
Episode 300, total reward: -3.06
Episode 350, total reward: -3.27
Episode 400, total reward: -2.75
Episode 450, total reward: -2.48
