# Automated Graph of Thoughts - Simple PPO Approach
As a first approach with Deep Reinforcement Learning (DRL), a simple PPO agent is trained on lists of fixed cardinality.
The goal of this first DRL approach is to verify that a complex Reinforcement Learning agent is able to learn a task for a given cardinality.

## Ensure Reproducibility
The seed for the PRNG is set to $0$.

In [1]:
from stable_baselines3.common.utils import set_random_seed

seed = 0
set_random_seed(seed)

## Provide Required Components with Parameters
Factory function for the required components are provided.
The experiment is employed with the following parameters:
- maximum graph depth: $8$
- maximum graph breadth: $4$
- divergence cutoff factor: $0.5$

The model is trained solely on lists of cardinality $16$.
The complexity equals the list cardinality.

In [2]:
from auto_graph_of_thoughts.language_model import create_simulated_chat_gpt_sum_list
from auto_graph_of_thoughts.controller import ContinuousGraphController
import random
from typing import Tuple

from pure_graph_of_thoughts.api.state import State

MAX_DEPTH = 8
MAX_BREADTH = 4
DIVERGENCE_CUTOFF_FACTOR = 0.5

CARDINALITIES = [16]
COMPLEXITIES = CARDINALITIES
MAX_COMPLEXITY = max(CARDINALITIES)

_random = random.Random(seed)


def generate_init_state() -> Tuple[int, State]:
    complexity = _random.choice(COMPLEXITIES)
    list_cardinality = complexity
    init_state: State = {
        'list': [
            _random.randint(0, 9) for _ in range(list_cardinality)
        ]

    }
    return complexity, init_state


def create_controller() -> ContinuousGraphController:
    return ContinuousGraphController(
            language_model=create_simulated_chat_gpt_sum_list(seed),
            generate_init_state=generate_init_state,
            max_depth=MAX_DEPTH,
            max_breadth=MAX_BREADTH,
            divergence_cutoff_factor=DIVERGENCE_CUTOFF_FACTOR,
            max_complexity=MAX_COMPLEXITY
    )

## Instantiate Environment
The `GraphOfThoughtsEnv` environment is instantiated.
The maximum time steps is set to $100$.

In [3]:
from auto_graph_of_thoughts.env import GraphOfThoughtsEnv
from gymnasium.utils.env_checker import check_env
from auto_graph_of_thoughts.tasks.sum_list import sum_list_task


def create_env() -> GraphOfThoughtsEnv:
    return GraphOfThoughtsEnv(
            sum_list_task,
            create_controller(),
            seed=seed,
            max_steps=100
    )


env = create_env()
check_env(env)

  logger.warn(
  logger.warn(


## Train PPO Model
The default PPO model is trained with a vectorized environment (number of environments: `8`).
The number of total time steps is set to $2^{18}$ ($262'144$).

In [4]:
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

vec_env = make_vec_env(lambda: create_env(), n_envs=8)
model_ppo = PPO('MultiInputPolicy', vec_env, verbose=1)
model_ppo.learn(total_timesteps=2 ** 18)
mean_reward, std_reward = evaluate_policy(model_ppo, model_ppo.get_env(), n_eval_episodes=10)
print(f"Mean reward: {mean_reward} +/- {std_reward}")

Using cuda device
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 5.34     |
|    ep_rew_mean     | -0.581   |
| time/              |          |
|    fps             | 1231     |
|    iterations      | 1        |
|    time_elapsed    | 13       |
|    total_timesteps | 16384    |
---------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 4.03       |
|    ep_rew_mean          | -0.351     |
| time/                   |            |
|    fps                  | 701        |
|    iterations           | 2          |
|    time_elapsed         | 46         |
|    total_timesteps      | 32768      |
| train/                  |            |
|    approx_kl            | 0.03127069 |
|    clip_fraction        | 0.542      |
|    clip_range           | 0.2        |
|    entropy_loss         | -1.58      |
|    explained_variance   | -0.173     |
|    learning_rate        |

In [5]:
model_ppo.save('./models/simple_ppo')

## Evaluate PPO Model
The trained PPO model is evaluated on $100$ time steps.

In [7]:
from auto_graph_of_thoughts.env import ObservationComponent

env = create_env()
model_ppo = PPO.load('./models/simple_ppo')
obs, info = env.reset()
for i in range(100):
    action, _states = model_ppo.predict(obs)
    depth = obs[ObservationComponent.depth.value].item()
    decoded_action = env.decode_action(action)
    obs, rewards, terminated, truncated, info = env.step(action)
    print(
        f'depth: {depth} - action: {decoded_action.type.name}-{decoded_action.operation.name if decoded_action.operation is not None else None} = {float(rewards)}')
    if terminated or truncated:
        obs, info = env.reset()
        print(f'Episode end')

depth: 0 - action: AppendOperation-split = 0.0875
depth: 1 - action: AppendOperation-generate_single = 0.075
depth: 2 - action: AppendOperation-merge = 0.0625
depth: 3 - action: AppendOperation-generate_single = 0.05
depth: 4 - action: AppendOperation-generate_single = 0.0375
depth: 5 - action: AppendOperation-generate_single = 0.025
depth: 6 - action: Stop-None = 1.0
Episode end
depth: 0 - action: AppendOperation-split = 0.0875
depth: 1 - action: AppendOperation-generate_single = 0.075
depth: 2 - action: AppendOperation-merge = 0.0625
depth: 3 - action: AppendOperation-split = 0.05
depth: 4 - action: AppendOperation-merge = 0.0375
depth: 5 - action: AppendOperation-generate_single = 0.025
depth: 6 - action: Stop-None = 1.0
Episode end
depth: 0 - action: AppendOperation-split = 0.0875
depth: 1 - action: AppendOperation-generate_single = 0.075
depth: 2 - action: AppendOperation-merge = 0.0625
depth: 3 - action: AppendOperation-split = 0.05
depth: 4 - action: AppendOperation-merge = 0.03

The agent is able to solve the task.