[download this notebook here](https://github.com/HumanCompatibleAI/imitation/blob/master/docs/tutorials/2_train_dagger.ipynb)
# Train an Agent using the DAgger Algorithm

The DAgger algorithm is an extension of behavior cloning. 
In behavior cloning, the training trajectories are recorded directly from an expert.
In DAgger, the learner generates the trajectories but an expert corrects the actions with the optimal actions in each of the visited states.
This ensures that the state distribution of the training data matches that of the learner's current policy.

First we need an expert to learn from. For convenience we download one from the HuggingFace model hub.

In [2]:
import numpy as np
import gymnasium as gym
from imitation.policies.serialize import load_policy
from imitation.util.util import make_vec_env
import numpy as np

env = make_vec_env(
    "seals:seals/CartPole-v0",
    rng=np.random.default_rng(),
    n_envs=1,
)
expert = load_policy(
    "ppo-huggingface",
    organization="HumanCompatibleAI",
    env_name="seals/CartPole-v0",
    venv=env,
)

Exception: code() argument 13 must be str, not int
Exception: code() argument 13 must be str, not int
Exception: code() argument 13 must be str, not int


In [3]:
class PolicyWrapper:
    def __init__(self, original_policy, env, acc):
        self.original_policy = original_policy
        self.env = env
        self.acc = acc

    def predict(self, *args,):
        if np.random.rand() < self.acc:
            return self.original_policy.predict(*args)

        else:
            x = self.original_policy.predict(*args)
            x1 = 1-x[0]
            out = (x1, x[1])
            print(out)
            return self.env.action_space.sample()
       
    def __getattr__(self, name):
        return getattr(self.original_policy, name)


wrapped_policy = PolicyWrapper(expert, env, 0.1)

Then we can construct a DAgger trainer und use it to train the policy on the cartpole environment.

In [8]:
import tempfile

from imitation.algorithms import bc
from imitation.algorithms.dagger import SimpleDAggerTrainer

bc_trainer = bc.BC(
    observation_space=env.observation_space,
    action_space=env.action_space,
    rng=np.random.default_rng(),
)

with tempfile.TemporaryDirectory(prefix="dagger_example_") as tmpdir:
    print(tmpdir)
    dagger_trainer = SimpleDAggerTrainer(
        venv=env,
        scratch_dir=tmpdir,
        expert_policy=wrapped_policy,
        bc_trainer=bc_trainer,
        rng=np.random.default_rng())

    dagger_trainer.train(2000)

C:\Users\kobbi\AppData\Local\Temp\dagger_example_ohfh7mon


TypeError: Policy must be None, a stable-baselines policy or algorithm, or a Callable, got <class '__main__.PolicyWrapper'> instead

Finally, the evaluation shows, that we actually trained a policy that solves the environment (500 is the max reward).

In [3]:
from stable_baselines3.common.evaluation import evaluate_policy

reward, _ = evaluate_policy(dagger_trainer.policy, env, 20)
print(reward)

500.0
