# Preference-based Reinforcement learning
## Configuring an agent
We will first define an agent using our PbRL-framework. An agent needs functionality to:
- generate query candidates
- select queries
- present queries and collect preferences
- train the reward model

For this, we mix concrete implementations into an abstract agent shell via multiple
inheritance and specify the constructor:

In [9]:
from agent.preference_based.sequential.sequential_pbrl_agent import AbstractSequentialPbRLAgent

from preference_data.query_generation.segment.segment_query_generator import RandomSegmentQueryGenerator
from preference_data.query_selection.query_selector import IndexQuerySelector
from preference_data.querent.synchronous.oracle.oracle import RewardMaximizingOracle
from reward_modeling.reward_trainer import RewardTrainer
from wrappers.utils import create_env

class SequentialPbRLAgent(AbstractSequentialPbRLAgent, RandomSegmentQueryGenerator, IndexQuerySelector,
                          RewardMaximizingOracle, RewardTrainer):
    def __init__(self, env, num_pretraining_epochs=10, num_training_epochs_per_iteration=10,
                 preferences_per_iteration=500):
        AbstractSequentialPbRLAgent.__init__(self, env,
                                             num_pretraining_epochs=num_pretraining_epochs,
                                             num_training_epochs_per_iteration=num_training_epochs_per_iteration,
                                             preferences_per_iteration=preferences_per_iteration)
        RandomSegmentQueryGenerator.__init__(self, self.policy_model, segment_sampling_interval=50)
        RewardTrainer.__init__(self, self.reward_model)

## Training the agent
We are now ready to train the agent. The agent will:
 - conduct reinforcement learning in the given environment,
 - thereby generating trajectories
 - that will be combined into queries and posed to an oracle,
 - which are then used to train the reward model.
The whole process ends once the agent ran 200,000 timesteps in the environment.

In [None]:
env = create_env("MountainCar-v0", termination_penalty=10.)

agent = SequentialPbRLAgent(env=env, num_pretraining_epochs=8,
                                num_training_epochs_per_iteration=16,
                                preferences_per_iteration=32)

agent.learn_reward_model(num_training_timesteps=200000, num_pretraining_preferences=512)

env.close()



Start reward model pretraining
Pretraining: Start data collection
Pretraining: Start reward model training
Start reward model training
Training: Start new training iteration. 0/200000 (0.00%) RL training steps completed.
Training: Start new training iteration. 1950/200000 (0.97%) RL training steps completed.
Training: Start new training iteration. 3900/200000 (1.95%) RL training steps completed.
Training: Start new training iteration. 5850/200000 (2.93%) RL training steps completed.
Training: Start new training iteration. 7800/200000 (3.90%) RL training steps completed.
Training: Start new training iteration. 9750/200000 (4.88%) RL training steps completed.
Training: Start new training iteration. 11700/200000 (5.85%) RL training steps completed.
Training: Start new training iteration. 13650/200000 (6.83%) RL training steps completed.
Training: Start new training iteration. 15600/200000 (7.80%) RL training steps completed.
Training: Start new training iteration. 17550/200000 (8.77%) RL 

We can monitor the learning process via TensorBoard. Start TensorBoard in a terminal:

```
tensorboard --logdir=runs
```