[![Binder](https://mybinder.org/badge_logo.svg)](https://lab.mlpack.org/v2/gh/mlpack/examples/master?urlpath=lab%2Ftree%2Fq_learning%2Fpendulum_dqn.ipynb)

You can easily run this notebook at https://lab.mlpack.org/

Here, we train a [Simple DQN](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) agent to get high scores for the [Pendulum](https://gym.openai.com/envs/Pendulum-v0/) environment. 

We make the agent train and test on OpenAI Gym toolkit's GUI interface provided through a distributed infrastructure (TCP API). More details can be found [here](https://github.com/zoq/gym_tcp_api).

A video of the trained agent can be seen in the end.

## Including necessary libraries and namespaces

In [1]:
#include <mlpack/core.hpp>

In [2]:
#include <mlpack/methods/ann/ffn.hpp>
#include <mlpack/methods/reinforcement_learning/q_learning.hpp>
#include <mlpack/methods/reinforcement_learning/q_networks/simple_dqn.hpp>
#include <mlpack/methods/reinforcement_learning/environment/env_type.hpp>
#include <mlpack/methods/reinforcement_learning/policy/greedy_policy.hpp>
#include <mlpack/methods/reinforcement_learning/training_config.hpp>

In [3]:
// Used to run the agent on gym's environment (provided externally) for testing.
#include <gym/environment.hpp>

In [4]:
// Used to generate and display a video of the trained agent.
#include "xwidgets/ximage.hpp"
#include "xwidgets/xvideo.hpp"
#include "xwidgets/xaudio.hpp"

In [5]:
using namespace mlpack;

In [6]:
using namespace mlpack::ann;

In [7]:
using namespace ens;

In [8]:
using namespace mlpack::rl;

## Initializing the agent

In [9]:
// Set up the state and action space.
DiscreteActionEnv::State::dimension = 3;
DiscreteActionEnv::Action::size = 3;

In [10]:
// Set up the network.
FFN<MeanSquaredError<>, GaussianInitialization> network(
    MeanSquaredError<>(), GaussianInitialization(0, 1));
network.Add<Linear<>>(DiscreteActionEnv::State::dimension, 128);
network.Add<ReLULayer<>>();
network.Add<Linear<>>(128, DiscreteActionEnv::Action::size);
SimpleDQN<> model(network);

In [11]:
// Set up the policy and replay method.
GreedyPolicy<DiscreteActionEnv> policy(1.0, 1000, 0.1, 0.99);
RandomReplay<DiscreteActionEnv> replayMethod(32, 10000);

In [12]:
// Set up training configurations.
TrainingConfig config;
config.ExplorationSteps() = 100;

In [13]:
// Set up DQN agent.
QLearning<DiscreteActionEnv, decltype(model), AdamUpdate, decltype(policy)>
    agent(config, model, policy, replayMethod);

## Preparation for training the agent

In [14]:
// Set up the gym training environment.
gym::Environment env("gym.kurg.org", "4040", "Pendulum-v0");

// Initializing training variables.
std::vector<double> returnList;
size_t episodes = 0;
bool converged = true;

// The number of episode returns to keep track of.
size_t consecutiveEpisodes = 50;

Since the Pendulum environment has a continuous action space, we need to perform "discretization" of the action space.

For that, we assume that our Q learning agent outputs 3 action values for our actions {0, 1, 2}. Meaning the actions given by the agent will either be `0`, `1`, or `2`. 

Now, we subtract `1.0` from the actions, which then becomes the input to the environment. This essentially means that we correspond the actions `0`, `1`, and `2` given by the agent, to the torque values `-1.0`, `0`, and `1.0` for the environment, respectively.

This simple trick allows us to train a continuous action-space environment using DQN.

Note that we have divided the action-space into 3 divisions here. But you may use any number of divisions as per your choice. More the number of divisions, finer are the controls available for the agent, and therefore better are the results!

In [15]:
// Function to train the agent on the Pendulum gym environment.
void train(const size_t numSteps)
{
  agent.Deterministic() = false;
  std::cout << "Training for " << numSteps << " steps." << std::endl;
  while (agent.TotalSteps() < numSteps)
  {
    double episodeReturn = 0;
    env.reset();
    do
    {
      agent.State().Data() = env.observation;
      agent.SelectAction();
      arma::mat action = {double(agent.Action().action) - 1.0};

      env.step(action);
      DiscreteActionEnv::State nextState;
      nextState.Data() = env.observation;

      replayMethod.Store(agent.State(), agent.Action(), env.reward, nextState,
          env.done, 0.99);
      episodeReturn += env.reward;
      agent.TotalSteps()++;
      if (agent.Deterministic() || agent.TotalSteps() < config.ExplorationSteps())
        continue;
      agent.TrainAgent();
    } while (!env.done);
    returnList.push_back(episodeReturn);
    episodes += 1;

    if (returnList.size() > consecutiveEpisodes)
      returnList.erase(returnList.begin());
        
    double averageReturn = std::accumulate(returnList.begin(),
                                           returnList.end(), 0.0) /
                           returnList.size();
    if(episodes % 4 == 0)
    {
      std::cout << "Avg return in last " << returnList.size()
          << " episodes: " << averageReturn
          << "\t Episode return: " << episodeReturn
          << "\t Total steps: " << agent.TotalSteps() << std::endl;
    }
  }
}

## Let the training begin

In [16]:
// Training the agent for a total of at least 5000 steps.
train(5000)

Training for 5000 steps.
Avg return in last 4 episodes: -1296.68	 Episode return: -1034.58	 Total steps: 800
Avg return in last 8 episodes: -1201.89	 Episode return: -1023.34	 Total steps: 1600
Avg return in last 12 episodes: -1137.23	 Episode return: -1229.57	 Total steps: 2400
Avg return in last 16 episodes: -1089.89	 Episode return: -795.063	 Total steps: 3200
Avg return in last 20 episodes: -1045.96	 Episode return: -740.987	 Total steps: 4000
Avg return in last 24 episodes: -1018.07	 Episode return: -985.838	 Total steps: 4800


## Testing the trained agent

In [17]:
agent.Deterministic() = true;

// Creating and setting up the gym environment for testing.
gym::Environment envTest("gym.kurg.org", "4040", "Pendulum-v0");
envTest.monitor.start("./dummy/", true, true);

// Resets the environment.
envTest.reset();
envTest.render();

double totalReward = 0;
size_t totalSteps = 0;

// Testing the agent on gym's environment.
while (1)
{
  // State from the environment is passed to the agent's internal representation.
  agent.State().Data() = envTest.observation;

  // With the given state, the agent selects an action according to its defined policy.
  agent.SelectAction();

  // Action to take, decided by the policy.
  arma::mat action = {double(agent.Action().action) - 1.0};

  envTest.step(action);
  totalReward += envTest.reward;
  totalSteps += 1;

  if (envTest.done)
  {
    std::cout << " Total steps: " << totalSteps << "\t Total reward: "
        << totalReward << std::endl;
    break;
  }

  // Uncomment the following lines to see the reward and action in each step.
  // std::cout << " Current step: " << totalSteps << "\t current reward: "
  //   << totalReward << "\t Action taken: " << action;
}

envTest.close();
std::string url = envTest.url();

auto video = xw::video_from_url(url).finalize();
video

 Total steps: 200	 Total reward: -633.678


A Jupyter widget

## A little more training...

In [18]:
// Training the same agent for a total of at least 20000 steps.
train(20000)

Training for 20000 steps.
Avg return in last 28 episodes: -1014.36	 Episode return: -1070.43	 Total steps: 5600
Avg return in last 32 episodes: -1013.11	 Episode return: -818.925	 Total steps: 6400
Avg return in last 36 episodes: -991.138	 Episode return: -1008.86	 Total steps: 7200
Avg return in last 40 episodes: -970.334	 Episode return: -745.247	 Total steps: 8000
Avg return in last 44 episodes: -953.705	 Episode return: -757.426	 Total steps: 8800
Avg return in last 48 episodes: -967.958	 Episode return: -1106.08	 Total steps: 9600
Avg return in last 50 episodes: -945.132	 Episode return: -1155.99	 Total steps: 10400
Avg return in last 50 episodes: -921.76	 Episode return: -1022.33	 Total steps: 11200
Avg return in last 50 episodes: -911.658	 Episode return: -928.998	 Total steps: 12000
Avg return in last 50 episodes: -904.935	 Episode return: -881.218	 Total steps: 12800
Avg return in last 50 episodes: -902.272	 Episode return: -967.976	 Total steps: 13600
Avg return in last 50 ep

# Final agent testing!

In [19]:
agent.Deterministic() = true;

// Creating and setting up the gym environment for testing.
gym::Environment envTest("gym.kurg.org", "4040", "Pendulum-v0");
envTest.monitor.start("./dummy/", true, true);

// Resets the environment.
envTest.reset();
envTest.render();

double totalReward = 0;
size_t totalSteps = 0;

// Testing the agent on gym's environment.
while (1)
{
  // State from the environment is passed to the agent's internal representation.
  agent.State().Data() = envTest.observation;

  // With the given state, the agent selects an action according to its defined policy.
  agent.SelectAction();

  // Action to take, decided by the policy.
  arma::mat action = {double(agent.Action().action) - 1.0};

  envTest.step(action);
  totalReward += envTest.reward;
  totalSteps += 1;

  if (envTest.done)
  {
    std::cout << " Total steps: " << totalSteps << "\t Total reward: "
        << totalReward << std::endl;
    break;
  }

  // Uncomment the following lines to see the reward and action in each step.
  // std::cout << " Current step: " << totalSteps << "\t current reward: "
  //   << totalReward << "\t Action taken: " << action;
}

envTest.close();
std::string url = envTest.url();

auto video = xw::video_from_url(url).finalize();
video

 Total steps: 200	 Total reward: -119.403


A Jupyter widget