[![Binder](https://mybinder.org/badge_logo.svg)](https://lab.mlpack.org/v2/gh/mlpack/examples/master?urlpath=lab%2Ftree%2Fq_learning%2Fmountain_car_dqn.ipynb)

You can easily run this notebook at https://lab.mlpack.org/

In this notebook, we show how to use a simple DQN to train an agent to solve the [MountainCar](https://gym.openai.com/envs/MountainCar-v0) environment. 

We make the agent train and test on OpenAI Gym toolkit's GUI interface provided through a distributed infrastructure (TCP API). More details can be found [here](https://github.com/zoq/gym_tcp_api).

A video of the trained agent can be seen in the end.

## Including necessary libraries and namespaces

In [1]:
#include <mlpack/core.hpp>

In [2]:
#include <mlpack/methods/ann/ffn.hpp>
#include <mlpack/methods/reinforcement_learning/q_learning.hpp>
#include <mlpack/methods/reinforcement_learning/q_networks/simple_dqn.hpp>
#include <mlpack/methods/reinforcement_learning/environment/env_type.hpp>
#include <mlpack/methods/reinforcement_learning/policy/greedy_policy.hpp>
#include <mlpack/methods/reinforcement_learning/training_config.hpp>

In [3]:
// Used to run the agent on gym's environment (provided externally) for testing.
#include <gym/environment.hpp>

In [4]:
// Used to generate and display a video of the trained agent.
#include "xwidgets/ximage.hpp"
#include "xwidgets/xvideo.hpp"
#include "xwidgets/xaudio.hpp"

In [5]:
using namespace mlpack;

In [6]:
using namespace mlpack::ann;

In [7]:
using namespace ens;

In [8]:
using namespace mlpack::rl;

## Initializing the agent

In [9]:
// Set up the state and action space.
DiscreteActionEnv::State::dimension = 2;
DiscreteActionEnv::Action::size = 3;

In [10]:
// Set up the network.
FFN<MeanSquaredError<>, GaussianInitialization> network(
    MeanSquaredError<>(), GaussianInitialization(0, 1));
network.Add<Linear<>>(DiscreteActionEnv::State::dimension, 128);
network.Add<ReLULayer<>>();
network.Add<Linear<>>(128, DiscreteActionEnv::Action::size);
// Set up the network.
SimpleDQN<> model(network);

In [11]:
// Set up the policy method.
GreedyPolicy<DiscreteActionEnv> policy(1.0, 1000, 0.1, 0.99);
RandomReplay<DiscreteActionEnv> replayMethod(32, 10000);

In [12]:
// Set up training configurations.
TrainingConfig config;
config.TargetNetworkSyncInterval() = 100;
config.ExplorationSteps() = 400;

In [13]:
// Set up DQN agent.
QLearning<DiscreteActionEnv, decltype(model), AdamUpdate, decltype(policy), decltype(replayMethod)>
    agent(config, model, policy, replayMethod);

## Preparation for training the agent

In [14]:
// Set up the gym training environment.
gym::Environment env("gym.kurg.org", "4040", "MountainCar-v0");

// Initializing training variables.
std::vector<double> returnList;
size_t episodes = 0;
bool converged = true;

// The number of episode returns to keep track of.
size_t consecutiveEpisodes = 50;

An important point to note for Mountain Car setup is that for each step that the car does not reach the goal located at position `0.5`, the environment returns a reward of `-1`. Now, since the agent’s reward never changes until completion of the episode, it is difficult for our algorithm to improve until it randomly reaches the top of the hill.

That is unless we modify the reward by giving an additional `0.5` reward for every time the agent managed to drag the car in the backward direction (i.e position < `-0.8`). This was important to gain momentum to climb the hill.

This minor tweak can greatly increase sample efficiency.

In [15]:
// Function to train the agent on the MountainCar-v0 gym environment.
void train(const size_t numSteps)
{
  agent.Deterministic() = false;
  std::cout << "Training for " << numSteps << " steps." << std::endl;
  while (agent.TotalSteps() < numSteps)
  {
    double episodeReturn = 0;
    double adjustedEpisodeReturn = 0;
    env.reset();
    do
    {
      agent.State().Data() = env.observation;
      agent.SelectAction();
      arma::mat action = {double(agent.Action().action)};

      env.step(action);
      DiscreteActionEnv::State nextState;
      nextState.Data() = env.observation;
      
      // Use an adjusted reward for task completion.
      double adjustedReward = env.reward;
      if (nextState.Data()[0] < -0.8)
        adjustedReward += 0.5;

      replayMethod.Store(agent.State(), agent.Action(), adjustedReward, nextState,
          env.done, 0.99);
      episodeReturn += env.reward;
      adjustedEpisodeReturn += adjustedReward;
      agent.TotalSteps()++;
      if (agent.Deterministic() || agent.TotalSteps() < config.ExplorationSteps())
        continue;
      agent.TrainAgent();
    } while (!env.done);
    returnList.push_back(episodeReturn);
    episodes += 1;

    if (returnList.size() > consecutiveEpisodes)
      returnList.erase(returnList.begin());
        
    double averageReturn = std::accumulate(returnList.begin(),
                                           returnList.end(), 0.0) /
                           returnList.size();
    if(episodes % 5 == 0)
    {
      std::cout << "Avg return in last " << consecutiveEpisodes
          << " episodes: " << averageReturn
          << "\t Episode return: " << episodeReturn
          << "\t Adjusted return: " << adjustedEpisodeReturn
          << "\t Total steps: " << agent.TotalSteps() << std::endl;
    }
  }
}

Note here that `Episode return:` is the actual (environment's) return, whereas `Adjusted return:` is the return calculated from the adjusted reward function we described earlier.

## Let the training begin

In [16]:
// Training the agent for a total of at least 75 episodes.
train(200*75)

Training for 15000 steps.
Avg return in last 50 episodes: -200	 Episode return: -200	 Adjusted return: -200	 Total steps: 1000
Avg return in last 50 episodes: -200	 Episode return: -200	 Adjusted return: -200	 Total steps: 2000
Avg return in last 50 episodes: -200	 Episode return: -200	 Adjusted return: -200	 Total steps: 3000
Avg return in last 50 episodes: -200	 Episode return: -200	 Adjusted return: -190.5	 Total steps: 4000
Avg return in last 50 episodes: -200	 Episode return: -200	 Adjusted return: -200	 Total steps: 5000
Avg return in last 50 episodes: -200	 Episode return: -200	 Adjusted return: -181.5	 Total steps: 6000
Avg return in last 50 episodes: -198.486	 Episode return: -200	 Adjusted return: -190.5	 Total steps: 6947
Avg return in last 50 episodes: -198.2	 Episode return: -200	 Adjusted return: -200	 Total steps: 7928
Avg return in last 50 episodes: -198.4	 Episode return: -200	 Adjusted return: -200	 Total steps: 8928
Avg return in last 50 episodes: -197.3	 Episode ret

## Testing the trained agent

In [17]:
agent.Deterministic() = true;

// Creating and setting up the gym environment for testing.
gym::Environment envTest("gym.kurg.org", "4040", "MountainCar-v0");
envTest.monitor.start("./dummy/", true, true);

// Resets the environment.
envTest.reset();
envTest.render();

double totalReward = 0;
size_t totalSteps = 0;

// Testing the agent on gym's environment.
while (1)
{
  // State from the environment is passed to the agent's internal representation.
  agent.State().Data() = envTest.observation;

  // With the given state, the agent selects an action according to its defined policy.
  agent.SelectAction();

  // Action to take, decided by the policy.
  arma::mat action = {double(agent.Action().action)};

  envTest.step(action);
  totalReward += envTest.reward;
  totalSteps += 1;

  if (envTest.done)
  {
    std::cout << " Total steps: " << totalSteps << "\t Total reward: "
        << totalReward << std::endl;
    break;
  }

  // Uncomment the following lines to see the reward and action in each step.
  // std::cout << " Current step: " << totalSteps << "\t current reward: "
  //   << totalReward << "\t Action taken: " << action;
}

envTest.close();
std::string url = envTest.url();

auto video = xw::video_from_url(url).finalize();
video

 Total steps: 162	 Total reward: -162


A Jupyter widget

## A little more training...

In [18]:
// Training the same agent for a total of at least 300 episodes.
train(200*300)

Training for 60000 steps.
Avg return in last 50 episodes: -188.62	 Episode return: -200	 Adjusted return: -184.5	 Total steps: 15431
Avg return in last 50 episodes: -186.08	 Episode return: -142	 Adjusted return: -127.5	 Total steps: 16251
Avg return in last 50 episodes: -183.98	 Episode return: -159	 Adjusted return: -144.5	 Total steps: 17127
Avg return in last 50 episodes: -182.22	 Episode return: -173	 Adjusted return: -158.5	 Total steps: 18039
Avg return in last 50 episodes: -180.28	 Episode return: -158	 Adjusted return: -142.5	 Total steps: 18879
Avg return in last 50 episodes: -178.6	 Episode return: -162	 Adjusted return: -148	 Total steps: 19685
Avg return in last 50 episodes: -177.32	 Episode return: -196	 Adjusted return: -177.5	 Total steps: 20572
Avg return in last 50 episodes: -173.96	 Episode return: -157	 Adjusted return: -142	 Total steps: 21323
Avg return in last 50 episodes: -172.9	 Episode return: -177	 Adjusted return: -161.5	 Total steps: 22169
Avg return in las

# Final agent testing!

In [19]:
agent.Deterministic() = true;

// Creating and setting up the gym environment for testing.
gym::Environment envTest("gym.kurg.org", "4040", "MountainCar-v0");
envTest.monitor.start("./dummy/", true, true);

// Resets the environment.
envTest.reset();
envTest.render();

double totalReward = 0;
size_t totalSteps = 0;

// Testing the agent on gym's environment.
while (1)
{
  // State from the environment is passed to the agent's internal representation.
  agent.State().Data() = envTest.observation;

  // With the given state, the agent selects an action according to its defined policy.
  agent.SelectAction();

  // Action to take, decided by the policy.
  arma::mat action = {double(agent.Action().action)};

  envTest.step(action);
  totalReward += envTest.reward;
  totalSteps += 1;

  if (envTest.done)
  {
    std::cout << " Total steps: " << totalSteps << "\t Total reward: "
        << totalReward << std::endl;
    break;
  }

  // Uncomment the following lines to see the reward and action in each step.
  // std::cout << " Current step: " << totalSteps << "\t current reward: "
  //   << totalReward << "\t Action taken: " << action;
}

envTest.close();
std::string url = envTest.url();
std::cout << url;
auto video = xw::video_from_url(url).finalize();
video

 Total steps: 104	 Total reward: -104
https://gym.kurg.org/b6827842a41c4/output.webm

A Jupyter widget