Binary String Reinforcement Learning

This is a from-scratch implementation of deep quality networks (DQN) inspired by OpenAI's paper Hindsight Experience Replay. In this task an agent must perform bit flipping actions to transform a given binary string to a given target string with only sparse rewards.

This repo allows one to replicate the results found in the paper and adjust virtual goal parameters for better performance.

Premise

The environment consists of a binary string of 1s and 0s with another binary goal string that the agent must transform the initial string into. For example, the optimal actions for a string of length 3 would look like this:

GOAL: [0 1 1]
STRING: [1 0 1] -> [0 0 1] -> [0 1 1]

With some domain knowledge, an implementer could create a shaped reward function that rewards an agent for taking actions that move the string closer to its goal (such as a norm function). However for many high level tasks the implementer does know ahead of time the precise reward function capable of smoothly guiding for an agent to the goal. In HER OpenAI researchers consider the case in which the rewards are extremely sparse: +1 reward if the strings match exactly, -1 reward otherwise.

A typical agent is not capable of solving strings longer than 15 bits since it is statistically improbable that the agent would randomly encounter the goal enough to learn from positive rewards. The agent would therefore only receive negative rewards and be incapable of learning how to reach the goal.

Hindsight Experience Replay

In the paper the authors propose a method of adding additional goals for each replay step added to the replay buffer. For each action, in addition to the (probably negative) reward replay, they modify the replay's goal string to be the replay's next state, resulting in a replay with a postiive positive reward. These "virtual goals" allow the agent to learn from immediate positive rewards without relying on random sampling from the environment.

Results

My results were able to replicate the authors' findings up to 30 bit strings. The authors did not include many details about hyperparameters in their paper, so it is possible that longer strings are solvable with different parameters such as a larger replay buffer and more training episodes.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
repo_assets		repo_assets
.gitignore		.gitignore
README.md		README.md
config.py		config.py
dqn.py		dqn.py
environment.py		environment.py
her.py		her.py
main.py		main.py
policy.py		policy.py
replay.py		replay.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Binary String Reinforcement Learning

Premise

Hindsight Experience Replay

Results

About

Releases

Packages

Languages

kylesayrs/BinaryStringRL

Folders and files

Latest commit

History

Repository files navigation

Binary String Reinforcement Learning

Premise

Hindsight Experience Replay

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages