[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/microsoft/TextWorld/blob/main/notebooks/Building%20a%20simple%20agent.ipynb)

# Building a simple agent with TextWorld

This tutorial outlines the steps to build an agent that learns how to play __choice-based__ text-based games generated with TextWorld.

### Prerequisite
Install TextWorld as described in the [README.md](https://github.com/microsoft/TextWorld#readme). Most of the time, a simple `pip install` should work.

In [None]:
!pip install textworld

and [PyTorch](https://pytorch.org/) (tested with both v1.8.2 and v.1.9.0).

In [None]:
!pip install torch torchvision torchaudio

**[Optional]** Download all data beforehand. Otherwise, they are going to be generated as needed (slower).

In [None]:
!wget https://aka.ms/textworld/notebooks/data.zip
!unzip -nq data.zip && rm -f data.zip

## Learning challenges
Training an agent such that it can learn how to play text-based games is not trivial. Among other challenges, we have to deal with

1. a combinatorial action space (that grows w.r.t. vocabulary)
2. a really sparse reward signal.

To ease the learning process, we will be requesting additional information alongside the game's narrative (as covered in [Playing TextWorld generated games with OpenAI Gym](Playing%20TextWorld%20generated%20games%20with%20OpenAI%20Gym.ipynb#Interact-with-the-game)). More specifically, we will request the following information:

- __Description__:
For every game state, we will get the output of the `look` command which describes the current location;

- __Inventory__:
For every game state, we will get the output of the `inventory` command which describes the player's inventory;

- __Admissible commands__:
For every game state, we will get the list of commands guaranteed to be understood by the game interpreter;

- __Intermediate reward__:
For every game state, we will get an intermediate reward which can either be:
  - __-1__: last action needs to be undone before resuming the quest
  -  __0__: last action didn't affect the quest
  -  __1__: last action brought us closer to completing the quest

- __Entities__:
For every game, we will get a list of entity names that the agent can interact with.


## Simple test games
We can use TextWorld to generate a few simple games with the following handcrafted world
```
                     Bathroom
                        +
                        |
                        +
    Bedroom +-(d1)-+ Kitchen +--(d2)--+ Backyard
      (P)               +                  +
                        |                  |
                        +                  +
                   Living Room           Garden
```
where the goal is always to retrieve a hidden food item and put it on the stove which is located in the kitchen. One can lose the game if it eats the food item instead of putting it on the stove!

Using `tw-make tw-simple ...`, we are going to generate the following 7 games:

| gamefile | description |
| :------- | :---------- |
| `games/rewardsDense_goalDetailed.z8` | dense reward + detailed instructions |
| `games/rewardsBalanced_goalDetailed.z8` | balanced rewards + detailed instructions |
| `games/rewardsSparse_goalDetailed.z8` | sparse rewards + detailed instructions |
| | |
| `games/rewardsDense_goalBrief.z8` | dense rewards + no instructions but the goal is mentionned |
| `games/rewardsBalanced_goalBrief.z8` | balanced rewards + no instructions but the goal is mentionned |
| `games/rewardsSparse_goalBrief.z8` | sparse rewards + no instructions but the goal is mentionned |
| | |
| `games/rewardsSparse_goalNone.z8` | sparse rewards + no instructions/goal<br>_Hint: there's an hidden note in the game that describes the goal!_ |

In [None]:
# You can skip this if you already downloaded the games in the prequisite section.

# Same as !make_games.sh
!tw-make tw-simple --rewards dense    --goal detailed --seed 18 --test --silent -f --output games/tw-rewardsDense_goalDetailed.z8
!tw-make tw-simple --rewards balanced --goal detailed --seed 18 --test --silent -f --output games/tw-rewardsBalanced_goalDetailed.z8
!tw-make tw-simple --rewards sparse   --goal detailed --seed 18 --test --silent -f --output games/tw-rewardsSparse_goalDetailed.z8
!tw-make tw-simple --rewards dense    --goal brief    --seed 18 --test --silent -f --output games/tw-rewardsDense_goalBrief.z8
!tw-make tw-simple --rewards balanced --goal brief    --seed 18 --test --silent -f --output games/tw-rewardsBalanced_goalBrief.z8
!tw-make tw-simple --rewards sparse   --goal brief    --seed 18 --test --silent -f --output games/tw-rewardsSparse_goalBrief.z8
!tw-make tw-simple --rewards sparse   --goal none     --seed 18 --test --silent -f --output games/tw-rewardsSparse_goalNone.z8

## Building the random baseline
Let's start with building an agent that simply selects an admissible command at random.

In [1]:
from typing import Mapping, Any

import numpy as np

import textworld.gym


class RandomAgent(textworld.gym.Agent):
    """ Agent that randomly selects a command from the admissible ones. """
    def __init__(self, seed=1234):
        self.seed = seed
        self.rng = np.random.RandomState(self.seed)

    @property
    def infos_to_request(self) -> textworld.EnvInfos:
        return textworld.EnvInfos(admissible_commands=True)
    
    def act(self, obs: str, score: int, done: bool, infos: Mapping[str, Any]) -> str:
        return self.rng.choice(infos["admissible_commands"])


## Play function
Let's write a simple play function that we will use to evaluate our agent on a given game.

In [2]:
import os
from glob import glob

import gym
import textworld.gym

import torch


def play(agent, path, max_step=100, nb_episodes=10, verbose=True):
    torch.manual_seed(20211021)  # For reproducibility when using action sampling.

    infos_to_request = agent.infos_to_request
    infos_to_request.max_score = True  # Needed to normalize the scores.
    
    gamefiles = [path]
    if os.path.isdir(path):
        gamefiles = glob(os.path.join(path, "*.z8"))
        
    env_id = textworld.gym.register_games(gamefiles,
                                          request_infos=infos_to_request,
                                          max_episode_steps=max_step)
    env = gym.make(env_id)  # Create a Gym environment to play the text game.
    if verbose:
        if os.path.isdir(path):
            print(os.path.dirname(path), end="")
        else:
            print(os.path.basename(path), end="")
        
    # Collect some statistics: nb_steps, final reward.
    avg_moves, avg_scores, avg_norm_scores = [], [], []
    for no_episode in range(nb_episodes):
        obs, infos = env.reset()  # Start new episode.

        score = 0
        done = False
        nb_moves = 0
        while not done:
            command = agent.act(obs, score, done, infos)
            obs, score, done, infos = env.step(command)
            nb_moves += 1
        
        agent.act(obs, score, done, infos)  # Let the agent know the game is done.
                
        if verbose:
            print(".", end="")
        avg_moves.append(nb_moves)
        avg_scores.append(score)
        avg_norm_scores.append(score / infos["max_score"])

    env.close()
    if verbose:
        if os.path.isdir(path):
            msg = "  \tavg. steps: {:5.1f}; avg. normalized score: {:4.1f} / {}."
            print(msg.format(np.mean(avg_moves), np.mean(avg_norm_scores), 1))
        else:
            msg = "  \tavg. steps: {:5.1f}; avg. score: {:4.1f} / {}."
            print(msg.format(np.mean(avg_moves), np.mean(avg_scores), infos["max_score"]))
    

#### Evaluate the random agent

In [3]:
# We report the score and steps averaged over 10 playthroughs.
play(RandomAgent(), "./games/tw-rewardsDense_goalDetailed.z8")    # Dense rewards
play(RandomAgent(), "./games/tw-rewardsBalanced_goalDetailed.z8") # Balanced rewards
play(RandomAgent(), "./games/tw-rewardsSparse_goalDetailed.z8")   # Sparse rewards

tw-rewardsDense_goalDetailed.z8..........  	avg. steps: 100.0; avg. score:  4.2 / 10.
tw-rewardsBalanced_goalDetailed.z8..........  	avg. steps: 100.0; avg. score:  0.7 / 4.
tw-rewardsSparse_goalDetailed.z8..........  	avg. steps: 100.0; avg. score:  0.0 / 1.


## Neural agent

Now, let's create an agent that can learn to play text-based games. The agent will be trained to select a command from the list of admissible commands given the current game's narrative, inventory, and room description. Here is an overview of the architecture used for the agent: 

<div>
  <img src="https://raw.githubusercontent.com/MarcCote/TextWorld/msr_summit_2021/notebooks/figs/neural_agent.png" width="500"/>
</div>



### Code
Here's the implementation of that learning agent built with [PyTorch](https://pytorch.org/).

In [4]:
import re
from typing import List, Mapping, Any, Optional
from collections import defaultdict

import numpy as np

import textworld
import textworld.gym
from textworld import EnvInfos

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


class CommandScorer(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(CommandScorer, self).__init__()
        torch.manual_seed(42)  # For reproducibility
        self.embedding    = nn.Embedding(input_size, hidden_size)
        self.encoder_gru  = nn.GRU(hidden_size, hidden_size)
        self.cmd_encoder_gru  = nn.GRU(hidden_size, hidden_size)
        self.state_gru    = nn.GRU(hidden_size, hidden_size)
        self.hidden_size  = hidden_size
        self.state_hidden = torch.zeros(1, 1, hidden_size, device=device)
        self.critic       = nn.Linear(hidden_size, 1)
        self.att_cmd      = nn.Linear(hidden_size * 2, 1)

    def forward(self, obs, commands, **kwargs):
        input_length = obs.size(0)
        batch_size = obs.size(1)
        nb_cmds = commands.size(1)

        embedded = self.embedding(obs)
        encoder_output, encoder_hidden = self.encoder_gru(embedded)
        state_output, state_hidden = self.state_gru(encoder_hidden, self.state_hidden)
        self.state_hidden = state_hidden
        value = self.critic(state_output)

        # Attention network over the commands.
        cmds_embedding = self.embedding.forward(commands)
        _, cmds_encoding_last_states = self.cmd_encoder_gru.forward(cmds_embedding)  # 1 x cmds x hidden

        # Same observed state for all commands.
        cmd_selector_input = torch.stack([state_hidden] * nb_cmds, 2)  # 1 x batch x cmds x hidden

        # Same command choices for the whole batch.
        cmds_encoding_last_states = torch.stack([cmds_encoding_last_states] * batch_size, 1)  # 1 x batch x cmds x hidden

        # Concatenate the observed state and command encodings.
        cmd_selector_input = torch.cat([cmd_selector_input, cmds_encoding_last_states], dim=-1)

        # Compute one score per command.
        scores = F.relu(self.att_cmd(cmd_selector_input)).squeeze(-1)  # 1 x Batch x cmds

        probs = F.softmax(scores, dim=2)  # 1 x Batch x cmds
        index = probs[0].multinomial(num_samples=1).unsqueeze(0) # 1 x batch x indx
        return scores, index, value

    def reset_hidden(self, batch_size):
        self.state_hidden = torch.zeros(1, batch_size, self.hidden_size, device=device)


class NeuralAgent:
    """ Simple Neural Agent for playing TextWorld games. """
    MAX_VOCAB_SIZE = 1000
    UPDATE_FREQUENCY = 10
    LOG_FREQUENCY = 1000
    GAMMA = 0.9
    
    def __init__(self) -> None:
        self._initialized = False
        self._epsiode_has_started = False
        self.id2word = ["<PAD>", "<UNK>"]
        self.word2id = {w: i for i, w in enumerate(self.id2word)}
        
        self.model = CommandScorer(input_size=self.MAX_VOCAB_SIZE, hidden_size=128)
        self.optimizer = optim.Adam(self.model.parameters(), 0.00003)
        
        self.mode = "test"
    
    def train(self):
        self.mode = "train"
        self.stats = {"max": defaultdict(list), "mean": defaultdict(list)}
        self.transitions = []
        self.model.reset_hidden(1)
        self.last_score = 0
        self.no_train_step = 0
    
    def test(self):
        self.mode = "test"
        self.model.reset_hidden(1)
        
    @property
    def infos_to_request(self) -> EnvInfos:
        return EnvInfos(description=True, inventory=True, admissible_commands=True,
                        won=True, lost=True)
    
    def _get_word_id(self, word):
        if word not in self.word2id:
            if len(self.word2id) >= self.MAX_VOCAB_SIZE:
                return self.word2id["<UNK>"]
            
            self.id2word.append(word)
            self.word2id[word] = len(self.word2id)
            
        return self.word2id[word]
            
    def _tokenize(self, text):
        # Simple tokenizer: strip out all non-alphabetic characters.
        text = re.sub("[^a-zA-Z0-9\- ]", " ", text)
        word_ids = list(map(self._get_word_id, text.split()))
        return word_ids

    def _process(self, texts):
        texts = list(map(self._tokenize, texts))
        max_len = max(len(l) for l in texts)
        padded = np.ones((len(texts), max_len)) * self.word2id["<PAD>"]

        for i, text in enumerate(texts):
            padded[i, :len(text)] = text

        padded_tensor = torch.from_numpy(padded).type(torch.long).to(device)
        padded_tensor = padded_tensor.permute(1, 0) # Batch x Seq => Seq x Batch
        return padded_tensor
      
    def _discount_rewards(self, last_values):
        returns, advantages = [], []
        R = last_values.data
        for t in reversed(range(len(self.transitions))):
            rewards, _, _, values = self.transitions[t]
            R = rewards + self.GAMMA * R
            adv = R - values
            returns.append(R)
            advantages.append(adv)
            
        return returns[::-1], advantages[::-1]

    def act(self, obs: str, score: int, done: bool, infos: Mapping[str, Any]) -> Optional[str]:
        
        # Build agent's observation: feedback + look + inventory.
        input_ = "{}\n{}\n{}".format(obs, infos["description"], infos["inventory"])
        
        # Tokenize and pad the input and the commands to chose from.
        input_tensor = self._process([input_])
        commands_tensor = self._process(infos["admissible_commands"])
        
        # Get our next action and value prediction.
        outputs, indexes, values = self.model(input_tensor, commands_tensor)
        action = infos["admissible_commands"][indexes[0]]
        
        if self.mode == "test":
            if done:
                self.model.reset_hidden(1)
            return action
        
        self.no_train_step += 1
        
        if self.transitions:
            reward = score - self.last_score  # Reward is the gain/loss in score.
            self.last_score = score
            if infos["won"]:
                reward += 100
            if infos["lost"]:
                reward -= 100
                
            self.transitions[-1][0] = reward  # Update reward information.
        
        self.stats["max"]["score"].append(score)
        if self.no_train_step % self.UPDATE_FREQUENCY == 0:
            # Update model
            returns, advantages = self._discount_rewards(values)
            
            loss = 0
            for transition, ret, advantage in zip(self.transitions, returns, advantages):
                reward, indexes_, outputs_, values_ = transition
                
                advantage        = advantage.detach() # Block gradients flow here.
                probs            = F.softmax(outputs_, dim=2)
                log_probs        = torch.log(probs)
                log_action_probs = log_probs.gather(2, indexes_)
                policy_loss      = (-log_action_probs * advantage).sum()
                value_loss       = (.5 * (values_ - ret) ** 2.).sum()
                entropy     = (-probs * log_probs).sum()
                loss += policy_loss + 0.5 * value_loss - 0.1 * entropy
                
                self.stats["mean"]["reward"].append(reward)
                self.stats["mean"]["policy"].append(policy_loss.item())
                self.stats["mean"]["value"].append(value_loss.item())
                self.stats["mean"]["entropy"].append(entropy.item())
                self.stats["mean"]["confidence"].append(torch.exp(log_action_probs).item())
            
            if self.no_train_step % self.LOG_FREQUENCY == 0:
                msg = "{:6d}. ".format(self.no_train_step)
                msg += "  ".join("{}: {: 3.3f}".format(k, np.mean(v)) for k, v in self.stats["mean"].items())
                msg += "  " + "  ".join("{}: {:2d}".format(k, np.max(v)) for k, v in self.stats["max"].items())
                msg += "  vocab: {:3d}".format(len(self.id2word))
                print(msg)
                self.stats = {"max": defaultdict(list), "mean": defaultdict(list)}
            
            loss.backward()
            nn.utils.clip_grad_norm_(self.model.parameters(), 40)
            self.optimizer.step()
            self.optimizer.zero_grad()
        
            self.transitions = []
            self.model.reset_hidden(1)
        else:
            # Keep information about transitions for Truncated Backpropagation Through Time.
            self.transitions.append([None, indexes, outputs, values])  # Reward will be set on the next call
        
        if done:
            self.last_score = 0  # Will be starting a new episode. Reset the last score.
        
        return action

### Training the neural agent
Let's first evaluate the agent before training to get a sense of its initial performance.

In [5]:
agent = NeuralAgent()
play(agent, "./games/tw-rewardsDense_goalDetailed.z8")

tw-rewardsDense_goalDetailed.z8..........  	avg. steps: 100.0; avg. score:  4.3 / 10.


Unsurprisingly, the result is not much different from what the random agent can achieve since our neural agent is initialized to a random policy.

Let's train the agent for a few episodes.

In [6]:
# You can skip this if you already downloaded the data in the prequisite section.

from time import time
agent = NeuralAgent()

print("Training")
agent.train()  # Tell the agent it should update its parameters.
starttime = time()
play(agent, "./games/tw-rewardsDense_goalDetailed.z8", nb_episodes=500, verbose=False)  # Dense rewards game.

print("Trained in {:.2f} secs".format(time() - starttime))

# Save the trained agent.
import os
os.makedirs('checkpoints', exist_ok=True)
torch.save(agent, 'checkpoints/agent_trained_on_single_game.pt')

Training
  1000. reward:  0.042  policy:  0.258  value:  0.074  entropy:  2.337  confidence:  0.099  score:  8  vocab: 258
  2000. reward: -0.058  policy: -1.414  value:  24.997  entropy:  2.385  confidence:  0.095  score:  9  vocab: 316
  3000. reward:  0.052  policy:  0.146  value:  0.097  entropy:  2.429  confidence:  0.092  score:  9  vocab: 318
  4000. reward:  0.042  policy: -0.040  value:  0.081  entropy:  2.397  confidence:  0.095  score:  5  vocab: 318
  5000. reward:  0.053  policy:  0.084  value:  0.104  entropy:  2.480  confidence:  0.089  score:  7  vocab: 319
  6000. reward:  0.048  policy: -0.015  value:  0.083  entropy:  2.402  confidence:  0.095  score:  5  vocab: 319
  7000. reward:  0.053  policy:  0.030  value:  0.086  entropy:  2.426  confidence:  0.096  score:  8  vocab: 319
  8000. reward: -0.049  policy: -0.944  value:  18.233  entropy:  2.421  confidence:  0.098  score:  9  vocab: 320
  9000. reward:  0.054  policy:  0.062  value:  0.093  entropy:  2.477  confi

#### Testing the agent trained on a single game

In [7]:
# We report the score and steps averaged over 10 playthroughs.
agent = torch.load('checkpoints/agent_trained_on_single_game.pt')
agent.test()
play(agent, "./games/tw-rewardsDense_goalDetailed.z8")  # Dense rewards game.

tw-rewardsDense_goalDetailed.z8..........  	avg. steps:  85.0; avg. score:  8.8 / 10.


Of course, since we trained on that single simple game, it's not surprinsing the agent can achieve a high score on it. It would be more interesting to evaluate the generalization capability of the agent.

To do so, we are going to test the agent on another game drawn from the same game distribution (i.e. same world but the goal is to pick another food item). Let's generate `games/another_game.z8` with the same rewards density (`--rewards dense`) and the same goal description (`--goal detailed`), but using `--seed 1` and without the `--test` flag (to make sure the game is not part of the test set since `games/rewardsDense_goalDetailed.z8` is).

In [None]:
!tw-make tw-simple --rewards dense --goal detailed --seed 1 --output games/tw-another_game.z8 -v -f

In [8]:
# We report the score and steps averaged over 10 playthroughs.
play(RandomAgent(), "./games/tw-another_game.z8")
play(agent, "./games/tw-another_game.z8")

tw-another_game.z8..........  	avg. steps: 100.0; avg. score:  3.9 / 8.
tw-another_game.z8..........  	avg. steps: 100.0; avg. score:  5.6 / 8.


As we can see the trained agent does a bit better than the random agent. In order to improve the agent's generalization capability, we should train it on many different games drawn from the game distribution.

One could use the following command to easily generate 100 training games:

In [None]:
# You can skip this if you already downloaded the data in the prequisite section.

! seq 1 100 | xargs -n1 -P4 tw-make tw-simple --rewards dense --goal detailed --format z8 --output training_games/ --seed

Then, we train our agent on that set of training games.

In [9]:
# You can skip this if you already downloaded the data in the prequisite section.

from time import time
agent = NeuralAgent()

print("Training on 100 games")
agent.train()  # Tell the agent it should update its parameters.
starttime = time()
play(agent, "./training_games/", nb_episodes=100 * 5, verbose=False)  # Each game will be seen 5 times.
print("Trained in {:.2f} secs".format(time() - starttime))

# Save the trained agent.
import os
os.makedirs('checkpoints', exist_ok=True)
torch.save(agent, 'checkpoints/agent_trained_on_multiple_games.pt')

Training on 100 games
  1000. reward: -0.067  policy: -0.285  value:  10.002  entropy:  2.348  confidence:  0.097  score:  9  vocab: 513
  2000. reward: -0.059  policy: -1.836  value:  23.834  entropy:  2.400  confidence:  0.095  score:  9  vocab: 595
  3000. reward:  0.043  policy:  0.107  value:  0.073  entropy:  2.357  confidence:  0.097  score:  6  vocab: 616
  4000. reward:  0.048  policy:  0.110  value:  0.094  entropy:  2.457  confidence:  0.090  score:  6  vocab: 644
  5000. reward:  0.048  policy:  0.008  value:  0.084  entropy:  2.382  confidence:  0.096  score:  9  vocab: 655
  6000. reward: -0.048  policy: -2.254  value:  42.841  entropy:  2.441  confidence:  0.093  score:  9  vocab: 670
  7000. reward:  0.041  policy: -0.017  value:  0.062  entropy:  2.378  confidence:  0.095  score:  5  vocab: 686
  8000. reward:  0.047  policy:  0.051  value:  0.073  entropy:  2.412  confidence:  0.095  score:  6  vocab: 689
  9000. reward: -0.051  policy: -0.222  value:  5.534  entropy:

#### Testing the agent trained on 100 games.

In [10]:
agent = torch.load('checkpoints/agent_trained_on_multiple_games.pt')
agent.test()
play(agent, "./games/tw-rewardsDense_goalDetailed.z8")  # Averaged over 10 playthroughs.
play(agent, "./games/tw-another_game.z8")               # Averaged over 10 playthroughs.

tw-rewardsDense_goalDetailed.z8..........  	avg. steps:  91.3; avg. score:  7.9 / 10.
tw-another_game.z8..........  	avg. steps:  97.8; avg. score:  6.0 / 8.


Compare it to the agent trained on a single game.

In [11]:
agent = torch.load('checkpoints/agent_trained_on_single_game.pt')
agent.test()
play(agent, "./games/tw-rewardsDense_goalDetailed.z8")  # Averaged over 10 playthroughs.
play(agent, "./games/tw-another_game.z8")               # Averaged over 10 playthroughs.

tw-rewardsDense_goalDetailed.z8..........  	avg. steps:  85.0; avg. score:  8.8 / 10.
tw-another_game.z8..........  	avg. steps: 100.0; avg. score:  5.6 / 8.


#### Evaluating the agent on a test distribution
We will generate 20 test games and evaluate the agent on them.

In [None]:
# You can skip this if you already downloaded the games in the prequisite section.

! seq 1 20 | xargs -n1 -P4 tw-make tw-simple --rewards dense --goal detailed --test --format z8 --output testing_games/ --seed

In [12]:
agent = torch.load('checkpoints/agent_trained_on_multiple_games.pt')
agent.test()
play(RandomAgent(), "./testing_games/", nb_episodes=20 * 10)
play(agent, "./testing_games/", nb_episodes=20 * 10)  # Averaged over 10 playthroughs for each test game.

./testing_games........................................................................................................................................................................................................  	avg. steps:  99.6; avg. normalized score:  0.5 / 1.
./testing_games........................................................................................................................................................................................................  	avg. steps:  92.5; avg. normalized score:  0.8 / 1.


While not being perfect, the agent manage to score more points on average compared to the random agent.

## Next steps

Here are a few possible directions one can take to improve the agent's performance.
- Adding more training games
- Changing the agent architecture
- Leveraging already trained word embeddings
- Playing more games at once (see [`textworld.gym.make_batch`](https://textworld.readthedocs.io/en/latest/textworld.gym.html#textworld.gym.utils.make_batch))


## Papers about RL applied to text-based games
* [Language Understanding for Text-based games using Deep Reinforcement Learning][narasimhan_et_al_2015]
* [Learning How Not to Act in Text-based Games][haroush_et_al_2017]
* [Deep Reinforcement Learning with a Natural Language Action Space][he_et_al_2015]
* [What can you do with a rock? Affordance extraction via word embeddings][fulda_et_al_2017]
* [Text-based adventures of the Golovin AI Agent][kostka_et_al_2017]
* [Using reinforcement learning to learn how to play text-based games][zelinka_2018]

[narasimhan_et_al_2015]: https://arxiv.org/abs/1506.08941
[haroush_et_al_2017]: https://openreview.net/pdf?id=B1-tVX1Pz
[he_et_al_2015]: https://arxiv.org/abs/1511.04636
[fulda_et_al_2017]: https://arxiv.org/abs/1703.03429
[kostka_et_al_2017]: https://arxiv.org/abs/1705.05637
[zelinka_2018]: https://arxiv.org/abs/1801.01999