# Textworld starting kit notebook

Model: *Bert-DQN with replay memory*

When running first: 
 1. Run the first 2 code cells(with pip installations)
 2. Restart runtime
 3. Countinue with the next cells

This is done, because there is a problem with dependencies of **textworld** and **colab**, requiring different versions of **prompt-toolkit**

In [0]:
!pip install textworld

Collecting prompt-toolkit<2.1.0,>=2.0.0 (from textworld)
  Using cached https://files.pythonhosted.org/packages/f7/a7/9b1dd14ef45345f186ef69d175bdd2491c40ab1dfa4b2b3e4352df719ed7/prompt_toolkit-2.0.9-py3-none-any.whl
[31mERROR: ipython 5.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.4, but you'll have prompt-toolkit 2.0.9 which is incompatible.[0m
Installing collected packages: prompt-toolkit
  Found existing installation: prompt-toolkit 1.0.16
    Uninstalling prompt-toolkit-1.0.16:
      Successfully uninstalled prompt-toolkit-1.0.16
Successfully installed prompt-toolkit-2.0.9


In [0]:
!pip install prompt-toolkit==1.0.16

Collecting prompt-toolkit==1.0.16
  Using cached https://files.pythonhosted.org/packages/57/a8/a151b6c61718eabe6b4672b6aa760b734989316d62ec1ba4996765e602d4/prompt_toolkit-1.0.16-py3-none-any.whl
[31mERROR: textworld 1.1.1 has requirement prompt-toolkit<2.1.0,>=2.0.0, but you'll have prompt-toolkit 1.0.16 which is incompatible.[0m
[31mERROR: jupyter-console 6.0.0 has requirement prompt-toolkit<2.1.0,>=2.0.0, but you'll have prompt-toolkit 1.0.16 which is incompatible.[0m
Installing collected packages: prompt-toolkit
  Found existing installation: prompt-toolkit 2.0.9
    Uninstalling prompt-toolkit-2.0.9:
      Successfully uninstalled prompt-toolkit-2.0.9
Successfully installed prompt-toolkit-1.0.16


In [0]:
!pip install pytorch_pretrained_bert



In [0]:
import os
import random
import logging
import yaml
import copy
import spacy
import numpy as np
import glob

from tqdm import tqdm
from typing import List, Dict, Any
from collections import namedtuple
import pandas as pd

import torch
import torch.nn.functional as F
import math

import gym
import textworld.gym
from textworld import EnvInfos
from pytorch_pretrained_bert import BertTokenizer, BertModel


torch.cuda.is_available()

True

In [0]:
torch.cuda.empty_cache()

## Generic functions

In [0]:
def to_np(x):
    if isinstance(x, np.ndarray):
        return x
    return x.data.cpu().numpy()


def to_pt(np_matrix, enable_cuda=False, type='long'):
    if type == 'long':
        if enable_cuda:
            return torch.autograd.Variable(torch.from_numpy(np_matrix).type(torch.LongTensor).cuda())
        else:
            return torch.autograd.Variable(torch.from_numpy(np_matrix).type(torch.LongTensor))
    elif type == 'float':
        if enable_cuda:
            return torch.autograd.Variable(torch.from_numpy(np_matrix).type(torch.FloatTensor).cuda())
        else:
            return torch.autograd.Variable(torch.from_numpy(np_matrix).type(torch.FloatTensor))


def _words_to_ids(words, word2id):
    ids = []
    for word in words:
        try:
            ids.append(word2id[word])
        except KeyError:
            ids.append(1)
    return ids


def preproc(s, str_type='None', tokenizer=None, lower_case=True):
    if s is None:
        return ["nothing"]
    s = s.replace("\n", ' ')
    if s.strip() == "":
        return ["nothing"]
    if str_type == 'feedback':
        if "$$$$$$$" in s:
            s = ""
        if "-=" in s:
            s = s.split("-=")[0]
    s = s.strip()
    if len(s) == 0:
        return ["nothing"]
    tokens = [t.text for t in tokenizer(s)]
    if lower_case:
        tokens = [t.lower() for t in tokens]
    return tokens


def max_len(list_of_list):
    return max(map(len, list_of_list))


def pad_sequences(sequences, maxlen=None, dtype='int32', value=0.):
    '''
    Partially borrowed from Keras
    # Arguments
        sequences: list of lists where each element is a sequence
        maxlen: int, maximum length
        dtype: type to cast the resulting sequence.
        value: float, value to pad the sequences to the desired value.
    # Returns
        x: numpy array with dimensions (number_of_sequences, maxlen)
    '''
    lengths = [len(s) for s in sequences]
    nb_samples = len(sequences)
    if maxlen is None:
        maxlen = np.max(lengths)
    # take the sample shape from the first non empty sequence
    # checking for consistency in the main loop below.
    sample_shape = tuple()
    for s in sequences:
        if len(s) > 0:
            sample_shape = np.asarray(s).shape[1:]
            break
    x = (np.ones((nb_samples, maxlen) + sample_shape) * value).astype(dtype)
    for idx, s in enumerate(sequences):
        if len(s) == 0:
            continue  # empty list was found
        # pre truncating
        trunc = s[-maxlen:]
        # check `trunc` has expected shape
        trunc = np.asarray(trunc, dtype=dtype)
        if trunc.shape[1:] != sample_shape:
            raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' %
                             (trunc.shape[1:], idx, sample_shape))
        # post padding
        x[idx, :len(trunc)] = trunc
    return x

## Layers

In [0]:
def masked_mean(x, m=None, dim=-1):
    """
        mean pooling when there're paddings
        input:  tensor: batch x time x h
                mask:   batch x time
        output: tensor: batch x h
    """
    if m is None:
        return torch.mean(x, dim=dim)
    mask_sum = torch.sum(m, dim=-1)  # batch
    res = torch.sum(x, dim=1)  # batch x h
    mean = res / (mask_sum.unsqueeze(-1) + 1e-6)
    
    del mask_sum
    del res
    
    return mean


class Embedding(torch.nn.Module):
    '''
    inputs: x:          batch x seq (x is post-padded by 0s)
    outputs:embedding:  batch x seq x emb
            mask:       batch x seq
    '''

    def __init__(self, embedding_size, vocab_size, enable_cuda=False):
        super(Embedding, self).__init__()
        self.embedding_size = embedding_size
        self.vocab_size = vocab_size
        self.enable_cuda = enable_cuda
        self.embedding_layer = torch.nn.Embedding(self.vocab_size, self.embedding_size, padding_idx=0)

    def compute_mask(self, x):
        mask = torch.ne(x, 0).float()
        if self.enable_cuda:
            mask = mask.cuda()
        return mask

    def forward(self, x):
        embeddings = self.embedding_layer(x)  # batch x time x emb
        mask = self.compute_mask(x)  # batch x time
        return embeddings, mask


class FastUniLSTM(torch.nn.Module):
    """
    Adapted from https://github.com/facebookresearch/DrQA/
    now supports:   different rnn size for each layer
                    all zero rows in batch (from time distributed layer, by reshaping certain dimension)
    """

    def __init__(self, ninp, nhids, dropout_between_rnn_layers=0.):
        super(FastUniLSTM, self).__init__()
        self.ninp = ninp
        self.nhids = nhids
        self.nlayers = len(self.nhids)
        self.dropout_between_rnn_layers = dropout_between_rnn_layers
        self.stack_rnns()

    def stack_rnns(self):
        rnns = [torch.nn.LSTM(self.ninp if i == 0 else self.nhids[i - 1],
                              self.nhids[i],
                              num_layers=1,
                              bidirectional=False) for i in range(self.nlayers)]
        self.rnns = torch.nn.ModuleList(rnns)

    def forward(self, x, mask):

        def pad_(tensor, n):
            if n > 0:
                zero_pad = torch.autograd.Variable(torch.zeros((n,) + tensor.size()[1:]))
                if x.is_cuda:
                    zero_pad = zero_pad.cuda()
                tensor = torch.cat([tensor, zero_pad])
            return tensor

        """
        inputs: x:          batch x time x inp
                mask:       batch x time
        output: encoding:   batch x time x hidden[-1]
        """
        # Compute sorted sequence lengths
        batch_size = x.size(0)
        lengths = mask.data.eq(1).long().sum(1)  # .squeeze()
        _, idx_sort = torch.sort(lengths, dim=0, descending=True)
        _, idx_unsort = torch.sort(idx_sort, dim=0)

        lengths = list(lengths[idx_sort])
        idx_sort = torch.autograd.Variable(idx_sort)
        idx_unsort = torch.autograd.Variable(idx_unsort)

        # Sort x
        x = x.index_select(0, idx_sort)

        # remove non-zero rows, and remember how many zeros
        n_nonzero = np.count_nonzero(lengths)
        n_zero = batch_size - n_nonzero
        if n_zero != 0:
            lengths = lengths[:n_nonzero]
            x = x[:n_nonzero]

        # Transpose batch and sequence dims
        x = x.transpose(0, 1)

        # Pack it up
        rnn_input = torch.nn.utils.rnn.pack_padded_sequence(x, lengths)

        # Encode all layers
        outputs = [rnn_input]
        for i in range(self.nlayers):
            rnn_input = outputs[-1]

            # dropout between rnn layers
            if self.dropout_between_rnn_layers > 0:
                dropout_input = F.dropout(rnn_input.data,
                                          p=self.dropout_between_rnn_layers,
                                          training=self.training)
                rnn_input = torch.nn.utils.rnn.PackedSequence(dropout_input,
                                                              rnn_input.batch_sizes)
            seq, last = self.rnns[i](rnn_input)
            outputs.append(seq)
            if i == self.nlayers - 1:
                # last layer
                last_state = last[0]  # (num_layers * num_directions, batch, hidden_size)
                last_state = last_state[0]  # batch x hidden_size

        # Unpack everything
        for i, o in enumerate(outputs[1:], 1):
            outputs[i] = torch.nn.utils.rnn.pad_packed_sequence(o)[0]
        output = outputs[-1]

        # Transpose and unsort
        output = output.transpose(0, 1)  # batch x time x enc

        # re-padding
        output = pad_(output, n_zero)
        last_state = pad_(last_state, n_zero)

        output = output.index_select(0, idx_unsort)
        last_state = last_state.index_select(0, idx_unsort)

        # Pad up to original batch sequence length
        if output.size(1) != mask.size(1):
            padding = torch.zeros(output.size(0),
                                  mask.size(1) - output.size(1),
                                  output.size(2)).type(output.data.type())
            output = torch.cat([output, torch.autograd.Variable(padding)], 1)

        output = output.contiguous() * mask.unsqueeze(-1)
        return output, last_state, mask

## Noisy nets

In [0]:
class NoisyLinear(torch.nn.Module):
    def __init__(self, in_features, out_features, std_init=0.4):
        super(NoisyLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.std_init = std_init
        self.weight_mu = torch.nn.Parameter(torch.empty(out_features, in_features))
        self.weight_sigma = torch.nn.Parameter(torch.empty(out_features, in_features))
        self.register_buffer('weight_epsilon', torch.empty(out_features, in_features))
        self.bias_mu = torch.nn.Parameter(torch.empty(out_features))
        self.bias_sigma = torch.nn.Parameter(torch.empty(out_features))
        self.register_buffer('bias_epsilon', torch.empty(out_features))
        self.reset_parameters()
        self.sample_noise()

    def reset_parameters(self):
        mu_range = 1.0 / math.sqrt(self.in_features)
        self.weight_mu.data.uniform_(-mu_range, mu_range)
        self.weight_sigma.data.fill_(self.std_init / math.sqrt(self.in_features))
        self.bias_mu.data.uniform_(-mu_range, mu_range)
        self.bias_sigma.data.fill_(self.std_init / math.sqrt(self.out_features))

    def _scale_noise(self, size):
        x = torch.randn(size)
        return x.sign().mul_(x.abs().sqrt_())

    def sample_noise(self):
        epsilon_in = self._scale_noise(self.in_features)
        epsilon_out = self._scale_noise(self.out_features)
        self.weight_epsilon.copy_(epsilon_out.ger(epsilon_in))
        self.bias_epsilon.copy_(epsilon_out)

    def forward(self, inp):
        if self.training:
            return F.linear(inp, self.weight_mu + self.weight_sigma * self.weight_epsilon, self.bias_mu + self.bias_sigma * self.bias_epsilon)
        else:
            return F.linear(inp, self.weight_mu, self.bias_mu)

## Model

In [0]:
example_input  = '-= Pantry =- \
You\'ve just sauntered into a pantry. You try to gain information on your surroundings by using a technique you call "looking."\
You see a shelf. The shelf is wooden. On the shelf you can make out a black pepper and an orange bell pepper. I mean, just wow! Isn\'t TextWorld just the best?\
There is an open frosted-glass door leading south.\
 You are carrying nothing.\
Recipe #1\
---------\
Gather all following ingredients and follow the directions to prepare this tasty meal.\
Ingredients:\
  black pepper\
  red apple\
  water\
Directions:\
  prepare meal\
                    ________  ________  __    __  ________        \
                   |        \|        \|  \  |  \|        \       \
                    \$$$$$$$$| $$$$$$$$| $$  | $$ \$$$$$$$$       \
                      | $$   | $$__     \$$\/  $$   | $$          \
                      | $$   | $$  \     >$$  $$    | $$          \
                      | $$   | $$$$$    /  $$$$\    | $$          \
                      | $$   | $$_____ |  $$ \$$\   | $$          \
                      | $$   | $$     \| $$  | $$   | $$          \
                       \$$    \$$$$$$$$ \$$   \$$    \$$          \
              __       __   ______   _______   __        _______  \
             |  \  _  |  \ /      \ |       \ |  \      |       \ \
             | $$ / \ | $$|  $$$$$$\| $$$$$$$\| $$      | $$$$$$$\ \
             | $$/  $\| $$| $$  | $$| $$__| $$| $$      | $$  | $$\
             | $$  $$$\ $$| $$  | $$| $$    $$| $$      | $$  | $$\
             | $$ $$\$$\$$| $$  | $$| $$$$$$$\| $$      | $$  | $$\
             | $$$$  \$$$$| $$__/ $$| $$  | $$| $$_____ | $$__/ $$\
             | $$$    \$$$ \$$    $$| $$  | $$| $$     \| $$    $$\
              \$$      \$$  \$$$$$$  \$$   \$$ \$$$$$$$$ \$$$$$$$ \
You are hungry! Let\'s cook a delicious meal. Check the cookbook in the kitchen for the recipe. Once done, enjoy your meal!'


In [0]:
def preproc_example(s):
  s = s.replace('$', '')
  s = s.replace('#', '')
  s = s.replace('\n', ' ')
  s = s.replace('  ', ' ')
  s = s.replace('_', '')
  s = s.replace('|', '')
  s = s.replace('\\', '')
  s = s.replace('/', '')
  s = s.replace('-', '')
  s = s.replace('=', '')
  return s

In [0]:
preproc_example(example_input)
# example_input.replace('$', '')

' Pantry  You\'ve just sauntered into a pantry. You try to gain information on your surroundings by using a technique you call "looking."You see a shelf. The shelf is wooden. On the shelf you can make out a black pepper and an orange bell pepper. I mean, just wow! Isn\'t TextWorld just the best?There is an open frostedglass door leading south. You are carrying nothing.Recipe 1Gather all following ingredients and follow the directions to prepare this tasty meal.Ingredients: black pepper red apple waterDirections: prepare meal                                                                                                              >                                                                                                                                                                                                                                                                                                      You are hungry! Let\'s cook a delicious meal. Check the cookbook 

In [0]:

def convert_examples_to_features(sequences, tokenizer):
  """Loads a data file into a list of `InputFeature`s."""
  batch_tokens = []
  batch_input_ids = []
  batch_input_masks = []
  for example in sequences:
      _example = preproc_example(example)      
#       print(_example)
      tokens = tokenizer.tokenize(_example)
      if len(tokens) > 512:
        tokens = tokens[:512]
      batch_tokens.append(tokens)
      del _example
      del tokens

  max_length = max([len(x) for x in batch_tokens])
#   print('bert_max_seqence', max_length)
  for tokens in batch_tokens:
      input_ids = tokenizer.convert_tokens_to_ids(tokens)
      # The mask has 1 for real tokens and 0 for padding tokens. Only real
      # tokens are attended to.
      input_mask = [1] * len(input_ids)

      # Zero-pad up to the sequence length.
      while len(input_ids) < max_length:
          input_ids.append(0)
          input_mask.append(0)
           
      batch_input_ids.append(input_ids)
      batch_input_masks.append(input_mask)
      del input_ids
      del input_mask
  
  return batch_tokens, batch_input_ids, batch_input_masks

def freeze_layer(layer):
    for param in layer.parameters():
        param.requires_grad = False

In [0]:
mlogger = logging.getLogger(__name__)

class Bert_DQN(torch.nn.Module):
    model_name = 'bert_dqn'

    def __init__(self, model_config, word_vocab, generate_length=5, enable_cuda=False):
        super(Bert_DQN, self).__init__()
        self.model_config = model_config
        self.enable_cuda = enable_cuda
        self.word_vocab_size = len(word_vocab)
        self.id2word = word_vocab
        self.generate_length = generate_length
        self.read_config()
#         print(enable_cuda)
        self.device = torch.device("cuda" if enable_cuda else "cpu")
        self.tokenizer = BertTokenizer.from_pretrained(self.bert_model, do_lower_case=True)
        self._def_layers()
        self.init_weights()
        self.print_parameters()

    def print_parameters(self):
      amount = 0
      for p in self.parameters():
          amount += np.prod(p.size())
      print("total number of parameters: %s" % (amount))
      parameters = filter(lambda p: p.requires_grad, self.parameters())
      amount = 0
      for p in parameters:
          amount += np.prod(p.size())
      print("number of trainable parameters: %s" % (amount))

    def read_config(self):
        # model config
#         self.embedding_size = self.model_config['embedding_size']
#         self.encoder_rnn_hidden_size = self.model_config['encoder_rnn_hidden_size']
#         self.action_scorer_hidden_dim = self.model_config['action_scorer_hidden_dim']
#         self.dropout_between_rnn_layers = self.model_config['dropout_between_rnn_layers']
        self.bert_model = self.model_config['bert_model']
        self.layer_index = self.model_config['layer_index']
        self.action_scorer_hidden_dim = self.model_config['action_scorer_hidden_dim']
        self.train_bert = self.model_config['train_bert']
        
    def _def_layers(self):

        # word embeddings
#         self.word_embedding = Embedding(embedding_size=self.embedding_size,
#                                         vocab_size=self.word_vocab_size,
#                                         enable_cuda=self.enable_cuda)

#         # lstm encoder
#         self.encoder = FastUniLSTM(ninp=self.embedding_size,
#                                    nhids=self.encoder_rnn_hidden_size,
#                                    dropout_between_rnn_layers=self.dropout_
        self.encoder = BertModel.from_pretrained(self.bert_model).to(self.device)
        if not self.train_bert:
          freeze_layer(self.encoder)
        # only for base models
        # for large models is 
        bert_embeddings = 768

        self.action_scorer_shared = torch.nn.Linear(bert_embeddings, self.action_scorer_hidden_dim)
        action_scorers = []
        for _ in range(self.generate_length):
            action_scorers.append( NoisyLinear(self.action_scorer_hidden_dim, 
                            self.word_vocab_size, 
                            std_init=self.model_config['noisy_std']))
        self.action_scorers = torch.nn.ModuleList(action_scorers)
        self.fake_recurrent_mask = None

    def init_weights(self):
        torch.nn.init.xavier_uniform_(self.action_scorer_shared.weight.data)
        for i in range(len(self.action_scorers)):
            self.action_scorers[i].sample_noise()
        self.action_scorer_shared.bias.data.fill_(0)

    def representation_generator(self, ids, mask):
        ids = ids.to(self.device)
        mask = mask.to(self.device)
        
        layers, _ = self.encoder(ids, attention_mask=mask)
#         encoding_sequence = layers[self.layer_index]
#         print('layer length: ', len(layers))
        encoding_sequence = layers[-2].type(torch.FloatTensor)
        encoding_sequence = encoding_sequence.to(self.device)
    
#         print('encoding_sequence: ', type(encoding_sequence))
#         print('encoding_sequence: ', encoding_sequence)
        mask = mask.type(torch.FloatTensor).to(self.device)
#         print('mask: ', type(mask))
#         print('mask: ', mask)
        
#         embeddings, mask = self.word_embedding.forward(_input_words)  # batch x time x emb
#         encoding_sequence, _, _ = self.encoder.forward(embeddings, mask)  # batch x time x h
        res_mean = masked_mean(encoding_sequence, mask)  # batch x h
        del layers
        del encoding_sequence
        del mask
        
        return res_mean


    def action_scorer(self, state_representation):
        hidden = self.action_scorer_shared.forward(state_representation)  # batch x hid
        hidden = F.relu(hidden)  # batch x hid
        action_ranks = []
        for i in range(len(self.action_scorers)):
            action_ranks.append(self.action_scorers[i].forward(hidden))  # batch x n_vocab
        del hidden
        return action_ranks

## Agent

In [0]:
# a snapshot of state to be stored in replay memory
Transition = namedtuple('Transition', ('bert_ids', 'bert_masks',
                                       'word_indices',
                                       'reward', 'mask', 'done',
                                       'next_bert_ids', 'next_bert_masks',
                                       'next_word_masks'))


class HistoryScoreCache(object):

    def __init__(self, capacity=1):
        self.capacity = capacity
        self.reset()

    def push(self, stuff):
        """stuff is float."""
        if len(self.memory) < self.capacity:
            self.memory.append(stuff)
        else:
            self.memory = self.memory[1:] + [stuff]

    def get_avg(self):
        return np.mean(np.array(self.memory))

    def reset(self):
        self.memory = []

    def __len__(self):
        return len(self.memory)


class PrioritizedReplayMemory(object):

    def __init__(self, capacity=100000, priority_fraction=0.0):
        # prioritized replay memory
        self.priority_fraction = priority_fraction
        self.alpha_capacity = int(capacity * priority_fraction)
        self.beta_capacity = capacity - self.alpha_capacity
        self.alpha_memory, self.beta_memory = [], []
        self.alpha_position, self.beta_position = 0, 0

    def push(self, is_prior=False, *args):
        """Saves a transition."""
        if self.priority_fraction == 0.0:
            is_prior = False
        if is_prior:
            if len(self.alpha_memory) < self.alpha_capacity:
                self.alpha_memory.append(None)
            self.alpha_memory[self.alpha_position] = Transition(*args)
            self.alpha_position = (self.alpha_position + 1) % self.alpha_capacity
        else:
            if len(self.beta_memory) < self.beta_capacity:
                self.beta_memory.append(None)
            self.beta_memory[self.beta_position] = Transition(*args)
            self.beta_position = (self.beta_position + 1) % self.beta_capacity

    def sample(self, batch_size):
        if self.priority_fraction == 0.0:
            from_beta = min(batch_size, len(self.beta_memory))
            res = random.sample(self.beta_memory, from_beta)
        else:
            from_alpha = min(int(self.priority_fraction * batch_size), len(self.alpha_memory))
            from_beta = min(batch_size - int(self.priority_fraction * batch_size), len(self.beta_memory))
            res = random.sample(self.alpha_memory, from_alpha) + random.sample(self.beta_memory, from_beta)
        random.shuffle(res)
        return res

    def __len__(self):
        return len(self.alpha_memory) + len(self.beta_memory)


class CustomAgent:
    def __init__(self):
        """
        Arguments:
            word_vocab: List of words supported.
        """
        self.mode = "train"
        with open("./vocab.txt") as f:
            self.word_vocab = f.read().split("\n")
        with open("config.yaml") as reader:
            self.config = yaml.safe_load(reader)
        self.word2id = {}
        for i, w in enumerate(self.word_vocab):
            self.word2id[w] = i
        self.EOS_id = self.word2id["</S>"]

        self.batch_size = self.config['training']['batch_size']
        self.max_nb_steps_per_episode = self.config['training']['max_nb_steps_per_episode']
        self.nb_epochs = self.config['training']['nb_epochs']

        # Set the random seed manually for reproducibility.
        np.random.seed(self.config['general']['random_seed'])
        torch.manual_seed(self.config['general']['random_seed'])
        if torch.cuda.is_available():
            if not self.config['general']['use_cuda']:
                print("WARNING: CUDA device detected but 'use_cuda: false' found in config.yaml")
                self.use_cuda = False
            else:
                torch.backends.cudnn.deterministic = True
                torch.cuda.manual_seed(self.config['general']['random_seed'])
                self.use_cuda = True
        else:
            self.use_cuda = False

        self.model = Bert_DQN(model_config=self.config["model"],
                              word_vocab=self.word_vocab,
                              enable_cuda=self.use_cuda)

        self.experiment_tag = self.config['checkpoint']['experiment_tag']
        self.model_checkpoint_path = self.config['checkpoint']['model_checkpoint_path']
        self.save_frequency = self.config['checkpoint']['save_frequency']

        if self.config['checkpoint']['load_pretrained']:
            self.load_pretrained_model(self.model_checkpoint_path + '/' + self.config['checkpoint']['pretrained_experiment_tag'] + '.pt')
        if self.use_cuda:
            self.model.cuda()

        self.replay_batch_size = self.config['general']['replay_batch_size']
        self.replay_memory = PrioritizedReplayMemory(self.config['general']['replay_memory_capacity'],
                                                     priority_fraction=self.config['general']['replay_memory_priority_fraction'])

        # optimizer
        parameters = filter(lambda p: p.requires_grad, self.model.parameters())
        self.optimizer = torch.optim.Adam(parameters, lr=self.config['training']['optimizer']['learning_rate'])

        # epsilon greedy
        self.epsilon_anneal_episodes = self.config['general']['epsilon_anneal_episodes']
        self.epsilon_anneal_from = self.config['general']['epsilon_anneal_from']
        self.epsilon_anneal_to = self.config['general']['epsilon_anneal_to']
        self.epsilon = self.epsilon_anneal_from
        self.update_per_k_game_steps = self.config['general']['update_per_k_game_steps']
        self.clip_grad_norm = self.config['training']['optimizer']['clip_grad_norm']

        self.nlp = spacy.load('en', disable=['ner', 'parser', 'tagger'])
        self.preposition_map = {"take": "from",
                                "chop": "with",
                                "slice": "with",
                                "dice": "with",
                                "cook": "with",
                                "insert": "into",
                                "put": "on"}
        self.single_word_verbs = set(["inventory", "look"])
        self.discount_gamma = self.config['general']['discount_gamma']
        self.current_episode = 0
        self.current_step = 0
        self._epsiode_has_started = False
        self.history_avg_scores = HistoryScoreCache(capacity=1000)
        self.best_avg_score_so_far = 0.0
        self.loss = []
        self.state = ''

    def train(self, imitate=False):
        """
        Tell the agent that it's training phase.
        """
        self.mode = "train"
        self.imitate = imitate
        self.wt_index = 0
#         print(self.wt_index)
        self.model.train()

    def eval(self):
        """
        Tell the agent that it's evaluation phase.
        """
        self.mode = "eval"
        self.model.eval()

    def _start_episode(self, obs: List[str], infos: Dict[str, List[Any]]) -> None:
        """
        Prepare the agent for the upcoming episode.

        Arguments:
            obs: Initial feedback for each game.
            infos: Additional information for each game.
        """
        self.init(obs, infos)
        self._epsiode_has_started = True

    def _end_episode(self, obs: List[str], scores: List[int], infos: Dict[str, List[Any]]) -> None:
        """
        Tell the agent the episode has terminated.

        Arguments:
            obs: Previous command's feedback for each game.
            score: The score obtained so far for each game.
            infos: Additional information for each game.
        """
        self.finish()
        self._epsiode_has_started = False

    def load_pretrained_model(self, load_from):
        """
        Load pretrained checkpoint from file.

        Arguments:
            load_from: File name of the pretrained model checkpoint.
        """
        print("loading model from %s\n" % (load_from))
        try:
            if self.use_cuda:
                state_dict = torch.load(load_from)
            else:
                state_dict = torch.load(load_from, map_location='cpu')
            self.model.load_state_dict(state_dict)
        except:
            print("Failed to load checkpoint...")

    def select_additional_infos(self) -> EnvInfos:
        """
        Returns what additional information should be made available at each game step.

        Requested information will be included within the `infos` dictionary
        passed to `CustomAgent.act()`. To request specific information, create a
        :py:class:`textworld.EnvInfos <textworld.envs.wrappers.filter.EnvInfos>`
        and set the appropriate attributes to `True`. The possible choices are:

        * `description`: text description of the current room, i.e. output of the `look` command;
        * `inventory`: text listing of the player's inventory, i.e. output of the `inventory` command;
        * `max_score`: maximum reachable score of the game;
        * `objective`: objective of the game described in text;
        * `entities`: names of all entities in the game;
        * `verbs`: verbs understood by the the game;
        * `command_templates`: templates for commands understood by the the game;
        * `admissible_commands`: all commands relevant to the current state;

        In addition to the standard information, game specific information
        can be requested by appending corresponding strings to the `extras`
        attribute. For this competition, the possible extras are:

        * `'recipe'`: description of the cookbook;
        * `'walkthrough'`: one possible solution to the game (not guaranteed to be optimal);

        Example:
            Here is an example of how to request information and retrieve it.

            >>> from textworld import EnvInfos
            >>> request_infos = EnvInfos(description=True, inventory=True, extras=["recipe"])
            ...
            >>> env = gym.make(env_id)
            >>> ob, infos = env.reset()
            >>> print(infos["description"])
            >>> print(infos["inventory"])
            >>> print(infos["extra.recipe"])

        Notes:
            The following information *won't* be available at test time:

            * 'walkthrough'
        """
        request_infos = EnvInfos()
        request_infos.description = True
        request_infos.inventory = True
        request_infos.entities = True
        request_infos.verbs = True
        request_infos.max_score = True
        request_infos.has_won = True
        request_infos.has_lost = True
        request_infos.extras = ["recipe", "walkthrough"]
        return request_infos

    def init(self, obs: List[str], infos: Dict[str, List[Any]]):
        """
        Prepare the agent for the upcoming games.

        Arguments:
            obs: Previous command's feedback for each game.
            infos: Additional information for each game.
        """
        # reset agent, get vocabulary masks for verbs / adjectives / nouns
        self.scores = []
        self.dones = []
        self.prev_actions = ["" for _ in range(len(obs))]
        # get word masks
        batch_size = len(infos["verbs"])
        verbs_word_list = infos["verbs"]
        noun_word_list, adj_word_list = [], []
        for entities in infos["entities"]:
            tmp_nouns, tmp_adjs = [], []
            for name in entities:
                split = name.split()
                tmp_nouns.append(split[-1])
                if len(split) > 1:
                    tmp_adjs += split[:-1]
            noun_word_list.append(list(set(tmp_nouns)))
            adj_word_list.append(list(set(tmp_adjs)))

        verb_mask = np.zeros((batch_size, len(self.word_vocab)), dtype="float32")
        noun_mask = np.zeros((batch_size, len(self.word_vocab)), dtype="float32")
        adj_mask = np.zeros((batch_size, len(self.word_vocab)), dtype="float32")
        for i in range(batch_size):
            for w in verbs_word_list[i]:
                if w in self.word2id:
                    verb_mask[i][self.word2id[w]] = 1.0
            for w in noun_word_list[i]:
                if w in self.word2id:
                    noun_mask[i][self.word2id[w]] = 1.0
            for w in adj_word_list[i]:
                if w in self.word2id:
                    adj_mask[i][self.word2id[w]] = 1.0
        second_noun_mask = copy.copy(noun_mask)
        second_adj_mask = copy.copy(adj_mask)
        second_noun_mask[:, self.EOS_id] = 1.0
        adj_mask[:, self.EOS_id] = 1.0
        second_adj_mask[:, self.EOS_id] = 1.0
        self.word_masks_np = [verb_mask, adj_mask, noun_mask, second_adj_mask, second_noun_mask]

        self.cache_chosen_indices = None
        self.current_step = 0

    def get_game_step_info(self, obs: List[str], infos: Dict[str, List[Any]]):
        """
        Get all the available information, and concat them together to be tensor for
        a neural model. we use post padding here, all information are tokenized here.

        Arguments:
            obs: Previous command's feedback for each game.
            infos: Additional information for each game.
        """
#         print('inventory: ', len(infos['inventory']))
#         print('obs: ', len(obs))
#         print('recipees: ', len(infos['extra.recipe']))
#         print('descriptions: ', len(infos['description']))
#-----------------------------------------------------------------------------------------------
#         inventory_token_list = [preproc(item, tokenizer=self.nlp) for item in infos["inventory"]]
#         inventory_id_list = [_words_to_ids(tokens, self.word2id) for tokens in inventory_token_list]

#         feedback_token_list = [preproc(item, str_type='feedback', tokenizer=self.nlp) for item in obs]
#         feedback_id_list = [_words_to_ids(tokens, self.word2id) for tokens in feedback_token_list]

#         quest_token_list = [preproc(item, tokenizer=self.nlp) for item in infos["extra.recipe"]]
#         quest_id_list = [_words_to_ids(tokens, self.word2id) for tokens in quest_token_list]

#         prev_action_token_list = [preproc(item, tokenizer=self.nlp) for item in self.prev_actions]
#         prev_action_id_list = [_words_to_ids(tokens, self.word2id) for tokens in prev_action_token_list]

#         description_token_list = [preproc(item, tokenizer=self.nlp) for item in infos["description"]]
#         for i, d in enumerate(description_token_list):
#             if len(d) == 0:
#                 description_token_list[i] = ["end"]  # if empty description, insert word "end"
#         description_id_list = [_words_to_ids(tokens, self.word2id) for tokens in description_token_list]
#         description_id_list = [_d + _i + _q + _f + _pa for (_d, _i, _q, _f, _pa) in zip(description_id_list, inventory_id_list, quest_id_list, feedback_id_list, prev_action_id_list)]

#         input_description = pad_sequences(description_id_list, maxlen=max_len(description_id_list)).astype('int32')
#         input_description = to_pt(input_description, self.use_cuda)
#-----------------------------------------------------------------------------------------------    
        sep = ' [SEP] '
        description_text_list = [_d + sep + _i + sep + _q + sep + _f + sep + _pa for (_d, _i, _q, _f, _pa) 
                                  in zip(infos['description'], infos['inventory'], infos['extra.recipe'], obs, self.prev_actions)]

        _, bert_ids, bert_mask  = convert_examples_to_features(description_text_list, self.model.tokenizer)
#         del inventory_token_list
#         del inventory_id_list
#         del feedback_token_list
#         del feedback_id_list
#         del quest_token_list
#         del quest_id_list
#         del prev_action_token_list
#         del prev_action_id_list
#         del description_token_list
#         del description_id_list
        del description_text_list
        
        return bert_ids, bert_mask

    def word_ids_to_commands(self, verb, adj, noun, adj_2, noun_2):
        """
        Turn the 5 indices into actual command strings.

        Arguments:
            verb: Index of the guessing verb in vocabulary
            adj: Index of the guessing adjective in vocabulary
            noun: Index of the guessing noun in vocabulary
            adj_2: Index of the second guessing adjective in vocabulary
            noun_2: Index of the second guessing noun in vocabulary
        """
        # turns 5 indices into actual command strings
        if self.word_vocab[verb] in self.single_word_verbs:
            return self.word_vocab[verb]
        if adj == self.EOS_id:
            res = self.word_vocab[verb] + " " + self.word_vocab[noun]
        else:
            res = self.word_vocab[verb] + " " + self.word_vocab[adj] + " " + self.word_vocab[noun]
        if self.word_vocab[verb] not in self.preposition_map:
            return res
        if noun_2 == self.EOS_id:
            return res
        prep = self.preposition_map[self.word_vocab[verb]]
        if adj_2 == self.EOS_id:
            res = res + " " + prep + " " + self.word_vocab[noun_2]
        else:
            res =  res + " " + prep + " " + self.word_vocab[adj_2] + " " + self.word_vocab[noun_2]
        return res

    def get_wordid_from_vocab(self, word):
      if word in self.word2id.keys():
        return self.word2id[word]
      else:
        return self.EOS_id
    
    def command_to_word_ids(self, cmd, batch_size):
      verb_id=self.EOS_id
      first_adj=self.EOS_id
      first_noun=self.EOS_id
      second_adj=self.EOS_id
      second_noun=self.EOS_id
      
#       print('cmd_to_ids')
#       print(cmd.split())
      ids = _words_to_ids(cmd.split(), self.word2id)
#       print(ids)
      for ind, i in enumerate(ids):
        if self.word_masks_np[0][0][i]==1.0:
          verb = ind
          verb_id = i
      nouns=[]
      for ind, i in enumerate(ids):
        if self.word_masks_np[2][0][i]==1.0:
          nouns.append((ind,i))
      if len(nouns) > 0:
        if nouns[0][0] != verb - 1:
          adj_ids = ids[verb + 1: nouns[0][0]]
          adj=''
          adj= ' '.join([self.word_vocab[x] for x in adj_ids]) 
#           print(adj)
          first_adj=self.get_wordid_from_vocab(adj)
#         print(nouns)
        first_noun=nouns[0][1]
      
      if len(nouns) > 1:
        if nouns[1][0] != nouns[0][0] - 1:
          adj_ids = ids[nouns[0][0]: nouns[1][0]]
          adj= ' '.join([self.word_vocab[x] for x in adj_ids]) 
          second_adj=self.get_wordid_from_vocab(adj)
        second_noun=nouns[1][1]
        
       
      list_ids = [verb_id, first_adj, first_noun, second_adj, second_noun]
      return [to_pt(np.array([[x]]*batch_size), self.use_cuda) for x in list_ids]  
      
    def get_chosen_strings(self, chosen_indices):
        """
        Turns list of word indices into actual command strings.

        Arguments:
            chosen_indices: Word indices chosen by model.
        """
        chosen_indices_np = [to_np(item)[:, 0] for item in chosen_indices]
        res_str = []
        batch_size = chosen_indices_np[0].shape[0]
        for i in range(batch_size):
          verb, adj, noun, adj_2, noun_2 = chosen_indices_np[0][i],\
                                           chosen_indices_np[1][i],\
                                           chosen_indices_np[2][i],\
                                           chosen_indices_np[3][i],\
                                           chosen_indices_np[4][i]
          res_str.append(self.word_ids_to_commands(verb, adj, noun, adj_2, noun_2))
          del verb
          del adj
          del noun
          del adj_2
          del noun_2
            
        del chosen_indices_np
        return res_str

    def choose_random_command(self, word_ranks, word_masks_np):
        """
        Generate a command randomly, for epsilon greedy.

        Arguments:
            word_ranks: Q values for each word by model.action_scorer.
            word_masks_np: Vocabulary masks for words depending on their type (verb, adj, noun).
        """
        batch_size = word_ranks[0].size(0)
        word_ranks_np = [to_np(item) for item in word_ranks]  # list of batch x n_vocab
        word_ranks_np = [r * m for r, m in zip(word_ranks_np, word_masks_np)]  # list of batch x n_vocab
        word_indices = []
        for i in range(len(word_ranks_np)):
          indices = []
          for j in range(batch_size):
              msk = word_masks_np[i][j]  # vocab
              indices.append(np.random.choice(len(msk), p=msk / np.sum(msk, -1)))
              del msk
          word_indices.append(np.array(indices))
          del indices
        # word_indices: list of batch
        word_qvalues = [[] for _ in word_masks_np]
        for i in range(batch_size):
            for j in range(len(word_qvalues)):
                word_qvalues[j].append(word_ranks[j][i][word_indices[j][i]])
        word_qvalues = [torch.stack(item) for item in word_qvalues]
        word_indices = [to_pt(item, self.use_cuda) for item in word_indices]
        word_indices = [item.unsqueeze(-1) for item in word_indices]  # list of batch x 1
        
        del word_ranks_np
        
        return word_qvalues, word_indices

#     def choose_random_command(self, word_ranks, word_masks_np):
#         batch_size = word_ranks[0].size(0)
#         word_ranks_np = [to_np(item) for item in word_ranks]  # list of batch x n_vocab
#         word_ranks_np = [r - np.min(r) for r in word_ranks_np] # minus the min value, so that all values are non-negative
#         kinda_epsilon = 0.1
#         random_ranks = np.random.normal(0, kinda_epsilon, word_ranks_np[0].shape) 
#         word_ranks_np = [r + random_ranks for r in word_ranks_np] # add noise      
#         word_ranks_np = [r * m for r, m in zip(word_ranks_np, word_masks_np)]  # list of batch x n_vocab
#         word_indices = [np.argmax(item, -1) for item in word_ranks_np]  # list of batch
#         word_qvalues = [[] for _ in word_masks_np]

#         for i in range(batch_size):
#             for j in range(len(word_qvalues)):
#                 word_qvalues[j].append(word_ranks[j][i][word_indices[j][i]])

#         word_qvalues = [torch.stack(item) for item in word_qvalues]
#         word_indices = [to_pt(item, self.use_cuda) for item in word_indices]
#         word_indices = [item.unsqueeze(-1) for item in word_indices]  # list of batch x 1
#         return word_qvalues, word_indices

    def choose_maxQ_command(self, word_ranks, word_masks_np):
        """
        Generate a command by maximum q values, for epsilon greedy.

        Arguments:
            word_ranks: Q values for each word by model.action_scorer.
            word_masks_np: Vocabulary masks for words depending on their type (verb, adj, noun).
        """
        batch_size = word_ranks[0].size(0)
        word_ranks_np = [to_np(item) for item in word_ranks]  # list of batch x n_vocab
        word_ranks_np = [r - np.min(r) for r in word_ranks_np] # minus the min value, so that all values are non-negative
        word_ranks_np = [r * m for r, m in zip(word_ranks_np, word_masks_np)]  # list of batch x n_vocab
        word_indices = [np.argmax(item, -1) for item in word_ranks_np]  # list of batch
        word_qvalues = [[] for _ in word_masks_np]
        for i in range(batch_size):
            for j in range(len(word_qvalues)):
                word_qvalues[j].append(word_ranks[j][i][word_indices[j][i]])
        word_qvalues = [torch.stack(item) for item in word_qvalues]
        word_indices = [to_pt(item, self.use_cuda) for item in word_indices]
        word_indices = [item.unsqueeze(-1) for item in word_indices]  # list of batch x 1
        
        del word_ranks_np
        
        return word_qvalues, word_indices

      
#     torch.Size([16, 227])
#     torch.Size([16, 227])
#     torch.Size([16, 768])
#     5
#     torch.Size([16, 20200])

# 16
# 5
# torch.Size([1, 20200])
    def get_ranks(self, bert_ids, bert_masks):
        """
        Given input description tensor, call model forward, to get Q values of words.

        Arguments:
            input_description: Input tensors, which include all the information chosen in
            select_additional_infos() concatenated together.
        """
#         word_ranks_arr = []
#         for x in range(len(bert_ids)):
#           bert_ids_single =  torch.tensor([bert_ids[x]], dtype=torch.long)
#           bert_masks_single = torch.tensor([bert_masks[x]], dtype=torch.long)
#           state_representation_single = self.model.representation_generator(bert_ids_single, bert_masks_single)
#           del bert_ids_single
#           del bert_masks_single
#           word_ranks_arr.append(self.model.action_scorer(state_representation_single))
# #           print(len(word_ranks_arr))
# #           print(len(word_ranks_arr[0]))
# #           print(word_ranks_arr[0][0].shape)
#           del state_representation_single
#         word_ranks = word_ranks_arr[0]
#         for x in range(len(word_ranks_arr) - 1):
#           for y in range(len(word_ranks_arr[x + 1])):
#             word_ranks[y] = torch.cat((word_ranks[y], word_ranks_arr[x + 1][y]), dim=0)
#         del word_ranks_arr
          
        bert_ids = torch.tensor([x for x in bert_ids], dtype=torch.long)
        bert_masks = torch.tensor([x for x in bert_masks], dtype=torch.long)
#         print(bert_ids.shape)
#         print(bert_masks.shape)
        state_representation = self.model.representation_generator(bert_ids, bert_masks)
#         print(state_representation.shape)
        del bert_ids
        del bert_masks
        
        word_ranks = self.model.action_scorer(state_representation)  # each element in list has batch x n_vocab size
#         print(len(word_ranks))
#         print(word_ranks[0].shape)
        del state_representation
        return word_ranks

    def act_eval(self, obs: List[str], scores: List[int], dones: List[bool], infos: Dict[str, List[Any]]) -> List[str]:
        """
        Acts upon the current list of observations, during evaluation.

        One text command must be returned for each observation.

        Arguments:
            obs: Previous command's feedback for each game.
            score: The score obtained so far for each game (at previous step).
            done: Whether a game is finished (at previous step).
            infos: Additional information for each game.

        Returns:
            Text commands to be performed (one per observation).

        Notes:
            Commands returned for games marked as `done` have no effect.
            The states for finished games are simply copy over until all
            games are done, in which case `CustomAgent.finish()` is called
            instead.
        """

        if self.current_step > 0:
            # append scores / dones from previous step into memory
            self.scores.append(scores)
            self.dones.append(dones)

        if all(dones):
            self._end_episode(obs, scores, infos)
            return  # Nothing to return.

        bert_ids, bert_masks = self.get_game_step_info(obs, infos)
        word_ranks = self.get_ranks(bert_ids, bert_masks)  # list of batch x vocab
        
        del bert_ids
        del bert_masks
        
        _, word_indices_maxq = self.choose_maxQ_command(word_ranks, self.word_masks_np)

        chosen_indices = word_indices_maxq
        chosen_indices = [item.detach() for item in chosen_indices]
        chosen_strings = self.get_chosen_strings(chosen_indices)
        self.prev_actions = chosen_strings
        self.current_step += 1
        
        return chosen_strings

    def act(self, obs: List[str], scores: List[int], dones: List[bool], infos: Dict[str, List[Any]]) -> List[str]:
        """
        Acts upon the current list of observations.

        One text command must be returned for each observation.

        Arguments:
            obs: Previous command's feedback for each game.
            score: The score obtained so far for each game (at previous step).
            done: Whether a game is finished (at previous step).
            infos: Additional information for each game.

        Returns:
            Text commands to be performed (one per observation).

        Notes:
            Commands returned for games marked as `done` have no effect.
            The states for finished games are simply copy over until all
            games are done, in which case `CustomAgent.finish()` is called
            instead.
        """
        if not self._epsiode_has_started:
            self._start_episode(obs, infos)

        if self.mode == "eval":
            return self.act_eval(obs, scores, dones, infos)

        if self.current_step > 0:
            # append scores / dones from previous step into memory
            self.scores.append(scores)
            self.dones.append(dones)
            # compute previous step's rewards and masks
            rewards_np, rewards, mask_np, mask = self.compute_reward()

        # Sample for noisy nets
        for i in range(len(self.model.action_scorers)):
            self.model.action_scorers[i].sample_noise()
            
        bert_ids, bert_masks = self.get_game_step_info(obs, infos)
        # generate commands for one game step, epsilon greedy is applied, i.e.,
        # there is epsilon of chance to generate random commands
        if self.imitate:
#           print('imitate')
          correct_cmd=infos['extra.walkthrough'][0][self.wt_index]
#           print(correct_cmd)
          if self.wt_index != len(infos['extra.walkthrough'][0]) - 1:
            self.wt_index+=1
          chosen_indices = self.command_to_word_ids(correct_cmd, len(bert_ids))
        else:
          word_ranks = self.get_ranks(bert_ids, bert_masks)  # list of batch x vocab

          _, word_indices_maxq = self.choose_maxQ_command(word_ranks, self.word_masks_np)
          _, word_indices_random = self.choose_random_command(word_ranks, self.word_masks_np)
          # random number for epsilon greedyupdate
          rand_num = np.random.uniform(low=0.0, high=1.0, size=(len(bert_ids), 1))
          less_than_epsilon = (rand_num < self.epsilon).astype("float32")  # batch
          greater_than_epsilon = 1.0 - less_than_epsilon
          less_than_epsilon = to_pt(less_than_epsilon, self.use_cuda, type='float')
          greater_than_epsilon = to_pt(greater_than_epsilon, self.use_cuda, type='float')
          less_than_epsilon, greater_than_epsilon = less_than_epsilon.long(), greater_than_epsilon.long()
#           print('Random_step: ',less_than_epsilon.tolist())
          chosen_indices = [
              less_than_epsilon * idx_random + greater_than_epsilon * idx_maxq
              for idx_random, idx_maxq in zip(word_indices_random, word_indices_maxq)
          ]
          chosen_indices = [item.detach() for item in chosen_indices]
        
        chosen_strings = self.get_chosen_strings(chosen_indices)
        self.prev_actions = chosen_strings
        

        # push info from previous game step into replay memory
        if self.current_step > 0:
            for b in range(len(obs)):
                if mask_np[b] == 0:
                    continue
                is_prior = rewards_np[b] > 0.0
                self.replay_memory.push(is_prior,*(self.cache_bert_ids[b],
                                        self.cache_bert_masks[b],
                                        [item[b] for item in self.cache_chosen_indices], 
                                        rewards[b], 
                                        mask[b], 
                                        dones[b],
                                        bert_ids[b],
                                        bert_masks[b],
                                        [item[b] for item in self.word_masks_np]))

        # cache new info in current game step into caches
        self.cache_chosen_indices = chosen_indices
        self.cache_bert_ids = bert_ids
        self.cache_bert_masks = bert_masks

        # update neural model by replaying snapshots in replay memory
        #fix update
        if self.current_step > 0 and self.current_step % self.update_per_k_game_steps == 0:
          loss = self.update()
          if loss is not None:
              # Backpropagate
              self.loss.append(to_np(loss).mean())
              self.optimizer.zero_grad()
              loss.backward(retain_graph=True)
              # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
              torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip_grad_norm)
              self.optimizer.step()  # apply gradients
          
           

        self.current_step += 1

        if all(dones):
            self._end_episode(obs, scores, infos)
            return  # Nothing to return.
        return chosen_strings

    def compute_reward(self):
        """
        Compute rewards by agent. Note this is different from what the training/evaluation
        scripts do. Agent keeps track of scores and other game information for training purpose.

        """
        # mask = 1 if game is not finished or just finished at current step
        if len(self.dones) == 1:
            # it's not possible to finish a game at 0th step
            mask = [1.0 for _ in self.dones[-1]]
        else:
            assert len(self.dones) > 1
            mask = [1.0 if not self.dones[-2][i] else 0.0 for i in range(len(self.dones[-1]))]
        mask = np.array(mask, dtype='float32')
        mask_pt = to_pt(mask, self.use_cuda, type='float')
        # rewards returned by game engine are always accumulated value the
        # agent have recieved. so the reward it gets in the current game step
        # is the new value minus values at previous step.
        rewards = np.array(self.scores[-1], dtype='float32')  # batch
        if len(self.scores) > 1:
            prev_rewards = np.array(self.scores[-2], dtype='float32')
            rewards = rewards - prev_rewards
        rewards_pt = to_pt(rewards, self.use_cuda, type='float')

        return rewards, rewards_pt, mask, mask_pt

    def update(self):
        """
        Update neural model in agent. In this example we follow algorithm
        of updating model in dqn with replay memory.

        """
        if len(self.replay_memory) < self.replay_batch_size:
            return None
        transitions = self.replay_memory.sample(self.replay_batch_size)
        batch = Transition(*zip(*transitions))
        
        del transitions

        #pyt bert_ids and bert_masks
#         observation_id_list = pad_sequences(batch.observation_id_list, maxlen=max_len(batch.observation_id_list)).astype('int32')
#         print(observation_id_list)
#         input_observation = to_pt(observation_id_list, self.use_cuda)

#         bert_ids = torch.tensor(batch.bert_ids, dtype=torch.long).to(self.model.device)
#         bert_masks = torch.tensor(batch.bert_masks, dtype=torch.long).to(self.model.device)
        bert_ids = pad_sequences(batch.bert_ids, maxlen=max_len(batch.bert_ids)).astype('int32')
        bert_masks = pad_sequences(batch.bert_masks, maxlen=max_len(batch.bert_masks)).astype('int32')

#         next_observation_id_list = pad_sequences(batch.next_observation_id_list, maxlen=max_len(batch.next_observation_id_list)).astype('int32')
#         next_input_observation = to_pt(next_observation_id_list, self.use_cuda)

#         next_bert_ids = torch.tensor(batch.next_bert_ids, dtype=torch.long).to(self.model.device)
#         next_bert_masks = torch.tensor(batch.next_bert_masks, dtype=torch.long).to(self.model.device)

        next_bert_ids = pad_sequences(batch.next_bert_ids, maxlen=max_len(batch.next_bert_ids)).astype('int32')
        next_bert_masks = pad_sequences(batch.next_bert_masks, maxlen=max_len(batch.next_bert_masks)).astype('int32')

        chosen_indices = list(list(zip(*batch.word_indices)))
        chosen_indices = [torch.stack(item, 0) for item in chosen_indices]  # list of batch x 1

        word_ranks = self.get_ranks(bert_ids, bert_masks)  # list of batch x vocab
        
        del bert_ids
        del bert_masks
        
        word_qvalues = [w_rank.gather(1, idx).squeeze(-1) for w_rank, idx in zip(word_ranks, chosen_indices)]  # list of batch
        
        del chosen_indices
        del word_ranks
        
        q_value = torch.mean(torch.stack(word_qvalues, -1), -1)  # batch
        del word_qvalues

        next_word_ranks = self.get_ranks(next_bert_ids, next_bert_masks)  # batch x n_verb, batch x n_noun, batchx n_second_noun
        del next_bert_ids
        del next_bert_masks
        
        next_word_masks = list(list(zip(*batch.next_word_masks)))
        next_word_masks = [np.stack(item, 0) for item in next_word_masks]
        next_word_qvalues, _ = self.choose_maxQ_command(next_word_ranks, next_word_masks)
        del next_word_masks
        del next_word_ranks
        
        next_q_value = torch.mean(torch.stack(next_word_qvalues, -1), -1)  # batch
        next_q_value = next_q_value.detach()

        rewards = torch.stack(batch.reward)  # batch
        not_done = 1.0 - np.array(batch.done, dtype='float32')  # batch
        not_done = to_pt(not_done, self.use_cuda, type='float')
        rewards = rewards + not_done * next_q_value * self.discount_gamma  # batch
        del not_done
        
        mask = torch.stack(batch.mask)  # batch
        loss = F.smooth_l1_loss(q_value * mask, rewards * mask)
        
        del q_value
        del mask
        del rewards
        del batch
        
        return loss

    def finish(self) -> None:
        """
        All games in the batch are finished. One can choose to save checkpoints,
        evaluate on validation set, or do parameter annealing here.
        """
        # Game has finished (either win, lose, or exhausted all the given steps).
        self.final_rewards = np.array(self.scores[-1], dtype='float32')  # batch
        dones = []
        for d in self.dones:
            d = np.array([float(dd) for dd in d], dtype='float32')
            dones.append(d)
        dones = np.array(dones)
        step_used = 1.0 - dones
        self.step_used_before_done = np.sum(step_used, 0)  # batch

        self.history_avg_scores.push(np.mean(self.final_rewards))
        # save checkpoint
#         print(self.mode)
#         print(self.current_episode)
#         print(self.save_frequency)
        if self.mode == "train" and self.current_episode % self.save_frequency == 0:
            avg_score = self.history_avg_scores.get_avg()
            if avg_score > self.best_avg_score_so_far:
              self.best_avg_score_so_far = avg_score

              save_to = self.model_checkpoint_path + '/' + self.experiment_tag + "_episode_" + str(self.current_episode) + ".pt"
              if not os.path.isdir(self.model_checkpoint_path):
                  os.mkdir(self.model_checkpoint_path)
              torch.save(self.model.state_dict(), save_to)
              print("\n========= saved checkpoint =========")

        self.current_episode += 1
        # annealing
        if self.current_episode < self.epsilon_anneal_episodes:
            self.epsilon -= (self.epsilon_anneal_from - self.epsilon_anneal_to) / float(self.epsilon_anneal_episodes)
      
    def get_mean_loss(self):
      mean_loss = 0.
      if len(self.loss) != 0:   
          mean_loss = sum(self.loss) / len(self.loss)
      self.loss = []
      return mean_loss

## Setup configs

### Vocab
Upload vocab.txt file

In [0]:
from google.colab import files

if not os.path.isfile('./vocab.txt'):
    uploaded = files.upload()
    # Upload vocab.txt
    for fn in uploaded.keys():
        print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))
else:
    print("Vocab already uploaded!")

Vocab already uploaded!


In [0]:
!head vocab.txt

!
"
#
$
%
&
'
'a
'd
'll


### Config

In [0]:
#enough memory first try 3 games : epochs > 8
# with open('./config.yaml', 'w') as config:
#     config.write("""general:
#   discount_gamma: 0.5
#   random_seed: 42
#   use_cuda: True  # disable this when running on machine without cuda

#   # replay memory
#   replay_memory_capacity: 100000  # adjust this depending on your RAM size
#   replay_memory_priority_fraction: 0.25
#   update_per_k_game_steps: 20
#   replay_batch_size: 4

#   # epsilon greedy
#   epsilon_anneal_episodes: 300  # -1 if not annealing
#   epsilon_anneal_from: 1.0
#   epsilon_anneal_to: 0.2

# checkpoint:
#   experiment_tag: 'starting-kit'
#   model_checkpoint_path: './saved_models'
#   load_pretrained: False  # during test, enable this so that the agent load your pretrained model
#   pretrained_experiment_tag: 'starting-kit'
#   save_frequency: 100

# training:
#   batch_size: 1
#   nb_epochs: 100
#   max_nb_steps_per_episode: 100  # after this many steps, a game is terminated
#   optimizer:
#     step_rule: 'adam'  # adam
#     learning_rate: 0.001
#     clip_grad_norm: 5

# model:
#   embedding_size: 32
#   encoder_rnn_hidden_size: [32]
#   action_scorer_hidden_dim: 16
#   dropout_between_rnn_layers: 0.
#   bert_model: 'bert-base-uncased'
#   layer_index: 11
# """)

In [0]:
# second try 3 games 7-8 epochs max
# with open('./config.yaml', 'w') as config:
#     config.write("""general:
#   discount_gamma: 0.7
#   random_seed: 42
#   use_cuda: True  # disable this when running on machine without cuda

#   # replay memory
#   replay_memory_capacity: 100000  # adjust this depending on your RAM size
#   replay_memory_priority_fraction: 0.25
#   update_per_k_game_steps: 4
#   replay_batch_size: 4

#   # epsilon greedy
#   epsilon_anneal_episodes: 60  # -1 if not annealing
#   epsilon_anneal_from: 1.0
#   epsilon_anneal_to: 0.2

# checkpoint:
#   experiment_tag: 'starting-kit'
#   model_checkpoint_path: '/gdrive/My Drive/saved_models'
#   load_pretrained: False  # during test, enable this so that the agent load your pretrained model
#   pretrained_experiment_tag: 'starting-kit'
#   save_frequency: 200

# training:
#   batch_size: 1
#   nb_epochs: 100
#   max_nb_steps_per_episode: 100  # after this many steps, a game is terminated
#   optimizer:
#     step_rule: 'adam'  # adam
#     learning_rate: 0.001
#     clip_grad_norm: 5

# model:
#   embedding_size: 64
#   encoder_rnn_hidden_size: [64]
#   action_scorer_hidden_dim: 32
#   dropout_between_rnn_layers: 0.
#   bert_model: 'bert-base-uncased'
#   layer_index: 11
# """)

In [0]:
# 6 games more than 20 epochs
# with open('./config.yaml', 'w') as config:
#     config.write("""general:
#   discount_gamma: 0.7
#   random_seed: 42
#   use_cuda: True  # disable this when running on machine without cuda

#   # replay memory
#   replay_memory_capacity: 100000  # adjust this depending on your RAM size
#   replay_memory_priority_fraction: 0.25
#   update_per_k_game_steps: 4
#   replay_batch_size: 4

#   # epsilon greedy
#   epsilon_anneal_episodes: 60  # -1 if not annealing
#   epsilon_anneal_from: 1.0
#   epsilon_anneal_to: 0.2

# checkpoint:
#   experiment_tag: 'starting-kit'
#   model_checkpoint_path: '/gdrive/My Drive/saved_models'
#   load_pretrained: False  # during test, enable this so that the agent load your pretrained model
#   pretrained_experiment_tag: 'starting-kit'
#   save_frequency: 200

# training:
#   batch_size: 1
#   nb_epochs: 100
#   max_nb_steps_per_episode: 100  # after this many steps, a game is terminated
#   optimizer:
#     step_rule: 'adam'  # adam
#     learning_rate: 0.001
#     clip_grad_norm: 5

# model:
#   embedding_size: 64
#   encoder_rnn_hidden_size: [64]
#   action_scorer_hidden_dim: 32
#   dropout_between_rnn_layers: 0.
#   bert_model: 'bert-base-uncased'
#   layer_index: 11
# """)

In [0]:
with open('./config.yaml', 'w') as config:
    config.write("""general:
  discount_gamma: 0.7
  random_seed: 42
  use_cuda: True  # disable this when running on machine without cuda

  # replay memory
  replay_memory_capacity: 100000  # adjust this depending on your RAM size
  replay_memory_priority_fraction: 0.5
  update_per_k_game_steps: 8
  replay_batch_size: 24

  # epsilon greedy
  epsilon_anneal_episodes: 200  # -1 if not annealing
  epsilon_anneal_from: 1
  epsilon_anneal_to: 0.2

checkpoint:
  experiment_tag: 'bert-dqn-noisy-networks-large'
  model_checkpoint_path: '/gdrive/My Drive/TextWorld/trained_models'
  load_pretrained: True  # during test, enable this so that the agent load your pretrained model
  pretrained_experiment_tag: 'bert-dqn-noisy-networks-large_episode_500'
  save_frequency: 100

training:
  batch_size: 10
  nb_epochs: 100
  max_nb_steps_per_episode: 100  # after this many steps, a game is terminated
  optimizer:
    step_rule: 'adam'  # adam
    learning_rate: 0.001
    clip_grad_norm: 5

model:
  noisy_std: 0.3
  embedding_size: 192
  encoder_rnn_hidden_size: [384]
  action_scorer_hidden_dim: 128
  dropout_between_rnn_layers: 0.
  bert_model: 'bert-base-uncased'
  train_bert: False
  layer_index: 11
""")

### Mount drive to load games

Notebook takes sample games from google drive(requires authentication).

To train the agent with games, upload archive with them in google drive and fix the path to the archive inside drive below.



In [0]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [0]:
# home_dir = '/gdrive/My Drive/Masters/TextWorld/'

In [0]:
# path_to_sample_games = home_dir + 'sample_games'

In [0]:
path_to_sample_games = '/gdrive/My Drive/TextWorld/sample_games'

In [0]:
def select_additional_infos() -> EnvInfos:
    """
    Returns what additional information should be made available at each game step.

    Requested information will be included within the `infos` dictionary
    passed to `CustomAgent.act()`. To request specific information, create a
    :py:class:`textworld.EnvInfos <textworld.envs.wrappers.filter.EnvInfos>`
    and set the appropriate attributes to `True`. The possible choices are:

    * `description`: text description of the current room, i.e. output of the `look` command;
    * `inventory`: text listing of the player's inventory, i.e. output of the `inventory` command;
    * `max_score`: maximum reachable score of the game;
    * `objective`: objective of the game described in text;
    * `entities`: names of all entities in the game;
    * `verbs`: verbs understood by the the game;
    * `command_templates`: templates for commands understood by the the game;
    * `admissible_commands`: all commands relevant to the current state;

    In addition to the standard information, game specific information
    can be requested by appending corresponding strings to the `extras`
    attribute. For this competition, the possible extras are:

    * `'recipe'`: description of the cookbook;
    * `'walkt

    hrough'`: one possible solution to the game (not guaranteed to be optimal);

    Example:
        Here is an example of how to request information and retrieve it.

        >>> from textworld import EnvInfos
        >>> request_infos = EnvInfos(description=True, inventory=True, extras=["recipe"])
        ...
        >>> env = gym.make(env_id)
        >>> ob, infos = env.reset()
        >>> print(infos["description"])
        >>> print(infos["inventory"])
        >>> print(infos["extra.recipe"])

    Notes:
        The following information *won't* be available at test time:

        * 'walkthrough'
    """
    request_infos = EnvInfos()
    request_infos.description = True
    request_infos.inventory = True
    request_infos.entities = True
    request_infos.verbs = True
    request_infos.max_score = True
    request_infos.has_won = True
    request_infos.has_lost = True
    request_infos.extras = ["recipe", "walkthrough"]
    return request_infos
  
def gather_entites(game_files):
    requested_infos = select_additional_infos()
    _validate_requested_infos(requested_infos)

    env_id = textworld.gym.register_games(game_files, requested_infos,
                                          max_episode_steps=1,
                                          name="training")

    env_id = textworld.gym.make_batch(env_id, batch_size=1, parallel=True)
    env = gym.make(env_id)
    game_range = range(len(game_files))
    entities = set()
    verbs = set()
    for game_no in game_range:
        obs, infos = env.reset()
        env.skip()
        entities |= { a for i in infos['entities'] for a in i }
        verbs |= { a for i in infos['verbs'] for a in i }
    return list(entities), list(verbs)
    
    
def make_vocab(games):
    entities, verbs = gather_entites(games)
    
    with open("./vocab.txt") as f:
        word_vocab = f.read().split("\n")
        
    #########################
    batch_size = len(verbs)
    verbs_word_list = verbs
    noun_word_list, adj_word_list = [], []
    tmp_nouns, tmp_adjs = [], []
    for name in entities:
        split = name.split()
        tmp_nouns.append(split[-1])
        if len(split) > 1:
            tmp_adjs.append(" ".join(split[:-1]))
    noun_word_list = list(set(tmp_nouns))
    adj_word_list = list(set(tmp_adjs))

    word2id = { word: idx for idx, word in enumerate(word_vocab) }
    
    verb_mask = np.zeros((len(word_vocab),), dtype="float32")
    noun_mask = np.zeros((len(word_vocab),), dtype="float32")
    adj_mask = np.zeros((len(word_vocab),), dtype="float32")
  
    for w in verbs_word_list:
        if w in word2id:
            verb_mask[word2id[w]] = 1.0
    for w in noun_word_list:
        if w in word2id:
            noun_mask[word2id[w]] = 1.0
    for w in adj_word_list:
        if w in word2id:
            adj_mask[word2id[w]] = 1.0

    second_noun_mask = copy.copy(noun_mask)
    second_adj_mask = copy.copy(adj_mask)
    second_noun_mask[:, self.EOS_id] = 1.0
    adj_mask[:, self.EOS_id] = 1.0
    second_adj_mask[:, self.EOS_id] = 1.0
    word_masks_np = [verb_mask, adj_mask, noun_mask, second_adj_mask, second_noun_mask]
    
    return word_vocab, word2id, word_masks_np

In [0]:
# List of additional information available during evaluation.
AVAILABLE_INFORMATION = EnvInfos(
    description=True, inventory=True,
    max_score=True, objective=True, entities=True, verbs=True,
    command_templates=True, admissible_commands=True,
    has_won=True, has_lost=True,
    extras=["recipe"]
)

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

def _validate_requested_infos(infos: EnvInfos):
    msg = "The following information cannot be requested: {}"
    for key in infos.basics:
        if not getattr(AVAILABLE_INFORMATION, key):
            raise ValueError(msg.format(key))

    for key in infos.extras:
        if key not in AVAILABLE_INFORMATION.extras:
            raise ValueError(msg.format(key))
            
def get_index(game_no, stats):
    return "G{}_{}".format(game_no, stats)
  
def get_game_id(game_info):
    return hash((tuple(game_info['entities'][0]), game_info['extra.recipe'][0]))
  
def make_stats(count_games):
    stats_cols = [ "scores", "steps", "loss", "max_score", "outcomes", "eps", "state"]
    stats = {}
    for col in stats_cols:
        stats[col] = [0] * count_games
    return stats  

def save_to_csv(epoch_no, stats_df, score_mean, states, loss, eps, col):
  filename = '/gdrive/My Drive/stats/TextWorld/game_' + str(col) + '.csv'
  log_df = pd.DataFrame(columns=['epoch','score', 'steps', 'loss', 'eps', 'state'])
  log_df.loc[0,'epoch'] = epoch_no
  log_df.loc[0,'score'] = score_mean
  log_df.loc[0,'steps'] = stats_df[get_index(col, 'st')]['avr']
  log_df.loc[0,'loss'] = loss[col]
  log_df.loc[0,'eps'] = eps[col]
  log_df.loc[0, 'state'] = states[col]
  if not os.path.isfile(filename):
     log_df.to_csv(filename, header=['epoch','score', 'steps', 'loss', 'eps', 'state'])
  else: # else it exists so append without writing the header
     log_df.to_csv(filename, mode='a', header=False)
            
def print_epoch_stats(epoch_no, stats):
    print("\n\nEpoch: {:3d}".format(epoch_no))
    steps, scores, loss, states = stats["steps"], stats["scores"], stats["loss"], stats["state"]
    max_scores, outcomes = stats["max_score"], stats["outcomes"]
    games_cnt, parallel_cnt = len(steps), len(steps[0])
    columns = [ get_index(col, st) for col in range(games_cnt) for st in ['st', 'sc']]
    stats_df = pd.DataFrame(index=list(range(parallel_cnt)) + ["avr", "loss"], columns=columns)
        
    for col in range(games_cnt):
        for row in range(parallel_cnt):
          outcome = outcomes[col][row]
          outcome = outcome > 0 and "W" or outcome < 0 and "L" or ""
          stats_df[get_index(col, 'st')][row] = steps[col][row]
          stats_df[get_index(col, 'sc')][row] = outcome + " " + str(scores[col][row])
        score_mean = np.mean(scores[col])
        stats_df[get_index(col, 'sc')]['avr'] = "{}/{}".format(score_mean, max_scores[col])
        stats_df[get_index(col, 'st')]['avr'] = stats_df[get_index(col, 'st')].mean()
        stats_df[get_index(col, 'sc')]['loss'] = "{:.5f}".format(loss[col])
        stats_df[get_index(col, 'st')]['loss'] = states[col]
        
        save_to_csv(epoch_no, stats_df, score_mean, states, loss, stats['eps'], col)
    print(stats_df)

def train(game_files):
    requested_infos = select_additional_infos()
    
    agent = CustomAgent()
    env_id = textworld.gym.register_games(game_files, requested_infos,
                                          max_episode_steps=agent.max_nb_steps_per_episode,
                                          name="training")

    env_id = textworld.gym.make_batch(env_id, batch_size=agent.batch_size, parallel=True)
    print("ENVID: {}".format(env_id))

    print("Making {} parallel environments to train on them\n".format(agent.batch_size))
    env = gym.make(env_id)
    count_games = len(game_files)
    games_ids = {}
    for epoch_no in range(1, agent.nb_epochs + 1):
        stats = make_stats(count_games)
        idx = 0
        for game_no in tqdm(range(count_games)):
            obs, infos = env.reset()
            game_id = get_game_id(infos)
            if epoch_no == 1:
                games_ids[game_id] = idx
                idx += 1
            real_id = games_ids[game_id]
            stats["max_score"][real_id] = infos['max_score'][0]
            
            imitate = random.random() > 1.0
            agent.train(imitate)

            scores = [0] * len(obs) 
            dones = [False] * len(obs)
            steps = [0] * len(obs)
            while not all(dones):
                # Increase step counts.
                steps = [step + int(not done) for step, done in zip(steps, dones)]
                commands = agent.act(obs, scores, dones, infos)
                obs, scores, dones, infos = env.step(commands)

            # Let the agent knows the game is done.
            agent.act(obs, scores, dones, infos)

            stats["scores"][real_id] = scores
            stats["steps"][real_id] = steps
            stats["eps"][real_id] = agent.epsilon
            stats["loss"][real_id] = agent.get_mean_loss()
            stats["state"][real_id] = ["imitate" if imitate else "agent"]
            stats["outcomes"][real_id] = [ w-l for w, l in zip(infos['has_won'], infos['has_lost'])]
        
        print_epoch_stats(epoch_no, stats)
    torch.save(agent.model, './agent_model.pt')
    return
          

# game_dir = path_to_sample_games
# games = []
# if os.path.isdir(game_dir):
#     games += glob.glob(os.path.join(game_dir, "*.ulx"))
# print("{} games found for training.".format(len(games)))

# if len(games) != 0:
#     train(games)

## Eval

In [0]:
# COMMENT TRAIN
# FIX LOADING MODEL PATH
# FIX BATCH_SIZE TO BE 10


def eval_games(game_files):
  requested_infos = select_additional_infos()

  agent = CustomAgent()
  
  env_id = textworld.gym.register_games(game_files, requested_infos,
                                        max_episode_steps=agent.max_nb_steps_per_episode,
                                        name="eval")

  env_id = textworld.gym.make_batch(env_id, batch_size=10, parallel=True)
  print("ENVID: {}".format(env_id))

  print("Making {} parallel environments to eval on them\n".format(agent.batch_size))
  env = gym.make(env_id)
  count_games = len(game_files)
  games_ids = {}

  stats = make_stats(count_games)
  score_sum = 0
  steps_sum = 0
  steps_length = count_games*10
  for game_no in tqdm(range(count_games)):
      obs, infos = env.reset()

      agent.eval()

      scores = [0] * len(obs) 
      dones = [False] * len(obs)
      steps = [0] * len(obs)
      while not all(dones):
          # Increase step counts.
          steps = [step + int(not done) for step, done in zip(steps, dones)]
          commands = agent.act(obs, scores, dones, infos)
          obs, scores, dones, infos = env.step(commands)

      # Let the agent knows the game is done.
      agent.act(obs, scores, dones, infos)
      score_sum += sum(scores)
      steps_sum += sum(steps)
      
  print('Max score: ', score_sum)
  print('Mean steps: ', steps_sum / steps_length)

game_dir = path_to_sample_games
games = []
if os.path.isdir(game_dir):
    games += glob.glob(os.path.join(game_dir, "*.ulx"))
print("{} games found for training.".format(len(games)))

if len(games) != 0:
  eval_games(games)

10 games found for training.
total number of parameters: 135648992
number of trainable parameters: 26166752
loading model from /gdrive/My Drive/TextWorld/trained_models/bert-dqn-noisy-networks-large_episode_500.pt

ENVID: batch10-tw-eval-v0
Making 10 parallel environments to eval on them



  result = entry_point.load(False)
100%|██████████| 10/10 [04:02<00:00, 24.63s/it]

Max score:  90
Mean steps:  100.0





tw-eval-v0 closed


Process Process-10:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/textworld/gym/envs/batch_env.py", line 20, in _child
    command = pipe.recv()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError


tw-eval-v0 closed


Process Process-9:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/textworld/gym/envs/batch_env.py", line 20, in _child
    command = pipe.recv()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError


tw-eval-v0 closed


Process Process-8:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
