# Textworld starting kit notebook

Model: *LSTM-DQN*

When running first: 
 1. Run the first 2 code cells(with pip installations)
 2. Restart runtime
 3. Continue with the next cells

This is done, because there is a problem with dependencies of **textworld** and **colab**, requiring different versions of **prompt-toolkit**

## Todo
### RL:
* [x] Prioritized Replay Memory
* [x] [N-step DQN](https://www.groundai.com/project/understanding-multi-step-deep-reinforcement-learning-a-systematic-study-of-the-dqn-target/)
* [x] [Fixed Q-targets](https://www.freecodecamp.org/news/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682/)
* [x] [Double DQN](https://towardsdatascience.com/deep-double-q-learning-7fca410b193a)
* [ ] Dueling DQN
* [ ] Multiple inputs (description, inventory, quests, etc)
* [x] Replay memory sample, when not having any alpha/beta priority samples, should take the whole sample of the opposite priority.
* [ ] Noisy nets
* [ ] DRQN ?
* [ ] [Rainbow Paper](https://arxiv.org/pdf/1710.02298.pdf) ?

### NLP:
* [x] Normalize tokens
* [x] Add l2 regularization
* [ ] Remove apostrophes from descriptions
* [x] Two word adjectives

### Game Env:
* [x] Train with simple games in starting kit
* [ ] Check game generation and complexity
* [ ] Make many simple(easy) games, each of which teaches different skill.
* [ ] Train first with simple games, and progressively train on more complex games.

### Debugging:
* [x] Extended log for checking steps and scores on every epoch
* [ ] Graphs with speed, reward, epsilon movement, loss function, episode length (use Tensorboard or similar)
* [ ] Model comparisson

### Colab:
* [x] Export model parameters to Drive
* [x] Export replay memory to Drive


## Setup

In [0]:
!pip install textworld

Collecting prompt-toolkit<2.1.0,>=2.0.0 (from textworld)
  Using cached https://files.pythonhosted.org/packages/f7/a7/9b1dd14ef45345f186ef69d175bdd2491c40ab1dfa4b2b3e4352df719ed7/prompt_toolkit-2.0.9-py3-none-any.whl
[31mERROR: ipython 5.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.4, but you'll have prompt-toolkit 2.0.9 which is incompatible.[0m
Installing collected packages: prompt-toolkit
  Found existing installation: prompt-toolkit 1.0.16
    Uninstalling prompt-toolkit-1.0.16:
      Successfully uninstalled prompt-toolkit-1.0.16
Successfully installed prompt-toolkit-2.0.9


In [0]:
!pip install prompt-toolkit==1.0.16

Collecting prompt-toolkit==1.0.16
  Using cached https://files.pythonhosted.org/packages/57/a8/a151b6c61718eabe6b4672b6aa760b734989316d62ec1ba4996765e602d4/prompt_toolkit-1.0.16-py3-none-any.whl
[31mERROR: textworld 1.1.1 has requirement prompt-toolkit<2.1.0,>=2.0.0, but you'll have prompt-toolkit 1.0.16 which is incompatible.[0m
[31mERROR: jupyter-console 6.0.0 has requirement prompt-toolkit<2.1.0,>=2.0.0, but you'll have prompt-toolkit 1.0.16 which is incompatible.[0m
Installing collected packages: prompt-toolkit
  Found existing installation: prompt-toolkit 2.0.9
    Uninstalling prompt-toolkit-2.0.9:
      Successfully uninstalled prompt-toolkit-2.0.9
Successfully installed prompt-toolkit-1.0.16


In [0]:
!pip install -U -q PyDrive

In [0]:
![[ ! -f './glove.6B.zip' ]] && wget 'http://nlp.stanford.edu/data/glove.6B.zip' && unzip './glove.6B.zip'

In [0]:
!head glove.6B.50d.txt

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392
. 0.15164 0.30177 -0.16763 0.17684 0.31719 0.33973 -0.43478 -0.31086 -0.44999 -0.29486 0.16608 0.11963 -0.41328 -0.42353

In [0]:
import os
import random
import logging
import yaml
import math
import copy
import spacy
import numpy as np
import glob
import pickle
import re
import csv

from tqdm import tqdm
from typing import List, Dict, Any
from collections import namedtuple
import pandas as pd

import torch
import torch.nn.functional as F

import gym
import textworld.gym
from textworld import EnvInfos

torch.cuda.is_available()

True

## Generic functions

In [0]:
def to_np(x):
    if isinstance(x, np.ndarray):
        return x
    return x.data.cpu().numpy()


def to_pt(np_matrix, enable_cuda=False, type='long'):
    if type == 'long':
        if enable_cuda:
            return torch.autograd.Variable(torch.from_numpy(np_matrix).type(torch.LongTensor).cuda())
        else:
            return torch.autograd.Variable(torch.from_numpy(np_matrix).type(torch.LongTensor))
    elif type == 'float':
        if enable_cuda:
            return torch.autograd.Variable(torch.from_numpy(np_matrix).type(torch.FloatTensor).cuda())
        else:
            return torch.autograd.Variable(torch.from_numpy(np_matrix).type(torch.FloatTensor))


def _words_to_ids(words, word2id):
    ids = []
    for word in words:
        try:
            ids.append(word2id[word])
        except KeyError:
            ids.append(1)
    return ids


def preproc(s, str_type='None', tokenizer=None, lower_case=True):
    if s is None:
        return ["nothing"]
    s = s.replace("\n", ' ')
    if s.strip() == "":
        return ["nothing"]
    if str_type == 'feedback':
        if "$$$$$$$" in s:
            s = ""
        if "-=" in s:
            s = s.split("-=")[0]
    s = s.strip()
    if len(s) == 0:
        return ["nothing"]
    tokens = [t.text.replace("'", "") for t in tokenizer(s)]
    # NORMALIZE WORDS
    #tokens = [t.norm_ for t in tokenizer(s)]
    if lower_case:
        tokens = [t.lower() for t in tokens]
    return tokens + ["<|>"]


def max_len(list_of_list):
    return max(map(len, list_of_list))


def pad_sequences(sequences, maxlen=None, dtype='int32', value=0.):
    '''
    Partially borrowed from Keras
    # Arguments
        sequences: list of lists where each element is a sequence
        maxlen: int, maximum length
        dtype: type to cast the resulting sequence.
        value: float, value to pad the sequences to the desired value.
    # Returns
        x: numpy array with dimensions (number_of_sequences, maxlen)
    '''
    lengths = [len(s) for s in sequences]
    nb_samples = len(sequences)
    if maxlen is None:
        maxlen = np.max(lengths)
    # take the sample shape from the first non empty sequence
    # checking for consistency in the main loop below.
    sample_shape = tuple()
    for s in sequences:
        if len(s) > 0:
            sample_shape = np.asarray(s).shape[1:]
            break
    x = (np.ones((nb_samples, maxlen) + sample_shape) * value).astype(dtype)
    for idx, s in enumerate(sequences):
        if len(s) == 0:
            continue  # empty list was found
        # pre truncating
        trunc = s[-maxlen:]
        # check `trunc` has expected shape
        trunc = np.asarray(trunc, dtype=dtype)
        if trunc.shape[1:] != sample_shape:
            raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' %
                             (trunc.shape[1:], idx, sample_shape))
        # post padding
        x[idx, :len(trunc)] = trunc
    return x

def load_glove_embeddings(path, word2idx, embedding_dim, enable_cuda=False):
    emb_df = pd.read_csv(path, sep=" ", index_col=0, 
                         header=None, quoting=csv.QUOTE_NONE)
    
    embeddings = np.zeros((len(word2idx), embedding_dim))
    for word, weights in emb_df.iterrows():
        index = word2idx.get(word, None)
        if index != None:
            embeddings[index] = weights.values
    
    if enable_cuda:
        return torch.from_numpy(embeddings).float().cuda()
    return torch.from_numpy(embeddings).float()
    
def freeze_layer(layer):
    for param in layer.parameters():
        param.requires_grad = False

## Layers

In [0]:
def masked_mean(x, m=None, dim=-1):
    """
        mean pooling when there're paddings
        input:  tensor: batch x time x h
                mask:   batch x time
        output: tensor: batch x h
    """
    if m is None:
        return torch.mean(x, dim=dim)
    mask_sum = torch.sum(m, dim=-1)  # batch
    res = torch.sum(x, dim=1)  # batch x h
    res = res / (mask_sum.unsqueeze(-1) + 1e-6)
    return res


class Embedding(torch.nn.Module):
    '''
    inputs: x:          batch x seq (x is post-padded by 0s)
    outputs:embedding:  batch x seq x emb
            mask:       batch x seq
    '''

    def __init__(self, embedding_size, vocab_size, enable_cuda=False):
        super(Embedding, self).__init__()
        self.embedding_size = embedding_size
        self.vocab_size = vocab_size
        self.enable_cuda = enable_cuda
        self.embedding_layer = torch.nn.Embedding(self.vocab_size, self.embedding_size, padding_idx=0)
        
    def set_weights(self, weights):
        self.embedding_layer.weight = torch.nn.Parameter(weights)

    def compute_mask(self, x):
        mask = torch.ne(x, 0).float()
        if self.enable_cuda:
            mask = mask.cuda()
        return mask

    def forward(self, x):
        embeddings = self.embedding_layer(x)  # batch x time x emb
        mask = self.compute_mask(x)  # batch x time
        return embeddings, mask
        

class FastUniLSTM(torch.nn.Module):
    """
    Adapted from https://github.com/facebookresearch/DrQA/
    now supports:   different rnn size for each layer
                    all zero rows in batch (from time distributed layer, by reshaping certain dimension)
    """

    def __init__(self, ninp, nhids, bidir=False, dropout_between_rnn_layers=0.):
        super(FastUniLSTM, self).__init__()
        self.ninp = ninp
        self.nhids = nhids
        self.nlayers = len(self.nhids)
        self.dropout_between_rnn_layers = dropout_between_rnn_layers
        self.stack_rnns(bidir)

    def stack_rnns(self, bidir):
        rnns = [torch.nn.LSTM(self.ninp if i == 0 else self.nhids[i - 1],
                              self.nhids[i],
                              num_layers=1,
                              bidirectional=bidir) for i in range(self.nlayers)]
            
        self.rnns = torch.nn.ModuleList(rnns)

    def forward(self, x, mask):

        def pad_(tensor, n):
            if n > 0:
                zero_pad = torch.autograd.Variable(torch.zeros((n,) + tensor.size()[1:]))
                if x.is_cuda:
                    zero_pad = zero_pad.cuda()
                tensor = torch.cat([tensor, zero_pad])
            return tensor

        """
        inputs: x:          batch x time x inp
                mask:       batch x time
        output: encoding:   batch x time x hidden[-1]
        """
        # Compute sorted sequence lengths
        batch_size = x.size(0)
        lengths = mask.data.eq(1).long().sum(1)  # .squeeze()
        _, idx_sort = torch.sort(lengths, dim=0, descending=True)
        _, idx_unsort = torch.sort(idx_sort, dim=0)

        lengths = list(lengths[idx_sort])
        idx_sort = torch.autograd.Variable(idx_sort)
        idx_unsort = torch.autograd.Variable(idx_unsort)

        # Sort x
        x = x.index_select(0, idx_sort)

        # remove non-zero rows, and remember how many zeros
        n_nonzero = np.count_nonzero(lengths)
        n_zero = batch_size - n_nonzero
        if n_zero != 0:
            lengths = lengths[:n_nonzero]
            x = x[:n_nonzero]

        # Transpose batch and sequence dims
        x = x.transpose(0, 1)

        # Pack it up
        rnn_input = torch.nn.utils.rnn.pack_padded_sequence(x, lengths)

        # Encode all layers
        outputs = [rnn_input]
        for i in range(self.nlayers):
            rnn_input = outputs[-1]

            # dropout between rnn layers
            if self.dropout_between_rnn_layers > 0:
                dropout_input = F.dropout(rnn_input.data,
                                          p=self.dropout_between_rnn_layers,
                                          training=self.training)
                rnn_input = torch.nn.utils.rnn.PackedSequence(dropout_input,
                                                              rnn_input.batch_sizes)
            seq, last = self.rnns[i](rnn_input)
            outputs.append(seq)
            if i == self.nlayers - 1:
                # last layer
                last_state = last[0]  # (num_layers * num_directions, batch, hidden_size)
                last_state = last_state[0]  # batch x hidden_size

        # Unpack everything
        for i, o in enumerate(outputs[1:], 1):
            outputs[i] = torch.nn.utils.rnn.pad_packed_sequence(o)[0]
        output = outputs[-1]

        # Transpose and unsort
        output = output.transpose(0, 1)  # batch x time x enc

        # re-padding
        output = pad_(output, n_zero)
        last_state = pad_(last_state, n_zero)

        output = output.index_select(0, idx_unsort)
        last_state = last_state.index_select(0, idx_unsort)

        # Pad up to original batch sequence length
        if output.size(1) != mask.size(1):
            padding = torch.zeros(output.size(0),
                                  mask.size(1) - output.size(1),
                                  output.size(2)).type(output.data.type())
            output = torch.cat([output, torch.autograd.Variable(padding)], 1)

        output = output.contiguous() * mask.unsqueeze(-1)
        return output, last_state, mask

## Noisy nets

In [0]:
class NoisyLinear(torch.nn.Module):
    def __init__(self, in_features, out_features, std_init=0.4):
        super(NoisyLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.std_init = std_init
        self.weight_mu = torch.nn.Parameter(torch.empty(out_features, in_features))
        self.weight_sigma = torch.nn.Parameter(torch.empty(out_features, in_features))
        self.register_buffer('weight_epsilon', torch.empty(out_features, in_features))
        self.bias_mu = torch.nn.Parameter(torch.empty(out_features))
        self.bias_sigma = torch.nn.Parameter(torch.empty(out_features))
        self.register_buffer('bias_epsilon', torch.empty(out_features))
        self.reset_parameters()
        self.sample_noise()

    def reset_parameters(self):
        mu_range = 1.0 / math.sqrt(self.in_features)
        self.weight_mu.data.uniform_(-mu_range, mu_range)
        self.weight_sigma.data.fill_(self.std_init / math.sqrt(self.in_features))
        self.bias_mu.data.uniform_(-mu_range, mu_range)
        self.bias_sigma.data.fill_(self.std_init / math.sqrt(self.out_features))

    def _scale_noise(self, size):
        x = torch.randn(size)
        return x.sign().mul_(x.abs().sqrt_())

    def sample_noise(self):
        epsilon_in = self._scale_noise(self.in_features)
        epsilon_out = self._scale_noise(self.out_features)
        self.weight_epsilon.copy_(epsilon_out.ger(epsilon_in))
        self.bias_epsilon.copy_(epsilon_out)

    def forward(self, inp):
        if self.training:
            return F.linear(inp, self.weight_mu + self.weight_sigma * self.weight_epsilon, self.bias_mu + self.bias_sigma * self.bias_epsilon)
        else:
            return F.linear(inp, self.weight_mu, self.bias_mu)

## Model

In [0]:
class LSTM_DQN(torch.nn.Module):
    model_name = 'lstm_dqn'

    def __init__(self, model_config, embedding_weights, word_vocab, generate_length=5, enable_cuda=False):
        super(LSTM_DQN, self).__init__()
        self.model_config = model_config
        self.enable_cuda = enable_cuda
        self.word_vocab_size = len(word_vocab)
        self.id2word = word_vocab
        self.generate_length = generate_length
        self.read_config()
        self._def_layers(embedding_weights)
        self.init_weights()
        #self.print_parameters()

    def print_parameters(self):
        print(self)
        amount = 0
        for p in self.parameters():
            amount += np.prod(p.size())
        print("Total number of parameters: {}".format(amount))
        parameters = filter(lambda p: p.requires_grad, self.parameters())
        amount = 0
        for p in parameters:
            amount += np.prod(p.size())
        print("Number of trainable parameters: {}".format(amount))

    def read_config(self):
        # model config
        self.freeze_embedding = self.model_config['freeze_embedding']
        self.embedding_size = self.model_config['embedding_size']
        self.encoder_rnn_hidden_size = self.model_config['encoder_rnn_hidden_size']
        self.action_scorer_hidden_dim = self.model_config['action_scorer_hidden_dim']
        self.dropout_between_rnn_layers = self.model_config['dropout_between_rnn_layers']
        self.bidirectional_lstm = self.model_config['bidirectional_lstm']

    def _def_layers(self, embedding_weights=None):
        # word embeddings
        self.word_embedding = Embedding(embedding_size=self.embedding_size,
                                        vocab_size=self.word_vocab_size,
                                        enable_cuda=self.enable_cuda)
        if not(embedding_weights is None):
            self.word_embedding.set_weights(embedding_weights)
            print("Embedding imported!")
            if self.freeze_embedding:
                freeze_layer(self.word_embedding.embedding_layer)
                print("Embedding freezed!")
            
        # lstm encoder
        self.encoder = FastUniLSTM(ninp=self.embedding_size,
                                   nhids=self.encoder_rnn_hidden_size,
                                   bidir=self.bidirectional_lstm,
                                   dropout_between_rnn_layers=self.dropout_between_rnn_layers)
        
        shared_input_size = self.encoder_rnn_hidden_size[-1]
        shared_input_size *= 2 if self.bidirectional_lstm else 1
        self.action_scorer_shared = torch.nn.Linear(shared_input_size, self.action_scorer_hidden_dim)

        action_scorers = []
        for _ in range(self.generate_length):
            action_scorers.append(
                NoisyLinear(self.action_scorer_hidden_dim, 
                            self.word_vocab_size, 
                            std_init=self.model_config['noisy_std']))
        self.action_scorers = torch.nn.ModuleList(action_scorers)
        self.fake_recurrent_mask = None

    def init_weights(self):
        torch.nn.init.xavier_uniform_(self.action_scorer_shared.weight.data)
        
        for i in range(len(self.action_scorers)):
            self.action_scorers[i].sample_noise()
        #    torch.nn.init.xavier_uniform_(self.action_scorers[i].weight.data)
        self.action_scorer_shared.bias.data.fill_(0)

    def representation_generator(self, _input_words):
        embeddings, mask = self.word_embedding.forward(_input_words)  # batch x time x emb
        encoding_sequence, _, _ = self.encoder.forward(embeddings, mask)  # batch x time x h
        mean_encoding = masked_mean(encoding_sequence, mask)  # batch x h
        return mean_encoding

    def action_scorer(self, state_representation):
        hidden = self.action_scorer_shared.forward(state_representation)  # batch x hid
        hidden = F.relu(hidden)  # batch x hid
        action_ranks = []
        for i in range(len(self.action_scorers)):
            action_ranks.append(self.action_scorers[i].forward(hidden))  # batch x n_vocab
        return action_ranks

In [0]:
# Declare global embedding_weights
embedding_weights = None

Cache score

In [0]:
class HistoryScoreCache(object):

    def __init__(self, capacity=1):
        self.capacity = capacity
        self.reset()

    def push(self, stuff):
        """stuff is float."""
        if len(self.memory) < self.capacity:
            self.memory.append(stuff)
        else:
            self.memory = self.memory[1:] + [stuff]

    def get_avg(self):
        return np.mean(np.array(self.memory))

    def reset(self):
        self.memory = []

    def __len__(self):
        return len(self.memory)

## Memory

In [0]:
# a snapshot of state to be stored in replay memory
Transition = namedtuple('Transition', ('observation_id_list', 'word_indices',
                                       'reward', 'mask', 'done',
                                       'next_observation_id_list',
                                       'next_word_masks'))


In [0]:
class PrioritizedReplayMemory(object):

    def __init__(self, capacity=100000, priority_fraction=0.0):
        # prioritized replay memory
        self.priority_fraction = priority_fraction
        self.alpha_capacity = int(capacity * priority_fraction)
        self.beta_capacity = capacity - self.alpha_capacity
        self.alpha_memory, self.beta_memory = [], []
        self.alpha_position, self.beta_position = 0, 0

    def push(self, is_prior, transition):
        """Saves a transition."""
        if self.priority_fraction == 0.0:
            is_prior = False
        if is_prior:
            if len(self.alpha_memory) < self.alpha_capacity:
                self.alpha_memory.append(None)
            self.alpha_memory[self.alpha_position] = transition
            self.alpha_position = (self.alpha_position + 1) % self.alpha_capacity
        else:
            if len(self.beta_memory) < self.beta_capacity:
                self.beta_memory.append(None)
            self.beta_memory[self.beta_position] = transition
            self.beta_position = (self.beta_position + 1) % self.beta_capacity

    def sample(self, batch_size):
        if self.priority_fraction == 0.0 or len(self.alpha_memory) == 0:
            from_beta = min(batch_size, len(self.beta_memory))
            res = random.sample(self.beta_memory, from_beta)
        elif len(self.beta_memory) == 0:
            from_alpha = min(batch_size, len(self.alpha_memory))
            res = random.sample(self.alpha_memory, from_alpha)
        else:
            priority_batch = int(self.priority_fraction * batch_size)
            from_alpha = min(priority_batch, len(self.alpha_memory))
            from_beta = min(batch_size - priority_batch, len(self.beta_memory))
            res = random.sample(self.alpha_memory, from_alpha) + random.sample(self.beta_memory, from_beta)
        random.shuffle(res)
        return res

    def __len__(self):
        return len(self.alpha_memory) + len(self.beta_memory)


## Agent

In [0]:
class CustomAgent:
    def __init__(self):
        global embedding_weights
        """
        Arguments:
            word_vocab: List of words supported.
        """
        self.mode = "train"
        with open("./vocab.txt") as f:
            self.word_vocab = f.read().split("\n")
        with open("config.yaml") as reader:
            self.config = yaml.safe_load(reader)
        self.word2id = {}
        for i, w in enumerate(self.word_vocab):
            self.word2id[w] = i
        self.EOS_id = self.word2id["</S>"]

        self.batch_size = self.config['training']['batch_size']
        self.max_nb_steps_per_episode = self.config['training']['max_nb_steps_per_episode']
        self.nb_epochs = self.config['training']['nb_epochs']
        self.qval_noise_std = self.config['training']['qval_noise_std']

        # Set the random seed manually for reproducibility.
        np.random.seed(self.config['general']['random_seed'])
        torch.manual_seed(self.config['general']['random_seed'])
        if torch.cuda.is_available():
            if not self.config['general']['use_cuda']:
                logging.warning("WARNING: CUDA device detected but 'use_cuda: false' found in config.yaml")
                self.use_cuda = False
            else:
                torch.backends.cudnn.deterministic = True
                torch.cuda.manual_seed(self.config['general']['random_seed'])
                self.use_cuda = True
        else:
            self.use_cuda = False
        
        if embedding_weights is None:
            print("Start loading glove")
            embedding_weights = load_glove_embeddings(
                self.config["model"]['embedding_path'],
                self.word2id,
                embedding_dim=self.config["model"]['embedding_size'],
                enable_cuda=self.use_cuda
            )
        print("Creating Q-Network")
        embedding_weights1 = embedding_weights.clone().detach() if not(embedding_weights is None) else None
        self.model = LSTM_DQN(model_config=self.config["model"],
                              embedding_weights=embedding_weights1,
                              word_vocab=self.word_vocab,
                              enable_cuda=self.use_cuda)
        print("Creating Target Network")
        embedding_weights2 = embedding_weights.clone().detach() if not(embedding_weights is None) else None
        self.target_model = LSTM_DQN(model_config=self.config["model"],
                                     embedding_weights=embedding_weights2,
                                     word_vocab=self.word_vocab,
                                     enable_cuda=self.use_cuda)
        
        self.target_model.print_parameters()
        
        self.update_target_model_count = 0
        self.target_model_update_frequency = self.config['training']['target_model_update_frequency']

        self.experiment_tag = self.config['checkpoint']['experiment_tag']
        self.model_checkpoint_dir = self.config['checkpoint']['model_checkpoint_dir']
        self.save_frequency = self.config['checkpoint']['save_frequency']

        if self.config['checkpoint']['load_pretrained']:
            self.load_pretrained_model(self.model_checkpoint_dir)
        if self.use_cuda:
            self.model.cuda()
            self.target_model.cuda()

        self.replay_batch_size = self.config['general']['replay_batch_size']
        self.replay_memory = PrioritizedReplayMemory(self.config['general']['replay_memory_capacity'],
                                                     priority_fraction=self.config['general']['replay_memory_priority_fraction'])

        # optimizer
        parameters = filter(lambda p: p.requires_grad, self.model.parameters())
        self.optimizer = torch.optim.Adam(parameters, lr=self.config['training']['optimizer']['learning_rate'])
        
        # n-step
        self.nsteps = self.config['general']['nsteps']
        self.nstep_buffer = []

        # epsilon greedy
        self.epsilon_anneal_episodes = self.config['general']['epsilon_anneal_episodes']
        self.epsilon_anneal_from = self.config['general']['epsilon_anneal_from']
        self.epsilon_anneal_to = self.config['general']['epsilon_anneal_to']
        self.epsilon = self.epsilon_anneal_from
        self.update_per_k_game_steps = self.config['general']['update_per_k_game_steps']
        self.clip_grad_norm = self.config['training']['optimizer']['clip_grad_norm']

        self.nlp = spacy.load('en', disable=['ner', 'parser', 'tagger'])
        self.preposition_map = {"take": "from",
                                "chop": "with",
                                "slice": "with",
                                "dice": "with",
                                "cook": "with",
                                "insert": "into",
                                "put": "on"}
        self.single_word_verbs = set(["inventory", "look"])
        self.discount_gamma = self.config['general']['discount_gamma']
        self.current_episode = 0
        self.current_step = 0
        self._epsiode_has_started = False
        self.history_avg_scores = HistoryScoreCache(capacity=1000)
        self.best_avg_score_so_far = 0.0
        self.loss = []

    def train(self):
        """
        Tell the agent that it's training phase.
        """
        self.mode = "train"
        self.model.train()

    def eval(self):
        """
        Tell the agent that it's evaluation phase.
        """
        self.mode = "eval"
        self.model.eval()

    def _start_episode(self, obs: List[str], infos: Dict[str, List[Any]]) -> None:
        """
        Prepare the agent for the upcoming episode.

        Arguments:
            obs: Initial feedback for each game.
            infos: Additional information for each game.
        """
        self.init(obs, infos)
        self._epsiode_has_started = True

    def _end_episode(self, obs: List[str], scores: List[int], infos: Dict[str, List[Any]]) -> None:
        """
        Tell the agent the episode has terminated.

        Arguments:
            obs: Previous command's feedback for each game.
            score: The score obtained so far for each game.
            infos: Additional information for each game.
        """
        self.finish()
        self._epsiode_has_started = False

    def load_pretrained_model(self, load_from_dir):
        """
        Load the pretrained model's last checkpoint from a directory

        Arguments:
            load_from_dir: Directory with save model parameters
        """
        
        checkpoints_glob = glob.glob(os.path.join(load_from_dir, '*.pt'))
#         reg = re.compile('.*{}_episode_(\d+)\.pt'.format(self.experiment_tag))
#         print("\nREG: {}".format(reg))
#         arg_max = np.argmax([ int(re.sub(reg, r'\1', i)) for i in checkpoints_glob ])
#         load_from = checkpoints_glob[arg_max]
#         if len(checkpoints_glob) == 0:
#             print("No model checkpoints to load from: " + load_from_dir)
#             return
        
#         arg_max = np.argmax([ int(re.sub(r'.*_(\d+)\.pt', r'\1', i)) for i in checkpoints_glob ])
#         load_from = checkpoints_glob[arg_max]
#         print("loading model from {}\n".format(load_from))
        load_from = os.path.join(load_from_dir, 'lstm-ddqn-noisy_episode_350.pt')
        print("LOADING FROM: {}".format(load_from) )
        
        try:
            if self.use_cuda:
                state_dict = torch.load(load_from)
            else:
                state_dict = torch.load(load_from, map_location='cpu')
            self.model.load_state_dict(state_dict)
        except:
            print("Failed to load model checkpoint...")
        
        memory_files = glob.glob(os.path.join(load_from_dir, '*.pickle'))
        if len(memory_files) == 0:
            print("No replay memory saves to load from: " + load_from_dir)
            return
        print("loading memory from: {}".format(memory_files[0]))
        
        try:
            with open(memory_files[0], 'rb') as f:
                self.replay_memory = pickle.load(f)
        except:
            print("Failed to load replay memory checkpoint...")
        
        

    def select_additional_infos(self) -> EnvInfos:
        """
        Returns what additional information should be made available at each game step.

        Requested information will be included within the `infos` dictionary
        passed to `CustomAgent.act()`. To request specific information, create a
        :py:class:`textworld.EnvInfos <textworld.envs.wrappers.filter.EnvInfos>`
        and set the appropriate attributes to `True`. The possible choices are:

        * `description`: text description of the current room, i.e. output of the `look` command;
        * `inventory`: text listing of the player's inventory, i.e. output of the `inventory` command;
        * `max_score`: maximum reachable score of the game;
        * `objective`: objective of the game described in text;
        * `entities`: names of all entities in the game;
        * `verbs`: verbs understood by the the game;
        * `command_templates`: templates for commands understood by the the game;
        * `admissible_commands`: all commands relevant to the current state;

        In addition to the standard information, game specific information
        can be requested by appending corresponding strings to the `extras`
        attribute. For this competition, the possible extras are:

        * `'recipe'`: description of the cookbook;
        * `'walkthrough'`: one possible solution to the game (not guaranteed to be optimal);

        Example:
            Here is an example of how to request information and retrieve it.

            >>> from textworld import EnvInfos
            >>> request_infos = EnvInfos(description=True, inventory=True, extras=["recipe"])
            ...
            >>> env = gym.make(env_id)
            >>> ob, infos = env.reset()
            >>> print(infos["description"])
            >>> print(infos["inventory"])
            >>> print(infos["extra.recipe"])

        Notes:
            The following information *won't* be available at test time:

            * 'walkthrough'
        """
        request_infos = EnvInfos()
        request_infos.description = True
        request_infos.inventory = True
        request_infos.entities = True
        request_infos.verbs = True
        request_infos.max_score = True
        request_infos.has_won = True
        request_infos.has_lost = True
        request_infos.extras = ["recipe"]
        return request_infos

    def init(self, obs: List[str], infos: Dict[str, List[Any]]):
        """
        Prepare the agent for the upcoming games.

        Arguments:
            obs: Previous command's feedback for each game.
            infos: Additional information for each game.
        """
        # reset agent, get vocabulary masks for verbs / adjectives / nouns
        self.scores = []
        self.dones = []
        self.prev_actions = ["" for _ in range(len(obs))]
        # get word masks
        batch_size = len(infos["verbs"])
        verbs_word_list = infos["verbs"]
        noun_word_list, adj_word_list = [], []
        for entities in infos["entities"]:
            tmp_nouns, tmp_adjs = [], []
            for name in entities:
                split = name.split()
                tmp_nouns.append(split[-1])
                if len(split) > 1:
                    tmp_adjs += split[:-1]
            noun_word_list.append(list(set(tmp_nouns)))
            adj_word_list.append(list(set(tmp_adjs)))

        verb_mask = np.zeros((batch_size, len(self.word_vocab)), dtype="float32")
        noun_mask = np.zeros((batch_size, len(self.word_vocab)), dtype="float32")
        adj_mask = np.zeros((batch_size, len(self.word_vocab)), dtype="float32")
        for i in range(batch_size):
            for w in verbs_word_list[i]:
                if w in self.word2id:
                    verb_mask[i][self.word2id[w]] = 1.0
            for w in noun_word_list[i]:
                if w in self.word2id:
                    noun_mask[i][self.word2id[w]] = 1.0
            for w in adj_word_list[i]:
                if w in self.word2id:
                    adj_mask[i][self.word2id[w]] = 1.0
        second_noun_mask = copy.copy(noun_mask)
        second_adj_mask = copy.copy(adj_mask)
        second_noun_mask[:, self.EOS_id] = 1.0
        adj_mask[:, self.EOS_id] = 1.0
        second_adj_mask[:, self.EOS_id] = 1.0
        self.word_masks_np = [verb_mask, adj_mask, noun_mask, second_adj_mask, second_noun_mask]

        self.cache_description_id_list = None
        self.cache_chosen_indices = None
        self.current_step = 0
        
    def append_to_replay(self, is_prior, transition):
        self.nstep_buffer.append((is_prior, transition))

        if len(self.nstep_buffer) < self.nsteps:
            return
        
        R = sum([self.nstep_buffer[i][1].reward * (self.discount_gamma**i) for i in range(self.nsteps)])
        prior, transition = self.nstep_buffer.pop(0)

        self.replay_memory.push(prior, transition._replace(reward=R))


    def get_game_step_info(self, obs: List[str], infos: Dict[str, List[Any]]):
        """
        Get all the available information, and concat them together to be tensor for
        a neural model. we use post padding here, all information are tokenized here.

        Arguments:
            obs: Previous command's feedback for each game.
            infos: Additional information for each game.
        """

        inventory_token_list = [preproc(item, tokenizer=self.nlp) for item in infos["inventory"]]
        inventory_id_list = [_words_to_ids(tokens, self.word2id) for tokens in inventory_token_list]
        #print("Inventory: \n{}\n".format(inventory_token_list))

        feedback_token_list = [preproc(item, str_type='feedback', tokenizer=self.nlp) for item in obs]
        feedback_id_list = [_words_to_ids(tokens, self.word2id) for tokens in feedback_token_list]
        #print("Feedback: \n{}\n".format(feedback_token_list))

        quest_token_list = [preproc(item, tokenizer=self.nlp) for item in infos["extra.recipe"]]
        quest_id_list = [_words_to_ids(tokens, self.word2id) for tokens in quest_token_list]
        #print("Quest:\n{}\n".format(quest_token_list))

        prev_action_token_list = [preproc(item, tokenizer=self.nlp) for item in self.prev_actions]
        prev_action_id_list = [_words_to_ids(tokens, self.word2id) for tokens in prev_action_token_list]
        #print("Prev action:\n{}\n".format(prev_action_token_list))
        
        description_token_list = [preproc(item, tokenizer=self.nlp) for item in infos["description"]]
        for i, d in enumerate(description_token_list):
            if len(d) == 0:
                description_token_list[i] = ["end"]  # if empty description, insert word "end"
        
        description_id_list = [_words_to_ids(tokens, self.word2id) for tokens in description_token_list]
        description_id_list = [
            _d +  _i + _q + _f + _pa 
            for (_d, _i, _q, _f, _pa) 
            in zip(description_id_list, inventory_id_list, quest_id_list, feedback_id_list, prev_action_id_list)
        ]
        
        input_description = pad_sequences(description_id_list, maxlen=max_len(description_id_list)).astype('int32')
        input_description = to_pt(input_description, self.use_cuda)

        return input_description, description_id_list

    def word_ids_to_commands(self, verb, adj, noun, adj_2, noun_2):
        """
        Turn the 5 indices into actual command strings.

        Arguments:
            verb: Index of the guessing verb in vocabulary
            adj: Index of the guessing adjective in vocabulary
            noun: Index of the guessing noun in vocabulary
            adj_2: Index of the second guessing adjective in vocabulary
            noun_2: Index of the second guessing noun in vocabulary
        """
        # turns 5 indices into actual command strings
        if self.word_vocab[verb] in self.single_word_verbs:
            return self.word_vocab[verb]
        if adj == self.EOS_id:
            res = self.word_vocab[verb] + " " + self.word_vocab[noun]
        else:
            res = self.word_vocab[verb] + " " + self.word_vocab[adj] + " " + self.word_vocab[noun]
        if self.word_vocab[verb] not in self.preposition_map:
            return res
        if noun_2 == self.EOS_id:
            return res
        prep = self.preposition_map[self.word_vocab[verb]]
        if adj_2 == self.EOS_id:
            res = res + " " + prep + " " + self.word_vocab[noun_2]
        else:
            res =  res + " " + prep + " " + self.word_vocab[adj_2] + " " + self.word_vocab[noun_2]
        return res

    def get_chosen_strings(self, chosen_indices):
        """
        Turns list of word indices into actual command strings.

        Arguments:
            chosen_indices: Word indices chosen by model.
        """
        chosen_indices_np = [to_np(item)[:, 0] for item in chosen_indices]
        res_str = []
        batch_size = chosen_indices_np[0].shape[0]
        for i in range(batch_size):
            verb, adj, noun, adj_2, noun_2 = chosen_indices_np[0][i],\
                                             chosen_indices_np[1][i],\
                                             chosen_indices_np[2][i],\
                                             chosen_indices_np[3][i],\
                                             chosen_indices_np[4][i]
            res_str.append(self.word_ids_to_commands(verb, adj, noun, adj_2, noun_2))
        return res_str

#     def choose_random_command(self, word_ranks, word_masks_np):
#         """
#         Generate a command randomly, for epsilon greedy.

#         Arguments:
#             word_ranks: Q values for each word by model.action_scorer.
#             word_masks_np: Vocabulary masks for words depending on their type (verb, adj, noun).
#         """
#         batch_size = word_ranks[0].size(0)
#         word_ranks_np = [to_np(item) for item in word_ranks]  # list of batch x n_vocab
#         word_ranks_np = [r * m for r, m in zip(word_ranks_np, word_masks_np)]  # list of batch x n_vocab
#         word_indices = []
#         for i in range(len(word_ranks_np)):
#             indices = []
#             for j in range(batch_size):
#                 msk = word_masks_np[i][j]  # vocab
#                 indices.append(np.random.choice(len(msk), p=msk / np.sum(msk, -1)))
#             word_indices.append(np.array(indices))
#         #print("RANDOM: {}".format(np.array(word_indices)))
#         # word_indices: list of batch
#         word_qvalues = [[] for _ in word_masks_np]
#         for i in range(batch_size):
#             for j in range(len(word_qvalues)):
#                 word_qvalues[j].append(word_ranks[j][i][word_indices[j][i]])
#         word_qvalues = [torch.stack(item) for item in word_qvalues]
#         word_indices = [to_pt(item, self.use_cuda) for item in word_indices]
#         word_indices = [item.unsqueeze(-1) for item in word_indices]  # list of batch x 1
#         return word_qvalues, word_indices

    def choose_random_command(self, word_ranks, word_masks_np):
        batch_size = word_ranks[0].size(0)
        word_ranks_np = [to_np(item) for item in word_ranks]  # list of batch x n_vocab
        word_ranks_np = [r - np.min(r) for r in word_ranks_np] # minus the min value, so that all values are non-negative
        random_ranks = np.random.normal(0, self.qval_noise_std, word_ranks_np[0].shape) 
        word_ranks_np = [r + random_ranks for r in word_ranks_np] # add noise      
        word_ranks_np = [r * m for r, m in zip(word_ranks_np, word_masks_np)]  # list of batch x n_vocab
        word_indices = [np.argmax(item, -1) for item in word_ranks_np]  # list of batch
        word_qvalues = [[] for _ in word_masks_np]

        for i in range(batch_size):
            for j in range(len(word_qvalues)):
                word_qvalues[j].append(word_ranks[j][i][word_indices[j][i]])

        word_qvalues = [torch.stack(item) for item in word_qvalues]
        word_indices = [to_pt(item, self.use_cuda) for item in word_indices]
        word_indices = [item.unsqueeze(-1) for item in word_indices]  # list of batch x 1
        return word_qvalues, word_indices

    def choose_maxQ_command(self, word_ranks, word_masks_np):
        """
        Generate a command by maximum q values, for epsilon greedy.

        Arguments:
            word_ranks: Q values for each word by model.action_scorer.
            word_masks_np: Vocabulary masks for words depending on their type (verb, adj, noun).
        """
        batch_size = word_ranks[0].size(0)
        word_ranks_np = [to_np(item) for item in word_ranks]  # list of batch x n_vocab
        word_ranks_np = [r - np.min(r) for r in word_ranks_np] # minus the min value, so that all values are non-negative
        word_ranks_np = [r * m for r, m in zip(word_ranks_np, word_masks_np)]  # list of batch x n_vocab
        word_indices = [np.argmax(item, -1) for item in word_ranks_np]  # list of batch
        word_qvalues = [[] for _ in word_masks_np]

        for i in range(batch_size):
            for j in range(len(word_qvalues)):
                word_qvalues[j].append(word_ranks[j][i][word_indices[j][i]])

        word_qvalues = [torch.stack(item) for item in word_qvalues]
        word_indices = [to_pt(item, self.use_cuda) for item in word_indices]
        word_indices = [item.unsqueeze(-1) for item in word_indices]  # list of batch x 1
        return word_qvalues, word_indices

    def get_ranks(self, model, input_description):
        """
        Given input description tensor, call model forward, to get Q values of words.

        Arguments:
            input_description: Input tensors, which include all the information chosen in
            select_additional_infos() concatenated together.
        """
        
        state_representation = model.representation_generator(input_description)
        #print("Size: {}".format(state_representation.size()))
        word_ranks = model.action_scorer(state_representation)  # each element in list has batch x n_vocab size
        return word_ranks
    
    def act_eval(self, obs: List[str], scores: List[int], dones: List[bool], infos: Dict[str, List[Any]]) -> List[str]:
        """
        Acts upon the current list of observations, during evaluation.

        One text command must be returned for each observation.

        Arguments:
            obs: Previous command's feedback for each game.
            score: The score obtained so far for each game (at previous step).
            done: Whether a game is finished (at previous step).
            infos: Additional information for each game.

        Returns:
            Text commands to be performed (one per observation).

        Notes:
            Commands returned for games marked as `done` have no effect.
            The states for finished games are simply copy over until all
            games are done, in which case `CustomAgent.finish()` is called
            instead.
        """

        if self.current_step > 0:
            # append scores / dones from previous step into memory
            self.scores.append(scores)
            self.dones.append(dones)

        if all(dones):
            self._end_episode(obs, scores, infos)
            return  # Nothing to return.

        input_description, _ = self.get_game_step_info(obs, infos)
        word_ranks = self.get_ranks(self.model, input_description)  # list of batch x vocab
        _, word_indices_maxq = self.choose_maxQ_command(word_ranks, self.word_masks_np)

        chosen_indices = word_indices_maxq
        chosen_indices = [item.detach() for item in chosen_indices]
        chosen_strings = self.get_chosen_strings(chosen_indices)
        self.prev_actions = chosen_strings
        self.current_step += 1

        return chosen_strings

    def act(self, obs: List[str], scores: List[int], dones: List[bool], infos: Dict[str, List[Any]]) -> List[str]:
        """
        Acts upon the current list of observations.

        One text command must be returned for each observation.

        Arguments:
            obs: Previous command's feedback for each game.
            score: The score obtained so far for each game (at previous step).
            done: Whether a game is finished (at previous step).
            infos: Additional information for each game.

        Returns:
            Text commands to be performed (one per observation).

        Notes:
            Commands returned for games marked as `done` have no effect.
            The states for finished games are simply copy over until all
            games are done, in which case `CustomAgent.finish()` is called
            instead.
        """
        if not self._epsiode_has_started:
            self._start_episode(obs, infos)

        if self.mode == "eval":
            return self.act_eval(obs, scores, dones, infos)

        if self.current_step > 0:
            # append scores / dones from previous step into memory
            self.scores.append(scores)
            self.dones.append(dones)
            # compute previous step's rewards and masks
            rewards_np, rewards, mask_np, mask = self.compute_reward()
        
        # Sample for noisy nets
        for i in range(len(self.model.action_scorers)):
            self.model.action_scorers[i].sample_noise()

        input_description, description_id_list = self.get_game_step_info(obs, infos)
        # generate commands for one game step, epsilon greedy is applied, i.e.,
        # there is epsilon of chance to generate random commands
        
        word_ranks = self.get_ranks(self.model, input_description)  # list of batch x vocab
        
        _, word_indices_maxq = self.choose_maxQ_command(word_ranks, self.word_masks_np)
        _, word_indices_random = self.choose_random_command(word_ranks, self.word_masks_np)
        
        # random number for epsilon greedy
        rand_num = np.random.uniform(low=0.0, high=1.0, size=(input_description.size(0), 1))
        less_than_epsilon = (rand_num < self.epsilon).astype("float32")  # batch
        greater_than_epsilon = 1.0 - less_than_epsilon
        less_than_epsilon = to_pt(less_than_epsilon, self.use_cuda, type='float')
        greater_than_epsilon = to_pt(greater_than_epsilon, self.use_cuda, type='float')
        less_than_epsilon, greater_than_epsilon = less_than_epsilon.long(), greater_than_epsilon.long()

        chosen_indices = [
            less_than_epsilon * idx_random + greater_than_epsilon * idx_maxq 
            for idx_random, idx_maxq in zip(word_indices_random, word_indices_maxq)
        ]
        chosen_indices = [item.detach() for item in chosen_indices]
        chosen_strings = self.get_chosen_strings(chosen_indices)
        random_idx = to_np(less_than_epsilon).astype('bool')
        random_idx = random_idx.reshape((random_idx.shape[0]))
        #print("\nRAND IDX: {}".format(random_idx))
        #print("\nMax commands: {}".format(np.array(chosen_strings)[~random_idx]))
        random_commands = np.array(chosen_strings)[random_idx]
        #print("\nRandom commands: {}".format(random_commands))
        self.prev_actions = chosen_strings

        # push info from previous game step into replay memory
        if self.current_step > 0:
            for b in range(len(obs)):
                if mask_np[b] == 0:
                    continue
                is_prior = rewards_np[b] > 0.0
                t = Transition(self.cache_description_id_list[b], 
                               [ item[b] for item in self.cache_chosen_indices], 
                               rewards[b], 
                               mask[b], 
                               dones[b], 
                               description_id_list[b], 
                               [item[b] for item in self.word_masks_np])
               # print("ACT: {}".format(t.observation_id_list))
                self.append_to_replay(is_prior, t)

        # cache new info in current game step into caches
        self.cache_description_id_list = description_id_list
        self.cache_chosen_indices = chosen_indices

        # update neural model by replaying snapshots in replay memory
        if self.current_step > 0 and self.current_step % self.update_per_k_game_steps == 0:
            loss = self.update()
            #print(loss)
            if loss is not None:
                self.loss.append(to_np(loss).mean())
                # Backpropagate
                self.optimizer.zero_grad()
                loss.backward(retain_graph=True)
                # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
                torch.nn.utils.clip_grad_norm_(self.model.encoder.parameters(), self.clip_grad_norm)
                self.optimizer.step()  # apply gradients

        self.current_step += 1

        if all(dones):
            self._end_episode(obs, scores, infos)
            return  # Nothing to return.
        return chosen_strings

    def compute_reward(self):
        """
        Compute rewards by agent. Note this is different from what the training/evaluation
        scripts do. Agent keeps track of scores and other game information for training purpose.
        """
        # mask = 1 if game is not finished or just finished at current step
        if len(self.dones) == 1:
            # it's not possible to finish a game at 0th step
            mask = [1.0 for _ in self.dones[-1]]
        else:
            assert len(self.dones) > 1
            mask = [1.0 if not self.dones[-2][i] else 0.0 for i in range(len(self.dones[-1]))]
        mask = np.array(mask, dtype='float32')
        mask_pt = to_pt(mask, self.use_cuda, type='float')
        # rewards returned by game engine are always accumulated value the
        # agent have recieved. so the reward it gets in the current game step
        # is the new value minus values at previous step.
        rewards = np.array(self.scores[-1], dtype='float32')  # batch
        if len(self.scores) > 1:
            prev_rewards = np.array(self.scores[-2], dtype='float32')
            rewards = rewards - prev_rewards
        rewards_pt = to_pt(rewards, self.use_cuda, type='float')

        return rewards, rewards_pt, mask, mask_pt
    
    def update_target_model(self):
        self.update_target_model_count = (self.update_target_model_count + 1) % self.target_model_update_frequency
        if self.update_target_model_count == 0:
            self.target_model.load_state_dict(self.model.state_dict())

    def update(self):
        """
        Update neural model in agent. In this example we follow algorithm
        of updating model in dqn with replay memory.
        """
        if len(self.replay_memory) < self.replay_batch_size:
            return None
        
        self.update_target_model()
        
        #print("UPDATE! \n Memory alpha size: {} | beta size: {}\n".format(len(self.replay_memory.alpha_memory), len(self.replay_memory.beta_memory)))
        transitions = self.replay_memory.sample(self.replay_batch_size)
        batch = Transition(*zip(*transitions))

        observation_id_list = pad_sequences(batch.observation_id_list, maxlen=max_len(batch.observation_id_list)).astype('int32')
        input_observation = to_pt(observation_id_list, self.use_cuda)
        next_observation_id_list = pad_sequences(batch.next_observation_id_list, maxlen=max_len(batch.next_observation_id_list)).astype('int32')
        next_input_observation = to_pt(next_observation_id_list, self.use_cuda)
        chosen_indices = list(list(zip(*batch.word_indices)))
        chosen_indices = [torch.stack(item, 0) for item in chosen_indices]  # list of batch x 1
        
        word_ranks = self.get_ranks(self.model, input_observation)  # list of batch x vocab
        word_qvalues = [w_rank.gather(1, idx).squeeze(-1) for w_rank, idx in zip(word_ranks, chosen_indices)]  # list of batch
        q_value = torch.mean(torch.stack(word_qvalues, -1), -1)  # batch

        # Action selection, using q-network
        next_word_ranks = self.get_ranks(self.target_model, next_input_observation) # batch x n_verb, batch x n_noun, batch x n_second_noun
        next_word_masks = list(list(zip(*batch.next_word_masks)))
        next_word_masks = [np.stack(item, 0) for item in next_word_masks]

       # _, next_word_indexes = self.choose_maxQ_command(next_word_ranks, next_word_masks)
        
        # Action evaluation, using target network
       # eval_next_word_ranks = self.get_ranks(self.target_model, next_input_observation)
       # next_word_qvalues = [rank.gather(1, idx.detach()).squeeze(-1) for rank, idx in zip(eval_next_word_ranks, next_word_indexes)]
        next_word_qvalues, _ = self.choose_maxQ_command(next_word_ranks, next_word_masks)
        next_q_value = torch.mean(torch.stack(next_word_qvalues, -1), -1)  # batch
        next_q_value = next_q_value.detach()

        rewards = torch.stack(batch.reward)  # batch
        not_done = 1.0 - np.array(batch.done, dtype='float32')  # batch
        not_done = to_pt(not_done, self.use_cuda, type='float')
        # NB: Should not_done be used?
        rewards = rewards + not_done * next_q_value * (self.discount_gamma**self.nsteps)  # batch
        #rewards = rewards + next_q_value * (self.discount_gamma**self.nsteps)  # batch
        mask = torch.stack(batch.mask)  # batch
        loss = F.smooth_l1_loss(q_value * mask, rewards * mask)
        return loss

    def save_agent(self):
        model_save = os.path.join(self.model_checkpoint_dir, self.experiment_tag + "_episode_" + str(self.current_episode) + ".pt")
        memory_save = os.path.join(self.model_checkpoint_dir, self.experiment_tag + ".pickle")
        if not os.path.isdir(self.model_checkpoint_dir):
            os.mkdir(self.model_checkpoint_dir)
        torch.save(self.model.state_dict(), model_save)
        with open(memory_save, 'wb') as f:
            pickle.dump(self.replay_memory, f)
        print("\n========= saved checkpoint =========")
        
        
    def finish(self) -> None:
        """
        All games in the batch are finished. One can choose to save checkpoints,
        evaluate on validation set, or do parameter annealing here.
        """
        # Game has finished (either win, lose, or exhausted all the given steps).
        self.final_rewards = np.array(self.scores[-1], dtype='float32')  # batch
        dones = []
        for d in self.dones:
            d = np.array([float(dd) for dd in d], dtype='float32')
            dones.append(d)
        dones = np.array(dones)
        step_used = 1.0 - dones
        self.step_used_before_done = np.sum(step_used, 0)  # batch
        
        self.history_avg_scores.push(np.mean(self.final_rewards))
            
        # save checkpoint
        if self.mode == "train" and self.current_episode % self.save_frequency == 0:
            avg_score = self.history_avg_scores.get_avg()
            if avg_score > self.best_avg_score_so_far:
                self.best_avg_score_so_far = avg_score
                self.save_agent()


        self.current_episode += 1
        # annealing
        if self.current_episode < self.epsilon_anneal_episodes:
            self.epsilon -= (self.epsilon_anneal_from - self.epsilon_anneal_to) / float(self.epsilon_anneal_episodes)
            
    def get_mean_loss(self):
        mean_loss = 0.
        if len(self.loss) != 0:   
            mean_loss = sum(self.loss) / len(self.loss)
        self.loss = []
        return mean_loss

## Configs and environments

### Vocab
Upload vocab.txt file`

In [0]:
from google.colab import files
                                                                                                                                                                                                                                                                                                                            
if not os.path.isfile('./vocab.txt'):
    uploaded = files.upload()
    # Upload vocab.txt
    for fn in uploaded.keys():
        print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))
else:
    print("Vocab already uploaded!")

Saving vocab_new.txt to vocab_new.txt
User uploaded file "vocab_new.txt" with length 158830 bytes


In [0]:
!mv vocab_new.txt vocab.txt

### Configuration

In [0]:
with open('./config.yaml', 'w') as config:
    config.write("""
general:
  discount_gamma: 0.7
  random_seed: 42
  use_cuda: True  # disable this when running on machine without cuda

  # replay memory
  replay_memory_capacity: 10000  # adjust this depending on your RAM size
  replay_memory_priority_fraction: 0.5
  update_per_k_game_steps: 5
  replay_batch_size: 32
  nsteps: 3

  # epsilon greedy
  epsilon_anneal_episodes: 200  # -1 if not annealing
  epsilon_anneal_from: 1
  epsilon_anneal_to: 0.1

checkpoint:
  experiment_tag: 'lstm-ddqn-noisy'
  model_checkpoint_dir: '/gdrive/My Drive/saved_models'
  load_pretrained: True  # during test, enable this so that the agent load your pretrained model
      #pretrained_experiment_dir: 'starting-kit'
  save_frequency: 50

training:
  batch_size: 10   # Parallel games played at once
  nb_epochs: 100
  max_nb_steps_per_episode: 100  # after this many steps, a game is terminated
  target_model_update_frequency: 8 # update target model after that number of backprops
  qval_noise_std: 0.1
  optimizer:
    step_rule: 'adam'  # adam
    learning_rate: 0.001
    clip_grad_norm: 5

model:
  embedding_path: ./glove.6B.50d.txt
  embedding_size: 50
  noisy_std: 0.3
  freeze_embedding: False
  encoder_rnn_hidden_size: [128]
  bidirectional_lstm: False
  action_scorer_hidden_dim: 64
  dropout_between_rnn_layers: 0.
""")

### Mount drive to load games

Notebook takes sample games from google drive(requires authentication).

To train the agent with games, upload archive with them in google drive and fix the path to the archive inside drive below.



In [0]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [0]:
!rm -rf starting_kit

In [0]:
# Put your path of the games archive.
!ls -1 | grep -q '^.*\.ulx$' || tar -xzvf '/gdrive/My Drive/starting_kit_games.tgz'

In [0]:
path_to_sample_games = './train'

## Train

In [0]:
# List of additional information available during evaluation.
AVAILABLE_INFORMATION = EnvInfos(
    description=True, inventory=True,
    max_score=True, objective=True, entities=True, verbs=True,
    command_templates=True, admissible_commands=True,
    has_won=True, has_lost=True,
    extras=["recipe"]
)

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

def _validate_requested_infos(infos: EnvInfos):
    msg = "The following information cannot be requested: {}"
    for key in infos.basics:
        if not getattr(AVAILABLE_INFORMATION, key):
            raise ValueError(msg.format(key))

    for key in infos.extras:
        if key not in AVAILABLE_INFORMATION.extras:
            raise ValueError(msg.format(key))
            
def get_index(game_no, stats):
    return "{}_{}".format(game_no, stats)
            
def print_epoch_stats(epoch_no, stats):
    print("\n\nEpoch: {:3d}".format(epoch_no))
    steps, scores, loss = stats["steps"], stats["scores"], stats["loss"],
    max_scores, outcomes = stats["max_score"], stats["outcomes"]
    games_cnt, parallel_cnt = len(steps), len(steps[0])
    columns = [ get_index(col, st) for col in range(games_cnt) for st in ['st', 'sc']]
    stats_df = pd.DataFrame(index=list(range(parallel_cnt)) + ["avr", "loss"], columns=columns)
        
    for col in range(games_cnt):
        for row in range(parallel_cnt):
            outcome = outcomes[col][row]
            outcome = outcome > 0 and "W" or outcome < 0 and "L" or ""
            stats_df[get_index(col, 'st')][row] = steps[col][row]
            stats_df[get_index(col, 'sc')][row] = outcome + " " + str(scores[col][row])
        stats_df[get_index(col, 'sc')]['avr'] = "{}/{}".format(np.mean(scores[col]), max_scores[col])
        stats_df[get_index(col, 'st')]['avr'] = stats_df[get_index(col, 'st')].mean()
        stats_df[get_index(col, 'st')]['loss'] =  stats["eps"][col]
        stats_df[get_index(col, 'sc')]['loss'] = "{:.5f}".format(loss[col])
    print(stats_df)
    
def get_game_id(game_info):
    return hash((tuple(game_info['entities'][0]), game_info['extra.recipe'][0]))
    #return hash((game_info['entities'], game_info['extra.recipe']))

def make_stats(count_games):
    stats_cols = [ "scores", "steps", "loss", "max_score", "outcomes", "eps" ]
    stats = {}
    for col in stats_cols:
        stats[col] = [0] * count_games
    return stats
    
def train(game_files):
    print("Agent starting...")
    agent = CustomAgent()
    print("Agent started")
    requested_infos = agent.select_additional_infos()
    _validate_requested_infos(requested_infos)

    env_id = textworld.gym.register_games(game_files, requested_infos,
                                          max_episode_steps=agent.max_nb_steps_per_episode,
                                          name="training")
    env_id = textworld.gym.make_batch(env_id, batch_size=agent.batch_size, parallel=True)
    print("Making {} parallel environments to train on them\n".format(agent.batch_size))
    env = gym.make(env_id)
    count_games = len(game_files)
    games_ids = {}
    for epoch_no in range(1, agent.nb_epochs + 1):
        stats = make_stats(count_games)
        idx = 0
        for game_no in tqdm(range(count_games)):
            obs, infos = env.reset()
            game_id = get_game_id(infos)
            if epoch_no == 1:
                games_ids[game_id] = idx
                idx += 1
            real_id = games_ids[game_id]
            stats["max_score"][real_id] = infos['max_score'][0]
            
            agent.train()

            scores = [0] * len(obs) 
            dones = [False] * len(obs)
            steps = [0] * len(obs)
            while not all(dones):
                # Increase step counts.
                steps = [step + int(not done) for step, done in zip(steps, dones)]
                commands = agent.act(obs, scores, dones, infos)
                obs, scores, dones, infos = env.step(commands)

            # Let the agent knows the game is done.
            agent.act(obs, scores, dones, infos)

            stats["scores"][real_id] = scores
            stats["steps"][real_id] = steps
            stats["eps"][real_id] = agent.epsilon
            stats["loss"][real_id] = agent.get_mean_loss()
            stats["outcomes"][real_id] = [ w-l for w, l in zip(infos['has_won'], infos['has_lost'])]
        
        print_epoch_stats(epoch_no, stats)
        
    #torch.save(agent.model, './agent_model.pt')
    return

In [0]:
# %%time

# game_dir = path_to_sample_games
# games = []
# if os.path.isdir(game_dir):
#     games += glob.glob(os.path.join(game_dir, "*.ulx"))
# print("{} games found for training.".format(len(games)))

# if len(games) != 0:
#     train(games)

In [0]:
path_to_sample_games = 'train'
def take_games(game_dir, n=10):
  games = glob.glob(os.path.join(game_dir, "*.ulx"))
  games.sort(key=lambda x: len(x))
  return games[:n]

In [0]:
# COMMENT TRAIN
# FIX LOADING MODEL PATH
# FIX BATCH_SIZE TO BE 10


def eval_games(game_files):
  agent = CustomAgent()
  requested_infos = agent.select_additional_infos()
  
  env_id = textworld.gym.register_games(game_files, requested_infos,
                                        max_episode_steps=agent.max_nb_steps_per_episode,
                                        name="eval")

  env_id = textworld.gym.make_batch(env_id, batch_size=10, parallel=True)
  print("ENVID: {}".format(env_id))

  print("Making {} parallel environments to eval on them\n".format(agent.batch_size))
  env = gym.make(env_id)
  count_games = len(game_files)
  games_ids = {}

  stats = make_stats(count_games)
  score_sum = 0
  steps_sum = 0
  steps_length = count_games*10
  for game_no in tqdm(range(count_games)):
      obs, infos = env.reset()

      agent.eval()

      scores = [0] * len(obs) 
      dones = [False] * len(obs)
      steps = [0] * len(obs)
      while not all(dones):
          # Increase step counts.
          steps = [step + int(not done) for step, done in zip(steps, dones)]
          commands = agent.act(obs, scores, dones, infos)
          obs, scores, dones, infos = env.step(commands)

      # Let the agent knows the game is done.
      agent.act(obs, scores, dones, infos)
      score_sum += sum(scores)
      steps_sum += sum(steps)
      
  print('Max score: ', score_sum)
  print('Mean steps: ', steps_sum / steps_length)

game_dir = path_to_sample_games
games = []
if os.path.isdir(game_dir):
    games += take_games(game_dir)
    print(games)
    
print("{} games found for training.".format(len(games)))

if len(games) != 0:
  eval_games(games)

['train/tw-cooking-recipe2-BB6F8a7uv1cq69.ulx', 'train/tw-cooking-recipe1-x5RZIbKs202tkPK.ulx', 'train/tw-cooking-recipe2-0251hW0Lu3Q3tKP.ulx', 'train/tw-cooking-recipe2-WBlZid3TEB3Cebp.ulx', 'train/tw-cooking-recipe3-dnX0IGYHJ0yCdjl.ulx', 'train/tw-cooking-recipe3-25mrSE2hL9WikrR.ulx', 'train/tw-cooking-recipe2-qx0Mfy3ue3xCxld.ulx', 'train/tw-cooking-recipe3-WDRfeP7sLn6uNYd.ulx', 'train/tw-cooking-recipe3-630RT7LOTZeC07V.ulx', 'train/tw-cooking-recipe2-ab2pHlDgC1okuEMb.ulx']
10 games found for training.
Creating Q-Network
Embedding imported!
Creating Target Network
Embedding imported!
LSTM_DQN(
  (word_embedding): Embedding(
    (embedding_layer): Embedding(20208, 50, padding_idx=0)
  )
  (encoder): FastUniLSTM(
    (rnns): ModuleList(
      (0): LSTM(50, 128)
    )
  )
  (action_scorer_shared): Linear(in_features=128, out_features=64, bias=True)
  (action_scorers): ModuleList(
    (0): NoisyLinear()
    (1): NoisyLinear()
    (2): NoisyLinear()
    (3): NoisyLinear()
    (4): NoisyLi

  result = entry_point.load(False)




  0%|          | 0/10 [00:00<?, ?it/s][A[A[A[A



 10%|█         | 1/10 [00:08<01:15,  8.41s/it][A[A[A[A



 20%|██        | 2/10 [00:14<01:02,  7.82s/it][A[A[A[A



 30%|███       | 3/10 [00:21<00:51,  7.41s/it][A[A[A[A



 40%|████      | 4/10 [00:27<00:41,  6.98s/it][A[A[A[A



 50%|█████     | 5/10 [00:33<00:33,  6.74s/it][A[A[A[A



 60%|██████    | 6/10 [00:40<00:27,  6.95s/it][A[A[A[A



 70%|███████   | 7/10 [00:47<00:20,  6.74s/it][A[A[A[A



 80%|████████  | 8/10 [00:53<00:13,  6.66s/it][A[A[A[A



 90%|█████████ | 9/10 [00:59<00:06,  6.50s/it][A[A[A[A



100%|██████████| 10/10 [01:10<00:00,  7.88s/it][A[A[A[A



[A[A[A[A

Max score:  0
Mean steps:  100.0
tw-eval-v12 closed


Process Process-130:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/textworld/gym/envs/batch_env.py", line 20, in _child
    command = pipe.recv()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError


tw-eval-v12 closed


Process Process-129:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/textworld/gym/envs/batch_env.py", line 20, in _child
    command = pipe.recv()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError


tw-eval-v12 closed


Process Process-128:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/textworld/gym/envs/batch_env.py", line 20, in _child
    command = pipe.recv()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError


tw-eval-v12 closed


Process Process-127:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/textworld/gym/envs/batch_env.py", line 20, in _child
    command = pipe.recv()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError


tw-eval-v12 closed


Process Process-126:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/textworld/gym/envs/batch_env.py", line 20, in _child
    command = pipe.recv()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError


tw-eval-v12 closed


Process Process-125:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/textworld/gym/envs/batch_env.py", line 20, in _child
    command = pipe.recv()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError


In [0]:
!ls '/gdrive/My Drive/saved_models/lstm-ddqn-noisy_episode_350.pt'

'/gdrive/My Drive/saved_models/lstm-ddqn-noisy_episode_350.pt'


## Save models

In [0]:
!ls -lh '/gdrive/My Drive/saved_models'

total 628M
-rw------- 1 root root  80M Jul  2 15:19 imitation-textworld-lstm-ddqn-noisy-nets_episode_50.pt
-rw------- 1 root root 2.0M Jul  3 14:09 lstm-ddqn-noisy-action-space_episode_0.pt
-rw------- 1 root root 643K Jul  3 16:24 lstm-ddqn-noisy-action-space_episode_100.pt
-rw------- 1 root root 643K Jul  3 16:37 lstm-ddqn-noisy-action-space_episode_150.pt
-rw------- 1 root root 643K Jul  3 16:48 lstm-ddqn-noisy-action-space_episode_200.pt
-rw------- 1 root root 643K Jul  3 16:58 lstm-ddqn-noisy-action-space_episode_250.pt
-rw------- 1 root root 643K Jul  3 17:08 lstm-ddqn-noisy-action-space_episode_300.pt
-rw------- 1 root root 643K Jul  3 17:16 lstm-ddqn-noisy-action-space_episode_350.pt
-rw------- 1 root root 643K Jul  3 17:23 lstm-ddqn-noisy-action-space_episode_400.pt
-rw------- 1 root root 643K Jul  3 17:29 lstm-ddqn-noisy-action-space_episode_450.pt
-rw------- 1 root root 643K Jul  3 17:36 lstm-ddqn-noisy-action-space_episode_500.pt
-rw------- 1 root root 643K Jul  3 16:10 lstm