# Predicting with RNNs in FastAI - Avoiding Infinite Loops

Some people, myself included, have run into RNNS made with FastAI spitting out the same sentence over and over again in a loop. This can be solved with a small tweak to the way the prediction vector is evaluated. This applies to any `Learner` created from a `LanguageModelData` object and works for character level or word level RNNs

## Background - Character Level RNN predictions

I ran into this issue trying to create a `Learner` version of the Nietzsche RNN. Compare the output predictions from the models below:

### Model 1: LSTM Model From Lesson 6

This is directly from Lesson 6

In [1]:
from torchtext import vocab, data

from fastai.nlp import *
from fastai.lm_rnn import *

PATH='data/nietzsche/'

TRN_PATH = 'trn/'
VAL_PATH = 'val/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

os.listdir(f'{PATH}')

['.ipynb_checkpoints', 'models', 'nietzsche.txt', 'trn', 'val']

In [2]:
TEXT = data.Field(lower=True, tokenize=list)
bs=64; bptt=16; n_fac=42; n_hidden=256

FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)

len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(470, 55, 1, 482972)

In [3]:
from fastai import sgdr

n_hidden=512

In [4]:
class CharSeqStatefulLSTM(nn.Module):
    def __init__(self, vocab_size, n_fac, bs, nl):
        super().__init__()
        self.vocab_size,self.nl = vocab_size,nl
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.LSTM(n_fac, n_hidden, nl, dropout=0.5)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h[0].size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        self.h = (V(torch.zeros(self.nl, bs, n_hidden)),
                  V(torch.zeros(self.nl, bs, n_hidden)))

In [5]:
m = CharSeqStatefulLSTM(md.nt, n_fac, 512, 2).cuda()
lo = LayerOptimizer(optim.Adam, m, 1e-2, 1e-5)

In [6]:
fit(m, md, 2, lo.opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=2), HTML(value='')))

epoch      trn_loss   val_loss                                                                                         
    0      1.80958    1.707095  
    1      1.687597   1.619923                                                                                         



[array([1.61992])]

In [7]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(m, md, 7, lo.opt, F.nll_loss, callbacks=cb)

HBox(children=(IntProgress(value=0, description='Epoch', max=7), HTML(value='')))

epoch      trn_loss   val_loss                                                                                         
    0      1.536625   1.478161  
    1      1.55555    1.482114                                                                                         
    2      1.445336   1.420886                                                                                         
    3      1.582081   1.521532                                                                                         
    4      1.496563   1.441881                                                                                         
    5      1.415837   1.391549                                                                                         
    6      1.360574   1.372136                                                                                         



[array([1.37214])]

### Testing the Model

In [8]:
def get_next(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]

In [9]:
get_next('for thos')

'e'

In [10]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [11]:
print(get_next_n('for thos', 1000))

for those aristocracy of whichdo seconsthat as even a culture "sees?     heastand and to say for there may according required shappensary reputients.[19] perhaps almost conceptions", the restraction, theprecessonizes--obversions, are synthesistic worth:--the present to dices in the world. the satisfacts, such no nastingly; and short. once from anlights of presumptice from attricknings(his will in a my dis repured from the uniteral degree--platory: no doubt, it is that the how is to the cultured craffined; then well only anything to decided, askmater, only present to dream to the drscerned to has reverence is something physics with some usual wanting whom hesit." but in them of causethe namely manage thewhatever) in fauth: there good too egoists, my devilposed 'years to theeperhams, however, preimitable commanding,womanythingfully, and sumplantly suffering, and feels how first, and kinds. the eetern: howlustom, and putfaith and actuallybecome to art of their soul? but it late hout victo

The model is clearly working. Compare this output to what happens when we instead create a `Learner` from the `LanguageModelData`

### Model 2: Model Created from LanguageModelData.get_learner

In [12]:
em_sz = n_fac  
nh = n_hidden     
nl = 2       

In [13]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

In [14]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

In [15]:
learner.fit(1e-2, 2, wds=1e-5)

HBox(children=(IntProgress(value=0, description='Epoch', max=2), HTML(value='')))

epoch      trn_loss   val_loss                                                                                         
    0      1.762447   1.660329  
    1      1.647266   1.559843                                                                                         



[array([1.55984])]

In [16]:
learner.fit(1e-2, 3, wds=1e-5, cycle_len=1, cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=7), HTML(value='')))

epoch      trn_loss   val_loss                                                                                         
    0      1.513945   1.437034  
    1      1.548548   1.463759                                                                                         
    2      1.446684   1.400717                                                                                         
    3      1.576475   1.487075                                                                                         
    4      1.515214   1.437371                                                                                         
    5      1.436331   1.388373                                                                                         
    6      1.374034   1.368162                                                                                         



[array([1.36816])]

Now we try to predict

In [17]:
m = learner.model
print(get_next_n('for thos', 1000))

AttributeError: 'list' object has no attribute 'exp'

We get a prediction error due because the outputs from the Learner model aren't in the same form as the CharSeqStatefulLSTM. To get around this, we use the prediction method from Lesson 4, used with the IMDB language model

In [21]:
m=learner.model
ss= """ for thos"""
t=TEXT.numericalize(ss)

In [22]:
# Set batch size to 1
m[0].bs=1
# Turn off dropout
m.eval()
# Reset hidden state
m.reset()
# Get predictions from model
res,*_ = m(t)
# Put the batch size back to what it was
m[0].bs=bs

In [23]:
nexts = torch.topk(res[-1], 10)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

[' ', 'e', 't', ',', 'o', 'i', 'u', '.', '-', 'h']

In [27]:
print(ss,"\n")
for i in range(1000):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end='')
    res,*_ = m(n[0].unsqueeze(0))
print('...')

 for thos 

 of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense

### The Loop

So what happened here? The LSTM model, which seemed to function fine when we made it from scratch, is now suddenly predicting the same thing over and over. The difference from what I have been able to determine is that in the first model, we choose the next character using `torch.multinomial`, while for the second model, we use `torch.topk`. I'm not too certain about the causes for this, but it looks like `torch.topk` is a poor choice in this scenario.

We can fix the looping problem by applying `torch.multinomial` instead. 

In [28]:
def get_next(inp):
    idxs = TEXT.numericalize(inp)
    res, *_ = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(res[-1].exp(), 2)
    if r.data[0] == 0:
        r = r[1]
    else:
        r = r[0]
    return TEXT.vocab.itos[to_np(r)[0]]

In [29]:
m = learner.model
print(get_next_n('for thos', 1000))

for those and man seem power, interroganism we repro[uss, that equality, system in mankind us. as an awory"--of it stacgorificance,too love oftheir age, we have also, andweregour of every tly, and have abyss of th--the degree at errordious things? to assimilar of the belongs?--for the retencertain "develops.60. e: are long itbeing ill thinking an advantage ofks, very feel usual to nature whether unfavour that pathosising nature upon his thinking out of pro5ulimatic lase for its past the<pad>part with principle, every cultivity, and out of people rights of dinestates are been, always it finners tas matter, of the nature), would not also to the morality, whichin daring modesty, passian victage beinbe mlm, in    any 13gormands"at things, that is to a [intdo youngstill have (the earing of life to character the ideas what say there is criticism must ne suchindifference to them wnehose tale[able, and cover messed. i let us real completer that they schoadiss and overcontemplations--in thesymp

And now everything is working again

# Word Level Models

The same problem and solution exists in word level models. Using the IMDB model:

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle
import spacy

In [2]:
PATH='data/aclImdb/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

os.listdir(PATH)

['.ipynb_checkpoints',
 'imdb.vocab',
 'imdbEr.txt',
 'models',
 'README',
 'test',
 'tmp',
 'train']

In [3]:
spacy_tok = spacy.load('en')

In [4]:
TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))

In [5]:
bs=48; bptt=70

In [6]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

In [7]:
em_sz = 200  
nh = 500     
nl = 3       

In [8]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

In [9]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

In [10]:
learner.load('adam3_10_enc_py03_full_2')

In [11]:
m=learner.model
ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
s = [TEXT.tokenize(ss)]
t=TEXT.numericalize(s)

In [12]:
# Set batch size to 1
m[0].bs=1
# Turn off dropout
m.eval()
# Reset hidden state
m.reset()
# Get predictions from model
res,*_ = m(t)
# Put the batch size back to what it was
m[0].bs=bs

In [13]:
print(ss,"\n")
for i in range(500):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = m(n[0].unsqueeze(0))
print('...')

. So, it wasn't quite was I was expecting, but I really liked it anyway! The best 

of the film 's main character . the film is a bit of a mess , but it 's a good film . <eos> i saw this movie at the toronto film festival and it was a great movie . i was very impressed with the acting . i was very impressed with the acting . i was very impressed with the acting . i was very impressed with the acting . i think the actors were very good . i think the director did a great job of making the movie . i would recommend this movie to anyone who wants to see a good movie . <eos> i saw this movie at the toronto film festival and it was a great movie . i was very impressed with the acting . i was very impressed with the acting . i was very impressed with the acting . i was very impressed with the acting . i think the actors were very good . i think the director did a great job of making the movie . i would recommend this movie to anyone who wants to see a good movie . <eos> i saw this movie at th

Here the repeats are less obvious but we can see most of the predictions are variants of the same sentence repeated over and over again. We can apply the same fix as before:

In [15]:
def proc_str(s): return TEXT.preprocess(TEXT.tokenize(s))
def num_str(s): return TEXT.numericalize([proc_str(s)])

In [16]:
m=learner.model

In [19]:
def sample_model(m, s, l=50):
    t = num_str(s)
    m[0].bs=1
    m.eval()
    m.reset()
    res,*_ = m(t)
    print('...', end='')

    for i in range(l):
        r = torch.multinomial(res[-1].exp(), 2)
        if r.data[0] == 0:
            r = r[1]
        else:
            r = r[0]
        word = TEXT.vocab.itos[to_np(r)[0]]
        res, *_ = m(r[0].unsqueeze(0))
        print(word, end=' ')
    m[0].bs=bs

In [20]:
sample_model(m, ss, l=500)

...part of the film was the final scene when the son wants owen to run away from others.<br /><br cronies did work with him on this cartoon . how this man can immersed guy and his family on a roses etc . , is the real reason i 'm still saying hurt a masterson 's must do it because he keeps from going ' through the eye . which was what i especially believe it is . and anymore : it was a wonderful mix where a person on earth could watch something so difficult to make . every reasoning about 9.5/10 there was never the emotion munez after . it was like the pollution that the people died in making the world and it is dutch movie you - guess you never want to see , anyone says or does it . i 'm shoving anyone back with this show ! . ! ! ! ! ! only diy and hoo ... showcase them for that great time joke and the show .... try to please 2 bucks and down out to get back to my favorite game now ! <eos> well thought i 'll give it a 9 , at least on it 's way . take me the time . the person who wrote

Now we're seeing much more variation in the output. On the subject of predictions, the prediction format for the language model used a single word at a time, while the character level model keeps a rolling section of the output texts for each sequential prediction. We can modify the prediction functions for the character model to work with word models to accomplish the same thing. I haven't played around with this enough to understand the effect of it, but here it is.

In [25]:
def get_next(inp):
    idxs = num_str(inp)
    res, *_ = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(res[-1].exp(), 2)
    if r.data[0] == 0:
        r = r[1]
    else:
        r = r[0]
    return TEXT.vocab.itos[to_np(r)[0]]

In [26]:
def get_next_n(inp, n):
    res_lst = inp.split(' ')
    try:
        for i in range(n):
            c = get_next(inp)
            res_lst += [c]
            inp_lst = inp.split(' ')
            inp_lst += [c]
            inp = """ """.join(inp_lst[1:])
        return """ """.join(res_lst)
    except:
        print(res_lst)
        print(c)
        print(inp)

In [27]:
m = learner.model
m.eval();

In [28]:
print(get_next_n(ss, 500))

. So, it wasn't quite was I was expecting, but I really liked it anyway! The best to feel - good heartedness . the untouchables : * * ( # ) unborn minions with the leather legs for some way ( popularity ! ) ! <eos> networks might be too free to do.<br /><br />there were so many things to be found in our world but the only reason why i liked it was because i took it first to fireball his guru . that 's the promote . i think joel fail to explain why he soured the tag line at the end of the film but it 's all set up . just ran out in 1948 , only disorientation . this film is so good its obvious what would helping the writers live their shocking.<br /><br />it 's been a long time since i 'd out to see this movie . it 's brilliant . aside from being stupid , it 's a refreshing exercise than my head belgian . the casting is bad . the story does r auspicious , out.<br /><br />too many different imagery . i feel that this will be seeing rather for correct words for someone who rises above its 

### So what does this mean?

From what I understand `torch.topk` returns the index of the n largest elements in the output vector, while `torch.multinomial` pulls n samples from the output vector using the values of the elements in the vector as weights in a probability distribution. I think the model learns that certain words come up all the time, so simply sampling the top valued outputs leads to loops. The added variance from the `torch.multinomial` method leads to a much more varied output from the RNN.