# Vanilla wikitext generation with pre-trained model

In [2]:
from fastai_old.text import *
import html
import spacy 

spacy.load('en')

<spacy.lang.en.English at 0x1c24a75c88>

In [3]:
DATA_PATH=Path('data/')
PATH = DATA_PATH/'aclImdb'

BOS = 'xbos'  # beginning-of-sentence tag
FLD = 'xfld'  # data field tag


We are going to first load a pre-trained language model. Our source LM is the wikitext103 LM created by Stephen Merity @ Salesforce research. [Link to dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/).
The language model for wikitext103 (AWD LSTM) has been pre-trained and the weights can be downloaded here: [Link](http://files.fast.ai/models/wt103/). Our target LM is the style LM. 

In [4]:
# ! wget -nH -r -np -P {PATH} http://files.fast.ai/models/wt103/

The pre-trained LM weights have an embedding size of 400, 1150 hidden units and just 3 layers. We need to match these values  with the target IMDB LM so that the weights can be loaded up.

In [5]:
em_sz,nh,nl = 400,1150,3

In [6]:
PRE_PATH = PATH/'models'/'wt103'
PRE_LM_PATH = PRE_PATH/'fwd_wt103.h5'

In [7]:
wgts = torch.load(PRE_LM_PATH, map_location=lambda storage, loc: storage)

In [8]:
itos2 = pickle.load((PRE_PATH/'itos_wt103.pkl').open('rb'))
stoi2 = collections.defaultdict(lambda:-1, {v:k for k,v in enumerate(itos2)})

Let's make a langauge model with no data so we can just play with the generic model.

In [9]:
wd=1e-7
bptt=70
bs=52
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))

In [10]:
trn_lm = np.array([[0]])
val_lm = np.array([[0]])

print(val_lm.shape)

vs=len(itos2)

trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)
val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt)
md = LanguageModelData(PATH, 1, vs, trn_dl, val_dl, bs=bs, bptt=bptt)

(1, 1)


In [11]:
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])*0.7

In [12]:
learner= md.get_model(opt_fn, em_sz, nh, nl, 
    dropouti=drops[0], dropout=drops[1], wdrop=drops[2], dropoute=drops[3], dropouth=drops[4])

learner.model.load_state_dict(wgts)

In [16]:
m=learner.model

Here's a function for generating text.

I should explore other ways of decoding, like:

In [29]:
#nexts = torch.topk(res[-1], 10)[1]
#nexts = torch.multinomial(res[-1].exp(), 10)
#nexts = res[-1].topk(10)[1]

In [32]:
def generate_text(m, s, l=20):
    m[0].bs=1  # Set batch size to 1
    m.eval()  # Turn off dropout
    m.reset()  # Reset hidden state
    m[0].bs=bs  # Put the batch size back to what it was

    ss = s.lower().split()
    si = [stoi2[w] for w in ss]
    t = torch.autograd.Variable(torch.LongTensor(np.array([si])))
    
    res,*_ = m(t)

    print(s,"\n")
    for i in range(l):
        n = res[-1].topk(2)[1]
        n = n[1] if n.data[0]==0 else n[0]
        print(itos2[n], end=' ')
        res,*_ = m(n.unsqueeze(0).unsqueeze(0))
    print('...')

In [34]:
generate_text(m, "the movie", l=50)

the movie 

, and the other two , the " u_n " and " u_n " , were the only two of the three to be able to use the same name . 
  the first two of the three were the first to be released . the first was the " ...


Let's try saving the model and pulling it up real fast later. This is saved in **aclImdb/models** because set the PATH as aclImdb and I think it just wants to put things in the models directory.

In [37]:
learner.save('pretrained')

In [38]:
generate_text(m, "the movie", l=50)

the movie 

, and the other two , the " u_n " and " u_n " , were the only two of the three to be able to use the same name . 
  the first two of the three were the first to be released . the first was the " ...
