# Notebook 3: Generation

In this notebook, I generate text using the saved model from the notebook `2-Training`.

## Setup

First I import the libraries I need:

In [1]:
import torch
import torch.nn as nn
from torch.functional import F
import numpy as np
import re
import math

I load the saved model from the previous notebook:

In [2]:
model = torch.load('saves/model')
model

LangModel(
  (emb): Embedding(3001, 200)
  (lstm1): LSTM(200, 300, batch_first=True)
  (lstm2): LSTM(300, 200, batch_first=True)
  (lin): Linear(in_features=200, out_features=3001, bias=True)
)

I also need the `TokTransform` object which translates between text tokens and tensors of numbers, and which stores the vocab:

In [3]:
import pickle
tok_tfm = pickle.load(open('saves/tok_tfm.p', 'rb'))

## Generation Methods

The first function used for text generation takes an input tensor and adds more numbers onto the end of it:

In [4]:
def generate(sm_base, gen_len, vocab_sz, inp):
    with torch.no_grad():
        for _ in range(gen_len):
            batched = inp[None,:] # add batch dimension
            preds = model(batched)
            logits = preds[0,-1,:] # get only the last predicted token
            logits = logits.numpy() # convert to numpy for weighted random choice functionality
            logits = logits[1:] # don't predict xxunk (position 0)
            exped = sm_base**logits # like softmax, but with an adjustable base instead of e
            probs = exped / exped.sum() 
            new = np.random.choice(np.arange(1, vocab_sz), size=1, p=probs) # don't predict xxunk
            new_t = torch.tensor(new)
            inp = torch.cat([inp, new_t])
        return inp

`sm_base` controls the base of the softmax function used to generate the probability distribution that tokens are randomly selected from. The standard softmax function uses e as the base, but lowering the base smooths out the probabilities, while increasing the base exaggerates the differences between probabilities. 

In [5]:
def make_text(sm_base,gen_len,inp):
    inp = inp.split(' ')
    inp = tok_tfm.encode(inp)
    output = generate(sm_base, gen_len, tok_tfm.count, inp)
    joined = ' '.join(tok_tfm.decode(output))
    fixed = re.sub(r' ([.,?:;’”])', '\\1', joined)
    fixed = re.sub(r'([“‘]) ', '\\1', fixed)
    fixed = re.sub(r'’ s', '’s', fixed)
    return fixed

Next I set random seeds for all sources of randomness in this notebook, so that the results will be the same each time it is run:

In [6]:
np.random.seed(1)

## Examples

A low softmax base will result in more randomness in the generated text:

In [7]:
make_text(sm_base=3,gen_len=50,inp="the")

'the necessity of, and in the world of sense the whole? is properly, on the internal unity of simple s in relation to each other, and criticism, as primal being does not presuppose use, à posteriori as a explanation of time must be subject,'

There are a lot of random words, but not much repetition. Even with the high randomness, there is a sense of intelligence to th generated text, instead of complete randomness.

With slightly less randomness, we get a more coherent-sounding text:

In [8]:
make_text(sm_base=5,gen_len=50,inp="the")

'the conception of the understanding, in which the form of the understanding is not the cause of the soul in space, is only in the subjective conditions of time. for if we do not possess any relation to the most whole. it is evident that, as'

The first sentence seems like it could have been written by a philosopher. There is a little more repetition now, for example the word "understanding". The second sentence ends abruptly, but that could just be the fault of the randomly selected period token. 

With even less randomness, we get a lot more repetition of words:

In [9]:
make_text(sm_base=10,gen_len=50,inp="the")

'the conception of a thing which is the synthesis of perception, and consequently the conception of a supreme being, in the same time, which is not the empirical condition of the conception. the former is an empirical conception, in which the conception of a thing in'

The consistent return to the word "conception" seems to stop the generated text from getting anywhere new. 

Finally, with very little randomness, we get something more coherent, but that might be due to memorization of pieces of the original text:

In [10]:
make_text(sm_base=100,gen_len=50,inp="the")

'the conception of a thing in general, which is not a thing in itself, and which is not a necessary being. but this is not a transcendental idea, which is not a thing in itself, but only in the sphere of experience, and not as'

But even if there is some memorization of phrases, we still get a much different result using the same settings again:

In [11]:
make_text(sm_base=100,gen_len=50,inp="the")

'the same time. but this is not an object of the subject, and is not the objective validity of the possibility of a thing which is not an object, but only in the sphere of experience, and the same with the conception of a thing in general'

## Final Thoughts

I didn't expect to get amazing results by training from scratch on a single book for only a few epochs, but I'm still amazed by how pseudo-intelligent the generated text can seem at times. I limited the model's vocabulary to 3000 in order to get a little less randomness in word selection during text generation, and it seems to have kept it from picking even more random-seeming words. 

I wonder how much information about the workings of language could be learned from this one book, if I used a larger model with some regularization and trained it for much longer. 

I could also try other forms of token selection for the generation process, such as only selecting from the top n most likely next tokens. I think my method of smoothing or exaggerating the softmax likelihoods resulting in a good amount of randomness while also keeping some coherence. 