# Jane Austen Generator
In honor of Valentine's Day, a Jane Austen generator.

This uses PyTorch, torchtext, and the fastai library: https://github.com/fastai/fastai

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle

## Language modeling

### Data

In [6]:
# Text files to several Jane Austen novels. Dividing them here into train & validation:
JANE_PATH='./language_model/austen/'

TRN_PATH = 'train/'
VAL_PATH = 'test/'

JANE_TRN = f'{JANE_PATH}{TRN_PATH}'
JANE_VAL = f'{JANE_PATH}{VAL_PATH}'

In [15]:
trn_files = !ls {JANE_TRN} 
trn_files   

['friendship.txt',
 'input.txt',
 'letters.txt',
 'mansfield.txt',
 'northanger.txt',
 'persuasion.txt',
 'pride.txt',
 'sense.txt',
 'susan.txt']

an example line in the text:

In [30]:
line = !cat {JANE_TRN}{trn_files[5]}
line[7000:7005]

['thought of his cruel conduct towards Mrs Smith, she could hardly bear',
 'the sight of his present smiles and mildness, or the sound of his',
 'artificial good sentiments.',
 '',
 'She meant to avoid any such alteration of manners as might provoke a']

In [32]:
!find {JANE_TRN} -name '*.txt' | xargs cat | wc -w

724695


In [33]:
!find {JANE_VAL} -name '*.txt' | xargs cat | wc -w

160356


In [42]:
TEXT = data.Field(lower=True, tokenize=spacy_tok)

Parameters: `bs` (batch size) and `bptt`(how many words processed at a time in each row of the mini-batch)

Making bptt higher will increase time and memory requirements, but will improve the model's ability to handle long sentences.

In [35]:
bs=32; bptt=120

Parameters: `embedding_size` (embedding vector size), `nhidden` (number of hidden activations per layer), `nlayers` (number of layers)

In [55]:
embedding_size = 300; nhidden = 500; nlayers = 3       

In [56]:
optimization_function = partial(optim.Adam, betas=(0.7, 0.99))

Creating the language model. Right now, the word embedding matrices are random. When we train the model,
they will take on meaningful values.

In [45]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(JANE_PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=8)

Dump TEXT so that we will be able to access the same word tokenization later on.

In [47]:
pickle.dump(TEXT, open(f'{JANE_PATH}models/TEXT.pkl','wb'))

In [54]:
print(f'Batches: {len(md.trn_dl)}\nUnique Word Tokens: {md.nt}\nSentences in Training Set: {len(md.trn_ds[0].text)}')

Batches: 223
Unique Word Tokens: 4932
Sentences in Training Set: 860595


### Train

Dropout values may need tuning. Others should be fine as is.

In [119]:
learner = md.get_model(optimization_function, embedding_size, nhidden, nlayers,
               dropouti=0.07, dropout=0.07, wdrop=0.15, dropoute=0.025, dropouth=0.055)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

In [120]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

epoch      trn_loss   val_loss                              
    0      5.645441   5.478835  
    1      5.041722   4.861283                              
    2      4.778134   4.709964                              
    3      4.55439    4.475187                              
    4      4.326685   4.326903                              
    5      4.198656   4.264                                 
    6      4.142878   4.243988                              
    7      4.150627   4.20584                               
    8      4.030343   4.135465                              
    9      3.930853   4.083008                              
    10     3.843416   4.049754                              
    11     3.77174    4.027615                              
    12     3.732222   4.015193                              
    13     3.702667   4.004896                              
    14     3.675034   4.005867                              



[4.0058665]

In [121]:
learner.save_encoder('jane_0_enc')       # Save the newly determined word embeddings

In [122]:
learner.load_encoder('jane_0_enc')

In [123]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=20)

epoch      trn_loss   val_loss                              
    0      3.683675   4.006124  
    1      3.777827   4.024555                              
    2      3.726166   4.011042                              
    3      3.664113   4.004571                              
    4      3.60846    3.98568                               
    5      3.546603   3.987119                              
    6      3.519214   3.979949                              
    7      3.461685   3.983063                              
    8      3.419532   3.980531                              
    9      3.381018   3.986598                              
    10     3.34568    3.99754                               
    11     3.296165   4.00575                               
    12     3.278288   4.007909                              
    13     3.247021   4.00965                               
    14     3.221758   4.017701                              
    15     3.190748   4.019388                      

[4.022858]

In [124]:
learner.save_encoder('jane_1_enc')

In [125]:
learner.load_encoder('jane_1_enc')

In [17]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

## Story Generator

In [100]:
FAKE_JANE = f'{JANE_PATH}/fake/'

In [137]:
m=learner.model
ss= '"Why, good Morning Mr. Darcy!" she exclaimed. \n"Hello my dearest Elinor", he said.\n'
num_words = 250
s = [spacy_tok(ss)]
t=TEXT.numericalize(s)
' '.join(s[0])

'" Why , good Morning Mr. Darcy ! " she exclaimed . \n " Hello my dearest Elinor " , he said . \n'

In [149]:
beam = 2
more_random = True

print_lead = ""
cap = True
skip_space = True

print(print_lead, end = '')
    
m[0].bs=1
m.eval()
m.reset()
res,*_ = m(t)
m[0].bs=bs

out = ""
for i in range(num_words):
    [ps, n] =res[-1].topk(beam)
    if more_random or i<beam or i%4 == 0:        
        w = n[np.random.randint(0, beam)]      # Choose a word out of the most likely 'beam' options (to add variety)
    else:
        w = n[0]                               # Choose the single most likely word
    while w.data[0] == 0:
        w = n[np.random.randint(0, beam)]
    wstr = TEXT.vocab.itos[w.data[0]]
    out = out + wstr + " "
    # Here I add some minor text formatting. The following doesn't change anything substantive from the model, but reconnects
    # things like ca n't, which the tokenizer pulled apart. It also notes to capitalize letters after . ? ! "
    if wstr=='i': wstr='I'
    if wstr=='nt': wstr = 'not'
    if cap:
        wstr = wstr.capitalize()
    if wstr in ['.', '?', '!', '“']: 
        cap=True
    elif wstr not in ['”', '"']:
        cap=False
    if skip_space or wstr in ['.', ',', ';', "'", '”', "n't", "n’t", "’ll", "'ll", "’s", "’ve", "'ve", "’d", "’re", "’m", "'s", "'d", "?", "!", "'re", "'m"]:
        print(wstr, end='')
        skip_space = False
    elif wstr=='“':
        print(f'\n      {wstr}', end='')
        skip_space = True
    elif wstr in ['to-', '-', 'good-']:
        print(wstr, end='')
        skip_space = True
    elif wstr=='"':
        pass
    else:
        print(f' {wstr}', end='')
              
    res,*_ = m(w[0].unsqueeze(0))
print('...')
i=0
text_file = open(f'{FAKE_JANE}{i}.txt', "a")
text_file.write(out)
text_file.write('\n\n')
text_file.close()

" With the first moment of hearing,
      “That is a very great thing to you, as I am now, to be sure, and that I should not be able in any way that way.”
      “Oh! My dear,” replied her ladyship,
      “You have not a doubt, fanny, that I should be so glad to have you go, and that I am sure you would not have gone so far.” And with a look of voice, she added, in an air,
      “That you should be able to go to town.”
      “Yes; but you must not think me so ill-used as you can do, and that I can not bear it; and I am sure you have no reason to suppose it possible that you should not have the smallest objection to your own. But I am sure you would not be able to think it worth your doing very much, I assure.”
      “Yes; I have not a doubt of your being so very agreeable.” Fanny was too much oppressed by the idea of being so much more, and that it would be so. She could only say that it had not occurred to her that the subject had been so long delayed. The first time which had passed o

***
# Favorite results:

Is a good-natured, good-humoured young man, and a great deal more to the purpose than the rest. I have been thinking of him, but he has not been so much pleased with his character, as to be a man of good sense. I am not sorry for the idea. I am not sorry to see you again. I am sure he has been so much more attached to me, and that I have no doubt of the very circumstance that is ever to be made of. I have had the pleasure of receiving a letter from me. I have had no doubt that I should not have been in my life. I have been very kind and very kind in the world to have you so much in love with you. You have been a very good kind of woman, but as I am not in the least fatigued by the world. I am sure you have been so much more attached to her, and that I can hardly help thinking it all. I am not afraid that the whole party are in town, and that we are all very much in the habit. We have been very much in town; and I have no reason for writing, and that I am very glad you will be able in the same time to see you...

***

Is a good-humoured, handsome, handsome girl, and not like a man of fortune. I have no idea that I should be able to do any thing in the world.”<br>
     &nbsp;&nbsp;&nbsp; “I am afraid you will be very happy.”<br>
     &nbsp;&nbsp;&nbsp; “Yes; but you know, that I have no reason to fear that I should be so happy as to make the most of it, and that you will be able to make the greatest part.” Fanny could not help smiling. She could only say, that she had no longer a thought of her.<br>
     &nbsp;&nbsp;&nbsp; “You have no idea of your being so very much in your power, I assure you.” She then added :<br>
     &nbsp;&nbsp;&nbsp; “I have not a notion I can have a great idea to be in the country.” Fanny could not help smiling. She was sure she had not seen it; and, after a moment's recollection, said to elizabeth :<br>
     &nbsp;&nbsp;&nbsp; “You are mistaken. I am not afraid of seeing him. I am sure you will not have the least idea of the matter. I have no doubt that I can not be so happy as you can do.” She then went away, but she could only say that he had been in the room; but he had not seen her before,...<br>

***

With the first moment of hearing,<br>
      &nbsp;&nbsp;&nbsp;“That is a very great thing to you, as I am now, to be sure, and that I should not be able in any way that way.”<br>
      &nbsp;&nbsp;&nbsp;“Oh! My dear,” replied her ladyship,<br>
      &nbsp;&nbsp;&nbsp;“You have not a doubt, fanny, that I should be so glad to have you go, and that I am sure you would not have gone so far.” And with a look of voice, she added, in an air,<br>
      &nbsp;&nbsp;&nbsp;“That you should be able to go to town.”<br>
      &nbsp;&nbsp;&nbsp;“Yes; but you must not think me so ill-used as you can do, and that I can not bear it; and I am sure you have no reason to suppose it possible that you should not have the smallest objection to your own. But I am sure you would not be able to think it worth your doing very much, I assure.”<br>
      &nbsp;&nbsp;&nbsp;“Yes; I have not a doubt of your being so very agreeable.” Fanny was too much oppressed by the idea of being so much more, and that it would be so. She could only say that it had not occurred to her that the subject had been so long delayed. The first time which had passed on her side...

***
# Is it Jane?

Create text files with fake Jane Austen. Then we'll build a model to classify whether input text is fake or genuine.  Finally, we'll loop through, generate 100 potential text generations, and then output the one that has the highest score from the classifier.  (This is to capture the sense that sometimes the text generator outputs a plausible paragraph, and sometimes it's a mess.  I'd like to have it output the best of 100 tries.)

In [128]:
filenum = 0
num_outputs = 20
beam = 2
more_random = True
    
m[0].bs=1
m.eval()
m.reset()
res,*_ = m(t)
m[0].bs=bs
for j in range(20):
    out = ""
    for i in range(num_words):
        [ps, n] =res[-1].topk(beam)
        if more_random or i<beam or i%4 == 0:        
            w = n[np.random.randint(0, beam)]      # Choose a word out of the most likely 'beam' options (to add variety)
        else:
            w = n[0]                               # Choose the single most likely word
        while w.data[0] == 0:
            w = n[np.random.randint(0, beam)]
        wstr = TEXT.vocab.itos[w.data[0]]
        out = out + wstr + " "
              
        res,*_ = m(w[0].unsqueeze(0))

    text_file = open(f'{FAKE_JANE}{filenum}.txt', "a")
    text_file.write(out)
    text_file.write('\n\n')
    text_file.close()

In [115]:
TEXT = pickle.load(open(f'{JANE_PATH}models/TEXT.pkl','rb'))

To be continued...