# Identify the Author

An example of classifying the speaker based on text patterns.  I first train a language model from scratch, 
and then use this to create a text classifier.  I train this classifier on novels from 5 different authors. I then present the
model with 6-line samples of new novels (by the same authors, but unseen by the model).  From that short sample, the model is able to guess the correct author 83% of the time, and the correct author is one of the top two guesses 95% of the time.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle
import spacy

## Train a Language Model

Here we could use a pretrained set such as word2vec, but since I have a large data set, I've found using domain-specific training as shown here yields better results.

### Data

In [2]:
PATH = './language_model/novels/'
TRAIN_LANGUAGE_MODEL = 'train_language_model/'
VAL_FOLDER = 'test/'

TRAIN = f'{PATH}{TRAIN_LANGUAGE_MODEL}'
VALIDATION = f'{PATH}{VAL_FOLDER}'

The language model training folder contains text files of several books from Austen, Baum, Dickens, Fitzgerald, and Wodehouse. I use this to build a language model.  These files are taken directly from www.gutenberg.org (I removed the boiler plate Gutenberg intro and the license agreement at the end, but otherwise the text files are untouched.)

In [3]:
all_files = !ls {TRAIN}
all_files[:5]

['beautiful-damned.txt',
 'bleak-house.txt',
 'christmas-carol.txt',
 'dorothy.txt',
 'emma.txt']

An example excerpt from a novel:

In [61]:
example_lines = !cat {TRAIN}beautiful-damned.txt
example_lines[20:25]

['immortality. Until the time came for this effort he would be Anthony',
 'Patch--not a portrait of a man but a distinct and dynamic personality,',
 'opinionated, contemptuous, functioning from within outward--a man who',
 'was aware that there could be no honor and yet had honor, who knew the',
 'sophistry of courage and yet was brave.']

I use the Torchtext dataloader library, and set TEXT as a data.Field, using a spacy tokenizer.

In [5]:
def substitute_break(x): return re.compile('<br />').sub("\n", x)

In [6]:
spacy_english = spacy.load('en')

In [7]:
def spacy_tokenizer(x): return [tok.text for tok in spacy_english.tokenizer(substitute_break(x))]

In [8]:
TEXT = data.Field(lower=True, tokenize=spacy_tokenizer)

Setting some variables

In [9]:
bs=32        # batch size
bptt=120     # number of words in each row of mini-batch (# words included in prediction of next word)

In [10]:
em_sz = 300  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

In [11]:
optimizer_function = partial(optim.Adam, betas=(0.7, 0.99))  # Optimizer will be Adam with slightly lowered momentum 
                                                             # (Default Adam momentum doesn't work well on RNNs)

In [12]:
FILES = dict(train=TRAIN_LANGUAGE_MODEL, validation=VAL_FOLDER, test=VAL_FOLDER)

LanguageModelData comes from the fastai library. It loads the training and validation text files and creates a suitable RNN model.  Here we ignore words that don't appear at least 8 times in the text corpus.

In [13]:
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=8)

Save TEXT so that we'll make sure words map to the same IDs when we later create the classifier model.

In [14]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

Exploring the language model data. Here are the: # training batches; # unique tokens in the vocab; # training sentences

In [15]:
len(md.trn_dl), md.nt, len(md.trn_ds[0].text)

(738, 11858, 2838340)

### Train the Language Model

 Here I use the fastai library to create a model based on the language model data. 

I'm feeding the model sequences of words from the training text, and then asking it to predict the next word. I don't expect it to get very good at this task (it's not surprising that the loss stays relatively high), but in the process of getting better at this, it has to develop good embeddings for the words. I'll later use these embeddings for my main goal (the author classification problem).

In [17]:
learner = md.get_model(optimizer_function, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

In [18]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

epoch      trn_loss   val_loss                              
    0      5.186721   5.202713  
    1      4.55629    4.625137                              
    2      4.376035   4.504276                              
    3      4.269179   4.362236                              
    4      4.088331   4.226201                              
    5      3.972325   4.17029                               
    6      3.916161   4.160203                              
    7      4.0351     4.174123                              
    8      3.950396   4.131284                              
    9      3.878697   4.113047                              
    10     3.810488   4.096276                              
    11     3.730453   4.080369                              
    12     3.671942   4.086648                              
    13     3.647436   4.087518                              
    14     3.612968   4.083753                              



[4.0837526]

In [19]:
learner.fit(3e-5, 1, wds=1e-6, cycle_len=5)

epoch      trn_loss   val_loss                              
    0      3.617367   4.084094  
    1      3.644177   4.081234                              
    2      3.616542   4.086579                              
    3      3.609883   4.083892                              
    4      3.617337   4.089175                              



[4.089175]

In [20]:
learner.save_encoder('novels_enc')

In [21]:
learner.load_encoder('novels_enc')

# Test with Text Generation

Here I take a look at the language model & check that it seems to have learned a decent sense of grammar and a reasonable word embedding.  I feed a "." as a primer, so that it'll likely start a new sentence.

In [23]:
m=learner.model
starter_text=""". """
num_words = 200
s = [spacy_tokenizer(starter_text)]
t=TEXT.numericalize(s)

Now I feed this input to the model and ask it to generate 200 words (at each step, I feed in the text it has already generated).  I introduce "beam" and "more_random" to encourage a bit of variety in the generated text.  (Instead of always picking the most likely word, it can choose from the "beam" most likely words.)

The rest is just for formatting.  I undo some of the tokenization (for example turning can 't back into can't).

In [24]:
beam = 2
more_random = True
print_lead = ""
cap = True
skip_space = True

print(print_lead, end = '')
    
m[0].bs=1
m.eval()
m.reset()
res,*_ = m(t)
m[0].bs=bs

for i in range(num_words):
    [ps, n] =res[-1].topk(beam)
    if more_random or i<beam or i%4 == 0:
        w = n[np.random.randint(0, beam)]
    else:
        w = n[0]
    while w.data[0] == 0:
        w = n[np.random.randint(0, beam)]
    wstr = TEXT.vocab.itos[w.data[0]]
    
    # From here on, I could just print wstr, but instead I've added a bit of formatting to make the output look better.
    # Mainly, I'm removing spaces, or capitalizing words after punctuation.
    
    if wstr=='i': wstr='I'
    if wstr=='nt': wstr = 'not'
    if cap:
        wstr = wstr.capitalize()
    if wstr in ['.', '?', '!', '“']: 
        cap=True
    elif wstr not in ['”', '"']:
        cap=False
    if skip_space or wstr in ['.', ',', ';', "'", '”', "n't", "n’t", "’ll", "'ll", "’s", "’ve", "'ve", "’d", "’re", "’m", "'s", "'d", "?", "!", "'re", "'m"]:
        print(wstr, end='')
        skip_space = False
    elif wstr=='“':
        print(f'\n      {wstr}', end='')
        skip_space = True
    elif wstr in ['to-', '-', 'good-']:
        print(wstr, end='')
        skip_space = True
    elif wstr=='"':
        pass
    else:
        print(f' {wstr}', end='')
              
    res,*_ = m(w[0].unsqueeze(0))

print('...')


The same, the way to the end of the night. I was not aware that I was going to make the thing a little more. I had been thinking that the thing would come out, but it seemed as if I had not been able. You're right, said mr. wooster, and I'll tell it, for I am sure I shall never have any idea of that. You are right. I said. I was not quite afraid. You are right. You are a man, my love. I do not think you will. I am sure I don't want to know what you mean. I'm not a-going on. I don't want to be a man, I suppose. But I am going to tell you, he said. I am going to the castle. I hadn't a word about him, and I had no doubt he would be so good as...


# Identify the Author

Now I use the word embeddings I've created to train a classifier.

I load the saved vocab from the language model, to ensure the same words map to the same IDs, if I've stepped away from this notebook.

In [25]:
TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))

First, I build a dataset that will feed the model a line from a novel, with the author's name as the label.  Splits is used to create a training/validation split.

In [26]:
class ShortAuthorsDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, **kwargs):
        fields = [('text', text_field), ('label', label_field)]
        examples = []
        for label in ['Austen', 'Baum', 'Dickens', 'Fitzgerald', 'Wodehouse']:
            for fname in glob(os.path.join(path, label, '*.txt')):
                with open(fname, 'r') as f:
                    for line in f:
                        text = f.readline()
                        if text!='\n':
                            examples.append(data.Example.fromlist([text, label], fields))

        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, root='.data',
               train='train', test='test', **kwargs):
        return super().splits(
            root, text_field=text_field, label_field=label_field,
            train=train, validation=None, test=test, **kwargs)

In [27]:
AUTHOR_LABEL = data.Field(sequential=False)
short_splits = ShortAuthorsDataset.splits(TEXT, AUTHOR_LABEL, PATH, train='train', test='test')

In [28]:
len(short_splits[0].examples), len(short_splits[1].examples)

(69186, 16466)

I look at an example from the training set:

In [29]:
t = short_splits[0].examples[1]
t.label, ' '.join(t.text)

('Austen',
 'to the rank of a baronet ’s lady , with all the comforts and consequences')

In [30]:
md2 = TextData.from_splits(PATH, short_splits, bs)

In [31]:
m2 = md2.get_model(optimizer_function, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
           dropout=0.1, dropouti=0.4, wdrop=0.5, dropoute=0.05, dropouth=0.3)
m2.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)      # add some regularization (to avoid overfitting)
m2.load_encoder('novels_enc')                        # load the word vector model we trained earlier

In [32]:
m2.clip=25.
lrs=np.array([1e-5, 1e-5, 1e-4, 1e-4, 1e-3])

In [33]:
m2.freeze_to(-1)    # I expect my model is already decent, so I freeze most of the layers and finetune only the last one

In [34]:
m2.fit(lrs, 1, metrics=[accuracy])

epoch      trn_loss   val_loss   accuracy                     
    0      1.165594   1.560178   0.364947  



[1.5601777, 0.36494741101288103]

In [35]:
m2.unfreeze()

I train all the layers with a very small learning rate.  cycle_len is a great feature from the fastai library - it lowers the learning rate over the training epochs (with restarts at the start of each cycle) to achieve higher accuracy, without my needing to standby and babysit the learning rate manually

In [36]:
m2.fit(lrs/10, 3, metrics=[accuracy], cycle_len=2)

epoch      trn_loss   val_loss   accuracy                     
    0      1.119698   1.554734   0.360444  
    1      1.137379   1.551771   0.366741                     
    2      1.108979   1.544455   0.371595                     
    3      1.108624   1.54133    0.368487                     
    4      1.088771   1.550971   0.365831                     
    5      1.08967    1.554331   0.379834                     



[1.5543311, 0.3798341424141115]

38% accuracy is better than 1 in 5 chance, but it's not great. Next I'll try giving the model more than 1 line to work with.

# Same but with longer text samples

Presumably the model will have an easier time if I give it slightly longer text samples from the novels. Now I feed it 6 lines at a time.

In [37]:
class LongAuthorsDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, **kwargs):
        fields = [('text', text_field), ('label', label_field)]
        examples = []
        for label in ['Austen', 'Baum', 'Dickens', 'Fitzgerald', 'Wodehouse']:
            for fname in glob(os.path.join(path, label, '*.txt')):
                with open(fname, 'r') as f:
                    for line in f:
                        text = ""
                        i=0
                        while i<6:              # Add 6 lines to text
                            t = f.readline()
                            if t=='EOF':
                                break
                            if t!='\n':         # Skip the line if it's blank
                                text += t
                                i+=1
                        examples.append(data.Example.fromlist([text, label], fields))

        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, root='.data',
               train='train', test='test', **kwargs):
        return super().splits(
            root, text_field=text_field, label_field=label_field,
            train=train, validation=None, test=test, **kwargs)

In [38]:
AUTHOR_LABEL = data.Field(sequential=False)
long_splits = LongAuthorsDataset.splits(TEXT, AUTHOR_LABEL, PATH, train='train', test='test')

In [39]:
len(long_splits[0].examples), len(long_splits[1].examples)

(20504, 4919)

In [40]:
md3 = TextData.from_splits(PATH, long_splits, bs)

In [41]:
t = long_splits[0].examples[1]
t.label, ' '.join(t.text)

('Austen',
 'acquaintance as thought miss ward and miss frances quite as handsome as \n miss maria , did not scruple to predict their marrying with almost equal \n advantage . but there certainly are not so many men of large fortune in \n the world as there are pretty women to deserve them . miss ward , at the \n end of half a dozen years , found herself obliged to be attached to \n the rev. mr. norris , a friend of her brother - in - law , with scarcely any')

Here I've tried some different values of dropout, and I kept the most successful set. It's worth exploring more 
hyperparameters here to get even better results

In [42]:
m3 = md3.get_model(optimizer_function, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl,     
            dropout=.01, dropouti=.1, wdrop=.1, dropoute=.005, dropouth=.1)
#            dropout=.2, dropouti=.5, wdrop=.6, dropoute=.1, dropouth=.5)
#            dropout=.05, dropouti=.3, wdrop=.3, dropoute=.01, dropouth=.15)      
#            dropout=0.1, dropouti=0.4, wdrop=0.5, dropoute=0.05, dropouth=0.3) 

m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)      # add some regularization (to avoid overfitting)
m3.load_encoder('novels_enc')                        # load the word vector model we trained earlier

In [43]:
m3.clip=25.
lrs=np.array([1e-4, 1e-4, 1e-3, 1e-3, 1e-2])

In [44]:
m3.freeze_to(-1)

In [45]:
m3.fit(lrs, 1, metrics=[accuracy])

epoch      trn_loss   val_loss   accuracy                    
    0      0.561768   0.88779    0.693235  



[0.88778967, 0.6932347544601986]

In [46]:
m3.unfreeze()

In [47]:
m3.fit(lrs/10, 1, metrics=[accuracy], cycle_len=5)

epoch      trn_loss   val_loss   accuracy                    
    0      0.335311   0.781723   0.729152  
    1      0.266786   0.638505   0.777403                    
    2      0.230577   0.638205   0.785803                    
    3      0.207183   0.613414   0.792014                    
    4      0.210763   0.62506    0.787144                    



[0.6250605, 0.7871435629082965]

In [65]:
m3.fit(lrs, 1, metrics=[accuracy], cycle_len=4)

epoch      trn_loss   val_loss   accuracy                    
    0      0.220866   1.142      0.7186    
    1      0.118115   0.701368   0.823546                     
    2      0.056187   0.7014     0.832069                     
    3      0.035933   0.716181   0.830242                     



[0.7161807, 0.830242447651826]

83% of the time, the model is successfully predicting the correct author, based on 6 lines of text of a novel it has never seen before.

I look more at the examples the model gets wrong.  Is the model's 2nd or 3rd guess correct?

In [66]:
output = m3.predict_with_targs()   # Make predictions on validation set

In [67]:
preds, targs = output

In [68]:
correct = np.argmax(preds, axis=1) == targs

Her are the first 20 incorrect guesses. For each row, I list the model's top 3 guesses (in order), and then the correct answer.  Usually the correct answer is the 2nd guess, and occasionally the 3rd.

In [69]:
list(zip(np.argsort(-preds[~correct], axis=1)[:20, :3].tolist(), targs[~correct][:20]))

[([3, 1, 4], 4),
 ([3, 1, 4], 4),
 ([1, 3, 5], 5),
 ([1, 3, 2], 3),
 ([3, 4, 1], 4),
 ([3, 1, 4], 1),
 ([3, 4, 2], 4),
 ([1, 3, 4], 3),
 ([3, 2, 1], 1),
 ([3, 2, 1], 2),
 ([1, 3, 2], 3),
 ([1, 3, 2], 3),
 ([1, 3, 2], 3),
 ([1, 3, 2], 3),
 ([1, 3, 5], 3),
 ([4, 3, 1], 3),
 ([1, 3, 2], 3),
 ([1, 3, 4], 3),
 ([1, 3, 5], 5),
 ([1, 3, 4], 4)]

Given that, I wonder how often is the correct author one of the model's top two guessess?

In [70]:
ordered_guesses = np.argsort(preds, axis=1)

In [71]:
top_two = [ordered_guesses[i, 5] == targs[i] or ordered_guesses[i, 4] == targs[i] for i in range(len(targs))]

In [72]:
corr = np.sum(top_two)     # Number of times model was in the top two guesses
total = len(top_two)       # Total number of validation set predictions

In [73]:
corr, total, corr/total

(4686, 4919, 0.9526326489123805)

The model predicts the correct answer 83% of the time. The correct answer is one of the top two guesses 95.3% of the time.