# Blogging with RNNs

By Karl Heyer

This project is a follow up to my last [RNN project](https://github.com/kheyer/ML-DL-Projects/tree/master/Shakespeare%20RNN) where I trained a character level RNN to write Shakespeare.

This time around we're going to make a word level RNN using  the [Blog Authorship Corpus](http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm), a collection of blogs. The blogs are categorized by the author's gender, age, profession and astrology sign. First we will train a language model on a section of blogs. Then we will try to fine tune the model to the task of classifying blog authors by gender.

There were a lot of things I wanted to try with this project, but unfortunately I ran into the limitations of my hardware (namely my 8 gigs of GPU memory). But more on that later. For now lets look at the model we're going to train.

## Contents

1. Model Architecture
2. Project Setup
    * 2.1 Import Libraries
    * 2.2 Data Overview
3. Model Training
4. Text Classification

# 1. Model Architecture

This project will use a recurrent neural network (RNN) to build a language model. Lets break down what these mean.

## Language Model

This model will try to predict the next word in a string of words. For example, if the last sequence of words the RNN has seen were "The sky is", it should predict "blue" as the next character.

This is different from a character level model that tries to predict one character at a time.

The fundamental way we will represent language in this model is through a word embedding matrix representing the vocabulary of our corpus. When we process text for feeding into the model, each word is mapped to a number value. Each word/number in the corpus has a corresponding row of weights in the embedding matrix. This creates a higher dimensional representation of the word. When the RNN processes sentences of text, it is really processing vectors of weight values for each word.


## Recurrent Neural Network

Most deep learning architectures have a series of layers that process an input into an output. Each layer has a set of weights that perform some functional operation to convert inputs to outputs. RNNs, rather than having a deep stack of layers, have a single hidden layer that is updated every time an input comes in. This allows the activations of a previous input to affect the activations of the next input. This allows RNNs to learn complex relationships along a series of inputs.

This has made RNNs very effective architectures for NLP problems, where understanding a word in a sentence requires understanding what came before it.

For example, lets say the RNN is processing the sentence "The sky is blue". Since this is a word level language model, each word (represented by a vector of weights from the embedding matrix) is fed into the network one word/vector at a time.

First the netword processes "The" and updates the hidden state accordingly.
Then the network processes "sky", using the activations generated by the previous word "the" to process "sky" and generate a new set of activations. 
Then the network processes "is" using the activations generated by processing "sky" and "the". And so on.


In Pytorch, the standard RNN module processes each input into a new set of activations using the following equation:

$h_t = tanh(w_{ih} x_t + b_{ih}  +  w_{hh} h_{(t-1)} + b_{hh})$

The network starts with an existing hidden state $h_{(t-1)}$. When a new input $x_t$ enters the network, it is processed by a linear layer with weights $w_{ih}$ and bias $b_{ih}$. The hidden state $h_{(t-1)}$ is also processed by a linear layer with weights $w_{hh}$ and bias $b_{hh}$. The outputs of these two linear layers are summed together and processed by the $tanh$ activation function to create the new hidden state $h_t$.

For this project, we'll use a more advanced version of an RNN called an LSTM. The key improvement of LSTMs over regular RNNs is LSTMs introduce a set of gates that control how much information from the old hidden state and the current input move into the new hidden state. The parameters that control these gates are learned by the network. This allows LSTMs to learn and understand information over longer periods.

The equations governing LSTMs:

$\begin{array}{ll}
i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\
f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\
g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\
o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\
c_t = f_t c_{(t-1)} + i_t g_t \\
h_t = o_t \tanh(c_t)
\end{array}$

$i_{t}$ is the input gate. This controls how much of the new input is taken into the cell state.

$f_{t}$ is the forget gate. This gate controls how much of the old cell state passes on to the new cell state.

$g_{t}$ is the cell gate, which is the same equation we used to define our new hidden state in the standard RNN.

$o_{t}$ is the output gate which controls how much of the cell state passes on to the hidden state.

$c_{t}$ is the current cell state. You will notice $c_{t}$ is composed of the old cell state, the current forget gate, the current input gate, and the current cell gate. This shows how the various gates are used to filter how much of the input and previous cell state pass on to the new hidden state.

$h_{t}$ is the new hidden state. It is composed of the current cell state activated by $tanh$ and filtered by the output gate.

Visually:[](attachment:LSTM.png)

<img src="LSTM.png">


When the optimizer updates the weights by backpropagation, the loss is propagated through all the hidden activations, essentially "unrolling" the hidden state into each set of activations created over the course of generating the final activations.

Visually:[](attachment:RNN.png)

<img src="RNN.png">

[Image source](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)



This leads to one of the most important parameters in training RNNs: backprop through time (bptt). The bptt value sets how many hidden activations are saved and propagated through during backpropagation. Bptt is a double edged sword - on the one hand, a larger bptt value allows the network to train over longer sequences. This can allow the network to understand language over a longer sequence length. On the other, the larger the bptt value, the deeper the network becomes and the harder it is to train. 



# 2. Project Setup

## 2.1 Load Libraries

This project uses FastAI, Pytorch, and their associated dependencies.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle
import spacy

## 2.2 Data Overview

The files for this corpus are all in xml format. The first thing to do is to convert them to txt files.

In [2]:
PATH='data/blogs/blogs/'

files = os.listdir(PATH)
files[:10]

['1000331.female.37.indUnk.Leo.xml',
 '1000866.female.17.Student.Libra.xml',
 '1004904.male.23.Arts.Capricorn.xml',
 '1005076.female.25.Arts.Cancer.xml',
 '1005545.male.25.Engineering.Sagittarius.xml',
 '1007188.male.48.Religion.Libra.xml',
 '100812.female.26.Architecture.Aries.xml',
 '1008329.female.16.Student.Pisces.xml',
 '1009572.male.25.indUnk.Cancer.xml',
 '1011153.female.27.Technology.Virgo.xml']

In [3]:
from bs4 import BeautifulSoup
from shutil import copy2

In [4]:
PATH2 = 'data/blogs/blogs_txt/'

This works to convert everything to txt format

In [5]:
for fn in files:
    try:
        infile = open(f'{PATH}{fn}', encoding="ISO-8859-1")
        contents = infile.read()
        soup = BeautifulSoup(contents, 'html5lib')
        posts = soup.find_all('post')

        with open(f'{PATH2}{fn[:-4]}.txt', 'w', encoding='utf-8') as out:
            for post in posts:
                out.write(post.get_text()) 
                
        if os.path.getsize(f'{PATH2}{fn[:-4]}.txt') == 0:
            print(f'{PATH2}{fn[:-4]}.txt')
        
    except:
        pass
        #print(fn)

In [5]:
txt_files = os.listdir(PATH2)

Number of files in the corpus:

In [6]:
len(txt_files)

19320

In [9]:
file_path = 'data/blogs/'

In experimenting with the model, I found I had to severely limit the selection of the corpus to avoid CUDA memory errors. Our training data set will be 400 blogs from men aged 23-27 and 400 blogs from women aged 23-27. Our validation set will be 200 total blogs from the same subset of writers.

In [10]:
os.makedirs(f'{file_path}trn/male', exist_ok=True)
os.makedirs(f'{file_path}trn/female', exist_ok=True)
os.makedirs(f'{file_path}trn/all', exist_ok=True)
os.makedirs(f'{file_path}val/male', exist_ok=True)
os.makedirs(f'{file_path}val/female', exist_ok=True)
os.makedirs(f'{file_path}val/all', exist_ok=True)
os.makedirs(f'{file_path}models', exist_ok=True)

In [11]:
tm = 0
tf = 0
vm = 0
vf = 0

for fn in txt_files:
    copy_file = False
    gender = fn.split('.')[1]
    age = int(fn.split('.')[2])
    
    if age in range(23,28):
        if random.random() > 0.1:
            dset = 'trn/'
            
            if gender == 'male':
                if tm < 400:
                    tm += 1
                    copy_file = True
                    
            else:
                if tf < 400:
                    tf += 1
                    copy_file = True          
            
        else:
            dset = 'val/'
            
            if gender == 'male':
                if vm < 100:
                    vm += 1
                    copy_file = True
            else:
                if vf < 100:
                    vf += 1
                    copy_file = True
        
        if copy_file:
            copy2(f'{PATH2}{fn}', f'{file_path}{dset}{gender}')
            copy2(f'{PATH2}{fn}', f'{file_path}{dset}all')
        
    if all((tm>=400, tf>=400, vm>=100, vf>=100)):
        break
    

## Text Tokenization 

An important step in creating language models is to process the text into tokens. Tokenization breaks text into chunks that will be fed into the network. For character level models, we would tokenize by making each character a single input. For word level language models, we will use spacy tokenizer to process the text.

In [10]:
spacy_tok = spacy.load('en')

In [11]:
TRN_PATH = 'trn/all'
VAL_PATH = 'val/all'
TRN = f'{file_path}trn/all'
VAL = f'{file_path}val/all'

In [12]:
file_path

'data/blogs/'

In [14]:
f = open(f'{TRN}/{trn_files[0]}')

In [15]:
blog = f.read()

An example of blogs in the corpus

In [17]:
blog[:1000]

'\n\n\t \n      cupid,please hear my cry, cupid, please let your arrow fly  straight into my fucking head.\n     \n\n    \n\n\n\t \n      Ijust got back from LA. I needed something to do so I looked up shows and ended up getting to see Deathray Davies that same night. I brought back shirts and stickers.my brother in law was impressed enough to purchase one of there albums.they were followed by this stoner/boogie band called Honky which allegedly contains and ex-Butthole Surfer.I didnt dig them too well.they were like the melvins and ZZ top combined. they even had redneck cowboy hats and silly 3 inch long goatees. their antics consisted of tipping there hats and torturing the audiance with between song chatter. they have a small lesbian following.  I didnt know the davies had like 6 members. I met the singer, his brother, and the second keyboardist.he didnt much care for honkey either.  thanx to colt from poulain for the correspondence. I hope we meet again except a show where WE are pl

This is the same blog tokenized. The difference isn't drastic, but you can see the effects of tokenization. Contractions like "didn't" become two words - "did", "nt".

In [18]:
' '.join([sent.string.strip() for sent in spacy_tok(blog)])[:1000]

' cupid , please hear my cry , cupid , please let your arrow fly  straight into my fucking head .  Ijust got back from LA . I needed something to do so I looked up shows and ended up getting to see Deathray Davies that same night . I brought back shirts and stickers.my brother in law was impressed enough to purchase one of there albums.they were followed by this stoner / boogie band called Honky which allegedly contains and ex - Butthole Surfer . I did nt dig them too well.they were like the melvins and ZZ top combined . they even had redneck cowboy hats and silly 3 inch long goatees . their antics consisted of tipping there hats and torturing the audiance with between song chatter . they have a small lesbian following .  I did nt know the davies had like 6 members . I met the singer , his brother , and the second keyboardist.he did nt much care for honkey either .  thanx to colt from poulain for the correspondence . I hope we meet again except a show where WE are playing too .  just g

We create a torchtext field to process the text and work with our dataloader. For this model we will use a batch size of 70 and a bptt value of 55. This runs really slow due to GPU memory limitations. During training I never saw GPU utilization go above 30%, but memory utilization was around 94%. Really riding the CUDA memory error line.

In [19]:
TEXT = data.Field(lower=True, tokenize="spacy")
#TEXT = pickle.load(open(f'{file_path}models/TEXT_2327_partial_2.pkl','rb'))

In [20]:
bs=70; bptt=55

When we create our dataloader, we set min_freq to 20. This cuts out any word that appears less than 20 times. I don't like doing this. I'd much prefer to have the cutoff around 5 or 10, since less frequently used words likely hold a lot of meaning. However this keeps the vocabulary smaller (and therefore the size of the embedding matrix). It's a compromise to make this thing train in a reasonable amount of time.

In [21]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(file_path, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=20)

In [24]:
pickle.dump(TEXT, open(f'{file_path}models/TEXT_2327_partial_2.pkl','wb'))

So after processing our corpus has a vocabulary of 25378 words and a total corpus length of 22593561 characters.

In [109]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(5378, 25378, 1, 22593561)

These are the top 10 most used words in the corpus.

In [110]:
TEXT.vocab.itos[:10]

['<unk>', '<pad>', '.', ',', 'the', 'i', ' ', 'to', 'and', 'a']

Each word maps to a number, like so

In [111]:
TEXT.vocab.stoi['king']

1123

This is an example of numericalizing text input. The tensor of numbers is what is actually processed by the network.

In [27]:
md.trn_ds[0].text[:12]

['\t ',
 '      ',
 'cupid',
 ',',
 'please',
 'hear',
 'my',
 'cry',
 ',',
 'cupid',
 ',',
 'please']

In [28]:
TEXT.numericalize([md.trn_ds[0].text[:12]])

Variable containing:
    93
    56
 16939
     3
   423
   412
    15
   963
     3
 16939
     3
   423
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

In [29]:
x,y = next(iter(md.trn_dl))

This is what an input training minibatch looks like. We have 70 columns, corresponding to our batch size of 70. Each column has 55 rows representing 55 word tokens in sequential order.

The y values we want to predict against are also word token values. For each value in the x matrix, there is a correct y value corresponding to the next word in the sequence. Due to how torchtext processes things, the y tensor is flattened. The 3850 (55x70) values in the y tensor correspond to predictions for each word in the x matrix.

In [30]:
x, x.shape

(Variable containing:
     93    248      4  ...     128     51    153
     56      2    113  ...    1763     37    623
  16939      6    151  ...       3     25      7
         ...            ⋱           ...         
    218    139     30  ...       3      9     20
    135    117      2  ...      39   1573     47
      2  10364      6  ...       0      2      7
 [torch.cuda.LongTensor of size 55x70 (GPU 0)], torch.Size([55, 70]))

In [31]:
y, y.shape

(Variable containing:
     56
      2
    113
   ⋮   
      2
      6
  10229
 [torch.cuda.LongTensor of size 3850 (GPU 0)], torch.Size([3850]))

# 3. Model Training

Here we create the model. em_sz sets the size (columns) of the embedding matrix. The embedding matrix will have size vocabulary x em_sz.

nh is the number of hidden activations per layer.

nl is the number of stacked LSTM modules in the model.

The model will be trained with ADAM optimizer.

The model itself has 5 dropout parameters.

dropouth is applied to the activations going from one LSTM block to another.

dropouti is applied to the input layer.

dropoute is applied to the embedding layer.

wdrop is applied to the LSTM's hidden weights.

dropout is applied to the linear decoder that determines the final prediction.

In [22]:
em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

In [23]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

In [24]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

This is the model structure. We start with an embedding matrix encoding our vocabulary. Then we have three LSTM modules, followed by a linear decoder.

In [36]:
learner

SequentialRNN(
  (0): RNN_Encoder(
    (encoder): Embedding(25378, 200, padding_idx=1)
    (encoder_with_dropout): EmbeddingDropout(
      (embed): Embedding(25378, 200, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDrop(
        (module): LSTM(200, 500)
      )
      (1): WeightDrop(
        (module): LSTM(500, 500)
      )
      (2): WeightDrop(
        (module): LSTM(500, 200)
      )
    )
    (dropouti): LockedDropout(
    )
    (dropouths): ModuleList(
      (0): LockedDropout(
      )
      (1): LockedDropout(
      )
      (2): LockedDropout(
      )
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=200, out_features=25378, bias=False)
    (dropout): LockedDropout(
    )
  )
)

Now we begin training

In [38]:
learner.fit(3e-3, 1, wds=1e-6)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                                                                                         
    0      4.522975   4.572558  



[array([4.57256])]

In [39]:
learner.fit(3e-3, 3, wds=1e-6, cycle_len=1, cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=7), HTML(value='')))

epoch      trn_loss   val_loss                                                                                         
    0      4.436817   4.419866  
    1      4.38615    4.405467                                                                                         
    2      4.343145   4.338641                                                                                         
    3      4.366256   4.404881                                                                                         
    4      4.310186   4.345111                                                                                         
    5      4.270685   4.294953                                                                                         
    6      4.238749   4.275535                                                                                         



[array([4.27554])]

In [40]:
learner.save('cyc1')
learner.save_encoder('cyc1_enc')

In [110]:
learner.fit(1e-3, 3, wds=1e-6, cycle_len=1, cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=7), HTML(value='')))

epoch      trn_loss   val_loss                                                                                         
    0      4.270131   4.278947  
    1      4.267469   4.28397                                                                                          
    2      4.237581   4.26392                                                                                          
    3      4.26826    4.291307                                                                                         
    4      4.237706   4.270288                                                                                         
    5      4.233086   4.255018                                                                                         
    6      4.218849   4.248353                                                                                         



[array([4.24835])]

In [111]:
learner.save('cyc2')
learner.save_encoder('cyc2_enc')

In [25]:
learner.load('cyc2')

In [39]:
learner.fit(1e-3, 4, wds=1e-6, cycle_len=10,cycle_save_name='cyc3_iter')

HBox(children=(IntProgress(value=0, description='Epoch', max=40), HTML(value='')))

epoch      trn_loss   val_loss                                                                                         
    0      4.257507   4.289395  
    1      4.249072   4.280441                                                                                         
    2      4.214001   4.271849                                                                                         
    3      4.224365   4.265314                                                                                         
    4      4.200969   4.254217                                                                                         
    5      4.189088   4.246558                                                                                         
    6      4.181067   4.238831                                                                                         
    7      4.166237   4.233099                                                                                         
    8  

[array([4.20655])]

In [40]:
learner.save('cyc3')
learner.save_encoder('cyc3_enc')

In [86]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=20, cycle_save_name='cyc4_iter')

HBox(children=(IntProgress(value=0, description='Epoch', max=20), HTML(value='')))

epoch      trn_loss   val_loss                                                                                         
    0      4.452277   4.280445  
    1      4.567751   4.463188                                                                                         
    2      4.558173   4.455255                                                                                         
    3      4.551173   4.447328                                                                                         
    4      4.548191   4.438882                                                                                         
    5      4.539622   4.428327                                                                                         
    6      4.524216   4.413176                                                                                         
    7      4.512895   4.404466                                                                                         
    8  

[array([4.25466])]

In [87]:
learner.save('cyc4')
learner.save_encoder('cyc4_enc')

## Model Training Review

This thing was a beast to train. Even with the major reductions we made to corpus size, min frequency of words and parameter values, the training epochs shown took several days.

In that time, model basically went nowhere. Validation loss dropped from 4.566 to 4.254, not a huge improvement given the number of epochs and training time involved.

Testing the model:

In [133]:
def get_next(inp):
    idxs = num_str(inp)
    res, *_ = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(res[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]

In [240]:
def get_next_n(inp, n):
    res_lst = inp.split(' ')
    try:
        for i in range(n):
            c = get_next(inp)
            res_lst += [c]
            inp_lst = inp.split(' ')
            inp_lst += [c]
            inp = """ """.join(inp_lst[1:])
        return """ """.join(res_lst)
    except:
        print(res_lst)
        print(c)
        print(inp)

In [241]:
m = learner.model
m.eval();

In [244]:
print(get_next_n('Usually I saw never say never, but today changed ', 1000))

Usually I saw never say never, but today changed  town in cousin magnets .    i 'm mouthful .    i am going to fight union .    derek caller optimism , sweet drum roll , and the songs are <unk> goal ... but the way it sounds .   diet coke can be blames the beauty is n't a sphere . what happening to a   urllink <unk> today also . asked by firewall is a great review . she 's asked me and wretched ,   urllink tucker   harris is thinking about how i feel the first time i come to their endangered affections what happened this jaime . cramp - <unk>   the meat tool , in curtis <unk> responsible for virgin <unk> frantic and often , shady being gentle with the whole system is orbit women .   time at the house , but no words will be higher goodwill .   broad , well american whitney braun at least one totally levels <unk> returns to iraq at interestingly very difficult in one . messaging is gives me perfume going on timer !   cheers , your goal of getting a david disposable chance and if you will

## Testing Results

These results aren't that great. Some chunks sound like sentences, but most are nonsensical. The main problem seems to be that the model loss is still far too high, and the model will struggle to train with such limited data.

This whole project has felt restricted by hardware issues. Makes me want to go through the effort of setting up AWS.

I'm going to try fine tune the model to gender classification. I don't have a lot of hope for this to perform given how bad it still is - crap in crap out - but I want to get some practice setting up this kind of model.

# 4. Text Classification

Issues about model performance aside, here we go.

First we create a torchtext dataset. Then we feed the dataset to a new model. The new model is then loaded with the word embedding matrix from the language model and trained to classify author gender from text. This time the torchtext field will contain the label ('male' or 'female') that the model will try to predict.

In [23]:
class BlogDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, **kwargs):
        fields = [('text', text_field), ('label', label_field)]
        examples = []
        for label in ['male', 'female']:
            for fname in glob(os.path.join(path, label, '*.txt')):
                with open(fname, 'r', encoding='utf-8') as f:
                    text = ''
                    while len(text) < 1000: #Ideally we would just grab the first line but a lot of the blogs
                        next_line = f.readline() #start with whitespace. This ensures we get a good chunk of text to go on
                        if not any((next_line == '', next_line == '\n', next_line == ' ')):
                            text += ' ' + next_line
                    
                examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, root='.data/',
               train='train', test='test', **kwargs):
        return super().splits(
            root, text_field=text_field, label_field=label_field,
            train=train, validation=None, test=test, **kwargs)

In [24]:
BLOG_LABEL = data.Field(sequential=False)
splits = BlogDataset.splits(TEXT, BLOG_LABEL, file_path, train='trn', test='val')

In [25]:
file_path

'data/blogs/'

In [26]:
bs=48

In [27]:
md2 = TextData.from_splits(file_path, splits, bs)

Some examples of male and female text:

In [28]:
t = splits[0].examples[5]

In [29]:
t.label

'male'

In [30]:
' '.join(t.text)

'        \n        wednesday of this week , my wife elizabeth and i found out that we will soon become parents ! soon , if march is soon for you .    funny because my wife has always been the one saying " i want to have babies " , while making cutesy hands , but since the news has come , really does n\'t act all that excited , but nausea can do that to you . really , the depth of the revelation has not fully hit us yet , but we are very happy .    one of the main purposes of this blog will be to document the belly growth visually as the next 6 1/2 months pass . oh , and i guess i will put up some photos when the little one appears .      also in other news ...            i am leaving work soon   ( do n\'t worry , i \'ve earned this break )   to head down to noblesville near indy to see sting ! this will be my 3rd time seeing him live , and a delayed birthday present to myself . i \'ll be meeting jenn swift and a friend of hers . i \'ll give an update on this event next posting .      \

In [31]:
t2 = splits[0].examples[-1]

In [32]:
t2.label

'female'

In [33]:
' '.join(t2.text)

" \t \n        i wrote these yesterday for work but thought you lot might like them too . they 're short and sweet as the cards we have to write on are not very big .   -----------------------------------------------------------------------       urllink lirael by garth nix    lirael returns us to the old kingdom fourteen years after the events told in sabriel . lirael is a daughter of the clayr , a race of seers who live within the glacier . but lirael has never received the gift of sight and so she spends her time working and hiding in the great library where she can learn more charter magic . when she is sent on a quest to find a boy named nicholas , she meets up with prince sameth and her whole world changes overnight . a fantastic return to the world first introduced in sabriel .   -----------------------------------------------------------------------       urllink abhorsen by garth nix    in abhorsen we return to the story of prince sameth and lirael . they are both still reelin

In [35]:
m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
           dropout=0.1, dropouti=0.65, wdrop=0.5, dropoute=0.1, dropouth=0.3)
m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
m3.clip=25.

So here we have the same model structure as before. The main difference is the dimensionality of the final layer.

In [36]:
m3

SequentialRNN(
  (0): MultiBatchRNN(
    (encoder): Embedding(25378, 200, padding_idx=1)
    (encoder_with_dropout): EmbeddingDropout(
      (embed): Embedding(25378, 200, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDrop(
        (module): LSTM(200, 500)
      )
      (1): WeightDrop(
        (module): LSTM(500, 500)
      )
      (2): WeightDrop(
        (module): LSTM(500, 200)
      )
    )
    (dropouti): LockedDropout(
    )
    (dropouths): ModuleList(
      (0): LockedDropout(
      )
      (1): LockedDropout(
      )
      (2): LockedDropout(
      )
    )
  )
  (1): PoolingLinearClassifier(
    (layers): ModuleList(
      (0): LinearBlock(
        (lin): Linear(in_features=600, out_features=3, bias=True)
        (drop): Dropout(p=0.1)
        (bn): BatchNorm1d(600, eps=1e-05, momentum=0.1, affine=True)
      )
    )
  )
)

We load up the model's encoder with the saved weights from the language model and try our best.

In [37]:
m3.load_encoder('cyc4_enc')
lrs=np.array([1e-5,1e-5,1e-5,1e-4,1e-3])

In [38]:
m3.freeze_to(-1)
m3.fit(lrs/2, 1, metrics=[accuracy])

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                                                                              
    0      1.104813   1.026852   0.381657  



[array([1.02685]), 0.3816565847613915]

In [39]:
m3.fit(lrs, 1, metrics=[accuracy], cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                                                                              
    0      1.060456   1.057459   0.37992   



[array([1.05746]), 0.3799195173540556]

In [40]:
m3.unfreeze()
m3.fit(lrs, 1, metrics=[accuracy], cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                                                                              
    0      1.004435   1.185917   0.381821  



[array([1.18592]), 0.38182147358036267]

In [41]:
m3.fit(lrs, 2, metrics=[accuracy], cycle_len=4, cycle_save_name='gender_class1')

HBox(children=(IntProgress(value=0, description='Epoch', max=8), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                                                                              
    0      0.952848   1.104415   0.500488  
    1      0.895149   1.159471   0.501726                                                                              
    2      0.858103   1.571479   0.501086                                                                              
    3      0.846589   1.683654   0.501409                                                                              
    4      0.82649    1.329061   0.501266                                                                              
    5      0.803777   0.979512   0.501                                                                                 
    6      0.783979   0.953483   0.501084                                                                              
    7      0.772295   0.935644   0.500664                                                                           

[array([0.93564]), 0.5006637278107683]

In [42]:
m3.fit(lrs, 4, metrics=[accuracy], cycle_len=4, cycle_save_name='gender_class2')

HBox(children=(IntProgress(value=0, description='Epoch', max=16), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                                                                              
    0      0.732586   0.810481   0.38408   
    1      0.732259   1.176944   0.501833                                                                              
    2      0.721015   1.184865   0.50342                                                                               
    3      0.719483   1.191819   0.502816                                                                              
    4      0.716559   1.179486   0.503789                                                                              
    5      0.709782   0.789773   0.387736                                                                              
    6      0.70994    0.934974   0.386555                                                                              
    7      0.707187   0.970563   0.504004                                                                           

[array([0.74688]), 0.38625989366836494]

In [43]:
#lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])
m3.fit(lrs/10, 2, metrics=[accuracy], cycle_len=4, cycle_save_name='gender_class3')

HBox(children=(IntProgress(value=0, description='Epoch', max=8), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                                                                              
    0      0.702355   0.73934    0.386541  
    1      0.701594   0.750925   0.503607                                                                              
    2      0.697262   0.760139   0.503607                                                                              
    3      0.697233   0.74761    0.387044                                                                              
    4      0.698773   0.757773   0.503585                                                                              
    5      0.697376   0.755829   0.386864                                                                              
    6      0.70034    0.751478   0.504425                                                                              
    7      0.697464   0.756986   0.504568                                                                           

[array([0.75699]), 0.504567826586341]

In [44]:
m3.fit(lrs/50, 2, metrics=[accuracy], cycle_len=4, cycle_save_name='gender_class4', cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=12), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                                                                              
    0      0.691443   0.767655   0.504282  
    1      0.694669   0.780608   0.504984                                                                              
    2      0.698031   0.775461   0.503866                                                                              
    3      0.699507   0.762166   0.503723                                                                              
    4      0.696531   0.7502     0.504706                                                                              
    5      0.69829    0.753113   0.50358                                                                               
    6      0.696828   0.754551   0.504383                                                                              
    7      0.698813   0.758001   0.50358                                                                            

[array([0.75492]), 0.5034417833375088]

## Postmortem

The model doesn't really go anywhere. As expected the crappy language model created a crappy classifier. I need to make some changes to really try to fix this. The language model takes way too long to train to really experiment with it.

I wonder if I could create the corpus in a more intelligent way. That or just accept that any serious attempt at this project will require several days of training time.

Despite my frustrations with this project, I'm glad I took the time to attempt it. I think language modeling is really interesting and it was fun to try it out. With all the talk about transfer learning in NLP going around lately, maybe I'll come back to this with a pretrained language model and focus more on the classification aspect.