In [1]:
%reload_ext autoreload
%autoreload 2

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle

## Language modeling

### Data

In [2]:
PATH='data/quora/'

F_NAME = 'train.csv'
TRN = f'{PATH}{F_NAME}'

%ls {PATH}

[0m[01;34mmodels[0m/  [01;34mtmp[0m/  train.csv


In [3]:
df_train = pd.read_csv(PATH + 'train.csv')

In [4]:
ques = df_train.question1[15]; ques

'What would a Trump presidency mean for current international master’s students on an F1 visa?'

Let's see how many words are in the dataset.

In [6]:
!find {TRN} -name '*.csv' | xargs cat | wc -w

7681017


The size seems large enough to create our own language model.
Even though this was not part of the problem, I felt this would make the final solution more comprehensive.
The idea was to use a pre-trained language model, and another model trained on given data itself.
Then comparing which of them works better in the Siamese model architecture.

All the words need to be tokenized, and tokenizer from fastai library is used.

In [7]:
' '.join(spacy_tok(ques))

'What would a Trump presidency mean for current international master ’s students on an F1 visa ?'

Above is first question as read by the tokenizer.

TorchText library is used to preprocess data, and spacy for tokenization.

'TEXT' field below (from TorchText) will contain all the vocabulary of our dataset. We configure it for lowercase and tokenize using spacy.

In [5]:
TEXT = data.Field(lower=True, tokenize=spacy_tok)

### Train Test Division

We will creat a df_lang_model dataframe for training language model.

In [6]:
ques_series = df_train.question1.append(df_train.question2, ignore_index=True)
df_lang_model = pd.DataFrame(ques_series, columns=['questions'])
df_lang_model.shape

(727722, 1)

In [7]:
VAL_RATIO = 0.2

# TODO: create a function for this
idx = np.arange(df_lang_model.shape[0])
np.random.seed(999)
np.random.shuffle(idx)

val_size = int(len(idx) * VAL_RATIO)

df_lang_model_train = df_lang_model.iloc[idx[val_size:]]
df_lang_model_val = df_lang_model.iloc[idx[:val_size]]

### Create Model

Let's create a model-data object from the dataframe.

In [8]:
df_lang_model_test = df_lang_model_val.tail();

In [9]:
bs=16; bptt=60

bptt is the number of the words that we want our model to read at once.
Both bs and bptt increase need for GPU memory.

In [10]:
md = LanguageModelData.from_dataframes(PATH, TEXT, 'questions', df_lang_model_train, df_lang_model_val, df_lang_model_test, 
                                       bs=bs, bptt=bptt)

Above method fills the 'TEXT' object with vocab attribute. This stores words and integer (token) mapping.

In [107]:
# storing TEXT object
# pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

Below, in trn_ds attribute we have dataset, and in trn_dl is the data loader for feeding into the optimizer.

Also below are the: # batches; # unique tokens in the vocab; # tokens in the training set

In [15]:
len(md.trn_dl), md.nt, len(md.trn_ds)

(1774, 83932, 1, 7952092)

Integer to String mapping.

In [16]:
# 'itos': 'int-to-string'
TEXT.vocab.itos[:12]

['<unk>',
 '<pad>',
 '?',
 '<eos>',
 'the',
 'what',
 'is',
 'i',
 'how',
 'a',
 'to',
 'in']

In [17]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['the']

4

In [18]:
md.trn_ds[0].text[:10]

['in',
 'western',
 'married',
 'life',
 ',',
 'does',
 'the',
 'husband',
 'have',
 'to']

TorchText will handle turning words into integer IDs.

In [19]:
TEXT.numericalize([md.trn_ds[0].text[:10]])

Variable containing:
   11
  993
  744
   73
   18
   25
    4
  979
   32
   10
[torch.cuda.LongTensor of size 10x1 (GPU 0)]

The 'LanguageModelData' object or 'md' will create 16 batches with approximate lengths of 60 tokens (bptt).

Below is 1 batch of data

Each batch also has labels (or next word in this case of language model).

In [27]:
next(iter(md.trn_dl))

(Variable containing:
     11      7     57  ...       2     80   1150
    993     80     21  ...       3      5    589
    744     30      2  ...       5   3061      2
         ...            ⋱           ...         
     18      7     11  ...      19  46937  81205
   2037    191   3856  ...      32     49     49
     15      4      2  ...    1080    804      2
 [torch.cuda.LongTensor of size 68x64 (GPU 0)], Variable containing:
    993
     80
     21
   ⋮   
     90
    499
      3
 [torch.cuda.LongTensor of size 4352 (GPU 0)])

### Train

Innitially generic model parameters are choosen.

In [11]:
em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

Large momentum values don't work well with RNN models, so we have adam optimizer with less momentum than its 
default of 0.9.

In [12]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

Below get_model method gives us a variant of AWD LSTM Language Model. This model provides good regularization through dropout.

Though dropout values for the model needs to be guessed and further optimized through training.

In [13]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

Model is slowly tuned in stages, learning rate finder could also be used. 

In [14]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

epoch      trn_loss   val_loss                                
    0      3.876628   3.765307  
    1      3.667433   3.563167                                
    2      3.509317   3.460986                                
    3      3.612688   3.51687                                 
    4      3.490294   3.422421                                
    5      3.401346   3.349517                                
    6      3.341496   3.329137                                
    7      3.546147   3.463429                                
    8      3.513475   3.428641                                
    9      3.460858   3.385919                                
    10     3.414455   3.343821                                
    11     3.354729   3.301025                                
    12     3.332077   3.270087                                
    13     3.256523   3.251745                                
    14     3.220865   3.251471                                



[3.2514706]

In [18]:
learner.save_encoder('adam1_enc')

In [17]:
learner.load_encoder('adam1_enc')

In [20]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)

epoch      trn_loss   val_loss                                
    0      3.478428   3.410003  
    1      3.475994   3.39572                                 
    2      3.475303   3.368326                                
    3      3.408111   3.342648                                
    4      3.366519   3.310779                                
    5      3.320632   3.27357                                 
    6      3.267667   3.247919                                
    7      3.231382   3.228729                                
    8      3.227391   3.218729                                
    9      3.167791   3.21881                                 



[3.21881]

For question comparison, we only need half of the language model- the *encoder*, so we save that part.

In [22]:
learner.save_encoder('adam_20_enc')

In [14]:
learner.load_encoder('adam_20_enc')

In [19]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

### Test

Let's test the language model a bit, by giving it a small string and see what it does.

In [32]:
m=learner.model
ss=""" How to """
s = [spacy_tok(ss)]
t=TEXT.numericalize(s)
' '.join(s[0])

'  How to'

We'll manually test the model here.

In [33]:
m[0].bs=1          # set batch size to 1

m.eval()           # turn off dropout

m.reset()          # reset hidden state

res,*_ = m(t)      # get predictions from model

m[0].bs=bs         # put the batch size back to what it was

The next top 10 predictions were for the next word after our short text are:

In [34]:
nexts = torch.topk(res[-1], 10)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

['the', 'a', 'be', 'get', 'make', 'my', 'an', 'work', 'your', 'india']

Generate a bit more text:

In [36]:
print(ss,"\n")
for i in range(30):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = m(n[0].unsqueeze(0))
print('...')

 How to  

way to learn about stock market ? <eos> what is the best way to learn about stock market ? <eos> what is the best way to learn about stock market ...


### Similarity

In [83]:
next(learner.data.trn_dl)

(Variable containing:
     38     37     25  ...       5      2    122
   4504     40  24961  ...      25      3     48
     17  71798    332  ...   19329     20     10
         ...            ⋱           ...         
    451      2   1223  ...       3    187      5
  19144      3      9  ...       8     57      6
      2      8    591  ...      16   4681      4
 [torch.cuda.LongTensor of size 64x16 (GPU 0)], Variable containing:
   4504
     40
  24961
   ⋮   
      7
    102
     23
 [torch.cuda.LongTensor of size 1024 (GPU 0)])

In [84]:
len(learner.data.trn_dl)

8282

In [78]:
x = next(iter(learner.data.trn_dl))[0][:, 0:2]

In [56]:
learner.model

SequentialRNN(
  (0): RNN_Encoder(
    (encoder): Embedding(83932, 200, padding_idx=1)
    (encoder_with_dropout): EmbeddingDropout(
      (embed): Embedding(83932, 200, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDrop(
        (module): LSTM(200, 500, dropout=0.05)
      )
      (1): WeightDrop(
        (module): LSTM(500, 500, dropout=0.05)
      )
      (2): WeightDrop(
        (module): LSTM(500, 200, dropout=0.05)
      )
    )
    (dropouti): LockedDropout(
    )
    (dropouths): ModuleList(
      (0): LockedDropout(
      )
      (1): LockedDropout(
      )
      (2): LockedDropout(
      )
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=200, out_features=83932, bias=False)
    (dropout): LockedDropout(
    )
  )
)

We will replace the LinearDecoder with the Siamese model defined below.

In [1]:
learner.model[0](learner.data.trn_dl)

In [102]:
class Residual(nn.Module):
    def __init__(self, insize, outsize):
        super(Residual, self).__init__()
        drate = .3
        self.math = nn.Sequential(
                                 nn.BatchNorm1d(insize),
                                 nn.Dropout(drate),
                                 nn.Linear(insize, outsize),
                                 nn.PReLU(),
                                )
        self.skip = nn.Linear(insize, outsize)
        
    def forward(self, x):
        return self.math(x) + self.skip(x)

class Siamese(nn.Module):
    def __init__(self):
        super(Siamese, self).__init__()

        # this is final classifier
        self.classifier = nn.Sequential(
                            Residual(4*(128), 256),
                            Residual(256, 128),
                            Residual(128, 128),
                            nn.Linear(128, 1),
                            nn.Sigmoid()
                            )
    def forward(self, x, y):
        
        z = torch.cat([x*y, (x-y)**2], 1)
        z = self.classifier(z)
        return z
    
test_model = Siamese()

I would need some more time to complete this. 

RNN_encoder model needs to be added to the Siamese model with frozen weights and then the classifier will be trained.
