This notebook was run on AWS on a `p3.8xlarge` in approximately 35 hours.

Environment is available as a publically available docker container: `hamelsmu/ml-gpu`

# Build Language Model From Docstrings

The goal is to build a language model using the docstrings, and use that language model to generate an embedding for each docstring.  

In [1]:
import torch,cv2
from lang_model_utils import lm_vocab, load_lm_vocab, train_lang_model
from general_utils import save_file_pickle, load_file_pickle
import logging
from pathlib import Path
from fastai.text import *
use_cache = True

source_path = Path('./data')

with open(source_path/'train.docstring', 'r') as f:
    trn_raw = f.readlines()

with open(source_path/'valid.docstring', 'r') as f:
    val_raw = f.readlines()

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
trn_raw[:10]

['render this object as a dict of its fields .\n',
 "unset fields so they do n't appear in the message body .\n",
 'runs the daily searcher , queuing selected episodes for search\n',
 'read data from an h5 file in sxs format\n',
 'output the waveform in nrar format .\n',
 'remove the left - whitespace margin of a block of python code .\n',
 'print a line or lines of python which already contain indentation .\n',
 'print a series of lines of python .\n',
 'print a line of python , indenting it according to the current indent level .\n',
 'close this printer , flushing any remaining lines .\n']

## Pre-process data for language model

We will use the class  `build_lm_vocab` to prepare our data for the language model

In [3]:
vocab = lm_vocab(max_vocab=50000,
                 min_freq=15)

# fit the transform on the training data, then transform
trn_flat_idx = vocab.fit_transform_flattened(trn_raw)



Look at the transformed data

In [4]:
trn_flat_idx[:10]

array([  2, 566,  18,  40,  36,   5, 108,   7, 144, 250])

In [5]:
[vocab.itos[x] for x in trn_flat_idx[:10]]

['_xbos_',
 'render',
 'this',
 'object',
 'as',
 'a',
 'dict',
 'of',
 'its',
 'fields']

In [6]:
# apply transform to validation data
val_flat_idx = vocab.transform_flattened(val_raw)



Save files for later use

In [15]:
if not use_cache:
    vocab.save('./data/lang_model/vocab.cls')
    save_file_pickle('./data/lang_model/trn_flat_idx_list.pkl', trn_flat_idx)
    save_file_pickle('./data/lang_model/val_flat_idx_list.pkl', val_flat_idx)



## Train Fast.AI Language Model

This model will read in files that were created and train a [fast.ai](https://github.com/fastai/fastai/tree/master/fastai) language model.  This model learns to predict the next word in the sentence using fast.ai's implementation of [AWD LSTM](https://github.com/salesforce/awd-lstm-lm).  

The goal of training this model is to build a general purpose feature extractor for text that can be used in downstream models.  In this case, we will utilize this model to produce embeddings for function docstrings.

In [16]:
vocab = load_lm_vocab('./data/lang_model/vocab.cls')
trn_flat_idx = load_file_pickle('./data/lang_model/trn_flat_idx_list.pkl')
val_flat_idx = load_file_pickle('./data/lang_model/val_flat_idx_list.pkl')



In [3]:
if not use_cache:
    fastai_learner, lang_model = train_lang_model(model_path = './data/lang_model_weights',
                                                  trn_indexed = trn_flat_idx,
                                                  val_indexed = val_flat_idx,
                                                  vocab_size = vocab.vocab_size,
                                                  lr=3e-3,
                                                  em_sz= 500,
                                                  nh= 500,
                                                  bptt=20,
                                                  cycle_len=1,
                                                  n_cycle=3,
                                                  cycle_mult=2,
                                                  bs = 200,
                                                  wd = 1e-6)
    
    logging.warning('Not re-training language model because use_cache=True')

HBox(children=(IntProgress(value=0, description='Epoch', max=7), HTML(value='')))

epoch      trn_loss   val_loss                                  
    0      3.738238   3.603623  
    1      3.63832    3.484815                                  
    2      3.591155   3.430484                                  
    3      3.569384   3.399131                                  
    4      3.523898   3.375731                                  
    5      3.538859   3.360246                                  
                                                                

data/lang_model_weights/models/langmodel_best.h5


    6      3.493145   3.345072  



In [4]:
if not use_cache:
    fastai_learner.fit(1e-3, 3, wds=1e-6, cycle_len=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=6), HTML(value='')))

epoch      trn_loss   val_loss                                  
    0      3.477406   3.329124  
    1      3.445067   3.285131                                  
    2      3.445098   3.306307                                  
    3      3.434065   3.262412                                  
    4      3.440198   3.290214                                  
    5      3.412634   3.247557                                  



[array([3.24756])]

In [5]:
if not use_cache:
    fastai_learner.fit(1e-3, 2, wds=1e-6, cycle_len=3, cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=9), HTML(value='')))

epoch      trn_loss   val_loss                                  
    0      3.459676   3.302814  
    1      3.417481   3.245754                                  
    2      3.377371   3.226938                                  
    3      3.462012   3.309963                                  
    4      3.463038   3.281091                                  
    5      3.39692    3.249816                                  
    6      3.368832   3.218649                                  
    7      3.343394   3.201463                                  
    8      3.361614   3.197675                                  



[array([3.19767])]

Save language model and learner

In [9]:
if not use_cache:
    fastai_learner.save('lang_model_learner.fai')
    lang_model_new = fastai_learner.model.eval()
    torch.save(lang_model_new, './data/lang_model/lang_model_gpu.torch')
    torch.save(lang_model_new.cpu(), './data/lang_model/lang_model_cpu.torch')

# Load Model and Encode All Docstrings

In [2]:
from lang_model_utils import load_lm_vocab
vocab = load_lm_vocab('./data/lang_model/vocab.cls')
idx_docs = vocab.transform(trn_raw + val_raw, max_seq_len=30, padding=False)
lang_model = torch.load('./data/lang_model/lang_model_gpu.torch', 
                        map_location=lambda storage, loc: storage)



In [3]:
lang_model.eval()

SequentialRNN(
  (0): RNN_Encoder(
    (encoder): Embedding(49811, 500, padding_idx=1)
    (encoder_with_dropout): EmbeddingDropout(
      (embed): Embedding(49811, 500, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDrop(
        (module): LSTM(500, 500)
      )
      (1): WeightDrop(
        (module): LSTM(500, 500)
      )
      (2): WeightDrop(
        (module): LSTM(500, 500)
      )
    )
    (dropouti): LockedDropout(
    )
    (dropouths): ModuleList(
      (0): LockedDropout(
      )
      (1): LockedDropout(
      )
      (2): LockedDropout(
      )
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=500, out_features=49811, bias=False)
    (dropout): LockedDropout(
    )
  )
)

In [4]:
def list2arr(l):
    "Convert list into pytorch Variable."
    return V(np.expand_dims(np.array(l), -1)).cpu()

def make_prediction_from_list(model, l):
    """
    Encode a list of integers that represent a sequence of tokens.  The
    purpose is to encode a sentence or phrase.

    Parameters
    -----------
    model : fastai language model
    l : list
        list of integers, representing a sequence of tokens that you want to encode

    """
    arr = list2arr(l)# turn list into pytorch Variable with bs=1
    model.reset()  # language model is stateful, so you must reset upon each prediction
    hidden_states = model(arr)[-1][-1] # RNN Hidden Layer output is last output, and only need the last layer

    #return avg-pooling, max-pooling, and last hidden state
    return hidden_states.mean(0), hidden_states.max(0)[0], hidden_states[-1]


def get_embeddings(lm_model, list_list_int):
    """
    Vectorize a list of sequences List[List[int]] using a fast.ai language model.

    Paramters
    ---------
    lm_model : fastai language model
    list_list_int : List[List[int]]
        A list of sequences to encode

    Returns
    -------
    tuple: (avg, mean, last)
        A tuple that returns the average-pooling, max-pooling over time steps as well as the last time step.
    """
    n_rows = len(list_list_int)
    n_dim = lm_model[0].nhid
    avgarr = np.empty((n_rows, n_dim))
    maxarr = np.empty((n_rows, n_dim))
    lastarr = np.empty((n_rows, n_dim))

    for i in tqdm_notebook(range(len(list_list_int))):
        avg_, max_, last_ = make_prediction_from_list(lm_model, list_list_int[i])
        avgarr[i,:] = avg_.data.numpy()
        maxarr[i,:] = max_.data.numpy()
        lastarr[i,:] = last_.data.numpy()

    return avgarr, maxarr, lastarr

In [5]:
%%time
avg_hs, max_hs, last_hs = get_embeddings(lang_model, idx_docs)

A Jupyter Widget


CPU times: user 64d 2h 6min 3s, sys: 1h 40min 49s, total: 64d 3h 46min 53s
Wall time: 1d 58min 56s


# Save Language Model Embeddings For Docstrings To Disk

In [12]:
savepath = Path('./data/lang_model_emb/')
np.save(savepath/'avg_emb_dim500.npy', avg_hs)
np.save(savepath/'max_emb_dim500.npy', max_hs)
np.save(savepath/'last_emb_dim500.npy', last_hs)