This notebook was run on AWS on a `p3.2xlarge` in approximately 15 hours.

Environment is available as a publically available docker container: `hamelsmu/ml-gpu`

# Build Language Model From Docstrings

The goal is to build a language model using the docstrings, and use that language model to generate an embedding for each docstring.  

In [3]:
from lang_model_utils import preprocess_lm_data

## Pre-process data for language model

We will use the helper function `preprocess_lm_data` to prepare our data for the language model

In [4]:
source_path = '/ds/hamel/code_search_data'
dest_path = '/ds/hamel/code_search_data/outputs'

In [6]:
preprocess_lm_data(read_data_path=source_path,
                   save_data_path=dest_path,
                   train_file='train.docstring',
                   validation_file='valid.docstring',
                   max_vocab=50000,
                   min_freq=15
                  )



`preprocess_lm_data` saves four files to the `save_data_path`:


1. `itos_dict.pkl`  this a dict that maps integer indices to a string.
2. `stoi_dict.pkl`  this is a defaultdict which maps strings to integer indices, with default value of zero.
3. `trn_indexed.npy` a numpy array that is the indexed version of the training data.
4. `val_indexed.npy` a numpy array that is the indexed version of the validation data.


In [13]:
!ls -lah {dest_path}

total 538M
drwxr-xr-x 2 root root 6.0K May 14 23:25 .
drwxr-xr-x 4 root root 6.0K May 14 23:25 ..
-rw-r--r-- 1 root root 910K May 14 23:25 itos_dict.pkl
-rw-r--r-- 1 root root 910K May 14 23:25 stoi_dict.pkl
-rw-r--r-- 1 root root 514M May 14 23:25 trn_indexed.npy
-rw-r--r-- 1 root root  22M May 14 23:25 val_indexed.npy


## Train Fast.AI Language Model

This model will read in files that were created and train a [fast.ai](https://github.com/fastai/fastai/tree/master/fastai) language model.  This model learns to predict the next word in the sentence using fast.ai's implementation of [AWD LSTM](https://github.com/salesforce/awd-lstm-lm).  

The goal of training this model is to build a general purpose feature extractor for text that can be used in downstream models.  In this case, we will utilize this model to produce embeddings for function docstrings.

In [1]:
from lang_model_utils import train_lang_model

In [None]:
fastai_learner, lm_model = train_lang_model(model_path='/ds/hamel/code_search_data/outputs/lm_model',
                                            data_path='/ds/hamel/code_search_data/outputs')

HBox(children=(IntProgress(value=0, description='Epoch', max=6), HTML(value='')))

epoch      trn_loss   val_loss                                      
    0      3.89538    3.692201  
 86%|████████▌ | 90702/105245 [1:17:20<12:23, 19.55it/s, loss=3.71]