These are samples used in the University of Cambridge course Machine Learning for Programming.
Scaffolding for a simple language model is provided in language_model/, for
TensorFlow 1.X, TensorFlow 2.X, and PyTorch. Python 3.6 or later is required.
If you want to re-use this, pick a framework you want to use, install it and
the requirements for this model using pip install -r requirements.txt.
To get started, open a console and change your current directory to language_model/.
Alternatively, set that directory to your PYTHONPATH enviornment variable:
export PYTHONPATH=/path/to/language_model
The scaffold provides some generic code to simplify the task (such as a
training loop, logic for saving and restoring, ...), but you need to complete
the code in a number of places to obtain a working model (these are marked by
#TODO N# in the code):
-
In
model.py, uncomment the line corresponding to the framework you want to use. -
In
dataset.py,load_data_fileneeds to be filled in to read a data file and return a sequence of lists of tokens; each list is considered one sample. This should re-use the code from the first practical to provide one sample for the tokens in each method.It is common practice to normalise capitalization of tokens (as the embedding of
fooandFooshould be similar). Make sure thatload_data_filetransforms all tokens to lower (or upper) case.You should be able to test this as follows:
$ python test_step2.py data/jsoup/src/main/java/org/jsoup/Jsoup.java.proto | tail -n -1 ['public', 'static', 'boolean', 'isvalid', 'lparen', 'string', 'bodyhtml', 'comma', 'whitelist', 'whitelist', 'rparen', 'lbrace', 'return', 'new', 'cleaner', 'lparen', 'whitelist', 'rparen', 'dot', 'isvalidbodyhtml', 'lparen', 'bodyhtml', 'rparen', 'semi', 'rbrace'] -
In
dataset.py,build_vocab_from_data_dirneeds to be completed to compute a vocabulary from the data. The vocabulary will be used to represent all tokens by integer IDs, and we need to consider three special tokens: theUNKtoken used to represent infrequent tokens and those not seen at training time, thePADtoken used to make all samples of the same length, andSTART_SYMBOLtoken used to as the first token in every sample and theEND_SYMBOLused as the last.To do this, we use the class
Vocabularyfromdpu_utils.mlutils.vocabulary. Usingload_data_filefrom above, compute the frequency of tokens in the passeddata_dir(collections.Counteris useful here) and use that information to add thevocab_sizemost common of them tovocab.You can test this step as follows:
$ python test_step3.py data/jsoup/src/main/java/org/jsoup/ Loaded vocabulary for dataset: {'%PAD%': 0, '%UNK%': 1, '%START%': 2, '%END%': 3, 'rparen': 4, 'lparen': 5, 'semi': 6, 'dot': 7, 'rbrace': 8, ' [...] -
In
dataset.py,tensorise_token_sequenceneeds to be completed to translate a token sequence into a sequence of integer token IDs of uniform length.The output of the function should always be a list of length
lengthof token IDs fromvocab, where longer sequences are truncated and shorter sequences are padded to the correct length. We also want to use this method to insert theSTART_SYMBOLat the beginning of each sample. The specialEND_SYMBOLsymbol needs to be appended to indicate the end of a list of tokens, whereas a specialPAD_SYMBOLneeds to be added to serve as a filler so that all token sequences will have the same length.You can test this step as follows: (note this is an example output that is using count_threshold of 2)
$ python test_step4.py data/jsoup/src/main/java/org/jsoup/ Sample 0: Real length: 50 Tensor length: 50 Raw tensor: [ 2 13 1 4 3 8 118 4 3 5 7 13 1 4 12 1 3 8 118 4 1 3 5 7 13 1 4 1 1 3 8 118 4 1 3 5 7 13 1 4 12 1 9 1 1 3 8 118 4 1] (truncated) Interpreted tensor: ['%START%', 'public', '%UNK%', 'lparen', 'rparen', 'lbrace', 'super', 'lparen', 'rparen', 'semi', 'rbrace', 'public', '%UNK%', 'lparen', 'string', '%UNK%', 'rparen', 'lbrace', 'super', 'lparen', '%UNK%', 'rparen', 'semi', 'rbrace', 'public', '%UNK%', 'lparen', '%UNK%', '%UNK%', 'rparen', 'lbrace', 'super', 'lparen', '%UNK%', 'rparen', 'semi', 'rbrace', 'public', '%UNK%', 'lparen', 'string', '%UNK%', 'comma', '%UNK%', '%UNK%', 'rparen', 'lbrace', 'super', 'lparen', '%UNK%'] (truncated) Sample 1: Real length: 46 Tensor length: 50 Raw tensor: [ 2 13 1 4 12 1 3 8 118 4 1 3 5 7 13 1 4 1 1 3 8 118 4 1 3 5 7 13 1 4 12 1 9 1 1 3 8 118 4 1 9 1 3 5 7 7 0 0] (truncated) Interpreted tensor: ['%START%', 'public', '%UNK%', 'lparen', 'string', '%UNK%', 'rparen', 'lbrace', 'super', 'lparen', '%UNK%', 'rparen', 'semi', 'rbrace', 'public', '%UNK%', 'lparen', '%UNK%', '%UNK%', 'rparen', 'lbrace', 'super', 'lparen', '%UNK%', 'rparen', 'semi', 'rbrace', 'public', '%UNK%', 'lparen', 'string', '%UNK%', 'comma', '%UNK%', '%UNK%', 'rparen', 'lbrace', 'super', 'lparen', '%UNK%', 'comma', '%UNK%', 'rparen', 'semi', 'rbrace', 'rbrace', '%PAD%', '%PAD%'] (truncated) ... -
The actual model needs to be built. Our goal is to learn to predict
tok[i]based on the tokentok[:i]seen so far. The process and scaffold is very similar in all frameworks. The methodcompute_logitsandcompute_loss_and_accneed to be completed, and thebuildmethod can always be used to initialise weights and layers that will be re-used during training and prediction. Parameters such asEmbeddingDimandRNNDimshould be hyperparameters, but values such as64work well.-
In
compute_logits, implement the logic to embed thetoken_idsinput tensor into a distributed representation. In TF 1.x, you can usetf.nn.embedding_lookup; in TF 2.X, you can usetf.keras.layers.Embedding; and in PyTorch, you can usetorch.nn.Embeddingfor this purpose.This should translate an
int32tensor of shape[Batch, Timesteps]into afloat32tensor of shape[Batch, Timesteps, EmbeddingDim]. -
In
compute_logits, implement an actual RNN consuming the results of the embedding layer. You can usetf.keras.layers.GRUresp.torch.nn.GRU(or their LSTM variants) for this. This should translate afloat32tensor of shape[Batch, Timesteps, EmbeddingDim]into afloat32tensor of shape[Batch, Timesteps, RNNDim]. -
In
compute_logits, implement a linear layer to translate the RNN output into an unnormalised probability distribution over the the vocabulary. You can usetf.keras.layers.Denseresp.torch.nn.Linearfor this. This should translate afloat32tensor of shape[Batch, Timesteps, RNNDim]into afloat32tensor of shape[Batch, Timesteps, VocabSize]. -
In
compute_loss_and_acc, implement a cross-entropy loss that compares the probability distribution computed at timestepTwith the input at timestepT+1(which is the token that we want to predict). Note that this means that we need to discard the final RNN output, as we do not know the next token. You can usetf.nn.sparse_softmax_cross_entropy_with_logitsresp.torch.nn.CrossEntropyLossfor this.
After completing these steps, you should be able to train the model and observe the loss going down (the accuracy value will only be filled in after step 6):
$ python train.py trained_models data/jsoup/{,} Loading data ... Built vocabulary of 4697 entries. Loaded 2233 training samples from data/jsoup/. Loaded 2233 validation samples from data/jsoup/. Running model on GPU. Constructed model, using the following hyperparameters: {"optimizer": "Adam", "learning_rate": 0.01, "learning_rate_decay": 0.98, "momentum": 0.85, "max_epochs": 500, "patience": 5, "max_vocab_size": 10000, "max_seq_length": 50, "batch_size": 200, "token_embedding_size": 64, "rnn_type": "GRU", "rnn_num_layers": 2, "rnn_hidden_dim": 64, "rnn_dropout": 0.2, "use_gpu": true, "run_id": "RNNModel-2019-12-29-13-23-18"} Initial valid loss: 0.042. [...] == Epoch 1 Train: Loss 0.0303, Acc 0.000 Valid: Loss 0.0224, Acc 0.000 (Best epoch so far, loss decreased 0.0224 from 0.0423) (Saved model to trained_models/RNNModel-2019-12-29-13-23-18_best_model.bin) == Epoch 2 Train: Loss 0.0213, Acc 0.000 Valid: Loss 0.0195, Acc 0.000 (Best epoch so far, loss decreased 0.0195 from 0.0224) (Saved model to trained_models/RNNModel-2019-12-29-13-23-18_best_model.bin) [...]The saved models should already be usable as autocompletion models, using the provided
predict.pyscript:$ python predict.py trained_models/RNNModel-2019-12-29-13-23-18_best_model.bin public Prediction at step 0 (tokens ['public']): Prob 0.282: static Prob 0.099: void Prob 0.067: string Continuing with token static Prediction at step 1 (tokens ['public', 'static']): Prob 0.345: void Prob 0.173: document Prob 0.123: string Continuing with token void Prediction at step 2 (tokens ['public', 'static', 'void']): Prob 0.301: main Prob 0.104: isfalse Prob 0.089: nonullelements Continuing with token main Prediction at step 3 (tokens ['public', 'static', 'void', 'main']): Prob 0.999: lparen Prob 0.000: filterout Prob 0.000: iterator Continuing with token lparen Prediction at step 4 (tokens ['public', 'static', 'void', 'main', 'lparen']): Prob 0.886: string Prob 0.033: int Prob 0.030: object Continuing with token stringNote: Note that tokens such as
{and(are represented aslbraceandlparenby the feature extractor and are used the same way here. -
-
Finally,
compute_loss_and_accshould be extended to also compute the number of (correct) predictions, so that accuracy of the model can be computed. For this, you need to check if the most likely prediction corresponds to the ground truth. You can usetf.argmaxresp.torch.argmaxhere. Finally, we also need to discount padding tokens, so you need to compute a mask which predictions correspond to padding. Here, you can useself.vocab.get_id_or_unk(self.vocab.get_pad())to get the integer ID of the padding token.After completing this step, you should be able to evaluate the model:
$ python evaluate.py trained_models/RNNModel-2019-12-29-13-23-18_best_model.bin data/jsoup/ Loading data ... Loaded trained model from trained_models/RNNModel-2019-12-29-13-23-18_best_model.bin. Loaded 2233 test samples from data/jsoup/. Test: Loss 24.9771, Acc 0.876 -
To improve training, we want to ignore those parts of the sequence that are just
%PAD%symbols introduced to get to a uniform length. To this end, we need to mask out part of the loss (for tokens that are irrelevant). You can use the mask computed in step 6 again here.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.