# Transformers for Code

1. AST data comes from python150k dataset `python100k_training.json` and `python50k_eval.json`
2. AST data is modified s.t. there are no nodes with type and value. If a node had a type and a value, the value was moved into a new child node. New AST data is stored in `output/new_trees.json`
3. `output/new_trees.json` is sanitized with `generate_raw.py` that strips it from any JSON syntax and retains the nodes value/type data and stored in `output/new_trees_raw.txt`
4. ~~A Huggingface Tokenizer `ByteLevelBPETokenizer` is trained on `output/new_trees_raw.txt` and stored in `tokenizer/`~~ For now I'll use a pretrained GPT2Tokenizer
5. The new AST trees are traversed in pre-order sequence and split up into smaller chunks of size $n$, $n$ being 1,000 by default in `models/trav_trans/generate_data.py` and stored as `output/dps.txt`
6. The splitted and sorted sequence will be converted in `convert.py` with the previously trained `ByteLevelBPETokenizer` and stored in `output/converted_train.txt`. The splitting function adds data that can determine if a tree was split or not. This allows us to mark every new tree with a special token `<ast>` that acts as a delimiter and tells our model that a new AST is about to be fed
7. The same steps will be applied to the `python50k_eval.json` in order to receive a `output/converted_eval.txt` evaluation dataset


## Dataset

The Dataset contains lists of encoded AST nodes, the max lengths being $1001$, as trees were split up to $1000$ node subtrees. Eventually, a special token `<ast>` is prepended to the input, resulting in a total max length of $1001$.

```
Update: AST slices were reduced to max 255. Eventually adding the special token results in a maximum length of 256
```

In [1]:
import data

cache_path = "output/inputs.pkl"

code_dataset = data.CodeDataset() # Init Dataset object
code_dataset.load_from_cache(cache_path) # Load previously cached pickled list instead of reading from file

In [2]:
len(code_dataset.__getitem__(519)) # 519 was known to be buggy and result in a larger size than 255

255

## Tokenizer

The Tokenizer was trained on the nodes contained in `py100k_train.json`. If a node contains a string including whitespaces, the whitespaces are replaced with a token `<spc>`, otherwise the tokenizer would tokenize the string into several sub-tokens due to splitting at whitespace. 
Finally, the trained Tokenizer is exported in `tokenizer/code-tokenizer.json`

In [3]:
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(tokenizer_file = "tokenizer/code-tokenizer.json")

In [4]:
tokenizer.pad_token = "<pad>"
tokenizer.add_special_tokens({"additional_special_tokens": ["<ast>"]})
tokenizer.additional_special_tokens

['<ast>']

Using the Tokenizer decoder will ignore special tokens i.e. `<unk>`. To make these tokens visible, the parameter `skip_special_tokens` has to be set to `False`.

## Trainer

The `Trainer` class contains the training parameters and configurations.

In [5]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, GPT2LMHeadModel, GPT2Config

# Configs from facebookresearch/code-prediction-transformer with reduced context size due to GPU constraints
configuration = GPT2Config(
    n_positions = 256,
    n_ctx = 256,
    n_layers=6, 
    n_embd=300, 
    n_head=6, 
    layer_norm_epsilon=1e-6, 
    vocab_size=tokenizer.vocab_size)
model = GPT2LMHeadModel(configuration)

data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm = False)

training_args = TrainingArguments(
    output_dir = "outputs/code-model",
    overwrite_output_dir = True,
    num_train_epochs = 3,
    per_device_train_batch_size = 6
)

trainer = Trainer(
    model = model, 
    args = training_args,
    data_collator = data_collator,
    train_dataset = code_dataset
)

In [None]:
import torch
torch.cuda.empty_cache()

trainer.train()
trainer.save_model()

Step,Training Loss
500,5.8634
1000,3.5338
1500,3.2173
2000,3.0933
2500,3.0016
