[![Github](https://img.shields.io/github/stars/lab-ml/python_autocomplete?style=social)](https://github.com/lab-ml/python_autocomplete)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/python_autocomplete/blob/master/notebooks/train.ipynb)

# Train a character level autoregressive model on Python source code

This notebook will download repositories linked from [Awesome PyTorch List](https://github.com/bharathgs/Awesome-pytorch-list/) and train a character level model.

[Evaluation notebook](https://github.com/lab-ml/python_autocomplete/blob/master/notebooks/evaluate.ipynb) evaluates the trained model on some samples. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/python_autocomplete/blob/master/notebooks/evaluate.ipynb)

### Install dependencies

In [None]:
%%capture
!pip install labml labml_python_autocomplete

Imports

In [2]:
from python_autocomplete import create_dataset
from labml.logger import inspect
from labml import monit, lab

import numpy as np

## Prepare dataset

We will step-by-step

1. download `readme` file from Awesome PyTorch List
2. pick github links in it
3. download those repositories
4. remove non python files
5. merge all python files into a training/validation text

The code for this is in [`create_dataset.py`](https://github.com/lab-ml/python_autocomplete/blob/master/python_autocomplete/create_dataset.py)

In [3]:
create_dataset.create_folders()

Get the list of repositories from [Awesome-PyTorch-list](https://github.com/bharathgs/Awesome-pytorch-list/)

In [4]:
create_dataset.get_awesome_pytorch_readme()
repos = create_dataset.get_repos_from_readme('pytorch_awesome.md')
inspect(repos)

Download zip files. For demonstration we only use 10 repos.

In [1]:
repos = repos[:10]
for i, r in monit.enum(f"Download {len(repos)} repos", repos):
    #  Download the repository zip
    zip_file = create_dataset.download_repo(r[0], r[1], i)
    # Extract the zip file
    extracted = create_dataset.extract_zip(zip_file)
    # Remove non .py files
    create_dataset.remove_files(extracted, {'.py'})

NameError: name 'repos' is not defined

Get list of python files across all repositories

In [6]:
source_files = create_dataset.get_python_files()
np.random.shuffle(source_files)
inspect(source_files)

Split the files into training and validation and merge them into `train.py` and `valid.py`

In [7]:
train_valid_split = int(len(source_files) * 0.9)
create_dataset.concat_and_save(lab.get_data_path() / 'train.py', source_files[:train_valid_split])
create_dataset.concat_and_save(lab.get_data_path() / 'valid.py', source_files[train_valid_split:])

## Train the model

Training script is defined in [`train.py`](https://github.com/lab-ml/python_autocomplete/blob/master/python_autocomplete/train.py).
We import the `Configs` class from it and create a new experiment with
custom configurations.
You can experiment with changing these configurations.

In [8]:
from python_autocomplete.train import Configs
from labml import experiment

Initialize `Configs` object

In [9]:
conf = Configs()

Create a new experiment

In [None]:
experiment.create(name="python_autocomplete",
                  comment='Colab demo')

Set configurations

A dictionary for custom configurations. You can see the options available for configurations from these configurations of a [previous training I ran](https://web.lab-ml.com/configs?uuid=39b03a1e454011ebbaff2b26e3148b3d).

In [2]:
custom_conf = {}

We will try a `transformer_model`, you can use `lstm_model` if you want to try a LSTM

In [None]:
custom_conf['model'] = 'transformer_model'

Number of layers for the model, we set `2` for demonstration

In [None]:
custom_conf['n_layers'] = 2

Batch size for training, you will have to reduce this if you use a larger model due to GPU memory constraints.

In [None]:
custom_conf['batch_size'] = 64

Number of epochs. I usually set a high number on stop the training when the validation loss stops improving.

In [None]:
custom_conf['epochs'] = 32

We use [Noam optimizer](https://lab-ml.com/labml_nn/optimizers/noam.html) (one used in original Transformers paper). You can also use something like `AdamW` (Adam with warmup). Transformer training usually needs a warmup session where the learning rate is kept low during initial training steps. 

Note that the learning rate is `1.0`, the actual learning rate will be $<10^{-3}$ because [Noam optimizer](https://lab-ml.com/labml_nn/optimizers/noam.html) factors the learning rate by $\frac{1}{\sqrt{d_{model}}}$ where $d_{model}$ is the number of dimensions in the transformer feature vector.

In [None]:
custom_conf['optimizer.optimizer'] = 'Noam'
custom_conf['optimizer.learning_rate'] = 1.0

Number of characters in a sample. This defaults to `512`, but we specify it here to make it easier to change.

In [None]:
custom_conf['seq_len'] = 512

Our training switches between training and validation within an epoch, so that we get a the validation loss (for a fraction of vlaidation data) more frequently. This is especially useful when an epoch take a lot longer to train. `inner_iterations` should be increased depending on how large the training dataset is (hence longer time per epoch). We set it at 5 since we are only training on $10$ repositories.

In [3]:
custom_conf['inner_iterations'] = 5

In [10]:
experiment.configs(conf, custom_conf)

Add models for saving and loading. If you plan to stop and continue the training you should include the optimizer also for saving.

*Accessing `conf.model` loads the model and the dataset to calculate the number of tokens (specific characters) (needed to initialize the model)*.

In [11]:
experiment.add_pytorch_models({'model': conf.model})

Start the experiment and run it. You can optionally load and continue from a previously saved run.

In [None]:
# experiment.load('d5ba7f56d88911eaa6629b54a83956dc')
with experiment.start():
    conf.run()