# UCL-COMP0087 Project Demo

#### Colab Setup

In [None]:
# Setup
from google import colab
colab.drive.mount('/content/drive')

# all imports, login, connect drive
import os
from pathlib import Path
import requests
from google.colab import auth
auth.authenticate_user()
from googleapiclient.discovery import build
drive = build('drive', 'v3').files()

# recursively get names
def get_path(file_id):
    f = drive.get(fileId=file_id, fields='name, parents').execute()
    name = f.get('name')
    if f.get('parents'):
        parent_id = f.get('parents')[0]  # assume 1 parent
        return get_path(parent_id) / name
    else:
        return Path(name)

# change directory
def chdir_notebook():
    d = requests.get('http://172.28.0.2:9000/api/sessions').json()[0]
    file_id = d['path'].split('=')[1]
    path = get_path(file_id)
    nb_dir = 'drive' / path.parent
    os.chdir(nb_dir)
    return nb_dir

!cd /
chdir_notebook()

In [None]:
!pip install -r config/requirements_colab.txt

----
## A. DATA preprocessing
### 1) JSON->bin
Preprocess .json format Conala data and save processed .bin file to `data/conala` folder.

**Data Format**
* Source: Short StackOverflow natural language questions.
* Target: Code snippet. e.x. `pandas.read('file.csv', nrows=100)`.

**Preprocess**
* Canonicalization: Identify question specific string / variable names using RegEx and replace them with universal tokens `str_0`, `str_1` and `var_0`, `var_1`, etc.
* Lowercase, tokenization

#TODO: Understand preprocessing details and note down here.

**Data Split**
* Scraped data: This set is not human-curated. Used for pretraining.
* Train set: Human-curated training set for fine-tuning. In total 2185 gold data.
* Dev set: Held out 200 dev examples from gold training set.
* Test set: 500 Conala testing instances.

In [4]:
mined_data_file = "data/conala-corpus/conala-mined.jsonl"
topk = 200 # number of pretraining data to be preprocessed
!python datasets/conala/dataset.py --pretrain=$mined_data_file --topk=$topk

process gold training data...
Skipped due to exceptions: 123
use mined data:  200
from file:  data/conala-corpus/conala-mined.jsonl
Skipped due to exceptions: 8
2248 training instances
200 dev instances
process testing data...
Skipped due to exceptions: 34
466 testing instances
number of word types: 1613, number of word types w/ frequency > 1: 882
number of singletons:  731
number of words not included: 972
total token count:  22658
unk token count:  1213
number of word types: 1749, number of word types w/ frequency > 1: 763
number of singletons:  986
number of words not included: 1272
total token count:  14861
unk token count:  1558
number of word types: 1701, number of word types w/ frequency > 1: 758
number of singletons:  943
number of words not included: 1204
total token count:  35988
unk token count:  1465
generated vocabulary Vocab(source Vocabulary[size=645]words, primitive Vocabulary[size=481]words, code Vocabulary[size=501]words)
Max action len: 96


In [2]:
# example of processed data.
from components.dataset import Dataset
n_example = 3
train_set = Dataset.from_bin_file("data/conala/train.gold.full.bin")
for src, tgt in zip(train_set.all_source[:n_example],train_set.all_targets[:n_example]):
    print(f'Source:{src} \nTarget:{tgt} \n')

Source:['concatenate', 'elements', 'of', 'a', 'list', 'str_0', 'of', 'multiple', 'integers', 'to', 'a', 'single', 'integer'] 
Target:sum(d * 10 ** i for i, d in enumerate(str_0[::-1])) 

Source:['convert', 'a', 'list', 'of', 'integers', 'into', 'a', 'single', 'integer'] 
Target:r = int(''.join(map(str, x))) 

Source:['convert', 'a', 'datetime', 'string', 'back', 'to', 'a', 'datetime', 'object', 'of', 'format', 'str_0'] 
Target:datetime.strptime('2010-11-13 10:33:54.227806', 'str_0') 



----
## Model
To adhere to the syntax requirements of code snippets, we use coding language independent AST to guide our generation of code [TODO:citation].

**Code <-> Series of Actions**
* Target <-> Python AST <--asdl--> asdl AST <--> Action series.
* Target: code snippet.
* Python AST: Language dependent Abstract Syntax Tree.
* asdl: Text file that specifies the Grammar of Python3.
* asdl AST: Language independent Abstract Syntax Tree.
* action series: Series of actions needed to generate an AST.

TODO: some model graphs here? AST examples?

**Source Sequence <-> Action Sequence**
* Tranx baseline: LSTM <-> LSTM
* TODO: our model, bert??

**Technical Details**
* Initialization: glorot_init vs. xavier_normal_ ?

----
## Train

In [11]:
# Tranx baseline model
!bash scripts/conala/train.sh

use glorot initialization
begin training, 500 training examples, 200 dev examples
vocab: Vocab(source Vocabulary[size=629]words, primitive Vocabulary[size=464]words, code Vocabulary[size=486]words)


Decoding:   0%|          | 0/200 [00:00<?, ?it/s]

[Epoch 1] epoch elapsed 9s
[Epoch 1] begin validation


Decoding: 100%|██████████| 200/200 [00:55<00:00,  3.60it/s]


TypeError: 'float' object is not subscriptable