# COMP0087 Group Project - Python Code Generator
Code generation is becoming an important and trending field in natural language processing (NLP), as it could potentially help to improve programming productivity by developing automatic code. Given some natural language (NL) utterances, the code generator aims to output some source code that completes the task described in the NL intents. Many models for the code generation task have been proposed by the researchers. In particular, TranX is a transition-based neural abstract syntax parser for code generation, it achieves state-of-the-art results on the CoNaLa dataset.

However, existing code generation models suffer from various problems. For example, TRANX often leads to disadvantageous performance when dealing with long and complex code generation tasks. Furthermore, current code generators suffer from learning dependencies between distant positions. In particular, TRANX uses standard bidirectional Long Short-term Memory (LSTM) network as the encoder and decoder, which may lead to this issue due to its sequential computation. TranX also has high complexity and high computational cost due to its recurrent layer type.

To solve these problems, this project explores potential solutions by using TRANX as the baseline, experimenting and modifying the encoder with different networks like Gated Recurrent Units (GRUs) and attentional encoder. In particular, TRANX_GRU beats the TRANX baseline results in terms of the exact match on the CoNaLa dataset. TranX_attentional_encoder achieves similar results as TRANX in terms of Corpus BLUE score while giving lower computational complexity per layer.
(hopefully) both candidate models beat the current state-of-the-art tranX model on conala dataset.

----
## 1 System Architecture

##########insert pic##########

Figure 1 and 2 gives brief overview of the system for tranX_GRU and tranX_Transformer respectively.

----
## 2 Project Setup
This project can be either run on colab or the local machine. Please find the project set up in the corresponding subsection below. To run it without CUDA, please simply remove the "--cuda" flag from the command line argument in all shell scripts under the file named "scripts".

### 2.1 Colab Setup

In [None]:
# Setup
from google import colab
colab.drive.mount('/content/drive')

# Imports, login, connect drive
import os
from pathlib import Path
import requests
from google.colab import auth
auth.authenticate_user()
from googleapiclient.discovery import build
drive = build('drive', 'v3').files()

# Recursively get names
def get_path(file_id):
    f = drive.get(fileId=file_id, fields='name, parents').execute()
    name = f.get('name')
    if f.get('parents'):
        parent_id = f.get('parents')[0]  # assume 1 parent
        return get_path(parent_id) / name
    else:
        return Path(name)

# Change directory
def chdir_notebook():
    d = requests.get('http://172.28.0.2:9000/api/sessions').json()[0]
    file_id = d['path'].split('=')[1]
    path = get_path(file_id)
    nb_dir = 'drive' / path.parent
    os.chdir(nb_dir)
    return nb_dir

!cd /
chdir_notebook()

### 2.2 Local Machine Setup

In [None]:
# Clone our project repository into the local machine
git clone https://github.com/kzCassie/ucl_nlp
# Enter the project file
cd ucl_nlp

# Create virtual environments
python3 -m venv config/env
# Activate virtual environment
source config/env/bin/activate
# Install all the required packages
pip install -r requirements.txt

----
## 3 Data Loading & Data Preprocessing

Run the following shell script to get the Conala json file from http://www.phontron.com/download/conala-corpus-v1.1.zip, and download the preprocessed Conala zipfile from GoogleDrive.

In [None]:
# Download original Conala json dataset
# Download pre-processed Colana zipfile from GoogleDrive
!bash pull_data.sh

**Clarification on Data Preprocessing**

Please note the data were preprocessed with the downloaded mined file and topk=100000 (First k number from mined file) using the code below.

In [None]:
mined_data_file = "data/conala-corpus/conala-mined.jsonl" # path to the downloaded mined file
topk = 100000 # number of pretraining data to be preprocessed
!python datasets/conala/dataset.py --pretrain=$mined_data_file --topk=$topk

We preprocess the json files into several bin files and save them to the folder named `data/canola/${topk}`. These preprocessed files are then used in the next section for training, fine-tuning and testing.

In particular, we preprocess the mined json file (conala-mined.jsonl) and save the results into *mined_100000.bin*, which is then used for the model pre-training. Next, the gold training data are preprocessed with the downloaded train file (*conala-train.json*), these preprocessed files are stored in *train.gold.full.bin*, and they are used for fine-tuning. At the end, we preprocess the test json file *conala-test.json* to *test.bin* and use it for model testing. In total, we use around 100000, 2500 and 466 instances for training, fine-tuning and testing respectively.

Please see the example of the preprocessed data below.

In [2]:
# example of pre-processed data.
from components.dataset import Dataset
n_example = 3
train_set = Dataset.from_bin_file("data/conala/train.gold.full.bin")
for src, tgt in zip(train_set.all_source[:n_example],train_set.all_targets[:n_example]):
    print(f'Source:{src} \nTarget:{tgt} \n')


Source:['concatenate', 'elements', 'of', 'a', 'list', 'str_0', 'of', 'multiple', 'integers', 'to', 'a', 'single', 'integer'] 
Target:sum(d * 10 ** i for i, d in enumerate(str_0[::-1])) 

Source:['convert', 'a', 'list', 'of', 'integers', 'into', 'a', 'single', 'integer'] 
Target:r = int(''.join(map(str, x))) 

Source:['convert', 'a', 'datetime', 'string', 'back', 'to', 'a', 'datetime', 'object', 'of', 'format', 'str_0'] 
Target:datetime.strptime('2010-11-13 10:33:54.227806', 'str_0') 



----
## 4 Model Training & Fine-tuning

### tranX_LSTM (Baseline)

In [None]:
# tranX baseline model
# pretrain with mined_num = 100000
！bash scripts/tranX/pretrain.sh 100000

# fine-tune with mined_num = 100000
! bash scripts/tranX/finetune.sh 100000

# test with mined_num = 100000
!bash scripts/tranX/test.sh 100000

### tranX_GRU

In [None]:
# GRU model
# pretrain with mined_num = 100000
！bash scripts/GRU/pretrain.sh 100000

# fine-tune with mined_num = 100000
! bash scripts/GRU/finetune.sh 100000

# test with mined_num = 100000
!bash scripts/GRU/test.sh 100000

### tranX_Transformer

In [None]:
# Transformer
# pretrain with mined_num = 100000
！bash scripts/transformer/pretrain.sh 100000

# fine-tune with mined_num = 100000
! bash scripts/transformer/finetune.sh 100000

# test with mined_num = 100000
!bash scripts/transformer/test.sh 100000


----
## 5 Model Testing with the test set provided in CoNaLa dataset

### tranX_LSTM (Baseline)

In [None]:
# tranX baseline model
# test with mined_num = 100000
!bash scripts/tranX/test.sh 100000

### tranX_GRU

In [None]:
# GRU model
# test with mined_num = 100000
!bash scripts/GRU/test.sh 100000

### tranX_attentional_encoder

In [None]:
# Transformer
# test with mined_num = 100000
!bash scripts/transformer/test.sh 100000

----

DRAFT - NOT BY ME
## 4 Model
To ad here to the syntax requirements of code snippets, we use coding language independent AST to guide our generation of code [TODO:citation].

**Code <-> Series of Actions**
* Target <-> Python AST <--asdl--> asdl AST <--> Action series.
* Target: code snippet.
* Python AST: Language dependent Abstract Syntax Tree.
* asdl: Text file that specifies the Grammar of Python3.
* asdl AST: Language independent Abstract Syntax Tree.
* action series: Series of actions needed to generate an AST.

TODO: some model graphs here? AST examples?

**Source Sequence <-> Action Sequence**
* Tranx baseline: LSTM <-> LSTM
* TODO: our model, bert??

**Technical Details**
* Initialization: glorot_init vs. xavier_normal_ ?
