# COMP0087 Group Project - Python Code Generator
Code generation is becoming an important and trending field in natural language processing (NLP), as it could potentially help to improve programming productivity by developing automatic code. Given some natural language (NL) utterances, the code generator aims to output some source code that completes the task described in the NL intents. Many models for the code generation task have been proposed by the researchers. In particular, TranX is a transition-based neural abstract syntax parser for code generation, it achieves state-of-the-art results on the CoNaLa dataset.

However, existing code generation models suffer from various problems. For example, TRANX often leads to disadvantageous performance when dealing with long and complex code generation tasks. Furthermore, current code generators suffer from learning dependencies between distant positions. In particular, TRANX uses standard bidirectional Long Short-term Memory (LSTM) network as the encoder and decoder, which may lead to this issue due to its sequential computation. TranX also has high complexity and high computational cost due to its recurrent layer type.

To solve these problems, this project explores potential solutions by using TRANX as the baseline, experimenting and modifying the encoder with different networks like Gated Recurrent Units (GRUs) and attentional encoder. In particular, for the code generation task based on the CoNaLa dataset, both TRANX\_GRU and TRANX\_attentional\_encoder achieve slightly better results than TRANX in terms of the exact match. In addition, the training time per epoch for our candidate models are faster than TRANX, these match with our interpretations and explanations in section 3 as GRU and attentional encoder give lower computational complexity per layer.

![jupyter](https://i.postimg.cc/X7J8xftm/result-table.png)

----
## 1 System Architecture
TRANX is a seq-to-action model, in which the input is the natural language utterances that described the task and the output is a series of actions corresponding to some Python source code that completes the task. Please find the workflow of TRANX below.

![jupyter](https://i.postimg.cc/JzDvG5RY/Work-flow.png)

TRANX employs an encoder-decoder structure to score AST by measuring the probability of a series of actions. TRANX uses a Bi-LSTM network for the encoder and a standard LSTM for the decoder. This project explores and replaces the encoder with two different network structures: Gated Recurrent Units (GRUs) and attentional encoder.

Figure below gives brief overview of the partial system for the original TRANX model.

![jupyter](https://i.postimg.cc/25tKYHvs/TRANX.jpg)

For TRANX_GRU, we replace the encoder part with a GRU network. In graphical representations, we change the LSTM encoder (highlighted by the dotted squared) in the TRANX architecture above with a GRU network as shown in the figure below.
![jupyter](https://i.postimg.cc/7hRW2Wsf/GRU.png)

For TRANX_attentional_encoder, we change the encoder part with an attentional encoder, which is also the encoder of the transformer. The corresponding changed part is shown below.

![jupyter](https://i.postimg.cc/N0GSZmHG/Transformer-enc.png)

----
## 2 Project Setup and Download Project Repo
This project can be either run on colab or the local machine. Please find the project set up in the corresponding subsection below. To run it without CUDA, please simply remove the "--cuda" flag from the command line argument in all shell scripts under the file named "scripts".

### 2.1 Colab Setup

Running the cell below will require you to enter an authorization code. Please kindly follow the instructions and go to the URL shown below, log in to your google account to get your own authorization code. Note that the authorization code will be asked **TWICE**, which means you may need to log in twice. Thanks for your patient.

In [None]:
# Setup
from google import colab
colab.drive.mount('/content/drive')

# Imports, login, connect drive
import os
from pathlib import Path
import requests
from google.colab import auth
auth.authenticate_user()
from googleapiclient.discovery import build
drive = build('drive', 'v3').files()

# Recursively get names
def get_path(file_id):
    f = drive.get(fileId=file_id, fields='name, parents').execute()
    name = f.get('name')
    if f.get('parents'):
        parent_id = f.get('parents')[0]  # assume 1 parent
        return get_path(parent_id) / name
    else:
        return Path(name)

# Change directory
def chdir_notebook():
    d = requests.get('http://172.28.0.2:9000/api/sessions').json()[0]
    file_id = d['path'].split('=')[1]
    path = get_path(file_id)
    nb_dir = 'drive' / path.parent
    os.chdir(nb_dir)
    return nb_dir

!cd /
chdir_notebook()

Mounted at /content/drive


PosixPath('drive/My Drive/Colab Notebooks')

The code that used to train and evaluate our models is available on https://github.com/kzCassie/ucl\_nlp. Note that you should only run the following code **ONCE**.

In [None]:
# Clone the project repo
%%shell
git clone 'https://github.com/kzCassie/ucl_nlp.git'

Cloning into 'ucl_nlp'...
remote: Enumerating objects: 617, done.[K
remote: Counting objects: 100% (617/617), done.[K
remote: Compressing objects: 100% (446/446), done.[K
remote: Total 617 (delta 300), reused 421 (delta 154), pack-reused 0[K
Receiving objects: 100% (617/617), 128.51 MiB | 17.52 MiB/s, done.
Resolving deltas: 100% (300/300), done.
Checking out files: 100% (160/160), done.




In [None]:
cd ucl_nlp

/content/drive/My Drive/Colab Notebooks/ucl_nlp


### 2.2 Local Machine Setup

When runing the project on local machine, please follow the instructions below. If you are using Colab, then please **IGNORE** section 2.2.

In [None]:
# Clone our project repository into the local machine
git clone https://github.com/kzCassie/ucl_nlp
# Enter the project file
cd ucl_nlp

# Create virtual environments
python3 -m venv config/env
# Activate virtual environment
source config/env/bin/activate
# Install all the required packages
pip install -r requirements.txt

----
## 3 Data Loading & Data Preprocessing

Run the following shell script to get the Conala json file from http://www.phontron.com/download/conala-corpus-v1.1.zip, and download the preprocessed Conala zipfile from GoogleDrive.

The CoNaLa dataset was originally released by CMU. This dataset contains mined corpus of natural language intents and Python code snippets from Stack Overflow. The original data is in JSON format and comes in 3 parts: first, 2379 manually-curated training examples, second 500 human-curated test examples and thirdly 600k automatically-mined examples.

In [None]:
# Download original Conala json dataset
# Download pre-processed Colana zipfile from GoogleDrive
!bash pull_data.sh

download original Conala json dataset
--2021-06-01 05:19:51--  http://www.phontron.com/download/conala-corpus-v1.1.zip
Resolving www.phontron.com (www.phontron.com)... 208.113.196.149
Connecting to www.phontron.com (www.phontron.com)|208.113.196.149|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52105440 (50M) [application/zip]
Saving to: ‘data/conala-corpus-v1.1.zip’


2021-06-01 05:19:53 (25.1 MB/s) - ‘data/conala-corpus-v1.1.zip’ saved [52105440/52105440]

Archive:  data/conala-corpus-v1.1.zip
   creating: data/conala-corpus/
  inflating: data/conala-corpus/conala-mined.jsonl  
  inflating: data/conala-corpus/conala-train.json  
  inflating: data/conala-corpus/conala-test.json  
mv: missing destination file operand after 'data/conala-corpus'
Try 'mv --help' for more information.
download preprocessed Conala zip from GoogleDrive
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   To

**Clarification on Data Preprocessing**

Hint: You don't need to preprocess it again **(YOU DON'T NEED TO RUN THE FOLLOWING CODE)**, as the code above has already helped you to download our preprocessed data. It is used for illustration and clarification.

The data were preprocessed with the downloaded mined file and topk=100000 (First k number from mined file) using the code below. This means we use all of the human-curated data, but only 100k automatically-mined examples for training.

In [None]:
mined_data_file = "data/conala-corpus/conala-mined.jsonl" # path to the downloaded mined file
topk = 100000 # number of pretraining data to be preprocessed
!python datasets/conala/dataset.py --pretrain=$mined_data_file --topk=$topk

After data pre-processing, the NL intents were transformed into lists of tokens, while the Python source code snippets become series of actions through the AST parsing process as explained before. Please see the example of the preprocessed data below.

In [None]:
# example of pre-processed data.
from components.dataset import Dataset
n_example = 3
train_set = Dataset.from_bin_file("data/conala/100000/train.gold.full.bin")
for src, tgt in zip(train_set.all_source[:n_example],train_set.all_targets[:n_example]):
    print(f'Source:{src} \nTarget:{tgt} \n')


Source:['concatenate', 'elements', 'of', 'a', 'list', 'str_0', 'of', 'multiple', 'integers', 'to', 'a', 'single', 'integer'] 
Target:sum(d * 10 ** i for i, d in enumerate(str_0[::-1])) 

Source:['convert', 'a', 'list', 'of', 'integers', 'into', 'a', 'single', 'integer'] 
Target:r = int(''.join(map(str, x))) 

Source:['convert', 'a', 'datetime', 'string', 'back', 'to', 'a', 'datetime', 'object', 'of', 'format', 'str_0'] 
Target:datetime.strptime('2010-11-13 10:33:54.227806', 'str_0') 



We preprocess the json files into several bin files and save them to the folder named data/canola/${topk}. These preprocessed files are then used in the next section for training, fine-tuning and testing. In particular, we held out 200 examples from the 2379 manually-curated training data as the validation set. We first trained the model from scratch with the 100k un-curated examples along with the 200 validation examples. Next, we fine-tuned the obtained model using the 2179 remaining training data and evaluated the same 200 evaluation examples to form the final model. In the end, we applied the fine-tuned model on the 500 test set on which the results are reported.

----
## 4 Model Training & Fine-tuning

Hint: It takes several hours to train and fine-tune the model. The best-pre-trained models can be downloaded using the following shell script, **if you just want to test the model to see the model performance, please go to section 5.**

### Download best pretrained models from GoogleDrive

We have uploaded the best-pretrained models to an open-accessed GoogleDrive file, please kindly use the shell script *pull_best_models.sh* to download these models.

In [None]:
# Download best pretrained models zipfile from GoogleDrive
!bash pull_best_models.sh

download trained_best_models from GoogleDrive
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   408    0   408    0     0   2775      0 --:--:-- --:--:-- --:--:--  2756
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 54.9M    0 54.9M    0     0  32.8M      0 --:--:--  0:00:01 --:--:-- 80.3M
Done!


### TRANX_LSTM (Baseline)

In [None]:
# tranX baseline model
# pretrain with mined_num = 100000
！bash scripts/tranX/pretrain.sh 100000

# fine-tune with the gold set
! bash scripts/tranX/finetune.sh 100000

### TRANX_GRU

In [None]:
# GRU model
# pretrain with mined_num = 100000
！bash scripts/GRU/pretrain.sh 100000

# fine-tune with the gold set
! bash scripts/GRU/finetune.sh 100000

### TRANX_attentional_encoder

In [None]:
# attentional_encoder
# pretrain with mined_num = 100000
！bash scripts/transformer/pretrain.sh 100000

# fine-tune with the gold set
! bash scripts/transformer/finetune.sh 100000

----
## 5 Model Testing with the test set provided in CoNaLa dataset

### tranX_LSTM (Baseline)

In [None]:
# tranX baseline model
!bash scripts/tranX/test.sh best_pretrained_models/finetune.conala.default_parser.hidden256.embed128.action128.field64.type64.dr0.3.lr0.001.lr_de0.5.lr_da15.beam15.vocab.src_freq3.code_freq3.mined_100000.bin.mined_100000.bin.glorot.par_state.seed0.bin 100000 default_parser

load model from [best_pretrained_models/finetune.conala.default_parser.hidden256.embed128.action128.field64.type64.dr0.3.lr0.001.lr_de0.5.lr_da15.beam15.vocab.src_freq3.code_freq3.mined_100000.bin.mined_100000.bin.glorot.par_state.seed0.bin]
Decoding: 100% 466/466 [02:46<00:00,  2.80it/s]
{'corpus_bleu': 0.3011020946247816, 'oracle_corpus_bleu': 0.42743396862112837, 'avg_sent_bleu': 0.2257651682863722, 'oracle_avg_sent_bleu': 0.38993023014444694, 'exact_match': 0.017167381974248927, 'oracle_exact_match': 0.06652360515021459}


### tranX_GRU

In [None]:
# GRU model
!bash scripts/GRU/test.sh best_pretrained_models/finetune.conala.gru_parser.hidden256.embed128.action128.field64.type64.dr0.3.lr0.001.lr_de0.5.lr_da15.beam15.vocab.src_freq3.code_freq3.mined_100000.bin.mined_100000.bin.glorot.par_state.seed0.bin 100000 gru_parser

load model from [best_pretrained_models/finetune.conala.gru_parser.hidden256.embed128.action128.field64.type64.dr0.3.lr0.001.lr_de0.5.lr_da15.beam15.vocab.src_freq3.code_freq3.mined_100000.bin.mined_100000.bin.glorot.par_state.seed0.bin]
Decoding: 100% 466/466 [02:41<00:00,  2.88it/s]
{'corpus_bleu': 0.2856295136099089, 'oracle_corpus_bleu': 0.4230469350759587, 'avg_sent_bleu': 0.2249070823860801, 'oracle_avg_sent_bleu': 0.39904222801804207, 'exact_match': 0.030042918454935622, 'oracle_exact_match': 0.08369098712446352}


### tranX_attentional_encoder

In [None]:
# attentional_encoder
!bash scripts/transformer/test.sh best_pretrained_models/finetune.conala.transformer_enc_parser.enc_nhead2.enc_nlayer1.hidden256.embed128.action128.field64.type64.dr0.3.lr0.001.lr_de0.5.lr_da15.beam15.vocab.src_freq3.code_freq3.mined_100000.bin.mined_100000.bin.glorot.par_state.seed0.bin 100000 transformer_enc_parser

load model from [best_pretrained_models/finetune.conala.transformer_enc_parser.enc_nhead2.enc_nlayer1.hidden256.embed128.action128.field64.type64.dr0.3.lr0.001.lr_de0.5.lr_da15.beam15.vocab.src_freq3.code_freq3.mined_100000.bin.mined_100000.bin.glorot.par_state.seed0.bin]
Decoding: 100% 466/466 [03:01<00:00,  2.56it/s]
{'corpus_bleu': 0.2819106289394482, 'oracle_corpus_bleu': 0.4184672856242742, 'avg_sent_bleu': 0.21514487082307537, 'oracle_avg_sent_bleu': 0.36651935234022703, 'exact_match': 0.023605150214592276, 'oracle_exact_match': 0.06223175965665236}
