In [1]:
import os
import torch
os.environ['CUDA_VISIBLE_DEVICES'] = "0"
device = torch.device("cuda")

In [2]:
import sys
sys.path.append('./huggingface_models/')
sys.path.append('./utils/')
from sample_utils import *
from inference_utils import *
from codenet_process_utils import *
from self_training_utils import *

In [3]:
%load_ext autoreload
%autoreload 2

## Preprocessing

### Collect accepted problems
Get problems_dict: get_codenet_dict
```
problems_dict['p00001'].keys(): ['desc', 'io', 'solutions']
```
Rare problems have 'meta' also.
```
problems_dict['p00001']['io'].keys(): ['output', 'input']
```
The 'io' seems to be extracted from the 'desc', but not exhaustively. The 'desc' usually contains more input-output pairs than what's in 'io'.

### Parse the programs into codedict
Get code_dict: get_codenet_code_dict
```
codes_dict['p00001'].keys(): ['C++', 'Java', 'Python', 'C#', 'C']
codes_dict['Java'][0].keys(): ['functions', 'program_pieces', 'function_names', 'parameter_lists', 'return_types', 'target_call', 'target_call_args', 'target_call_params', 'target_call_return_type', 'idx', 'pid', 'program_formatted', 'io']
codes_dict['Java'][0]['idx']:'s150444541.java'
codes_dict['Java'][0]['pid']:'p00100'
```

### Filter programs by function and compilation
1. Filter programs that has functions (other than main/Main): get_nonempty_functions
2. Filter by compilation: get_codenet_call_dict. Note that in this step, we don't compile the original program. Instead, we combine the import_str extracted from the original program with the functions into a new program, and compile this new program.
3. Get filtered programs: get_compiled_functions
We get call_dict in this step. \
```
call_dict[lang] = [programs, processed_results, result_keys, error_type_dict]
```
We also get filtered_dict in this step.\
```
filtered_dict["Java"][0].keys(): ['code_dic_id', "import_str", "function", "pid"]
```

### Merge filtered program
Merge all the filtered programs into one dict (merged_filtered_dict).
```
merged_filtered_dict.keys(): ['C++', 'Java', 'Python', 'C#', 'C']
merged_filtered_dict["Java"][0].keys(): ['code_dic_id', 'import_str', 'function', 'pid', 'code_dic', 'batch_id']
```

### No-tok preprocessing
Process the filtered data for model training.
1. remove comments, empty lines format_codestring_codenet(codestring, lang)
2. replace new_lines notok_prepro(codestring, lang, is_plbart)
3. after decoding, do notok_detok notok_detok(codestring, lang, is_plbart)
4. do detok_format(codestring, detokenizer) to get detokenized version for Java and Python

### Cached Files
codenet/codenet_problems_dict_i.json\
codenet/codenet_codedict_i.json\
codenet/codenet_call_dict_i.json\
codenet/codenet_filtered_dict_i.json\
codenet_merged_filtered_dict.json
codenet_merged_filtered_dict_notok.json\

Since "java" is a special token in plbart, we have to create input data for plbart separately.\
codenet_merged_filtered_dict_notok_plbart.json\


## Sampling:
(Use script: get_codenet_preds.py)
1. get preds: get_preds_lang_dict_codenet
2. merge hypo files (since sampling takes time, we sample some languages in parallel. Thus we need to merge them later)

We get preds_lang_dict in this step. \
preds_lang_dict[(lang1, lang2)] = preds

## Filtering:
1. remove duplicated preds: get_dedup_preds
2. filter by type-matching: prep_exec_hypo_codenet
3. filter by compilation: get_hypo_call_list

We get hypo call_dict in this step. \
call_list contains info about the processed hypos in lang2.\
```
call_list = [programs, processed_results, result_keys, error_type_dict]
call_dict[(lang1, lang2)] = [new_preds, functions, function_id_dict, call_list]
```

## Hypo Processing
1. Preprocess filtered hypos: get_lang_pair_dict\
    1.1 No-tok preprocessing
2. Merge lang1-lang2 and lang2-lang1
3. Split in to train/val/test: get_split_lang_pair_dict
4. Write into parallel files: write_codenet_pairdata

We get lang_pair_dict in this step.\
```
lang_pair_list = [src_codes, target_codes, pids]
lang_pair_dict[(lang1, lang2)] = lang_pair_list
```

### Cached Files
1. Preds from trained models\
    plbart_codenet_preds_lang_dict.pkl\
    codet5_codenet_preds_lang_dict.pkl
2. Hypo call_dict\
    plbart_codenet_lang_pair_call_dict.pkl
3. Generated Parallel data\
    codet5_codenet_src_hypo_pair_dict_plbart.pkl
4. Hypo split_dict\
    codenet_hypo_split_dict.json

### Preprocessing

In [1]:
path = "preprocessing_codenet.ipynb"

### Sampling

In [2]:
path = "sampling_codenet.ipynb"

### Filtering

In [3]:
path = "filtering_hypo_codenet.ipynb"

### Filtered Hypo PostProcessing

In [None]:
path = "postprocessing_filtered_hypo_codenet.ipynb"

### Utils

In [4]:
path = "utils_codenet.ipynb"