## Setting up for curricular experiment

This assumes you have already followed the instructions in `baselines/baseline_t5`, which will set up the baseline clue files for model input

### Datasets
1. Download and unzip the xd cw crossword set from http://xd.saul.pw/xd-clues.zip.
    - Save it as './data/original/xd/clues.tsv'
2. Preprocess the dataset using this notebook
3. The dataset will be saved to k_acw_export_dir (as a single train.json file)
4. We will also produce the anagram dataset


In [None]:
%load_ext autoreload
%autoreload 2

from decrypt.scrape_parse.acw_load import get_clean_xd_clues
from decrypt import config
from decrypt.common.util_data import clue_list_tuple_to_train_split_json
from decrypt.common import validation_tools as vt

k_xd_orig_tsv = config.DataDirs.OriginalData.k_xd_cw        # ./data/original/xd/clues.tsv
k_acw_export_dir = config.DataDirs.DataExport.xd_cw_json

In [None]:
# defaults to strip periods, remove questions, remove abbrevs, remove fillin
stc_map, all_clues = get_clean_xd_clues(k_xd_orig_tsv,
                                        remove_if_not_in_dict=False,
                                        do_filter_dupes=True)
clue_list_tuple_to_train_split_json((all_clues,),
                                    comment='ACW set; xd cw set, all',
                                    export_dir=k_acw_export_dir,
                                    overwrite=False)

In [None]:
# produce anagram datasets
# roughly 3 minutes to complete
from decrypt.common import anagrammer
anagrammer.gen_db_with_both_inputs(update_flag="overwrite")

from decrypt.common.util_data import (
    get_anags,
    write_json_tuple
)
import json
import os

In [None]:
def make_anag_sets_json():
    all_anags = get_anags(max_num_words=-1)
    json_list = []
    for idx, a_list in enumerate(all_anags):
        json_list.append(dict(idx=idx,
                              anag_list=a_list))
    print(json_list[0])

    # normally would be (idx, input, tgt)
    output_tuple = [json_list,]

    os.makedirs(config.DataDirs.DataExport.anag_dir)
    write_json_tuple(output_tuple,
                     comment="List of all anagram groupings",
                     export_dir=config.DataDirs.DataExport.anag_dir,
                     overwrite=False)

def make_anag_indic_list_json():
    # make the indicator list
    with open(config.DataDirs.OriginalData.k_deits_anagram_list, 'r') as f:
        all_anag_indicators = f.readlines()
        print(len(all_anag_indicators))

    final_indic_list = []
    for a in all_anag_indicators:
        final_indic_list.append(a.replace('_', " ").strip())
    with open(config.DataDirs.DataExport.anag_indics, 'w') as f:
        json.dump(final_indic_list,f)

In [None]:
make_anag_sets_json()

In [None]:
make_anag_indic_list_json()



## Curricular training
1. At this point you should have a files at
 - `./data/clue_json/curricular/ACW/train.json`
 - `./data/clue_json/curricular/anagram/[train.json, anag_indics.json]`

2. Running curricular training is the same as running main t5 vanilla train, except that we pass an extra multitask flag, which specifies the curriculum to use. See `seq2seq/multitask_config`. You should pass one of the names from  `multi_config` dict in that file

For example, to train the naive split with the top performing curricular approach (i.e. the result in table 3 that is ACW + ACW-descramble)
```python
python train_clues.py --default_train=base --name=naive_top_curricular --project=curricular --wandb_dir='./wandb' --data_dir='../data/clue_json/guardian/naive_random' --multitask=ACW__ACW_descramble
```

Note that the modifications on the dataset are done at the

3. To produce Table 3 of the results
    -  we don't need to do a model_eval run since the outputted predictions have 5 generations
       (which is all we report for that table (for faster experimental iteration).
    - we need to run `load_and_run_t5` on all outputs (column 1) and on the anagram subset (column 2)
      See below for how we do this.

4. For our top result in Table 2 (main resuls) we
    1. scale up the curricular period (to 4 total epochs)
```python
python train_clues.py --default_train=base --name=naive_top_curricular --project=curricular --wandb_dir='./wandb' --data_dir='../data/clue_json/guardian/naive_random' --multitask=final_top_result_scaled_up
```
    2. eval with full 100 generations, as before:
e.g., if epoch 10 is best (you'll need to set the run_name)
This runs the eval set (change the run_name)
```python
python train_clues.py --default_val=base --name=curricular_naive_top --project=curricular --data_dir='../data/clue_json/guardian/naive_random' --ckpt_path='./wandb/run_name/files/epoch_10.pth.tar
```


In [None]:
from decrypt.common.label_anagrams import make_label_set

labels = make_label_set()

In [None]:
# note that this should be run directly on the top model output from curricular training
# otherwise (eg. if 100 beams were used), the top 5 output
# sequences would be expected to change
# remember not to append .json

# eval on the full output (5 beams / 5 sequences)
# this is column 1 of table 3
vt.load_and_run_t5('outputs/model_output.preds',
                   # pre_truncate=5,        # should not be needed since we have only 5 outputs
                   do_length_filter=True)

# run on the anagram subset
# this is column 2 of table 3
vt.load_and_run_t5('outputs/model_output.preds',
                   filter_fcn=vt.make_set_filter(labels, 'anag_direct'),
                   # pre_truncate=5,
                   do_length_filter=True)

# we are looking at agg_top_match (which is after filter)