## For running the T5 vanilla baseline
Here we provide code to
1. Setup datafiles for running t5 (i.e. produce json files)
1. Run the seq2seq model to produce outputs, saving in a directory that matches k_t5_outputs below
    - Train model (produces saved checkpoints)
    - Eval top performing model (load from top checkpoint, produces json outputs)

1. run eval code using the json output from model eval

In [None]:
%load_ext autoreload
%autoreload 2

from decrypt.scrape_parse import (
    load_guardian_splits,
    load_guardian_splits_disjoint,
    load_guardian_splits_disjoint_hash
)

import os
from decrypt import config
from decrypt.common import validation_tools as vt
from decrypt.common.util_data import clue_list_tuple_to_train_split_json
import logging
logging.getLogger(__name__)


k_json_folder = config.DataDirs.Guardian.json_folder

## 1 Produce datasets

In [None]:
def make_dataset(split_type: str, overwrite=False):
    assert split_type in ['naive_random', 'naive_disjoint', 'word_init_disjoint']
    if split_type == 'naive_random':
        load_fn = load_guardian_splits
        tgt_dir = config.DataDirs.DataExport.guardian_naive_random_split
    elif split_type == 'naive_disjoint':
        load_fn = load_guardian_splits_disjoint
        tgt_dir = config.DataDirs.DataExport.guardian_naive_disjoint_split
    else:
        load_fn = load_guardian_splits_disjoint_hash
        tgt_dir = config.DataDirs.DataExport.guardian_word_init_disjoint_split

    _, _, (train, val, test) = load_fn(k_json_folder)

    os.makedirs(tgt_dir, exist_ok=True)
    # write the output as json
    try:
        clue_list_tuple_to_train_split_json((train, val, test),
                                            comment=f'Guardian data. Split: {split_type}',
                                            export_dir=tgt_dir,
                                            overwrite=overwrite)
    except FileExistsError:
        logging.warning(f'You have already generated the {split_type} dataset.\n'
                        f'It is located at {tgt_dir}\n'
                        f'To regenerate, pass overwrite=True or delete it\n')


make_dataset('naive_random')
make_dataset('word_init_disjoint')
# you can also make_dataset('naive_disjoint')

## 2 Running (training) the model
1. Setup environment
    1. You should setup wandb for logging (that's where metrics will show up).
    If you try to run without wandb, then wandb will tell you what you need to do to initialize

    1. The relevant libraries used for our runs are
        - transformers==4.4.2
        - wandb==0.10.13 # this can probably be updated
        - torch==1.7.1+cu110
        - torchvision==0.8.2+cu110
    - Choose place for your wandb dir, e.g., `'./wandb' `
1. Train the model
    1. Note that the default arguments are given in args_cryptic. See `--default_train` and `--default_val`
    - Note that, when looking at logging messages or wandb, it will appear that epochs start at 11.
    This is done so that we have "space" for 10 "warmup" epochs for curricular training.
     This space causes all plots in wandb to line up.
    1. from directory seq2seq, run the commands in the box below.
     This will produce model checkpoints that can then be used for evaluation.



Baseline naive
```python
python train_clues.py --default_train=base --name=baseline_naive --project=baseline --wandb_dir='./wandb' --data_dir='../data/clue_json/guardian/naive_random'
```
Baseline (naive split), without lengths
```python
python train_clues.py --default_train=base --name=baseline_naive_nolens --project=baseline --wandb_dir='./wandb' --data_dir='../data/clue_json/guardian/word_initial_disjoint' --special=no_lens
```
Baseline disjoint (word initial disjoint)
```python
python train_clues.py --default_train=base --name=baseline_disj --project=baseline --wandb_dir='./wandb' --data_dir='../data/clue_json/guardian/word_initial_disjoint'
```
Baseline disjoint (word initial disjoint), without lengths
```python
python train_clues.py --default_train=base --name=baseline_disj --project=baseline --wandb_dir='./wandb' --data_dir='../data/clue_json/guardian/word_initial_disjoint' --special=no_lens
```

## 3 Evaluating the model
During training we generate only 5 beams for efficiency. For eval we generate 100.
1. Select the best model based on num_match_top_sampled. There should be
a logging statement at the end of the run that prints the location
of the best model checkpoint.
You can also find it by matching the peak in the wandb.ai metrics graph
to the appropriate model save.
2. Run eval using that model (see commands below), which will
produce a file in a (new, different) wandb directory that looks like `epoch_11.pth.tar.preds.json` (i.e only a single epoch)

For example,

Baseline naive, if epoch 20 is best (you'll need to set the run_name)
This produces generations for the validation set
```python
python train_clues.py --default_val=base --name=baseline_naive_val --project=baseline --data_dir='../data/clue_json/guardian/naive_random' --ckpt_path='./wandb/run_name/files/epoch_20.pth.tar
```

To produce generations for the test set,
```python
python train_clues.py --default_val=base --name=baseline_naive_val --project=baseline --data_dir='../data/clue_json/guardian/naive_random' --ckpt_path='./wandb/run_name/files/epoch_10.pth.tar --test
```

This should also be run for the no-lengths versions if you want to replicate those results.


Now we produce metrics by evaluating the json that was produced
1. Change the k_t5_outputs_dir value to the location where you have saved the json files. 
    - Recommend copying all of the preds.json files into a common directory and working from that.
    - Alternatively you could modify the code below and pass in a full path name to each of the json outputs (using the wandb directory path)
1. For each t5 model eval (above) that you ran (each of which produced some `..preds.json` file, run `load_and_run()` to get metrics for those outputs
1. The resulting outputs are the values we report in the tables. See `decrypt/common/validation_tools.ModelEval` for more details about the numbers that are produced. Percentages are prefixed by agg_

Note that for the Main Results Table 2, the metrics we include in the table correspond to
- `agg_top_match`
- `agg_top_10_after_filter`

More details of these metric calculations can be found in `decrypt.common.validation_tools`

In [None]:
# for example, if your output files are in
# 'decrypt/t5_outputs/'
# and you will run the below, e.g., if you have named the files
# baseline_naive_e12_test.json
# (.json will be appended for you by the load_and_run_t5 function)
# a better name for load_and_run is load_and_eval

# for example
### primary - test
vt.load_and_run_t5('decrypt/t5_outputs/baseline_naive_e12_test')
vt.load_and_run_t5('decrypt/t5_outputs/baseline_naive_nolens_e15_test')     # test set

## primary val
vt.load_and_run_t5('decrypt/t5_outputs/baseline_naive_e12_val')
vt.load_and_run_t5('decrypt/t5_outputs/baseline_naive_nolens_e15_val')

