## For running the T5 vanilla baseline
1. Setup datafiles for running t5 (i.e. produce json files)
1. Run the seq2seq model to produce outputs, saving in a directory that matches k_t5_outputs below
    - Train model (produces saved checkpoints)
    - Eval top performing model (load from top checkpoint, produces json outputs)

1. run the eval code here using the json output from model eval

In [5]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('../decrypt')
from decrypt.scrape_parse import (
    load_guardian_splits,
    load_guardian_splits_disjoint,
    load_guardian_splits_disjoint_hash
)

import os
import config
from decrypt.common import validation_tools as vt
from decrypt.common.util_data import clue_list_tuple_to_train_split_json
import logging
logging.getLogger(__name__)


k_json_folder = config.DataDirs.Guardian.json_folder

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1 Produce datasets

In [10]:
def make_dataset(split_type: str, overwrite=False):
    assert split_type in ['naive_random', 'naive_disjoint', 'word_init_disjoint']
    if split_type == 'naive_random':
        load_fn = load_guardian_splits
        tgt_dir = config.DataDirs.DataExport.guardian_naive_random_split
    elif split_type == 'naive_disjoint':
        load_fn = load_guardian_splits_disjoint
        tgt_dir = config.DataDirs.DataExport.guardian_naive_disjoint_split
    else:
        load_fn = load_guardian_splits_disjoint_hash
        tgt_dir = config.DataDirs.DataExport.guardian_word_init_disjoint_split

    _, _, (train, val, test) = load_fn(k_json_folder)

    os.makedirs(tgt_dir, exist_ok=True)
    # write the output as json
    try:
        clue_list_tuple_to_train_split_json((train, val, test),
                                            comment=f'Guardian data. Split: {split_type}',
                                            export_dir=tgt_dir,
                                            overwrite=overwrite)
    except FileExistsError:
        logging.warning(f'You have already generated the {split_type} dataset.\n'
                        f'It is located at {tgt_dir}\n'
                        f'To regenerate, pass overwrite=True or delete it\n')


make_dataset('naive_random')
make_dataset('word_init_disjoint')
# you can also make_dataset('naive_disjoint')

INFO:decrypt.scrape_parse.guardian_load:loading from /Users/jsrozner/jsrozner/cryptic/decrypt/data/puzzles
INFO:decrypt.scrape_parse.guardian_load:Using file glob at /Users/jsrozner/jsrozner/cryptic/decrypt/data/puzzles/cryptic*.json
INFO:decrypt.scrape_parse.guardian_load:Glob has size 5518
INFO:decrypt.scrape_parse.guardian_load:Glob size matches the expected one from Decrypting paper
100%|██████████| 5518/5518 [00:12<00:00, 448.52it/s]
 70%|███████   | 100848/143991 [00:01<00:00, 76506.56it/s]

[("length punct: '", 1),
 ('invalid: clue group', 7687),
 ('invalid: invalid start char (most are continuation clues)', 607),
 ('invalid: number in clue (commonly references another clue)', 7066),
 ('invalid: regexp', 75),
 ('invalid: soln length does not match specified lens (multi box soln)', 56),
 ('invalid: unrecognized char in clue (e.g. html)', 85),
 ('invalid: zero-len clue text after regexp', 15),
 ('length punct: ,', 24644),
 ('length punct: -', 4148),
 ('length punct: .', 8),
 ('length punct: /', 1),
 ('stat: parsed_puzzle', 5518),
 ('stat: total_clues', 143991),
 (1, 119956),
 (2, 20272),
 (3, 2957),
 (4, 686),
 (5, 112),
 (6, 8)]
Total clues: len(puzz_list)


100%|██████████| 143991/143991 [00:01<00:00, 130467.53it/s]
100%|██████████| 55783/55783 [00:02<00:00, 24200.43it/s]


removed 1611 exact dupes
142380


INFO:decrypt.scrape_parse.guardian_load:Counter({1: 118540, 2: 20105, 3: 2929, 4: 686, 5: 112, 6: 8})
INFO:decrypt.scrape_parse.guardian_load:Clue list length matches Decrypting paper expected length
INFO:decrypt.scrape_parse.guardian_load:Got splits of lenghts [75847, 32628, 33905]
INFO:decrypt.scrape_parse.guardian_load:First three clues of train set:
	[GuardianClue(clue='Sailor boy in his hammock', lengths=[4], soln='abed', soln_with_spaces='abed', idx=34809, dataset=PosixPath('/Users/jsrozner/jsrozner/cryptic/decrypt/data/puzzles'), across_or_down='across', pos=(0, 2), unique_clue_id='cryptic_23048_10-across', type='cryptic', number=23048, id='crosswords/cryptic/23048', creator='Rufus', orig_lengths='4', lengths_punctuation=set()), GuardianClue(clue='With a degree, I leave this subject', lengths=[5], soln='maths', soln_with_spaces='maths', idx=412, dataset=PosixPath('/Users/jsrozner/jsrozner/cryptic/decrypt/data/puzzles'), across_or_down='across', pos=(0, 13), unique_clue_id='crypt

{'idx': 34809, 'input': 'Sailor boy in his hammock (4)', 'target': 'abed'}
{'idx': 412,
 'input': 'With a degree, I leave this subject (5)',
 'target': 'maths'}
{'idx': 116809,
 'input': 'Burrow to cure limb and make sure one gets up (3,3,5)',
 'target': 'set the alarm'}



## 2 Running (training) the model
1. Setup environment
    1. You should setup wandb for logging (that's where metrics will show up).
    If you try to run, the wandb will tell you what you need to do to initialize
1. Train the model
    1. from directory seq2seq, run the commands in the box below
    1. Will produce model checkpoints

# TODO
todo: environment setup

- Choose place for your wandb dir, e.g., `'./wandb' `
- Note that the default arguments are given in args_cryptic. See `--default_train` and `--default_val`
- Note that it looks like epochs start at 11, so that we have space for 10 "warmup" epochs for curricular training - this is so that plots in wandb will line up

Baseline naive
```python
train_clues.py --default_train=base --name=baseline_naive --project=baseline --wandb_dir='./wandb' --data_dir='../data/clue_json/guardian/naive_random'
```
Baseline disjoint (word initial disjoint)
```python
train_clues.py --default_train=base --name=baseline_disj --project=baseline --wandb_dir='./wandb' --data_dir='../data/clue_json/guardian/word_initial_disjoint'
```

Baseline (naive split), without lengths
```python
train_clues.py --default_train=base --name=baseline_naive_nolens --project=baseline --wandb_dir='./wandb' --data_dir='../data/clue_json/guardian/word_initial_disjoint' --special=no_lens
```

## 3 Evaluating the model
For training we generate only 5 beams. For eval we are going to generate 100.
1. Select the best model based on num_match_top_sampled
2. Run eval using that model
3. This will produce a file in a new wandb directory that looks like `epoch_11.pth.tar.preds.json` (i.e a single epoch)

For example,

Baseline naive, if epoch 10 is best (you'll need to set the run_name)
This runs the eval set
```python
train_clues.py --default_val=base --name=baseline_naive_val --project=baseline --data_dir='../data/clue_json/guardian/naive_random' --ckpt_path='./wandb/run_name/files/epoch_10.pth.tar
```

To test the test set, add `--test`
```python
train_clues.py --default_val=base --name=baseline_naive_val --project=baseline --data_dir='../data/clue_json/guardian/naive_random' --ckpt_path='./wandb/run_name/files/epoch_10.pth.tar --test
```


Now we evaluate the json that was produced
1. Change the k_t5_outputs_dir value to the location where you have saved the json files. 
    - Recommend copying all of the preds.json files into a common directory and working from that.
    - Alternatively you could modify the code below and pass in a full path name to each of the json outputs (using the wandb directory path)
1. For each t5 model eval that you ran, run `load_and_run()` to get metrics for those outputs
1. The resulting outputs are the values we report in the tables. See `decrypt/common/validation_tools.ModelEval` for more details about the numbers that are produced. Percentages are prefixed by agg_

In [6]:
# for example, if your output files are in
# 'decrypt/t5_outputs/'
# and you will run the below, e.g., if you have named the files
# baseline_naive_e12_test.json
# (.json will be appended for you by the load_and_run_t5 function)
# a better name for load_and_run is load_and_eval

# for example
### primary - test
vt.load_and_run_t5('decrypt/t5_outputs/baseline_naive_e12_test')
vt.load_and_run_t5('decrypt/t5_outputs/baseline_naive_nolens_e15_test')     # test set

## primary val
vt.load_and_run_t5('decrypt/t5_outputs/baseline_naive_e12_val')
vt.load_and_run_t5('decrypt/t5_outputs/baseline_naive_nolens_e15_val')



28476
[('agg_filter_len_pre_truncate', 44.504495013344574),
 ('agg_filtered_few', 0.007093692934400899),
 ('agg_generate_few', 0.0),
 ('agg_generate_none', 0.0),
 ('agg_in_filtered', 0.4591234723978087),
 ('agg_in_sample', 0.2807978648686613),
 ('agg_sample_len', 10.0),
 ('agg_sample_len_correct', 0.48454136816968674),
 ('agg_sample_len_pre_truncate', 100.0),
 ('agg_sample_wordct_correct', 0.9789858126141312),
 ('agg_top_10_after_filter', 0.3385658098047479),
 ('agg_top_match', 0.1630495856159573),
 ('agg_top_match_len_correct', 0.9998946481247366),
 ('agg_top_match_none', 0.00010535187526337969),
 ('agg_top_match_wordct_correct', 0.9942407641522686),
 ('agg_top_sample_result_len_correct', 0.560858266610479),
 ('agg_top_sample_result_wordct_correct', 0.9846186262115466),
 ('filter_len_pre_truncate', 1267310),
 ('filtered_few', 202),
 ('generate_few', 0),
 ('generate_none', 0),
 ('in_filtered', 13074),
 ('in_sample', 7996),
 ('sample_len', 284760),
 ('sample_len_correct', 137978),
 ('sa