Decrypting Cryptic Crosswords: Semantically Complex Wordplay Puzzles as a Target for NLP
This repository is the official implementation of Decrypting Cryptic Crosswords: Semantically Complex Wordplay Puzzles as a Target for NLP. Please cite arxiv or Neurips 2021 version
The dataset is also available at https://doi.org/10.5061/dryad.n02v6wwzp
This will enable you to download and replicate the datasplits, but it has not been updated to include all requirements to run the (baselines and experiments notebooks).
pip install -r requirements.txt
git clone <anonymized> # if using code supplement, just unzip cd decrypt pushd ./data && unzip "*.json.zip" && popd
Download data (can safely be ignored)
If you want to download the data yourself from the web (you probably don't want to)
git clone <anonymized> # if using code supplement, just unzip cd decrypt mkdir -p './data/puzzles' python decrypt/scrape_parse/guardian_scrape.py --save_directory="./data/puzzles"
Then when you run
load_guardian_splits you will run
load_guardian_splits("./data/puzzles", load_from_files=True, use_premade_json=False)
Reproducing our splits
from decrypt.scrape_parse import ( load_guardian_splits, # naive random split load_guardian_splits_disjoint, # answer-disjoint split load_guardian_splits_disjoint_hash # word-initial disjoint split ) from decrypt.scrape_parse.guardian_load import SplitReturn """ each of these methods returns a tuple of `SplitReturn` - soln to clue map (string to List of clues mapping to that soln): Dict[str, List[BaseClue] this enables seeing all clues associated with a given answer word - list of all clues (List[BaseClue]) - Tuple of three lists (the train, val, test splits), each is List[BaseClue] Note that load_guardian_splits() will verify that - total glob length matches the one in paper (ie. number of puzzles downloaded matches) - total clue set length matches the one in paper (i.e. filtering is the same) - one of the clues in our train set matches our train set (i.e. a single clue spot check for randomness) If you get an assertion error or an exception during load, please file an issue, since the splits should be identical Alternatively, if you don't care, you can pass `verify=False` to `load_guardian_splits` """ soln_to_clue_map, all_clues_list, (train, val, test) = load_guardian_splits()
Replicating our work
We make code available to replicate the entire paper.
Note that the directory structure is specified in
decrypt/config.py. You can change it if you would like.
Most references use this file, but run commands (i.e.
python ... assume that the directories are unchanged
from the original config.py.
Datasets and task (Section 3)
- The splits are replicated as above using the load methods
- The task is replicated in the following sections
- We provide code to replicate metric analysis. See the implementation in jupyter notebooks below
To run the notebooks, you should start your jupyter server from the top level
The notebooks have been run using pycharm open from the top level
If you experience import errors it is likely because you are not running from the top level.
Baselines (Section 4)
Notebook to replicate the four baselines are in
Note that a patch will need to be applied to work with the deits solver.
Curriculum Learning (Section 5)
Note that details of training and evaluating the models are available in the relevant jupyter notebooks.