This repo contains code and resources for the following papers:
- EACL 2023 paper, Document-Level Planning for Text Simplification.
- ACL 2023 Findings paper, Context-Aware Document Simplification.
We will progressively update with code, instructions, pretrained models, data, etc. as soon as we can make them available.
git clone https://github.com/liamcripwell/plan_simp.git
cd plan_simp
pip install -e .
We provide pretrained models for the components of our contextual planning system PG_Dyn
as well as several of the models proposed in Context-Aware Document Simplification on HuggingFace.
Systems can be loaded within Python as follows:
from plan_simp.models.classifier import load_planner
from plan_simp.models.bart import load_simplifier
# contextual simplification planner
planner, p_tokenizer, p_hparams = load_planner("liamcripwell/pgdyn-plan")
# simplification model
simplifier, tokenizer, hparams = load_simplifier("liamcripwell/pgdyn-simp")
An example use-case of inference on out-of-domain test data is illustrated in this script.
The Wiki-Auto data used to train the relevant planners and simplification models can be downloaded here. Please contact the authors for more information regarding the Newsela data once you have obtain a licence.
We provide a script to generate sentence-level context encodings which are used within pgdyn-plan
and conbart
.
# encode sentence-level context embeddings to be used by the planner
python plan_simp/scripts/encode_contexts.py \
--data=examples/wikiauto_docs_valid.csv \
--x_col=complex \
--id_col=pair_id \
--save_dir=fake_context_dir/
A planner can be used to dynamically generate simplified documents. The planner will iteratively predict an operation for the current sentence (given the document context) and pass this to an encoder-decoder to conditionally generate a simplification. These simplifications are then used within the context of subsequent sentences. There is no need to pass a --simple_context_dir
argument because dynamic context will be managed on-the-fly.
python plan_simp/scripts/generate.py dynamic
--clf_model_ckpt=<planner_model> # e.g. liamcripwell/pgdyn-plan
--model_ckpt=<simplification_model> # e.g. liamcripwell/pgdyn-simp
--test_file=<test_sentences>
--doc_id_col=pair_id # document identifier for each sentence
--context_doc_id=c_id
--context_dir=<context_dir>
--reading_lvl=s_level
--out_file=<output_csv>
Alternatively, plan-guided simplification can be formed with pre-determined operation labels, or with no guidance at all.
# plan-guidance with predefined operations in `"label"` column
python plan_simp/scripts/generate.py inference
--model_ckpt=<simplification_model> # e.g. liamcripwell/pgdyn-simp
--test_file=<test_sentences>
--op_col=label
--reading_lvl=s_level
--out_file=<output_csv>
# generation with end-to-end model (no plan guidance)
python plan_simp/scripts/generate.py inference
--model_ckpt=liamcripwell/ledpara
--test_file=<test_data>
--reading_lvl=s_level
--out_file=<output_csv>
Generation can also be performed within python (see the source code for more parameter details).
# basic generation with no planning
from plan_simp.models.bart import run_generator
preds = run_generator(model_ckpt="liamcripwell/ledpara", **params)
# dynamic plan-guided generation
from plan_simp.scripts.generate import Launcher
launcher = Launcher()
launcher.dynamic(model_ckpt="liamcripwell/o-conbart", clf_model_ckpt="liamcripwell/pgdyn-plan", **params)
The following commands show example use cases for training your own planner models.
For Newsela, make sure to use the --reading_lvl
flag to specify a column in the data which indicates the target reading level of the simplification.
# baseline
python plan_simp/scripts/train_clf.py
--train_file=<train_file>
--val_file=<val_file>
--x_col=complex
--y_col=label
--batch_size=32
--learning_rate=1e-5
--ckpt_metric=val_macro_f1
# contextual (PG variants)
python plan_simp/scripts/train_clf.py
--train_file=<train_file>
--val_file=<val_file>
--x_col=complex
--y_col=label
--batch_size=32
--learning_rate=1e-5
--ckpt_metric=val_macro_f1
--add_context
--context_doc_id=pair_id
--context_dir=<context_dir>
--context_window=13
# used for dynamic context loading
--simple_context_doc_id=pair_id
--simple_context_dir=<context_dir>
# sequence tagging
python plan_simp/scripts/train_tagger.py
--train_file=<train_file>
--val_file=<val_file>
--x_col=<doc_id>
--y_col=<labels>
--batch_size=32
--learning_rate=1e-5
--embed_dir=<sent_rep_dir>
# baseline
python plan_simp/scripts/eval_clf.py <planner_model> <test_set>
# contextual
python plan_simp/scripts/eval_clf.py <planner_model> <test_set>
--add_context=True
--context_dir=<context_dir>
--context_doc_id=pair_id
# used for dynamic context loading
--simple_context_dir=<context_dir>
--simple_context_doc_id=pair_id
--reading_lvl=s_level # for Newsela
# sequence tagging; for Newsela add --x_col=c_id --reading_lvl=s_level
python plan_simp/scripts/eval_tagger.py <planner_model> <test_set> --embed_dir=<sent_rep_dir>
Below is an example of how to evaluate simplification outputs from the systems. BARTScore is disabled by default but can be enabled by pointing to a model via the --bartscore_path
flag.
# evaluate simplification performance
python plan_simp/scripts/eval_simp.py \
--input_data=examples/wikiauto_docs_valid.csv \ # document-level input
--output_data=test_out.csv \ # sentence-level predictions (will be automatically merged)
--x_col=complex \
--r_col=simple \
--y_col=pred \
--doc_id_col=pair_id \
--prepro=True \
--sent_level=True
If you find this repository useful, please cite our publications:
@inproceedings{cripwell-etal-2023-document,
title = "Document-Level Planning for Text Simplification",
author = {Cripwell, Liam and
Legrand, Jo{\"e}l and
Gardent, Claire},
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.eacl-main.70",
pages = "993--1006",
}
@inproceedings{cripwell-etal-2023-context,
title = "Context-Aware Document Simplification",
author = {Cripwell, Liam and
Legrand, Jo{\"e}l and
Gardent, Claire},
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.834",
doi = "10.18653/v1/2023.findings-acl.834",
pages = "13190--13206",
}