Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges

This is the official repository for the paper Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges.

This repository is under construction. Please be patient until we add more information. Thank you!

Methodology

The paper documents the detailed pipeline of this systematic literature review. This figure gives you an overview of how we sample 244 works relevant to text generation.

Text Generation Tasks

Our literature review identifies five most prominent areas related to text generation, namely open-ended text generation, summarization, translation, paraphrasing, and question answering.

Task	Description
Open-ended text generation	Newly generated text is iteratively conditioned on the previous context.
Summarization	Generating a text from one or more texts conveying information in a shorter format.
Translation	Converting a source text in language A to a target language B.
Paraphrasing	Generating text that has (approximately) identical meaning but uses different words or structures.
Question answering	Takes a question as input text and outputs a streamlined answer or a list of possible answers.

For each of these tasks, we identify major sub-tasks and relevant challenges.

Evaluation Metrics

We provide an overview of model-free and model-based automatic metrics as well as methodologies for human evaluation. "Used" marks the number of papers that consider the metric in their publication from our 136 filtered Semantic Scholar documents (proposing, surveying, or applying).

We find that model-free n-gram-based metrics are by far the most used metrics within the works we cover. Model-based approaches are usually employed in a hybrid manner, combining embeddings with rule-based methods. Several works use human evaluation for performance measurements while often disregarding inter-annotator agreement scores.

Type	Category	Metric	Description	Used
Model-free	N-gram	BLEU	Textual overlap between source and reference (precision).	69
		ROUGE	Textual overlap between source and reference (recall).	46
		METEOR	Textual overlap between source and reference (precision and recall).	32
		CIDEr	Measures consensus on multiple reference texts.	15
		chrF++	Character-based F-score computed using n-grams.	13
		Dist-n	Measures generation diversity by the percentage of distinct n-grams.	8
		NIST	Alters BLEU to also consider n-gram informativeness.	6
		Self-BLEU	Measures generation diversity by calculating BLEU between generated samples.	2
	Statistical	Perplexity	Fluency metric based on the likelihood of word sequences.	23
		Word Error Rate	The rate of words that are different from a reference sequence based on the Levenshtein distance.	11
	Graph	SPICE	Measures the semantic similarity of two texts by the distance of their scene graphs.	6

Model-based	Hybrid	BERTScore	Contextual token similarity to measure textual overlap.	13
		MoverScore	Uses contextualized embeddings and captures both intersection and deviation from the reference for a similarity score.	6
		Word Mover Distance	Distance metric to measure the dissimilarity of two texts.	2
	Trained	BLEURT	Models human judgement on text quality.	4
		BARTScore	Promptable metric that models human judgments on faithfulness besides precision and recall.	3

Human	Performance	Likert Scale	Humans can choose on a scale, e.g., from 1 (horrible quality) to 5 (perfect quality).	22
		Pairwise Comparison	Humans choose the best example from two samples.	10
		Turing Test	Can quantify how distinguishable human text is from machine-generated text.	6
		Binary	Humans are answering binary questions with yes or no.	3
		Best-Worst Scaling	From a list of examples, humans are instructed to select the best and worst output.	2
	Agreement	Krippendorff Alpha	Measures the disagreement between annotators for nominal, ordinal, and metric data.	4
		Fleiss Kappa	Measures the agreement on nominal data between a fixed pair of annotators.	4
		Pearson Correlation	Displays the agreement between annotators by measuring linear correlation.	3
		Spearman Correlation	Displays the monotonic relationships on ranked data.	2

Pipeline

Remarks

The number of works that we retrieve through our systematic pipeline drops off after 2022 (see the figure below). Additionally, the average number of citations drops, indicating that citations are not a good factor for measuring paper relevance for recent papers. To improve coverage, we do not exclude papers from 2022+ by the citation criterion and supplement our selection of works with additional and more recent works.

Manual Filtering

We reduced 279 papers to 136 papers by reading their titles and abstracts. You can download a PDF containing our relevance judgments here.

Setup

We recommend using Python 3.10 for this project.

First, install the requirements: pip install -r requirements.txt

Code

The project has multiple scripts included, each used for separate parts of the pipeline.

setup.py: Defines the parameters used for searching and filtering the scientific works.
tokens.py: You need an API key to use the Semantic Scholar API. This is the place to put it.
search.py: The initial retrieval of scientific works through the Semantic Scholar API.
filter.py: The automated filtering process that selects the top five works per query and year by influential citation counts.

Citation

If you use this repository or our paper for your research work, please cite us in the following way.

@misc{becker2024text,
      title={Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges}, 
      author={Jonas Becker and Jan Philip Wahle and Bela Gipp and Terry Ruas},
      year={2024},
      eprint={2405.15604},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.idea		.idea
__pycache__		__pycache__
input		input
output/figures		output/figures
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extractTopX.py		extractTopX.py
filter.py		filter.py
output.zip		output.zip
pipeline.png		pipeline.png
relevance_judgements.pdf		relevance_judgements.pdf
search.py		search.py
setup.py		setup.py
tasks.png		tasks.png
under_construction.png		under_construction.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges

Methodology

Text Generation Tasks

Evaluation Metrics

Pipeline

Remarks

Manual Filtering

Setup

Code

Citation

About

Releases

Packages

Languages

License

jonas-becker/text-generation

Folders and files

Latest commit

History

Repository files navigation

Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges

Methodology

Text Generation Tasks

Evaluation Metrics

Pipeline

Remarks

Manual Filtering

Setup

Code

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages