Abstraction not Memory:BERT and the English Article System

This repository contains code and data associated with the paper Abstraction not Memory: BERT and the English Article System.

If you make use of this package, please cite our work.

Summary

Article prediction is a task that has long defied accurate linguistic description. As such, this task is ideally suited to evaluate models on their ability to emulate native-speaker intuition. To this end, we compare the performance of native English speakers and pre-trained models on the task of article prediction set up as a three way choice (a/an, the, zero). Our experiments with BERT show that BERT outperforms humans on this task across all articles. In particular, BERT is far superior to humans at detecting the zero article, possibly because we insert them using rules that the deep neural model can easily pick up. More interestingly, we find that BERT tends to agree more with annotators than with the corpus when inter-annotator agreement is high but switches to agreeing more with the corpus as inter-annotator agreement drops. We contend that this alignment with annotators, despite being trained on the corpus, suggests that BERT is not memorising article use, but captures a high level generalisation of article use akin to human intuition.

Data

A sample of the data used for analysis in the paper is provided below. Since we make use of data from the British National Corpus, we are unable to publish this data in its entirety here.

Please contact us at htm43@bath.ac.uk for a copy of the data.

ID	Question	TrueLabel	ModelPrediction	Person1	Person2	Person3	Person4	Person5
1	Hun Sen sold out to the Vietnamese and is now selling out to Ø Thai businessmen as well, it says. Only _____ Khmer Rouge will fight to preserve Cambodia for ø Cambodians. It is actively supported by a small part of the population and tacitly tolerated by many more, out of either fear or a simple desire not to be involved in a political battle.	the	the	the	the	null	null	the
2	In Paris, its diplomats proved to be the most methodical, hardest working and best co - ordinated of all the Cambodians. _____ Coalition Government of Democratic Kampuchea (CGDK), made up of the Khmer Rouge, the Sihanoukists and the KPNLF, is recognised by the United Nations. It has Ø nine permanent diplomatic missions abroad."	the	the	the	the	the	a	the
3	Therefore we cannot return to the politics of 1975 to 1978,’ said the Khmer Rouge leader Khieu Samphan, in Paris. We need _____ liberal economy and a liberal democracy. This is the basis for Ø national unity."	a	a	a	a	a	a	a

Data Analysis Scripts

The scripts used for data analysis can be found in the RScripts folder. This script requires the following input files.

Please email htm43@bath.ac.uk for a copy of the following files:

all.csv -- All the data.
2.csv -- All instances where 2 annotators agreed
3.csv -- All instances where 3 annotators agreed
4.csv -- All instances where 4 annotators agreed
tie.csv -- instances where there was a 2-2 tie

NOTE: The 5th annotator is used as a control and is not included in the analysis above.

This is the data and scripts used to generate tables 2 and 3 in the paper.

Citation

If you make use of this work, please cite us:

@inproceedings{tayyar-madabushi-etal-2022-abstraction,
    title = "Abstraction not Memory: {BERT} and the {E}nglish Article System",
    author = "Tayyar Madabushi, Harish  and
      Divjak, Dagmar  and
      Milin, Petar",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.67",
    doi = "10.18653/v1/2022.naacl-main.67",
    pages = "924--931",
    abstract = "Article prediction is a task that has long defied accurate linguistic description. As such, this task is ideally suited to evaluate models on their ability to emulate native-speaker intuition. To this end, we compare the performance of native English speakers and pre-trained models on the task of article prediction set up as a three way choice (a/an, the, zero). Our experiments with BERT show that BERT outperforms humans on this task across all articles. In particular, BERT is far superior to humans at detecting the zero article, possibly because we insert them using rules that the deep neural model can easily pick up. More interestingly, we find that BERT tends to agree more with annotators than with the corpus when inter-annotator agreement is high but switches to agreeing more with the corpus as inter-annotator agreement drops. We contend that this alignment with annotators, despite being trained on the corpus, suggests that BERT is not memorising article use, but captures a high level generalisation of article use akin to human intuition.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
RScripts		RScripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstraction not Memory:BERT and the English Article System

Summary

Data

Data Analysis Scripts

Citation

About

Releases

Packages

Languages

License

H-TayyarMadabushi/Abstraction-not-Memory-BERT-and-the-English-Article-System-NAACL-2022

Folders and files

Latest commit

History

Repository files navigation

Abstraction not Memory:BERT and the English Article System

Summary

Data

Data Analysis Scripts

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages