This repo contains the notebook and generated texts created for the DAGPap22: Detecting automatically generated scientific papers Shared Task (SDP-2022, COLING-2022). Our work focuses on comparing different transformer-based models as well as using additional datasets and techniques to deal with imbalanced classes. As a final submission, we utilized an ensemble of SciBERT, RoBERTa, and DeBERTa fine-tuned using random oversampling technique. Our model achieved 99.24% in terms of F1-score. The official evaluation results have put our system at the third place.
The corresponding paper is: Anna Glazkova, Maksim Glazkov. 2022. Detecting Generated Scientific Papers using an Ensemble of Transformer Models. In Proceedings of the Third Workshop on Scholarly Document Processing. Association for Computational Linguistics.
- Our code for fine-tuning BERT-based models
- Additional data generated by us using Back Translation and GPT-2
These files can be read by:
import pickle
with open('backtranslation_kp20k.pickle', 'rb') as f:
generated_texts = pickle.load(f)