GPT-Sentinel: Distinguishing Human and ChatGPT Generated Content

This repository is no longer actively maintained. Please refer to our latest followup work:

Paper: Token Prediction as Implicit Classification to Identify LLM-Generated Text
Codebase: T5-Sentinel-public

Overview

📄 Link to Paper (arXiv) | 💾 Link to Dataset | 📦 Link to Checkpoint

This repository is the codebase for paper GPT-Sentinel: Distinguishing Human and ChatGPT Generating Content.

We collect and publish OpenGPTText - a high quality dataset with approximately 30,000 text sample rephrased by gpt-3.5-turbo (ChatGPT).
We construct two detectors with different architectures - the RoBERTa-Sentinel and T5-Sentinel.
T5-Sentinel shows SOTA performance (98% accuracy) on OpenGPTText dataset