# Overview

In the realm of natural language processing, advancements in large language models (LLMs) have reached remarkable heights, enabling the generation of text that closely mirrors human-written content. In October 2023, a notable competition surfaced on Kaggle, centering around the intricacies of natural language processing (https://www.kaggle.com/competitions/llm-detect-ai-generated-text). This competition seeks to foster open research and transparency in AI detection techniques applicable to real-world scenarios.

The challenge at hand involves developing a robust machine learning model capable of accurately discerning whether an essay was penned by a student or generated by an LLM. The competition dataset is a diverse compilation of student-written essays alongside essays produced by various LLMs. However, upon preliminary analysis, it is evident that the training set contains a limited number of AI-generated texts. To address this limitation, I will supplement the dataset with additional text sources that are comparable to LLM-generated content.

In this notebook, I aim to delve into the intricacies of the competition, exploring the dataset nuances and discussing the unique characteristics of LLM-generated texts. By leveraging alternative datasets, I seek to enhance the model's performance and contribute valuable insights to the ongoing discourse on AI detection techniques.

In [2]:
import requests
import json
import pickle
import pandas as pd
import numpy as np
import seaborn as sns
import re
import math
import string
from collections import Counter
import matplotlib.pyplot as plt
import os
from sklearn.feature_extraction.text import CountVectorizer

import warnings
warnings.filterwarnings('ignore')

#### Original training set

In [13]:
path = './data/llm-detect-ai-generated-text/train_essays.csv'
train_essays = pd.read_csv(path)

Original training set from competition (*train_essays*) consists of following columns:
- **id** - a unique identifier for each essay,
- **prompt_id** - identifies the prompt the essay was written in response to,
- **text** - the essay text itself,
- **generated** - this field is the target, it informs whether the essay was written by a student (`0`) or generated by an LLM (`1`).

In [18]:
pd.set_option('display.max_colwidth', 120)
train_essays.head()

Unnamed: 0,id,prompt_id,text,generated
0,0059830c,0,"Cars. Cars have been around since they became famous in the 1900s, when Henry Ford created and built the first Model...",0
1,005db917,0,"Transportation is a large necessity in most countries worldwide. With no doubt, cars, buses, and other means of tran...",0
2,008f63e3,0,"""America's love affair with it's vehicles seems to be cooling"" says Elisabeth rosenthal. To understand rosenthal's p...",0
3,00940276,0,How often do you ride in a car? Do you drive a one or any other motor vehicle to work? The store? To the mall? Have ...,0
4,00c39458,0,Cars are a wonderful thing. They are perhaps one of the worlds greatest advancements and technologies. Cars get us f...,0


There are also train prompts in other file, which consists of following columns:
- **prompt_id** - a unique identifier for each prompt,
- **prompt_name** - the title of the prompt,
- **instructions** - the instructions given to students,
- **source_text** - the text of the article(s) the essays were written in response to, in Markdown format. Significant paragraphs are enumerated by a numeral preceding the paragraph on the same line, as in `0 Paragraph one.\n\n1 Paragraph two.`. Essays sometimes refer to a paragraph by its numeral. Each article is preceded with its title in a heading, like `# Title`. When an author is indicated, their name will be given in the title after `by`. Not all articles have authors indicated. An article may have subheadings indicated like `## Subheading`.

In [10]:
pd.set_option('display.max_colwidth', 200)

path = './data/llm-detect-ai-generated-text/train_prompts.csv'
train_prompts = pd.read_csv(path)
train_prompts

Unnamed: 0,prompt_id,prompt_name,instructions,source_text
0,0,Car-free cities,Write an explanatory essay to inform fellow citizens about the advantages of limiting car usage. Your essay must be based on ideas and information that can be found in the passage set. Manage your...,"# In German Suburb, Life Goes On Without Cars by Elisabeth Rosenthal\n\n1 VAUBAN, Germany—Residents of this upscale community are suburban pioneers, going where few soccer moms or commuting execut..."
1,1,Does the electoral college work?,Write a letter to your state senator in which you argue in favor of keeping the Electoral College or changing to election by popular vote for the president of the United States. Use the informatio...,"# What Is the Electoral College? by the Office of the Federal Register\n\n1 The Electoral College is a process, not a place. The founding fathers established it in the Constitution as a compromise..."


For now let's focus on *train_essays* and check some simple statistics.

In [15]:
print(f'Number of essays in train set: {len(train_essays)}')

Number of essays in train set: 1378


In [17]:
num_generated = train_essays['generated'].value_counts()
print(f'Number of student and LLM generated essays:\n{num_generated}')

Number of student and LLM generated essays:
generated
0    1375
1       3
Name: count, dtype: int64


As mentioned earlier, the number of essays generated by LLMs in the training set is quite limited. The entire dataset consists of 1378 records, with only three of them being AI-generated essays. This accounts for a mere 0.002 percent of all records, highlighting a substantial imbalance in the dataset. Fortunately, the community has actively addressed this issue by taking matters into their own hands to supplement and enhance the training set, aiming to acquire data that would facilitate more effective model training.

A noteworthy example of such a dataset is the new **V2 release of the DAIGT train dataset** (https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset). This updated version incorporates new prompts and incorporates the use of novel language models. The collective efforts of the community in augmenting the dataset underscore the commitment to improving the efficacy of model training. This notebook will explore these dataset dynamics and delve into the enhancements introduced by the community to create a more robust and representative training set for tackling the challenge posed by the Kaggle competition.

#### DAIGT v2

In [20]:
path = './data/daigt_v2/train_v2_drcat_02.csv'
data = pd.read_csv(path)

In [21]:
pd.set_option('display.max_colwidth', 120)
data.head()

Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Phones\n\nModern humans today are always on their phone. They are always on their phone more than 5 hours a day no s...,0,Phones and driving,persuade_corpus,False
1,This essay will explain if drivers should or should not be able to use electronic devices while operating a vehicle....,0,Phones and driving,persuade_corpus,False
2,"Driving while the use of cellular devices\n\nToday, most of the society is thoughtless. Especially new drivers, all ...",0,Phones and driving,persuade_corpus,False
3,Phones & Driving\n\nDrivers should not be able to use phones while operating a vehicle. Drivers who used their phone...,0,Phones and driving,persuade_corpus,False
4,Cell Phone Operation While Driving\n\nThe ability to stay connected to people we know despite distance was originall...,0,Phones and driving,persuade_corpus,False


In [22]:
print(f'Number of essays in dataset: {len(data)}')

Number of essays in dataset: 44868


In [23]:
num_generated = data['label'].value_counts()
print(f'Number of student and LLM generated essays:\n{num_generated}')

Number of student and LLM generated essays:
label
0    27371
1    17497
Name: count, dtype: int64


In [24]:
num_prompts = data['prompt_name'].value_counts()
print(f'Prompts used:\n{num_prompts}')

Prompts used:
prompt_name
Distance learning                        5554
Seeking multiple opinions                5176
Car-free cities                          4717
Does the electoral college work?         4434
Facial action coding system              3084
Mandatory extracurricular activities     3077
Summer projects                          2701
Driverless cars                          2250
Exploring Venus                          2176
Cell phones at school                    2119
Grades for extracurricular activities    2116
Community service                        2092
"A Cowboy Who Rode the Waves"            1896
The Face on Mars                         1893
Phones and driving                       1583
Name: count, dtype: int64


Dataset was improved using following sources:
- text generated with ChatGPT by MOTH (https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset),
- persuade corpus contributed by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/persaude-corpus-2/),
- text generated with Llama-70b and Falcon180b by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b),
- text generated with ChatGPT and GPT4 by Radek (https://www.kaggle.com/datasets/radek1/llm-generated-essays),
- 2000 Claude essays generated by @darraghdog (https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic),
- LLM-generated essay using PaLM from Google Gen-AI by @kingki19 (https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai),
- official train essays,
- essays generated by Darek Kłeczek with various LLMs.

In [25]:
num_sources = data['source'].value_counts()
print(f'Sources:\n{num_sources}')

Sources:
source
persuade_corpus                       25996
mistral7binstruct_v1                   2421
mistral7binstruct_v2                   2421
chat_gpt_moth                          2421
llama2_chat                            2421
kingki19_palm                          1384
train_essays                           1378
llama_70b_v1                           1172
falcon_180b_v1                         1055
darragh_claude_v6                      1000
darragh_claude_v7                      1000
radek_500                               500
NousResearch/Llama-2-7b-chat-hf         400
mistralai/Mistral-7B-Instruct-v0.1      400
cohere-command                          350
palm-text-bison1                        349
radekgpt4                               200
Name: count, dtype: int64
