# Human and synthetic text dataset wrangling

## Introduction

We have 5 target datasets. The plan it to get them downloaded and saved locally, read into Python as appropriate and combined into a unified dataset. Ideally, running this notebook from a clone of this repo should get you the base dataset used for perplexity scoring. Here are the target datasets:

1. [Hans 2024](https://github.com/ahans30/Binoculars/tree/main), referred to as `hans`. Source: GitHub.
2. [AI vs human text](https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text), referred to as `gerami`. Source: Kaggle.
3. [Human vs. LLM text corpus](https://www.kaggle.com/datasets/starblasters8/human-vs-llm-text-corpus), referred to as `grinberg`. Source: Kaggle.
4. [Human-ChatGPT texts](https://github.com/HarshOza36/Detection-Of-Machine-Generated-Text/tree/master), referred to as `gaggar`. Source: GitHub.
5. [ai-text-detection-pile](https://huggingface.co/datasets/artem9k/ai-text-detection-pile), referred to as `yatsenko`. Source: HuggingFace.

## Notebook setup

In [1]:
# Change working directory to parent so we can import as we would
# from the perplexity ratio score root directory
%cd ..

# Standard library imports
import glob
import csv
import json
import os.path
import zipfile
import urllib.request
from pathlib import Path
from itertools import product

# PyPI imports
import pyarrow # pylint: disable=import-error
import kaggle # pylint: disable=import-error
import pandas as pd # pylint: disable=import-error
from datasets import load_dataset, utils # pylint: disable=import-error

# Internal imports
import configuration as config

/home/siderealyear/projects/llm_detector/perplexity_ratio_score


## 1. Raw data acquisition

First, download the raw data from each source so that we have a local copy archived.

### 1.1. Hans

This dataset comes in 6 JSON-lines formatted files. One for each combination of data source and generating model.

In [2]:
# Set up output directory
output_directory=f'{config.RAW_DATA_PATH}/hans'
Path(output_directory).mkdir(parents=True, exist_ok=True)

# Data source info
hans_generating_models=['falcon7','llama2_13']
hans_data_sources=['cnn','cc_news','pubmed']
hans_base_url='https://raw.githubusercontent.com/ahans30/Binoculars/refs/heads/main/datasets/core'

# Loop on generating models and data sources, downloading files for each
for generating_model, data_source in product(hans_generating_models, hans_data_sources):
    output_file=f'{output_directory}/{generating_model}-{data_source}.jsonl'

    # Only download the file if we don't already have it
    if Path(output_file).is_file() is False:
        data_url=f'{hans_base_url}/{data_source}/{data_source}-{generating_model}.jsonl'
        download_result=urllib.request.urlretrieve(data_url, output_file)

### 1.2. Gerami

In [3]:
# Set up output directory
output_directory=f'{config.RAW_DATA_PATH}/gerami'
Path(output_directory).mkdir(parents=True, exist_ok=True)

# Output file
output_file=f'{output_directory}/ai-vs-human-text.zip'

# Only download the file if we don't already have it
if Path(output_file).is_file() is False:
    kaggle.api.dataset_download_files('shanegerami/ai-vs-human-text', path=output_directory)

    # Unzip the data
    with zipfile.ZipFile(output_file, 'r') as zip_ref:
        zip_ref.extractall(output_directory)

### 1.3. Grinberg

In [4]:
# Set up output directory
output_directory=f'{config.RAW_DATA_PATH}/grinberg'
Path(output_directory).mkdir(parents=True, exist_ok=True)

# Output file
output_file=f'{output_directory}/human-vs-llm-text-corpus.zip'

# Only download the file if we don't already have it
#if Path(output_file).is_file() is False:
kaggle.api.dataset_download_files('starblasters8/human-vs-llm-text-corpus', path=output_directory)

# Unzip the data
with zipfile.ZipFile(output_file, 'r') as zip_ref:
    zip_ref.extractall(output_directory)

Dataset URL: https://www.kaggle.com/datasets/starblasters8/human-vs-llm-text-corpus


### 1.4. Gaggar

In [5]:
# Set up output directory
output_directory=f'{config.RAW_DATA_PATH}/gaggar'
Path(output_directory).mkdir(parents=True, exist_ok=True)

# File IO locations
data_url='https://github.com/HarshOza36/Detection-Of-Machine-Generated-Text/raw/refs/heads/master/data/Final%20Dataset.zip'
output_file=f'{output_directory}/data.zip'

# Only download the file if we don't already have it
if Path(output_file).is_file() is False:
    download_result=urllib.request.urlretrieve(data_url, output_file)

    # Unzip the data
    with zipfile.ZipFile(output_file, 'r') as zip_ref:
        zip_ref.extractall(output_directory)

### 1.5. Yatsenko

In [6]:
# Set up output directory
output_directory=f'{config.RAW_DATA_PATH}/yatsenko'
Path(output_directory).mkdir(parents=True, exist_ok=True)

# Output directory for the data
output_file=f'{output_directory}/data'

# Only download the file if we don't already have it
if Path(output_file).is_dir() is False:
    utils.disable_progress_bar()
    ds=load_dataset('artem9k/ai-text-detection-pile')

    # Save the dataset to disk
    ds.save_to_disk(output_file)

## 2. Data loading & parsing

Next, read in the data and parse it to a consistent format. Target is three columns: text, synthetic (0 for human, 1 for synthetic), author (human, or model name) and source for the source dataset. Then shard to disk as parquet.

### 2.1. Hans

In [7]:
# Holder for results
parsed_text={
    'text': [],
    'synthetic': [],
    'author': [],
    'source': []
}

# Counters
human_texts=0
synthetic_texts=0

# Loop on the generating model and original text source
for generating_model, data_source in product(hans_generating_models, hans_data_sources):

    # Get the file path
    file_path=f'{config.RAW_DATA_PATH}/hans/{generating_model}-{data_source}.jsonl'

    # Loop on the JSON lines in the file, parsing each one
    with open(file_path) as input_file:
        for line in input_file:
            data=json.loads(line)

            # Get the generated text and add to parsed text
            parsed_text['source'].append('hans')
            parsed_text['synthetic'].append(1)
            parsed_text['author'].append(generating_model)

            if generating_model == 'llama2_13':
                text=data['meta-llama-Llama-2-13b-hf_generated_text_wo_prompt']

            elif generating_model == 'falcon7':
                text=data['-fs-cml-models-Falcon-falcon-7b_generated_text_wo_prompt']

            parsed_text['text'].append(text)

            synthetic_texts+=1

            # Get the human text and add to parsed text
            parsed_text['source'].append('hans')
            parsed_text['synthetic'].append(0)
            parsed_text['author'].append('human')

            if 'article' in data.keys():
                text=data['article']

            elif 'text' in data.keys():
                text=data['text']

            parsed_text['text'].append(text)

            human_texts+=1

print(f'Parsed {human_texts + synthetic_texts} texts, {human_texts} human and {synthetic_texts} synthetic')

Parsed 22542 texts, 11271 human and 11271 synthetic


### 2.2. Gerami

In [8]:
# Data file path
file_path=f'{config.RAW_DATA_PATH}/gerami/AI_Human.csv'

# Counters
human_texts=0
synthetic_texts=0

# Read the file
with open(file_path, mode='r') as input_file:
    reader=csv.reader(input_file)

    # Loop on CSV rows, parsing each
    for i, row in enumerate(reader):

        # Skip the header row
        if i > 0:
            parsed_text['source'].append('gerami')

            if row[1] == '0.0':
                parsed_text['synthetic'].append(0)
                parsed_text['author'].append('human')
                human_texts+=1

            if row[1] == '1.0':
                parsed_text['synthetic'].append(1)
                parsed_text['author'].append('unknown_model')
                synthetic_texts+=1

            parsed_text['text'].append(row[0])
            
print(f'Parsed {human_texts + synthetic_texts} texts, {human_texts} human and {synthetic_texts} synthetic')

Parsed 487235 texts, 305797 human and 181438 synthetic


### 2.3. Grinberg

Note: CSV file seems to have some bad quoting in it - fails to parse with `Error: field larger than field limit (131072)`, likely indicating a bad unterminated quotation in one of the texts.

In [9]:
# Data file path
file_path=f'{config.RAW_DATA_PATH}/grinberg/data.parquet'

# Counters
human_texts=0
synthetic_texts=0

# Read the file into a Pandas dataframe
data_df=pd.read_parquet(file_path)
data_df.head()

# Extract texts and sources
texts=data_df['text'].to_list()
sources=data_df['source'].to_list()

# Loop on text and source lists, parse and add the to results
for text, source in zip(texts, sources):
    parsed_text['source'].append('grinberg')

    if source == 'Human':
        parsed_text['synthetic'].append(0)
        parsed_text['author'].append('human')
        human_texts+=1

    if source != 'Human':
        parsed_text['synthetic'].append(1)
        parsed_text['author'].append('unknown_model')
        synthetic_texts+=1

    parsed_text['text'].append(text)

print(f'Parsed {human_texts + synthetic_texts} texts, {human_texts} human and {synthetic_texts} synthetic')

Parsed 788922 texts, 347692 human and 441230 synthetic


### 2.4. Gaggar

In [10]:
# Data file path
file_path=f'{config.RAW_DATA_PATH}/gaggar/Complete Dataset/FinalDataset.csv'

# Counters
human_texts=0
synthetic_texts=0

# Read the file
with open(file_path, mode='r') as input_file:
    reader=csv.reader(input_file)

    # Loop on CSV rows, parsing each
    for i, row in enumerate(reader):

        # Skip the header row
        if i > 0:
            parsed_text['source'].append('gaggar')

            if row[1] == '0':
                parsed_text['synthetic'].append(0)
                parsed_text['author'].append('human')
                human_texts+=1

            if row[1] == '1':
                parsed_text['synthetic'].append(1)
                parsed_text['author'].append('GPT-3.5-turbo')
                synthetic_texts+=1

            parsed_text['text'].append(row[0])
            
print(f'Parsed {human_texts + synthetic_texts} texts, {human_texts} human and {synthetic_texts} synthetic')

Parsed 776945 texts, 400015 human and 376930 synthetic


### 2.5. Yatsenko

Attempting to load files with `pyarrow.ipc.open_file` results in `ArrowInvalid: Not an Arrow file`. Looks like we have to load using HuggingFace's *datasets*.

In [11]:
# Load the dataset
utils.disable_progress_bar()
dataset=load_dataset(f'{config.RAW_DATA_PATH}/yatsenko/data')

# Counters
human_texts=0
synthetic_texts=0

# Loop over and parse the dataset
for i, record in enumerate(dataset['train']):

    parsed_text['source'].append('yatsenko')

    if record['source'] == 'human':
        parsed_text['synthetic'].append(0)
        parsed_text['author'].append('human')
        human_texts+=1

    if record['source'] == 'ai':
        parsed_text['synthetic'].append(1)
        parsed_text['author'].append('unknown_model')
        synthetic_texts+=1

    parsed_text['text'].append(record['text'])

print(f'Parsed {human_texts + synthetic_texts} texts, {human_texts} human and {synthetic_texts} synthetic')

Parsed 1392522 texts, 1028146 human and 364376 synthetic


## 3. Save the combined dataset

In [12]:
# Get some summary stats about the file
total_texts=len(parsed_text['synthetic'])
synthetic_texts=sum(parsed_text['synthetic'])
human_texts=total_texts - synthetic_texts
percent_synthetic=(synthetic_texts/total_texts)*100
percent_human=(human_texts/total_texts)*100

print(f'Have {total_texts} texts')
print(f' Human: {human_texts}({percent_human:.1f}%)')
print(f' Synthetic: {synthetic_texts}({percent_synthetic:.1f}%)')

Have 3468166 texts
 Human: 2092921(60.3%)
 Synthetic: 1375245(39.7%)


In [13]:
# Set up output directory
output_directory=f'{config.INTERMEDIATE_DATA_PATH}'
Path(output_directory).mkdir(parents=True, exist_ok=True)

# Save it as JSON
with open(f'{output_directory}/all_texts.json', 'w', encoding='utf-8') as output_file:
    json.dump(parsed_text, output_file, ensure_ascii=False, indent=4)

Final file is 7.29 GB on disk.