# Human and synthetic text dataset wrangling

## Introduction

We have 5 target datasets. The plan it to get them downloaded and saved locally, read into Python as appropriate and combined into a unified dataset. Ideally, running this notebook from a clone of this repo should get you the base dataset used for perplexity scoring. Here are the target datasets:

1. [Hans 2024](https://github.com/ahans30/Binoculars/tree/main), referred to as `hans`. Source: GitHub.
2. [AI vs human text](https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text), referred to as `gerami`. Source: Kaggle.
3. [Human vs. LLM text corpus](https://www.kaggle.com/datasets/starblasters8/human-vs-llm-text-corpus), referred to as `grinberg`. Source: Kaggle.
4. [Human-ChatGPT texts](https://github.com/HarshOza36/Detection-Of-Machine-Generated-Text/tree/master), referred to as `gaggar`. Source: GitHub.
5. [ai-text-detection-pile](https://huggingface.co/datasets/artem9k/ai-text-detection-pile), referred to as `yatsenko`. Source: HuggingFace.

## Notebook setup

In [1]:
# Change working directory to parent so we can import as we would
# from the perplexity ratio score root directory
%cd ..

# Standard library imports
import os.path
import zipfile
import urllib.request
from pathlib import Path
from itertools import product

# PyPI imports
import kaggle
from datasets import load_dataset, utils

# Internal imports
import configuration as config

/mnt/arkk/llm_detector/perplexity_ratio_score


## 1. Raw data acquisition

First, download the raw data from each source so that we have a local copy archived.

### 1.1. Hans

This dataset comes in 6 JSON-lines formatted files. One for each combination of data source and generating model.

In [2]:
# Set up output directory
output_directory=f'{config.RAW_DATA_PATH}/hans'
Path(output_directory).mkdir(parents=True, exist_ok=True)

# Data source info
generating_models=['falcon7','llama2_13']
data_sources=['cnn','cc_news','pubmed']
base_url='https://raw.githubusercontent.com/ahans30/Binoculars/refs/heads/main/datasets/core'

# Loop on generating models and data sources, downloading files for each
for generating_model, data_source in product(generating_models, data_sources):
    output_file=f'{output_directory}/{generating_model}-{data_source}.jsonl'

    # Only download the file if we don't already have it
    if Path(output_file).is_file() is False:
        data_url=f'{base_url}/{data_source}/{data_source}-{generating_model}.jsonl'
        download_result=urllib.request.urlretrieve(data_url, output_file)

### 1.2. Gerami

In [8]:
# Set up output directory
output_directory=f'{config.RAW_DATA_PATH}/gerami'
Path(output_directory).mkdir(parents=True, exist_ok=True)

# Output file
output_file=f'{output_directory}/ai-vs-human-text.zip'

# Only download the file if we don't already have it
if Path(output_file).is_file() is False:
    kaggle.api.dataset_download_files('shanegerami/ai-vs-human-text', path=output_directory)

    # Unzip the data
    with zipfile.ZipFile(output_file, 'r') as zip_ref:
        zip_ref.extractall(output_directory)

Dataset URL: https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text


### 1.3. Grinberg

In [10]:
# Set up output directory
output_directory=f'{config.RAW_DATA_PATH}/grinberg'
Path(output_directory).mkdir(parents=True, exist_ok=True)

# Output file
output_file=f'{output_directory}/human-vs-llm-text-corpus.zip'

# Only download the file if we don't already have it
#if Path(output_file).is_file() is False:
kaggle.api.dataset_download_files('starblasters8/human-vs-llm-text-corpus', path=output_directory)

# Unzip the data
with zipfile.ZipFile(output_file, 'r') as zip_ref:
    zip_ref.extractall(output_directory)

Dataset URL: https://www.kaggle.com/datasets/starblasters8/human-vs-llm-text-corpus


### 1.4. Gaggar

In [13]:
# Set up output directory
output_directory=f'{config.RAW_DATA_PATH}/gaggar'
Path(output_directory).mkdir(parents=True, exist_ok=True)

# File IO locations
data_url='https://github.com/HarshOza36/Detection-Of-Machine-Generated-Text/raw/refs/heads/master/data/Final%20Dataset.zip'
output_file=f'{output_directory}/data.zip'

# Only download the file if we don't already have it
if Path(output_file).is_file() is False:
    download_result=urllib.request.urlretrieve(data_url, output_file)

    # Unzip the data
    with zipfile.ZipFile(output_file, 'r') as zip_ref:
        zip_ref.extractall(output_directory)

### 1.5. Yatsenko

In [6]:
# Set up output directory
output_directory=f'{config.RAW_DATA_PATH}/yatsenko'
Path(output_directory).mkdir(parents=True, exist_ok=True)

# Output directory for the data
output_file=f'{output_directory}/data'

# Only download the file if we don't already have it
if Path(output_file).is_dir() is False:
    utils.disable_progress_bar()
    ds=load_dataset('artem9k/ai-text-detection-pile')

    # Save the dataset to disk
    ds.save_to_disk(output_file)