This notebook creates new train and test splits from Guo et al.'s Huggingface dataset.
Their CSV file on Google Drive appears to mix up questions and answers (see [this Github issue](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection/issues/30)). 
While not directly relevant here (the questions won't be used), it can't be easily verified if other issues were introduced when generating those splits.
This project does not intend to benchmark any detectors.



In [1]:
# filter by document length as human responses tend to be shorter in this dataset on average
MAX_WORDS = 150
MIN_WORDS = 100

In [2]:
from datasets import load_dataset
import pandas as pd
from sklearn.model_selection import train_test_split

h3_dataset_hf_raw = load_dataset("Hello-SimpleAI/HC3",name="all")

In [3]:
list(h3_dataset_hf_raw["train"].info.features.keys())

['id', 'question', 'human_answers', 'chatgpt_answers', 'source']

In [4]:
h3_dataset_hf = pd.DataFrame(h3_dataset_hf_raw["train"], columns=list(h3_dataset_hf_raw["train"].info.features.keys()))
h3_dataset_hf = h3_dataset_hf.explode("human_answers").explode("chatgpt_answers")
h3_dataset_hf = pd.melt(h3_dataset_hf, id_vars=["question", "source", "id"], value_vars=["human_answers", "chatgpt_answers"], value_name="answer", var_name="author")
#h3_dataset_hf["label"] = h3_dataset_hf["author"] == "chatgpt_answers"
#h3_dataset_hf["label"] = h3_dataset_hf["label"].astype(int)
h3_dataset_hf["id"] = h3_dataset_hf["id"].astype(int)
h3_dataset_hf = h3_dataset_hf.dropna(subset=["answer"])

h3_dataset_hf["answer"] = h3_dataset_hf["answer"].replace(r'\n','', regex=True) # the csvs don't have nl
h3_dataset_hf["question"] = h3_dataset_hf["question"].replace(r'\n','', regex=True) # the csvs don't have nl
h3_dataset_hf

Unnamed: 0,question,source,id,author,answer
0,"Why is every book I hear about a "" NY Times # ...",reddit_eli5,0,human_answers,"Basically there are many categories of "" Best ..."
1,"Why is every book I hear about a "" NY Times # ...",reddit_eli5,0,human_answers,"If you 're hearing about it , it 's because it..."
2,"Why is every book I hear about a "" NY Times # ...",reddit_eli5,0,human_answers,"One reason is lots of catagories . However , h..."
3,"If salt is so bad for cars , why do we use it ...",reddit_eli5,1,human_answers,salt is good for not dying in car crashes and ...
4,"If salt is so bad for cars , why do we use it ...",reddit_eli5,1,human_answers,"In Minnesota and North Dakota , they tend to u..."
...,...,...,...,...,...
123171,Is rise in pressure from 116/66 to 140/80 norm...,medicine,24317,chatgpt_answers,It's not uncommon for blood pressure to fluctu...
123172,What could cause a painless lump in the right ...,medicine,24318,chatgpt_answers,There are several possible causes of a painles...
123173,Can Acutret be given to a child for treatment ...,medicine,24319,chatgpt_answers,It is not appropriate for me to recommend a sp...
123174,Are BP of 119/65 and pulse of 35 causes for co...,medicine,24320,chatgpt_answers,It is not uncommon for people with rheumatoid ...


In [5]:
# filter by document length as human responses tend to be shorter in this dataset on average

doc_within_range = h3_dataset_hf["answer"].str.split().str.len().apply(lambda l : (l <= MAX_WORDS and l >= MIN_WORDS))
df_min_max_len = h3_dataset_hf[doc_within_range]
df_min_max_len

Unnamed: 0,question,source,id,author,answer
0,"Why is every book I hear about a "" NY Times # ...",reddit_eli5,0,human_answers,"Basically there are many categories of "" Best ..."
4,"If salt is so bad for cars , why do we use it ...",reddit_eli5,1,human_answers,"In Minnesota and North Dakota , they tend to u..."
9,Why has nobody assassinated Kim Jong - un He i...,reddit_eli5,3,human_answers,You ca n't just go around assassinating the le...
14,How was airplane technology able to advance so...,reddit_eli5,4,human_answers,The importance of the Wright Brothers and othe...
21,What has changed that we frequently now throw ...,reddit_eli5,7,human_answers,It 's three fold : * Stuff is cheaper to mass ...
...,...,...,...,...,...
123170,What is the treatment for presence of breasts ...,medicine,24316,chatgpt_answers,"I'm sorry, but I am an AI language model and d..."
123171,Is rise in pressure from 116/66 to 140/80 norm...,medicine,24317,chatgpt_answers,It's not uncommon for blood pressure to fluctu...
123173,Can Acutret be given to a child for treatment ...,medicine,24319,chatgpt_answers,It is not appropriate for me to recommend a sp...
123174,Are BP of 119/65 and pulse of 35 causes for co...,medicine,24320,chatgpt_answers,It is not uncommon for people with rheumatoid ...


In [6]:
# filter out responses that contain some of the "indicating words" provided by Guo et al.
# note that only some are used: they also remove certain stock phrases like "There are several ways" for their filtered version


indicating_words_chatgpt_en = [
    "AI assistant",
    "AI language model",
    "I'm sorry", # e.g. ... but I am not a medical doctor; but I am an AI language; to hear about your husband's symptoms 
    "It is difficult for me",
    "Contents may violate our content",
    "This content may violate our content policy",
    "Can you please provide the statement",
    "If you have any more questions, please don't hesitate to ask.",
    "If you have any questions about",
    "Let me know if you have any other questions",
    "If you have any more questions, feel free to ask!",
    "!\rnetwork error\r\r\r\r", # loris
    "Free Research Preview.", # loris
    "Your feedback will help us improve.", # loris
    ]

remove = df_min_max_len["answer"].str.contains('|'.join(indicating_words_chatgpt_en), regex=True)

print("Removing {} documents, specifically:".format(len(df_min_max_len[remove])))
display(df_min_max_len[remove]["author"].value_counts())

df = df_min_max_len[~remove]

Removing 729 documents, specifically:


author
chatgpt_answers    728
human_answers        1
Name: count, dtype: int64

In [7]:
len_before = len(df)
df = df.drop_duplicates(subset=["question", "answer", "author"])

# There are duplicated human answers, not chat
df = df.drop_duplicates(subset=[ "answer", "author"])
print("Dropped {} duplicates (was {})".format(len_before - len(df),len_before))


Dropped 9517 duplicates (was 23636)


In [8]:
df[df.duplicated(subset=["answer"])].sort_values(by="answer")

Unnamed: 0,question,source,id,author,answer


In [9]:
def get_equal_numbers_of_answers(group):
    human_answers = group[group["author"] == "human_answers"]
    chatgpt_answers = group[group["author"] == "chatgpt_answers"]
    n = min(len(human_answers), len(chatgpt_answers))
    return pd.concat([human_answers.sample(n, random_state=42), chatgpt_answers.sample(n, random_state=42)])
    

In [10]:
# want a balanced dataset: sample pairs of answers to the same question

df = df.groupby(["question"]).apply(get_equal_numbers_of_answers)

In [11]:
print("{} human answers, {} chat".format(len(df[df["author"] == "human_answers"]), len(df[df["author"] == "chatgpt_answers"])))
assert len(df[df["author"] == "human_answers"]) == len(df[df["author"] == "chatgpt_answers"]), "Sampling not balanced"

1536 human answers, 1536 chat


In [12]:
df = df.reset_index(drop=True)
df

Unnamed: 0,question,source,id,author,answer
0,""" Magic eye "" images Magic is still my only ex...",reddit_eli5,16763,human_answers,Have you ever looked at a 3D movie without gla...
1,""" Magic eye "" images Magic is still my only ex...",reddit_eli5,16763,chatgpt_answers,"Sure! ""Magic Eye"" images are pictures that con..."
2,""" Second cousin twice removed "" I do n't under...",reddit_eli5,16987,human_answers,I 'm a Viriginian with a complex family tree -...
3,""" Second cousin twice removed "" I do n't under...",reddit_eli5,16987,chatgpt_answers,Sure! A cousin is a relative who is the child ...
4,""" There are more stars in the universe than th...",reddit_eli5,1002,human_answers,Counted ? Nobody . But you can estimate based ...
...,...,...,...,...,...
3067,why would closing price of a stock be differen...,finance,21840,chatgpt_answers,There are a few reasons why the closing price ...
3068,why you have to sneeze when you have a cold Re...,reddit_eli5,8072,human_answers,"First of all , why do we cough ? Coughing is a..."
3069,why you have to sneeze when you have a cold Re...,reddit_eli5,8072,chatgpt_answers,"When you have a cold, your body is trying to g..."
3070,“Occupation” field on IRS Form 1040,finance,20684,human_answers,"It doesn't generally matter, and I'm not sure ..."


In [13]:
train, test = train_test_split(df, test_size=0.3, random_state=42, stratify=df["author"])

In [14]:
print("train", len(train))
print("test", len(test))

train 2150
test 922


In [15]:
train.to_pickle("./dataset_train.pkl")

In [16]:
test.to_pickle("./dataset_test.pkl")