This notebook creates new train and test splits from Guo et al.'s Huggingface dataset.
Their CSV file on Google Drive appears to mix up questions and answers (see [this Github issue](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection/issues/30)). 
While not directly relevant here (the questions won't be used), it can't be easily verified if other issues were introduced when generating those splits.

Also, human answers in reddit_eli5 and open_qa appear to have artifacts in the form of spaces added before/after punctuation. For open_qa is already part of WikiQACorpus! E.g. whenever a word is a link. This effectively watermarks everything! 

Those two datasets are therefore excluded



In [1]:
# filter by document length as human responses tend to be shorter in this dataset on average
MAX_WORDS = 150
MIN_WORDS = 50

In [2]:
from datasets import load_dataset
import pandas as pd
from sklearn.model_selection import train_test_split

h3_dataset_hf_raw = load_dataset("Hello-SimpleAI/HC3",name="all")

In [3]:
h3_dataset_hf = pd.DataFrame(h3_dataset_hf_raw["train"], columns=list(h3_dataset_hf_raw["train"].info.features.keys()))
h3_dataset_hf = h3_dataset_hf.explode("human_answers").explode("chatgpt_answers")
h3_dataset_hf = pd.melt(h3_dataset_hf, id_vars=["question", "source", "id"], value_vars=["human_answers", "chatgpt_answers"], value_name="answer", var_name="author")
#h3_dataset_hf["label"] = h3_dataset_hf["author"] == "chatgpt_answers"
#h3_dataset_hf["label"] = h3_dataset_hf["label"].astype(int)
h3_dataset_hf["id"] = h3_dataset_hf["id"].astype(int)
h3_dataset_hf = h3_dataset_hf.dropna(subset=["answer"])

# h3_dataset_hf["answer"] = h3_dataset_hf["answer"].replace(r'\n','', regex=True) # for comparision only: the csvs don't have nl
# h3_dataset_hf["question"] = h3_dataset_hf["question"].replace(r'\n','', regex=True)

h3_dataset_hf = h3_dataset_hf[~(h3_dataset_hf["source"].str.contains("open_qa"))] # the original human dataset has artifacts form hyperlinks effectively watermarking human text 
h3_dataset_hf = h3_dataset_hf[~(h3_dataset_hf["source"].str.contains("reddit_eli5"))] # the human dataset has artifacts 
#h3_dataset_hf = h3_dataset_hf[~(h3_dataset_hf["source"].str.contains("reddit_eli5"))] # the human dataset has artifacts 
h3_dataset_hf

Unnamed: 0,question,source,id,author,answer
54906,"Please explain what is ""Animal cognition""",wiki_csai,18299,human_answers,Animal cognition encompasses the mental capaci...
54907,"Please explain what is ""Human intelligence""",wiki_csai,18300,human_answers,Human intelligence is the intellectual capabil...
54908,"Please explain what is ""Oxford English Diction...",wiki_csai,18301,human_answers,The Oxford English Dictionary (OED) is the fir...
54909,"Please explain what is ""Oxford University Press""",wiki_csai,18302,human_answers,Oxford University Press (OUP) is the universit...
54910,"Please explain what is ""AI applications""",wiki_csai,18303,human_answers,Artificial intelligence (AI) has been used in ...
...,...,...,...,...,...
123171,Is rise in pressure from 116/66 to 140/80 norm...,medicine,24317,chatgpt_answers,It's not uncommon for blood pressure to fluctu...
123172,What could cause a painless lump in the right ...,medicine,24318,chatgpt_answers,There are several possible causes of a painles...
123173,Can Acutret be given to a child for treatment ...,medicine,24319,chatgpt_answers,It is not appropriate for me to recommend a sp...
123174,Are BP of 119/65 and pulse of 35 causes for co...,medicine,24320,chatgpt_answers,It is not uncommon for people with rheumatoid ...


In [4]:
# filter by document length as human responses tend to be shorter in this dataset on average

doc_within_range = h3_dataset_hf["answer"].str.split().str.len().apply(lambda l : (l <= MAX_WORDS and l >= MIN_WORDS))
df_min_max_len = h3_dataset_hf[doc_within_range]
df_min_max_len

Unnamed: 0,question,source,id,author,answer
54906,"Please explain what is ""Animal cognition""",wiki_csai,18299,human_answers,Animal cognition encompasses the mental capaci...
54907,"Please explain what is ""Human intelligence""",wiki_csai,18300,human_answers,Human intelligence is the intellectual capabil...
54915,"Please explain what is ""Natural-language under...",wiki_csai,18308,human_answers,Natural-language understanding (NLU) or natura...
54918,"Please explain what is ""Automated decision-mak...",wiki_csai,18311,human_answers,Automated decision-making (ADM) involves the u...
54926,"Please explain what is ""Knowledge representation""",wiki_csai,18319,human_answers,"Knowledge representation and reasoning (KRR, K..."
...,...,...,...,...,...
123165,Is the dental implant related to the pain in r...,medicine,24311,chatgpt_answers,It is possible that your dental implants could...
123170,What is the treatment for presence of breasts ...,medicine,24316,chatgpt_answers,"I'm sorry, but I am an AI language model and d..."
123171,Is rise in pressure from 116/66 to 140/80 norm...,medicine,24317,chatgpt_answers,It's not uncommon for blood pressure to fluctu...
123173,Can Acutret be given to a child for treatment ...,medicine,24319,chatgpt_answers,It is not appropriate for me to recommend a sp...


In [5]:
# filter out responses that contain some of the "indicating words" provided by Guo et al.
# note that only some are used: they also remove certain stock phrases like "There are several ways" for their filtered version


indicating_words_chatgpt_en = [
    "AI assistant",
    "AI language model",
    "I'm sorry", # e.g. ... but I am not a medical doctor; but I am an AI language; to hear about your husband's symptoms 
    "It is difficult for me",
    "Contents may violate our content",
    "This content may violate our content policy",
    "Can you please provide the statement",
    "If you have any more questions, please don't hesitate to ask.",
    "If you have any questions about",
    "Let me know if you have any other questions",
    "If you have any more questions, feel free to ask!",
    "!\rnetwork error\r\r\r\r", # loris
    "Free Research Preview.", # loris
    "Your feedback will help us improve.", # loris
    ]

remove = df_min_max_len["answer"].str.contains('|'.join(indicating_words_chatgpt_en), regex=True)

print("Removing {} documents, specifically:".format(len(df_min_max_len[remove])))
display(df_min_max_len[remove]["author"].value_counts())

df = df_min_max_len[~remove]

Removing 67 documents, specifically:


author
chatgpt_answers    66
human_answers       1
Name: count, dtype: int64

In [6]:
len_before = len(df)
df = df.drop_duplicates(subset=["question", "answer", "author"])

# There are duplicated human answers, not chat
df = df.drop_duplicates(subset=[ "answer", "author"])
print("Dropped {} duplicates (was {})".format(len_before - len(df),len_before))


Dropped 319 duplicates (was 4213)


In [7]:
df[df.duplicated(subset=["answer"])].sort_values(by="answer")

Unnamed: 0,question,source,id,author,answer


In [8]:
def get_equal_numbers_of_answers(group):
    human_answers = group[group["author"] == "human_answers"]
    chatgpt_answers = group[group["author"] == "chatgpt_answers"]
    n = min(len(human_answers), len(chatgpt_answers))
    return pd.concat([human_answers.sample(n, random_state=42), chatgpt_answers.sample(n, random_state=42)])
    

In [9]:
# want a balanced dataset: sample pairs of answers to the same question

df = df.groupby(["question"]).apply(get_equal_numbers_of_answers)

  df = df.groupby(["question"]).apply(get_equal_numbers_of_answers)


In [10]:
print("{} human answers, {} chat".format(len(df[df["author"] == "human_answers"]), len(df[df["author"] == "chatgpt_answers"])))
assert len(df[df["author"] == "human_answers"]) == len(df[df["author"] == "chatgpt_answers"]), "Sampling not balanced"

508 human answers, 508 chat


In [11]:
df = df.reset_index(drop=True)
df

Unnamed: 0,question,source,id,author,answer
0,1099 versus corporation to corporation for pay...,finance,19394,human_answers,Do not mix personal accounts and corporate acc...
1,1099 versus corporation to corporation for pay...,finance,19394,chatgpt_answers,A 1099 form is a tax form used to report certa...
2,A University student wondering if investing in...,finance,20116,human_answers,You can start investing with any amount. You c...
3,A University student wondering if investing in...,finance,20116,chatgpt_answers,Investing in stocks can be a good idea if you ...
4,ADR listed in PINK,finance,21056,human_answers,"Pink Sheets is not a stock exchange per se, an..."
...,...,...,...,...,...
1011,why if change manufacturing of a product not c...,finance,20126,chatgpt_answers,There are a number of reasons why changing the...
1012,why would closing price of a stock be differen...,finance,21840,human_answers,There is more than one exchange where stock ca...
1013,why would closing price of a stock be differen...,finance,21840,chatgpt_answers,There are a few reasons why the closing price ...
1014,“Occupation” field on IRS Form 1040,finance,20684,human_answers,"It doesn't generally matter, and I'm not sure ..."


In [12]:
train, test = train_test_split(df, test_size=0.3, random_state=42, stratify=df["author"])

In [13]:
print("train", len(train))
print("test", len(test))

train 711
test 305


In [14]:
train.to_pickle("./dataset_train.pkl")
train.to_csv("./dataset/train.csv", index=False, encoding="utf8")

In [15]:
test.to_pickle("./dataset_test.pkl")
test.to_csv("./dataset/test.csv", index=False, encoding="utf8")

In [20]:
# the experiments where run on the .pkl files, providing .csv files for convenience
from pandas.testing import assert_frame_equal
assert_frame_equal(pd.read_pickle("./dataset_train.pkl").reset_index(drop=True), pd.read_csv("./dataset/train.csv").reset_index(drop=True),check_dtype=False)
assert_frame_equal(pd.read_pickle("./dataset_test.pkl").reset_index(drop=True), pd.read_csv("./dataset/test.csv").reset_index(drop=True), check_dtype=False)

