### Data processing

This notebook shows our approach to combining various datasets into our overall dataset. We combine four datasets of human-generated and machine-generated text from various contexts. To create a balanced final dataset, we take 10,000 samples of human-generated and machine-generated text from each dataset and put them all together into our final dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [46]:
essays = pd.read_csv("essays_balanced.csv")
grover = pd.read_csv("grover.csv")
reviews = pd.read_csv("reviews.csv")
wiki = pd.read_csv("wiki.csv")

### Processing essays

In [3]:
essays.sample(5)

Unnamed: 0.1,Unnamed: 0,text,prompt_name,source,RDizzl3_seven,Label
6463,7764,All students should be required to join a afte...,Mandatory extracurricular activities,persuade_corpus,False,Human
16574,23298,Some schools offer distance learning as an opt...,Distance learning,persuade_corpus,False,Human
12292,38064,I believe that the minimum wage in our country...,Grades for extracurricular activities,falcon_180b_v1,False,Machine
8356,10181,The technology to read people emotional expres...,Facial action coding system,persuade_corpus,True,Human
12737,17460,"""Driveless Cars Are Coming"" this article gives...",Driverless cars,persuade_corpus,True,Human


In [4]:
essays['source'].value_counts()

source
persuade_corpus                       9515
llama2_chat                           1312
mistral7binstruct_v2                  1177
mistral7binstruct_v1                  1145
chat_gpt_moth                         1086
llama_70b_v1                           856
darragh_claude_v6                      827
darragh_claude_v7                      813
falcon_180b_v1                         752
kingki19_palm                          674
train_essays                           485
palm-text-bison1                       305
cohere-command                         296
radek_500                              253
mistralai/Mistral-7B-Instruct-v0.1     214
NousResearch/Llama-2-7b-chat-hf        186
radekgpt4                              104
Name: count, dtype: int64

In [7]:
essays.groupby(by = 'prompt_name')['source'].value_counts().head(50)

prompt_name                    source                            
"A Cowboy Who Rode the Waves"  persuade_corpus                       524
                               darragh_claude_v7                      73
                               darragh_claude_v6                      68
                               llama2_chat                            66
                               llama_70b_v1                           54
                               cohere-command                         50
                               palm-text-bison1                       50
                               chat_gpt_moth                          47
                               falcon_180b_v1                         46
                               mistral7binstruct_v2                   36
                               mistral7binstruct_v1                   34
Car-free cities                persuade_corpus                       699
                               kingki19_palm              

In [10]:
machine_essays = essays[essays['Label'] == 'Machine'].sample(n = 10000)
human_essays = essays[essays['Label'] == 'Human'].sample(n = 10000)
combined_essays = pd.concat([machine_essays, human_essays])

In [11]:
combined_essays = combined_essays[['text', 'Label', 'source']]
combined_essays['Original dataset'] = 'essays'
combined_essays['Row in original dataset'] = combined_essays.index

In [12]:
combined_essays.rename(columns = {'text':'Text', 'source':'Source'}, inplace = True)

In [13]:
combined_essays

Unnamed: 0,Text,Label,Source,Original dataset,Row in original dataset
12494,Hi my name is Timmy and I’m in grade 7. Homewo...,Machine,llama_70b_v1,essays,12494
6853,Graduating from high school a year early may s...,Machine,chat_gpt_moth,essays,6853
14687,"Dear state senator,\n\nI am writing to expres...",Machine,cohere-command,essays,14687
465,Cell phones have become an essential part of o...,Machine,falcon_180b_v1,essays,465
14846,"Dear Senator,\n\nI am writing to you today to ...",Machine,kingki19_palm,essays,14846
...,...,...,...,...,...
967,L\n\nimiting car usage is a great and helpful ...,Human,persuade_corpus,essays,967
13995,"Dear President of the United States,\n\nI thin...",Human,persuade_corpus,essays,13995
403,Should drivers be able to use their phone whil...,Human,persuade_corpus,essays,403
16369,Many students in the U.S. are being advised to...,Human,persuade_corpus,essays,16369


### Processing grover

In [37]:
grover.sample(5)

Unnamed: 0.1,Unnamed: 0,article,domain,title,date,authors,ind30k,url,label,orig_split,split,random_score,top_p
21763,21763,"Ge Min, spokesperson for the Washington, D.C. ...",theepochtimes.com,A Silent Rally Remembers Historic Appeal 20 Ye...,2019-04-15,,3551,https://www.theepochtimes.com/a-silent-rally-r...,human,gen,test,0.616154,
22823,22823,"As Congress considers a new funding bill, fami...",necn.com,Families of Overdose Victims Rally for Safe In...,2019-04-08,,8108,https://www.necn.com/news/new-england/Families...,machine,gen,test,,0.939683
23162,23162,(Reuters) — Uber expects it will be a long tim...,venturebeat.com,Uber expects a long wait before self-driving c...,2019-04-09,Author,22427,https://venturebeat.com/2019/04/08/uber-expect...,human,train_burner,test,1.009594,
5087,5087,Please enable Javascript to watch this video\n...,fox2now.com,"A multimillion-dollar contract, a mysterious s...",2019-04-12,Chris Hayes,3312,https://fox2now.com/2019/04/11/a-multimillion-...,machine,gen,train,,0.939683
18828,18828,"A Gynaecologist, Dr Kenneth Adedugba, of Life ...",vanguardngr.com,‘Men will continue to produce sperm until old ...,2019-04-07,View All Posts Nwafor Polycarp,827,https://www.vanguardngr.com/2019/04/men-will-c...,human,gen,test,-0.039887,


In [38]:
grover['label'].value_counts()

label
human      15000
machine    10000
Name: count, dtype: int64

In [39]:
machine_grover = grover[grover['label'] == 'machine']
human_grover = grover[grover['label'] == 'human'].sample(n = 10000)
combined_grover = pd.concat([machine_grover, human_grover])

In [40]:
combined_grover['Label'] = combined_grover['label'].apply(lambda x: x.capitalize())
combined_grover['Original dataset'] = 'grover'
combined_grover['Row in original dataset'] = combined_grover.index
combined_grover.rename(columns = {'article':'Text'}, inplace = True)

In [41]:
combined_grover = combined_grover[['Text', 'Label', 'Original dataset', 'Row in original dataset']]

In [42]:
combined_grover['Source'] = 'grover'

In [43]:
combined_grover

Unnamed: 0,Text,Label,Original dataset,Row in original dataset,Source
1,"When I first started looking at oil assets, I ...",Machine,grover,1,grover
4,The Kingston and St Andrew Municipal Corporati...,Machine,grover,4,grover
5,"news, local-news,\nTwo protests against an inv...",Machine,grover,5,grover
7,Drink-driving charges brought against Danny Dr...,Machine,grover,7,grover
13,Called by the Mirror after today's strong new ...,Machine,grover,13,grover
...,...,...,...,...,...
2642,TONY PULIS thinks Dael Fry has all the attribu...,Human,grover,2642,grover
2907,My favorite way to get to know a city is bite ...,Human,grover,2907,grover
17957,Other stars who will select the winners in the...,Human,grover,17957,grover
14402,New Delhi: Private equity (PE) fund Actis Llp ...,Human,grover,14402,grover


### Processing reviews

In [44]:
reviews.sample(5)

Unnamed: 0,category,rating,Label,Text
28806,Books_5,4.0,Machine,I love the characters in this book and the wri...
28758,Books_5,4.0,Machine,"well done just too dark and violent, with the ..."
3804,Home_and_Kitchen_5,3.0,Machine,"I'm pretty torn about this, but I thought it w..."
32997,Toys_and_Games_5,4.0,Machine,Well made and wonderful toys.Cute and very fun...
18633,Tools_and_Home_Improvement_5,4.0,Machine,Very cool light (UV 3) and the wide beam is go...


In [47]:
reviews['label'] = reviews['label'].apply(lambda x: 'Human' if x == 'OR' else 'Machine')
reviews.rename(columns = {'label':'Label', 'text_':'Text'}, inplace=True)

In [48]:
machine_reviews = reviews[reviews['Label'] == 'Machine'].sample(n = 10000)
human_reviews = reviews[reviews['Label'] == 'Human'].sample(n = 10000)
combined_reviews = pd.concat([machine_reviews, human_reviews])

In [23]:
combined_reviews['Original dataset'] = 'reviews'
combined_reviews['Row in original dataset'] = combined_reviews.index
combined_reviews = combined_reviews[['Text', 'Label', 'Original dataset', 'Row in original dataset']]

In [49]:
combined_reviews['Source'] = "GPT-2"

In [50]:
combined_reviews

Unnamed: 0,category,rating,Label,Text,Source
16915,Tools_and_Home_Improvement_5,5.0,Machine,"My walking 9 month old, I couldn't find a good...",GPT-2
17501,Tools_and_Home_Improvement_5,5.0,Machine,"This is a great tool, it's lightweight and eas...",GPT-2
33416,Toys_and_Games_5,3.0,Machine,"Too many small pieces that go into the pieces,...",GPT-2
38131,Clothing_Shoes_and_Jewelry_5,5.0,Machine,I love this bag! It looks great and the materi...,GPT-2
426,Home_and_Kitchen_5,5.0,Machine,Very pretty and the size is perfect. It is th...,GPT-2
...,...,...,...,...,...
40355,Clothing_Shoes_and_Jewelry_5,4.0,Human,"This is the first smart watch I've used, but I...",GPT-2
18745,Tools_and_Home_Improvement_5,4.0,Human,Do your homework. If installing these bulbs i...,GPT-2
38663,Clothing_Shoes_and_Jewelry_5,5.0,Human,This is one of the most awesome body stockings...,GPT-2
3429,Home_and_Kitchen_5,4.0,Human,We use these on multi-sizes of mason jars for ...,GPT-2


### Processing wiki

In [51]:
wiki.sample(5)

Unnamed: 0.1,Unnamed: 0,id,url,title,wiki_intro,generated_intro,title_len,wiki_intro_len,generated_intro_len,prompt,generated_text,prompt_tokens,generated_text_tokens
12525,12525,45531058,https://en.wikipedia.org/wiki/Kanakanjali,Kanakanjali,Kanakanjali is a Bengali television serial whi...,Kanakanjali is a Bengali television serial whi...,1,228,113,200 word wikipedia style introduction on 'Kana...,aired on GEC channel from 1999 to 2000. It is...,31,145
8498,8498,5809498,https://en.wikipedia.org/wiki/VPS/VM,VPS/VM,VPS/VM (Virtual Processing System/Virtual Mach...,VPS/VM (Virtual Processing System/Virtual Mach...,1,158,30,200 word wikipedia style introduction on 'VPS/...,early type of computer which was used in the ...,31,27
125837,125837,37393652,https://en.wikipedia.org/wiki/Trap%20music,Trap music,"Trap is a subgenre of hip hop music, that orig...",Trap is a subgenre of hip hop that emerged in ...,2,160,72,200 word wikipedia style introduction on 'Trap...,that emerged in the early 2010s. It is a comb...,24,84
99586,99586,38068406,https://en.wikipedia.org/wiki/66%20Ophiuchi,66 Ophiuchi,66 Ophiuchi is a binary variable star in the e...,66 Ophiuchi is a binary variable star system i...,2,235,56,200 word wikipedia style introduction on '66 O...,system in the constellation Ophiuchus. It is ...,26,62
8987,8987,22983863,https://en.wikipedia.org/wiki/Deng%20Yujiao%20...,Deng Yujiao incident,The Deng Yujiao incident occurred on 10 May 2...,The Deng Yujiao incident occurred on January ...,3,208,117,200 word wikipedia style introduction on 'Deng...,"January 6, 1986, when Deng Yujiao, then a hig...",28,140


In [52]:
wiki = wiki.sample(n = 20000)

In [53]:
human_wiki = pd.DataFrame(wiki.iloc[:10000])
machine_wiki = pd.DataFrame(wiki.iloc[10000:])

In [54]:
human_wiki.rename(columns = {'wiki_intro':'Text'}, inplace = True)
machine_wiki.rename(columns = {'generated_intro':'Text'}, inplace = True)

In [55]:
human_wiki['Label'] = 'Human'
machine_wiki['Label'] = 'Machine'

In [56]:
combined_wiki = pd.concat([human_wiki, machine_wiki])

In [57]:
combined_wiki['Original dataset'] = 'wiki'
combined_wiki['Row in original dataset'] = combined_wiki.index
combined_wiki = combined_wiki[['Text', 'Label', 'Original dataset', 'Row in original dataset']]

In [58]:
combined_wiki['Source'] = "GPT-3 (Curie)"

In [59]:
combined_wiki

Unnamed: 0,Text,Label,Original dataset,Row in original dataset,Source
13424,Glycan nomenclature is the systematic naming o...,Human,wiki,13424,GPT-3 (Curie)
54775,Johannes Kotkas (3 February 1915 – 8 May 1998)...,Human,wiki,54775,GPT-3 (Curie)
18866,"Siruseri is a southern suburb of Chennai, Indi...",Human,wiki,18866,GPT-3 (Curie)
129302,is a Japanese actor and singer associated with...,Human,wiki,129302,GPT-3 (Curie)
78312,"Roger Charles Alperin (January 8, 1947 – Novem...",Human,wiki,78312,GPT-3 (Curie)
...,...,...,...,...,...
84969,"In decision theory, a score function, or scori...",Machine,wiki,84969,GPT-3 (Curie)
45878,James Bernard Fagan (18 May 1873 – 10 July 195...,Machine,wiki,45878,GPT-3 (Curie)
130325,Hervé Bléjean is a military officer of the Fre...,Machine,wiki,130325,GPT-3 (Curie)
102299,Stade Brestois 29 is a French football club ba...,Machine,wiki,102299,GPT-3 (Curie)


### Putting it all together

In [60]:
combined = pd.concat([combined_essays, combined_grover, combined_reviews, combined_wiki])

In [61]:
combined.to_csv('combined_data.csv', index = False)

In [62]:
combined['Label'].value_counts()

Label
Machine    40000
Human      40000
Name: count, dtype: int64

In [63]:
combined['Original dataset'].value_counts()

Original dataset
essays    20000
grover    20000
wiki      20000
Name: count, dtype: int64

In [64]:
combined

Unnamed: 0,Text,Label,Source,Original dataset,Row in original dataset,category,rating
12494,Hi my name is Timmy and I’m in grade 7. Homewo...,Machine,llama_70b_v1,essays,12494.0,,
6853,Graduating from high school a year early may s...,Machine,chat_gpt_moth,essays,6853.0,,
14687,"Dear state senator,\n\nI am writing to expres...",Machine,cohere-command,essays,14687.0,,
465,Cell phones have become an essential part of o...,Machine,falcon_180b_v1,essays,465.0,,
14846,"Dear Senator,\n\nI am writing to you today to ...",Machine,kingki19_palm,essays,14846.0,,
...,...,...,...,...,...,...,...
84969,"In decision theory, a score function, or scori...",Machine,GPT-3 (Curie),wiki,84969.0,,
45878,James Bernard Fagan (18 May 1873 – 10 July 195...,Machine,GPT-3 (Curie),wiki,45878.0,,
130325,Hervé Bléjean is a military officer of the Fre...,Machine,GPT-3 (Curie),wiki,130325.0,,
102299,Stade Brestois 29 is a French football club ba...,Machine,GPT-3 (Curie),wiki,102299.0,,
