### Data processing

This notebook shows our approach to combining various datasets into our overall dataset. We combine four datasets of human-generated and machine-generated text from various contexts. To create a balanced final dataset, we take 10,000 samples of human-generated and machine-generated text from each dataset and put them all together into our final dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
essays = pd.read_csv("essays_balanced.csv")
grover = pd.read_csv("grover.csv")
reviews = pd.read_csv("reviews.csv")
wiki = pd.read_csv("wiki.csv")

### Processing essays

In [3]:
essays.sample(5)

Unnamed: 0.1,Unnamed: 0,text,prompt_name,source,RDizzl3_seven,Label
10968,14333,"Dear, Principal\n\nIf your trying to decide if...",Community service,persuade_corpus,False,Human
2041,40984,Advantages of Limiting Car Usage\n\nIn recent...,Car-free cities,mistralai/Mistral-7B-Instruct-v0.1,True,Machine
12894,17200,Driverless cars really fasinates me. In the ne...,Driverless cars,persuade_corpus,True,Human
10021,12282,"What is that thing on Mars?\n\nWell, some peop...",The Face on Mars,persuade_corpus,True,Human
32,853,What makes a mistake something someone will ne...,Phones and driving,persuade_corpus,False,Human


In [4]:
essays['source'].value_counts()

source
persuade_corpus                       9515
llama2_chat                           1312
mistral7binstruct_v2                  1177
mistral7binstruct_v1                  1145
chat_gpt_moth                         1086
llama_70b_v1                           856
darragh_claude_v6                      827
darragh_claude_v7                      813
falcon_180b_v1                         752
kingki19_palm                          674
train_essays                           485
palm-text-bison1                       305
cohere-command                         296
radek_500                              253
mistralai/Mistral-7B-Instruct-v0.1     214
NousResearch/Llama-2-7b-chat-hf        186
radekgpt4                              104
Name: count, dtype: int64

In [5]:
essays.groupby(by = 'prompt_name')['source'].value_counts().head(50)

prompt_name                    source                            
"A Cowboy Who Rode the Waves"  persuade_corpus                       524
                               darragh_claude_v7                      73
                               darragh_claude_v6                      68
                               llama2_chat                            66
                               llama_70b_v1                           54
                               cohere-command                         50
                               palm-text-bison1                       50
                               chat_gpt_moth                          47
                               falcon_180b_v1                         46
                               mistral7binstruct_v2                   36
                               mistral7binstruct_v1                   34
Car-free cities                persuade_corpus                       699
                               kingki19_palm              

In [11]:
machine_essays = essays[essays['Label'] == 'Machine'].sample(n = 10000)
human_essays = essays[essays['Label'] == 'Human'].sample(n = 10000)
combined_essays = pd.concat([machine_essays, human_essays])

In [12]:
combined_essays = combined_essays[['text', 'Label', 'source']]
combined_essays['Original dataset'] = 'essays'
combined_essays['Row in original dataset'] = combined_essays.index

In [13]:
combined_essays.rename(columns = {'text':'Text', 'source':'Model'}, inplace = True)

In [16]:
combined_essays.loc[combined_essays['Label'] == 'Human', 'Model'] = 'Human'

In [17]:
combined_essays

Unnamed: 0,Text,Label,Model,Original dataset,Row in original dataset
13355,While driverless cars present many promising b...,Machine,darragh_claude_v7,essays,13355
7249,Homework Clubs: The Key to Unlocking Academic ...,Machine,llama2_chat,essays,7249
2603,"""The legalization of marijuana has been a cont...",Machine,falcon_180b_v1,essays,2603
3993,Taking the opportunity to learn new things can...,Machine,mistral7binstruct_v1,essays,3993
3773,Working with a partner is an effective way fo...,Machine,mistral7binstruct_v2,essays,3773
...,...,...,...,...,...
12907,People are always interested in looking in the...,Human,Human,essays,12907
1556,The majority of Americans have the luxury of o...,Human,Human,essays,1556
16441,Students that live far away from schools or li...,Human,Human,essays,16441
15362,"Dear Principal,\n\nI think you should go with ...",Human,Human,essays,15362


### Processing grover

In [18]:
grover.sample(5)

Unnamed: 0.1,Unnamed: 0,article,domain,title,date,authors,ind30k,url,label,orig_split,split,random_score,top_p
23669,23669,Posted\nA Florida man who authorities believe ...,abc.net.au,Cassowary kills suspected breeder in Florida,2019-04-14,Australian Broadcasting Corporation,26715,https://www.abc.net.au/news/2019-04-14/florida...,human,train_burner,test,1.224442,
3582,3582,A 36-year-old man from Oban has appeared in co...,thecourier.co.uk,Man appears in court after biker suffers leg i...,2019-04-16,Ross Gardiner,9735,https://www.thecourier.co.uk/fp/news/local/per...,machine,gen,train,,0.939683
11197,11197,A Nevada judge has dismissed a lawsuit by ranc...,oregonlive.com,Cliven Bundy’s public lands claim is 'simply d...,2019-04-09,"Maxine Bernstein, The Oregonian Oregonlive",23408,https://www.oregonlive.com/crime/2019/04/clive...,human,train_burner,val,-0.181267,
15859,15859,"HARRISBURG, Pa. (AP) — A grand jury report on ...",pennlive.com,Grand jury sees ways to reduce Pennsylvania’s ...,2019-04-15,"Mark Scolforo, The Associated Press",210,https://www.pennlive.com/news/2019/04/grand-ju...,machine,gen,test,,0.939683
4984,4984,Board members play a critical role as gatekeep...,theprovince.com,Lubomyr Luciuk: Trying to uphold Canada's refu...,2019-04-17,"Lubomyr Luciuk, More Lubomyr Luciuk",24194,https://theprovince.com/opinion/op-ed/lubomyr-...,human,train_burner,train,-0.002763,


In [19]:
grover['label'].value_counts()

label
human      15000
machine    10000
Name: count, dtype: int64

In [26]:
machine_grover = grover[grover['label'] == 'machine']
human_grover = grover[grover['label'] == 'human'].sample(n = 10000)
combined_grover = pd.concat([machine_grover, human_grover])

In [27]:
combined_grover['Label'] = combined_grover['label'].apply(lambda x: x.capitalize())
combined_grover['Original dataset'] = 'grover'
combined_grover['Row in original dataset'] = combined_grover.index
combined_grover.rename(columns = {'article':'Text'}, inplace = True)

In [28]:
combined_grover = combined_grover[['Text', 'Label', 'Original dataset', 'Row in original dataset']]

In [31]:
combined_grover['Model'] = 'Grover'
combined_grover.loc[combined_grover['Label'] == 'Human', 'Model'] = 'Human'

In [32]:
combined_grover

Unnamed: 0,Text,Label,Original dataset,Row in original dataset,Model
1,"When I first started looking at oil assets, I ...",Machine,grover,1,Grover
4,The Kingston and St Andrew Municipal Corporati...,Machine,grover,4,Grover
5,"news, local-news,\nTwo protests against an inv...",Machine,grover,5,Grover
7,Drink-driving charges brought against Danny Dr...,Machine,grover,7,Grover
13,Called by the Mirror after today's strong new ...,Machine,grover,13,Grover
...,...,...,...,...,...
10700,Players in the construction and building mater...,Human,grover,10700,Human
13710,A 68-year-old man has been sentenced to 90 day...,Human,grover,13710,Human
10675,A Carnival cruise ship helped the U.S. Coast G...,Human,grover,10675,Human
20319,"life-style,\nHUNTING for rocks might not sound...",Human,grover,20319,Human


### Processing reviews

In [33]:
reviews.sample(5)

Unnamed: 0,category,rating,label,text_
20454,Pet_Supplies_5,4.0,OR,"One of my dogs liked it, the other did not! S..."
6200,Sports_and_Outdoors_5,5.0,CG,Super comfortable. Hard to find a better qual...
26185,Kindle_Store_5,2.0,CG,I received this ebook at a discount in exchang...
22051,Pet_Supplies_5,3.0,OR,"Unfortunately, my 5 cats only found them fun f..."
13976,Movies_and_TV_5,5.0,OR,We are enjoying this series and are looking fo...


In [34]:
reviews['label'] = reviews['label'].apply(lambda x: 'Human' if x == 'OR' else 'Machine')
reviews.rename(columns = {'label':'Label', 'text_':'Text'}, inplace=True)

In [35]:
machine_reviews = reviews[reviews['Label'] == 'Machine'].sample(n = 10000)
human_reviews = reviews[reviews['Label'] == 'Human'].sample(n = 10000)
combined_reviews = pd.concat([machine_reviews, human_reviews])

In [36]:
combined_reviews['Original dataset'] = 'reviews'
combined_reviews['Row in original dataset'] = combined_reviews.index
combined_reviews = combined_reviews[['Text', 'Label', 'Original dataset', 'Row in original dataset']]

In [37]:
combined_reviews['Model'] = "GPT-2"
combined_reviews.loc[combined_reviews['Label'] == 'Human', 'Model'] = 'Human'

In [38]:
combined_reviews

Unnamed: 0,Text,Label,Original dataset,Row in original dataset,Model
16594,Not a convenient size and looks a little cheap...,Machine,reviews,16594,GPT-2
30221,"Mysteries are usually good books, but I though...",Machine,reviews,30221,GPT-2
17121,a little tough to get into. The only problem i...,Machine,reviews,17121,GPT-2
11022,This product is perfect for what it is intende...,Machine,reviews,11022,GPT-2
502,Beautiful heavy weight dining room rug. The qu...,Machine,reviews,502,GPT-2
...,...,...,...,...,...
24514,Outstanding! I cannot wait to see what happens...,Human,reviews,24514,Human
24914,Mr. Reasoner's story of three different cultur...,Human,reviews,24914,Human
10607,Some obvious quality control issues. One batt...,Human,reviews,10607,Human
21696,I can't believe how much the dogs loved this t...,Human,reviews,21696,Human


### Processing wiki

In [39]:
wiki.sample(5)

Unnamed: 0.1,Unnamed: 0,id,url,title,wiki_intro,generated_intro,title_len,wiki_intro_len,generated_intro_len,prompt,generated_text,prompt_tokens,generated_text_tokens
50731,50731,1008089,https://en.wikipedia.org/wiki/Holy%20Brook,Holy Brook,The Holy Brook is a channel of the River Kenne...,The Holy Brook is a channel of water that orig...,2,157,196,200 word wikipedia style introduction on 'Holy...,water that originates in the White Mountains ...,22,226
11682,11682,46820994,https://en.wikipedia.org/wiki/Blutzeuge,Blutzeuge,"Blutzeuge (German for ""blood witness"") was a t...","Blutzeuge (German for ""blood witness"") was a N...",1,189,126,200 word wikipedia style introduction on 'Blut...,Nazi propaganda term for a person who reports...,30,142
56045,56045,14521980,https://en.wikipedia.org/wiki/Granger%20%28nam...,Granger (name),Granger is a surname of English and French ori...,Granger is a surname of English and Scottish o...,2,309,34,200 word wikipedia style introduction on 'Gran...,Scottish origin. Notable people with the surn...,24,53
41917,41917,30335473,https://en.wikipedia.org/wiki/All%20things,All things,"""all things"" is the seventeenth episode of the...","""all things"" is the seventeenth episode of the...",2,198,136,200 word wikipedia style introduction on 'All ...,the first season of the American television s...,25,166
148977,148977,207746,https://en.wikipedia.org/wiki/Plagiaulax,Plagiaulax,Plagiaulax is a genus of mammal from the Lower...,Plagiaulax is a genus of mammal from the famil...,1,153,82,200 word wikipedia style introduction on 'Plag...,the family Eulipotyphlopidae. The genus conta...,29,123


In [40]:
wiki = wiki.sample(n = 20000)

In [41]:
human_wiki = pd.DataFrame(wiki.iloc[:10000])
machine_wiki = pd.DataFrame(wiki.iloc[10000:])

In [42]:
human_wiki.rename(columns = {'wiki_intro':'Text'}, inplace = True)
machine_wiki.rename(columns = {'generated_intro':'Text'}, inplace = True)

In [43]:
human_wiki['Label'] = 'Human'
machine_wiki['Label'] = 'Machine'

In [44]:
combined_wiki = pd.concat([human_wiki, machine_wiki])

In [45]:
combined_wiki['Original dataset'] = 'wiki'
combined_wiki['Row in original dataset'] = combined_wiki.index
combined_wiki = combined_wiki[['Text', 'Label', 'Original dataset', 'Row in original dataset']]

In [47]:
combined_wiki['Model'] = "GPT-3 (Curie)"
combined_wiki.loc[combined_wiki['Label'] == 'Human', 'Model'] = 'Human'

In [48]:
combined_wiki

Unnamed: 0,Text,Label,Original dataset,Row in original dataset,Model
108221,"Decatur may refer to a number of places, stree...",Human,wiki,108221,Human
27528,"Joseph Jacques Omer Plante (; January 17, 1929...",Human,wiki,27528,Human
79953,"Michael Dana Butcher (born May 10, 1965) is an...",Human,wiki,79953,Human
113863,The Winnipeg Art Gallery (WAG) is an art museu...,Human,wiki,113863,Human
123850,Andrea Fantoni (1659–1734) was an Italian scul...,Human,wiki,123850,Human
...,...,...,...,...,...
72827,The Battle of Lewisham took place on 8 May 180...,Machine,wiki,72827,GPT-3 (Curie)
106421,"Asher Wright (September 7, 1803 – April 21, 18...",Machine,wiki,106421,GPT-3 (Curie)
16338,LimeWire is a discontinued free software peer-...,Machine,wiki,16338,GPT-3 (Curie)
53934,The term Diocese of Canada may refer to: \n\n1...,Machine,wiki,53934,GPT-3 (Curie)


### Putting it all together

In [51]:
combined = pd.concat([combined_essays, combined_grover, combined_reviews, combined_wiki])

In [52]:
combined.to_csv('combined_data.csv', index = False)

In [53]:
combined['Label'].value_counts()

Label
Machine    40000
Human      40000
Name: count, dtype: int64

In [54]:
combined['Original dataset'].value_counts()

Original dataset
essays     20000
grover     20000
reviews    20000
wiki       20000
Name: count, dtype: int64

In [55]:
combined

Unnamed: 0,Text,Label,Model,Original dataset,Row in original dataset
13355,While driverless cars present many promising b...,Machine,darragh_claude_v7,essays,13355
7249,Homework Clubs: The Key to Unlocking Academic ...,Machine,llama2_chat,essays,7249
2603,"""The legalization of marijuana has been a cont...",Machine,falcon_180b_v1,essays,2603
3993,Taking the opportunity to learn new things can...,Machine,mistral7binstruct_v1,essays,3993
3773,Working with a partner is an effective way fo...,Machine,mistral7binstruct_v2,essays,3773
...,...,...,...,...,...
72827,The Battle of Lewisham took place on 8 May 180...,Machine,GPT-3 (Curie),wiki,72827
106421,"Asher Wright (September 7, 1803 – April 21, 18...",Machine,GPT-3 (Curie),wiki,106421
16338,LimeWire is a discontinued free software peer-...,Machine,GPT-3 (Curie),wiki,16338
53934,The term Diocese of Canada may refer to: \n\n1...,Machine,GPT-3 (Curie),wiki,53934
