### Data processing

This notebook shows our approach to combining various datasets into our overall dataset. We combine four datasets of human-generated and machine-generated text from various contexts. To create a balanced final dataset, we take 10,000 samples of human-generated and machine-generated text from each dataset and put them all together into our final dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
essays = pd.read_csv("essays.csv")
grover = pd.read_csv("grover.csv")
reviews = pd.read_csv("reviews.csv")
wiki = pd.read_csv("wiki.csv")

### Processing essays

In [3]:
essays.sample(5)

Unnamed: 0,text,prompt_name,source,RDizzl3_seven,Label
35310,I recently had an experience that taught me a ...,Seeking multiple opinions,chat_gpt_moth,False,Machine
13223,So you think that aliens created the face that...,The Face on Mars,persuade_corpus,True,Human
38819,"Hey there! As an 8th grader, I'm super excite...",Distance learning,llama2_chat,False,Machine
39303,Advantages of Limiting Car Usage\n\nLimiting c...,Car-free cities,radek_500,True,Machine
21019,"Dear TEACHER_NAME,\n\nI think that we should h...",Cell phones at school,persuade_corpus,False,Human


In [4]:
essays['source'].value_counts()

source
persuade_corpus                       25996
mistral7binstruct_v1                   2421
mistral7binstruct_v2                   2421
chat_gpt_moth                          2421
llama2_chat                            2421
kingki19_palm                          1384
train_essays                           1378
llama_70b_v1                           1172
falcon_180b_v1                         1055
darragh_claude_v6                      1000
darragh_claude_v7                      1000
radek_500                               500
NousResearch/Llama-2-7b-chat-hf         400
mistralai/Mistral-7B-Instruct-v0.1      400
cohere-command                          350
palm-text-bison1                        349
radekgpt4                               200
Name: count, dtype: int64

In [5]:
machine_essays = essays[essays['Label'] == 'Machine'].sample(n = 10000)
human_essays = essays[essays['Label'] == 'Human'].sample(n = 10000)
combined_essays = pd.concat([machine_essays, human_essays])

In [6]:
combined_essays = combined_essays[['text', 'Label']]
combined_essays['Original dataset'] = 'essays'
combined_essays['Row in original dataset'] = combined_essays.index

In [7]:
combined_essays.rename(columns = {'text':'Text'}, inplace = True)

In [8]:
combined_essays

Unnamed: 0,Text,Label,Original dataset,Row in original dataset
26613,The idea of graduating high school in three ye...,Machine,essays,26613
26326,"Hey, I'm so excited to write this essay about ...",Machine,essays,26326
30579,Introduction\n\nSelf-reliance is a concept tha...,Machine,essays,30579
33547,"Sure, here's my attempt at writing an essay as...",Machine,essays,33547
33768,The legalization of marijuana is a highly deba...,Machine,essays,33768
...,...,...,...,...
6212,This is why you should join the seagoing cowbo...,Human,essays,6212
18030,As I admit that the concept of a driverless ve...,Human,essays,18030
12996,"In the article ""unmasking the face on mars"" pe...",Human,essays,12996
635,Cell phones have become very popular over the ...,Human,essays,635


### Processing grover

In [9]:
grover.sample(5)

Unnamed: 0.1,Unnamed: 0,article,domain,title,date,authors,ind30k,url,label,orig_split,split,random_score,top_p
18684,18684,Afghan security forces gather Tuesday near Bag...,courthousenews.com,3 US Soldiers and a Contractor Killed in Afgha...,2019-04-09,Associated Press,26816,https://www.courthousenews.com/3-u-s-soldiers-...,human,train_burner,test,-0.066292,
20246,20246,It's tempting to apply to a new job in the hop...,fool.com,4 Signs You're Applying to a Job for the Wrong...,2019-04-15,"Maurie Backman, Because They Often Aren'T, She...",5311,https://www.fool.com/careers/2019/04/15/4-sign...,machine,gen,test,,0.939683
9921,9921,"(Front) Chris Evans, Director Joe Russo, Brie ...",nbcmiami.com,‘Avengers: Endgame’ Has Sold Nearly Twice as M...,2019-04-09,,23926,https://www.nbcmiami.com/entertainment/enterta...,human,train_burner,train,2.403799,
9558,9558,One person is dead and a firefighter is injure...,necn.com,"1 Killed, Firefighter Injured in Fall River Blaze",2019-04-06,,7033,https://www.necn.com/news/new-england/One-Dead...,machine,gen,train,,0.939683
17171,17171,U.S. Treasurer Jovita Carranza speaks at a for...,nbcphiladelphia.com,Trump Taps Carranza as Small Business Administ...,2019-04-04,,20300,https://www.nbcphiladelphia.com/news/national-...,human,train_burner,test,-0.401165,


In [10]:
grover['label'].value_counts()

label
human      15000
machine    10000
Name: count, dtype: int64

In [11]:
machine_grover = grover[grover['label'] == 'machine']
human_grover = grover[grover['label'] == 'human'].sample(n = 10000)
combined_grover = pd.concat([machine_grover, human_grover])

In [12]:
combined_grover['Label'] = combined_grover['label'].apply(lambda x: x.capitalize())
combined_grover['Original dataset'] = 'grover'
combined_grover['Row in original dataset'] = combined_grover.index
combined_grover.rename(columns = {'article':'Text'}, inplace = True)

In [13]:
combined_grover = combined_grover[['Text', 'Label', 'Original dataset', 'Row in original dataset']]

In [14]:
combined_grover

Unnamed: 0,Text,Label,Original dataset,Row in original dataset
1,"When I first started looking at oil assets, I ...",Machine,grover,1
4,The Kingston and St Andrew Municipal Corporati...,Machine,grover,4
5,"news, local-news,\nTwo protests against an inv...",Machine,grover,5
7,Drink-driving charges brought against Danny Dr...,Machine,grover,7
13,Called by the Mirror after today's strong new ...,Machine,grover,13
...,...,...,...,...
21069,"What's happening: Next month in San Francisco,...",Human,grover,21069
19792,PSG will be crowned champions on Wednesday if ...,Human,grover,19792
16248,"Nov. 16, 1929 – March 29, 2019\nSister Elizabe...",Human,grover,16248
20646,"""Through his band Frightened Rabbit, so many p...",Human,grover,20646


### Processing reviews

In [15]:
reviews.sample(5)

Unnamed: 0,category,rating,label,text_
17207,Tools_and_Home_Improvement_5,5.0,CG,Awesome little cutter. Used it to cut the lon...
27750,Kindle_Store_5,4.0,CG,Jessica McClain is the daughter of an island b...
4755,Sports_and_Outdoors_5,5.0,OR,"Best tasting trail food we have ever tried, my..."
38953,Clothing_Shoes_and_Jewelry_5,5.0,OR,This cross is simple but elegant and just the ...
20808,Pet_Supplies_5,5.0,OR,We've used this for our previous dogs and now ...


In [16]:
reviews['label'] = reviews['label'].apply(lambda x: 'Human' if x == 'OR' else 'Machine')
reviews.rename(columns = {'label':'Label', 'text_':'Text'}, inplace=True)

In [17]:
machine_reviews = reviews[reviews['Label'] == 'Machine'].sample(n = 10000)
human_reviews = reviews[reviews['Label'] == 'Human'].sample(n = 10000)
combined_reviews = pd.concat([machine_reviews, human_reviews])

In [18]:
combined_reviews['Original dataset'] = 'reviews'
combined_reviews['Row in original dataset'] = combined_reviews.index
combined_reviews = combined_reviews[['Text', 'Label', 'Original dataset', 'Row in original dataset']]

In [19]:
combined_reviews

Unnamed: 0,Text,Label,Original dataset,Row in original dataset
36938,I LOVE this top & it has the quality and comfo...,Machine,reviews,36938
15675,"got this for my son, who has arthritis. He sa...",Machine,reviews,15675
10038,I bought this receiver to use with my cell pho...,Machine,reviews,10038
35443,"Fun family game. With this set, you can play ...",Machine,reviews,35443
33998,Easy and fun to build. We have had the wooden ...,Machine,reviews,33998
...,...,...,...,...
367,This is a beautiful table runner and looks gre...,Human,reviews,367
2860,I've made candy with these several times alrea...,Human,reviews,2860
17267,Posting was misleading in that it stated it to...,Human,reviews,17267
39875,I do have to say that I received this product ...,Human,reviews,39875


### Processing wiki

In [20]:
wiki.sample(5)

Unnamed: 0.1,Unnamed: 0,id,url,title,wiki_intro,generated_intro,title_len,wiki_intro_len,generated_intro_len,prompt,generated_text,prompt_tokens,generated_text_tokens
118440,118440,7408300,https://en.wikipedia.org/wiki/San%20Andres%2C%...,"San Andres, Manila","San Andres (also San Andres Bukid, bukid being...","San Andres (also San Andres Bukid, bukid meani...",3,151,218,200 word wikipedia style introduction on 'San ...,"meaning ""mountain"" in the Tagalog language) i...",31,300
15531,15531,20874905,https://en.wikipedia.org/wiki/Lisa%20Wickham,Lisa Wickham,Lisa Wickham is a media producer-director-TV p...,Lisa Wickham is a media producer-director-TV p...,2,205,229,200 word wikipedia style introduction on 'Lisa...,. Wickham has worked in a number of positions ...,28,297
35585,35585,59558024,https://en.wikipedia.org/wiki/Miss%20Nepal%202019,Miss Nepal 2019,Hidden Treasures Ruslan FM 95.2 Miss Nepal 201...,Hidden Treasures Ruslan FM 95.2 Miss Nepal 201...,3,158,77,200 word wikipedia style introduction on 'Miss...,2019\n\nThe Miss Nepal 2019 pageant is a nati...,27,97
65599,65599,6699713,https://en.wikipedia.org/wiki/Thekla%20Reuten,Thekla Reuten,Thekla Simona Gelsomina Reuten (born 16 Septem...,Thekla Simona Gelsomina Reuten (born 16 Septem...,2,174,166,200 word wikipedia style introduction on 'Thek...,1984) is a German singer and songwriter. She ...,33,239
64663,64663,360506,https://en.wikipedia.org/wiki/History%20of%20J...,History of Jerusalem,"During its long history, Jerusalem has been at...","During its long history, Jerusalem has been sa...",3,188,247,200 word wikipedia style introduction on 'Hist...,sacked and destroyed numerous times by invadi...,24,300


In [21]:
wiki = wiki.sample(n = 20000)

In [22]:
human_wiki = pd.DataFrame(wiki.iloc[:10000])
machine_wiki = pd.DataFrame(wiki.iloc[10000:])

In [23]:
human_wiki.rename(columns = {'wiki_intro':'Text'}, inplace = True)
machine_wiki.rename(columns = {'generated_intro':'Text'}, inplace = True)

In [24]:
human_wiki['Label'] = 'Human'
machine_wiki['Label'] = 'Machine'

In [25]:
combined_wiki = pd.concat([human_wiki, machine_wiki])

In [26]:
combined_wiki['Original dataset'] = 'wiki'
combined_wiki['Row in original dataset'] = combined_wiki.index
combined_wiki = combined_wiki[['Text', 'Label', 'Original dataset', 'Row in original dataset']]

In [27]:
combined_wiki

Unnamed: 0,Text,Label,Original dataset,Row in original dataset
34604,"Deborah Wiles (born May 5, 1953, Mobile, Alaba...",Human,wiki,34604
138105,"In computing, subject-oriented programming is ...",Human,wiki,138105
120916,"Richard Aaron Katz (March 13, 1924 – November ...",Human,wiki,120916
145880,The geography of Cornwall describes the extre...,Human,wiki,145880
71798,"David Lewis Fultz (May 29, 1875 – October 29, ...",Human,wiki,71798
...,...,...,...,...
116735,The 1934 WANFL season was the 50th season of t...,Machine,wiki,116735
101963,The Salle du Bel-Air or Salle du Bel-Air is a ...,Machine,wiki,101963
92244,Jasmine Ser Xiang Wei (born 24 September 1987)...,Machine,wiki,92244
84183,The Nantuo 181 class tug is a Chinese diesel-e...,Machine,wiki,84183


### Putting it all together

In [28]:
combined = pd.concat([combined_essays, combined_grover, combined_reviews, combined_wiki])

In [29]:
combined.to_csv('combined_data.csv', index = False)

In [30]:
combined['Label'].value_counts()

Label
Machine    40000
Human      40000
Name: count, dtype: int64

In [31]:
combined['Original dataset'].value_counts()

Original dataset
essays     20000
grover     20000
reviews    20000
wiki       20000
Name: count, dtype: int64