### Data processing

This notebook shows our approach to combining various datasets into our overall dataset. We combine four datasets of human-generated and machine-generated text from various contexts. To create a balanced final dataset, we take 10,000 samples of human-generated and machine-generated text from each dataset and put them all together into our final dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
essays = pd.read_csv("essays.csv")
grover = pd.read_csv("grover.csv")
reviews = pd.read_csv("reviews.csv")
wiki = pd.read_csv("wiki.csv")

### Processing essays

In [3]:
essays.sample(5)

Unnamed: 0,text,prompt_name,source,RDizzl3_seven,Label
11395,I dont't think that we should use the Facial A...,Facial action coding system,persuade_corpus,True,Human
20466,The Electoral College has changed between what...,Does the electoral college work?,persuade_corpus,True,Human
42447,"Dear senator, The electoral process is extreme...",Does the electoral college work?,train_essays,True,Human
25505,You're the president of your school. You decid...,Seeking multiple opinions,persuade_corpus,False,Human
39156,I think dress codes are beneficial to the scho...,Grades for extracurricular activities,llama_70b_v1,False,Machine


In [4]:
essays['source'].value_counts()

source
persuade_corpus                       25996
mistral7binstruct_v1                   2421
mistral7binstruct_v2                   2421
chat_gpt_moth                          2421
llama2_chat                            2421
kingki19_palm                          1384
train_essays                           1378
llama_70b_v1                           1172
falcon_180b_v1                         1055
darragh_claude_v6                      1000
darragh_claude_v7                      1000
radek_500                               500
NousResearch/Llama-2-7b-chat-hf         400
mistralai/Mistral-7B-Instruct-v0.1      400
cohere-command                          350
palm-text-bison1                        349
radekgpt4                               200
Name: count, dtype: int64

In [5]:
machine_essays = essays[essays['Label'] == 'Machine'].sample(n = 10000)
human_essays = essays[essays['Label'] == 'Human'].sample(n = 10000)
combined_essays = pd.concat([machine_essays, human_essays])

In [6]:
combined_essays = combined_essays[['text', 'Label']]
combined_essays['Original dataset'] = 'essays.csv'
combined_essays['Row in original dataset'] = combined_essays.index

In [7]:
combined_essays.rename(columns = {'text':'Text'}, inplace = True)

In [8]:
combined_essays

Unnamed: 0,Text,Label,Original dataset,Row in original dataset
38586,Introduction:\nHey there! I came across your p...,Machine,essays.csv,38586
40740,"Dear [State Senator's Name],\n\nI hope this ...",Machine,essays.csv,40740
29120,Pakistan is truly an extraordinary place with ...,Machine,essays.csv,29120
42810,"This essay will analyze, discuss and prove one...",Machine,essays.csv,42810
33554,"Dear Principal Smith, \n\nI am writing to you ...",Machine,essays.csv,33554
...,...,...,...,...
13950,"dear TEACHER_NAME,\n\nI am writing to let you ...",Human,essays.csv,13950
43125,"Dear, state Senator Electoral College should b...",Human,essays.csv,43125
11808,This new technology to read emotional expressi...,Human,essays.csv,11808
42735,It's not a secret that we as humans use cars t...,Human,essays.csv,42735


### Processing grover

In [9]:
grover.sample(5)

Unnamed: 0.1,Unnamed: 0,article,domain,title,date,authors,ind30k,url,label,orig_split,split,random_score,top_p
11349,11349,(CN) – A multi-year judicial panel sparring wi...,courthousenews.com,EU Lawmakers OK Filter to Ease High Court Logjam,2019-04-09,William Dotinga,7431,https://www.courthousenews.com/eu-lawmakers-ok...,machine,gen,val,,0.939683
15975,15975,"If you spend most of your day typing, you need...",thenextweb.com,Review: The Keychron K1 and K2 are the wireles...,2019-04-02,Napier Lopez,27232,https://thenextweb.com/plugged/2019/04/02/revi...,human,train_burner,test,-0.667901,
18977,18977,opinion\nGovernment is established in a Republ...,allafrica.com,Gambia: Real Development Versus Cosmetic Devel...,2019-04-05,,8093,https://allafrica.com/stories/201904050416.html,human,gen,test,-0.008861,
4426,4426,Indie media company Oakhurst Entertainment has...,deadline.com,‘Hell Fest’ Co-Writer Seth Sherwood To Helm ‘B...,2019-04-08,Amanda N'Duka,22854,https://deadline.com/2019/04/hell-fest-seth-sh...,human,train_burner,train,-0.142662,
13240,13240,5370537724001\nThe incoming federal government...,sheppnews.com.au,Election winner must fix school funding,2019-04-17,,1105,https://www.sheppnews.com.au/@national-news/20...,human,gen,test,-2.050456,


In [10]:
grover['label'].value_counts()

label
human      15000
machine    10000
Name: count, dtype: int64

In [11]:
machine_grover = grover[grover['label'] == 'machine']
human_grover = grover[grover['label'] == 'human'].sample(n = 10000)
combined_grover = pd.concat([machine_grover, human_grover])

In [12]:
combined_grover['Label'] = combined_grover['label'].apply(lambda x: x.capitalize())
combined_grover['Original dataset'] = 'grover'
combined_grover['Row in original dataset'] = combined_grover.index
combined_grover.rename(columns = {'article':'Text'}, inplace = True)

In [13]:
combined_grover = combined_grover[['Text', 'Label', 'Original dataset', 'Row in original dataset']]

In [14]:
combined_grover

Unnamed: 0,Text,Label,Original dataset,Row in original dataset
1,"When I first started looking at oil assets, I ...",Machine,grover,1
4,The Kingston and St Andrew Municipal Corporati...,Machine,grover,4
5,"news, local-news,\nTwo protests against an inv...",Machine,grover,5
7,Drink-driving charges brought against Danny Dr...,Machine,grover,7
13,Called by the Mirror after today's strong new ...,Machine,grover,13
...,...,...,...,...
19575,Colorado lawmakers gave final approval Wednesd...,Human,grover,19575
8782,China on Wednesday said the issue of designati...,Human,grover,8782
11141,Opinion: the HSE has almost exhausted their 20...,Human,grover,11141
23389,Your Android phone could soon replace your har...,Human,grover,23389


### Processing reviews

In [15]:
reviews.sample(5)

Unnamed: 0,category,rating,label,text_
6637,Sports_and_Outdoors_5,5.0,OR,"Haven't had a chance to use this hippack yet, ..."
9595,Electronics_5,5.0,OR,"At this price point, these stands are sturdy, ..."
37366,Clothing_Shoes_and_Jewelry_5,4.0,CG,"Love this watch, very durable and looks great ..."
10460,Electronics_5,5.0,CG,"FAR FARRR FARRRRR From ""Professional"". I have..."
38634,Clothing_Shoes_and_Jewelry_5,5.0,CG,Like the hat a little small it's fine.Very goo...


In [16]:
reviews['label'] = reviews['label'].apply(lambda x: 'Human' if x == 'OR' else 'Machine')
reviews.rename(columns = {'label':'Label', 'text_':'Text'}, inplace=True)

In [17]:
machine_reviews = reviews[reviews['Label'] == 'Machine'].sample(n = 10000)
human_reviews = reviews[reviews['Label'] == 'Human'].sample(n = 10000)
combined_reviews = pd.concat([machine_reviews, human_reviews])

In [18]:
combined_reviews['Original dataset'] = 'reviews'
combined_reviews['Row in original dataset'] = combined_reviews.index
combined_reviews = combined_reviews[['Text', 'Label', 'Original dataset', 'Row in original dataset']]

In [19]:
combined_reviews

Unnamed: 0,Text,Label,Original dataset,Row in original dataset
3446,I am using them with a set of wide wooden legs...,Machine,reviews,3446
29982,Book arrived on time. We have the book on our...,Machine,reviews,29982
18255,"Looks great, but the quality is not as good as...",Machine,reviews,18255
6209,Pros\n1) Conforms to Milano Gel (Cordura) inst...,Machine,reviews,6209
22654,"My dog didn't show much interest in the food, ...",Machine,reviews,22654
...,...,...,...,...
4493,Works as described. Did not notice the honeyco...,Human,reviews,4493
30440,The book covers many ways to build and maintai...,Human,reviews,30440
31199,This years favorite book. I really wanted to u...,Human,reviews,31199
37163,I got these for my gf & she loves them. Perfec...,Human,reviews,37163


### Processing wiki

In [20]:
wiki.sample(5)

Unnamed: 0.1,Unnamed: 0,id,url,title,wiki_intro,generated_intro,title_len,wiki_intro_len,generated_intro_len,prompt,generated_text,prompt_tokens,generated_text_tokens
53497,53497,6489017,https://en.wikipedia.org/wiki/Crystal%20Computing,Crystal Computing,"Crystal Computing, later renamed Design Design...","Crystal Computing, later renamed Design Design...",2,221,201,200 word wikipedia style introduction on 'Crys...,a research project at Xerox PARC in the early...,24,238
56109,56109,33911743,https://en.wikipedia.org/wiki/Jason%20Wallace,Jason Wallace,Jason Wallace (born 1969) is an author living ...,"Jason Wallace (born 1969) is an author, speake...",2,163,178,200 word wikipedia style introduction on 'Jaso...,", speaker, and entrepreneur. Wallace is the co...",24,234
83920,83920,21539624,https://en.wikipedia.org/wiki/Tony%20Berti,Tony Berti,"Charles Anton Berti, Jr. (born June 21, 1972,...","Charles Anton Berti, Jr. (born June 15, 1953)...",2,185,199,200 word wikipedia style introduction on 'Tony...,"15, 1953) is an American actor and comedian.\...",27,300
307,307,2170212,https://en.wikipedia.org/wiki/Screeb,Screeb,Screeb is a small village in south-west Conne...,Screeb is a small village in the county of Cu...,1,158,215,200 word wikipedia style introduction on 'Scre...,"the county of Cumbria, England, close to the ...",25,299
20630,20630,24860133,https://en.wikipedia.org/wiki/Plazi,Plazi,Plazi is a Swiss-based international non-profi...,Plazi is a Swiss-based international non-profi...,1,184,188,200 word wikipedia style introduction on 'Plaz...,that aims to promote sustainability.\n\nPlazi...,27,250


In [21]:
wiki = wiki.sample(n = 20000)

In [22]:
human_wiki = pd.DataFrame(wiki.iloc[:10000])
machine_wiki = pd.DataFrame(wiki.iloc[10000:])

In [23]:
human_wiki.rename(columns = {'wiki_intro':'Text'}, inplace = True)
machine_wiki.rename(columns = {'generated_intro':'Text'}, inplace = True)

In [24]:
human_wiki['Label'] = 'Human'
machine_wiki['Label'] = 'Machine'

In [25]:
combined_wiki = pd.concat([human_wiki, machine_wiki])

In [26]:
combined_wiki['Original dataset'] = 'wiki'
combined_wiki['Row in original dataset'] = combined_wiki.index
combined_wiki = combined_wiki[['Text', 'Label', 'Original dataset', 'Row in original dataset']]

In [27]:
combined_wiki

Unnamed: 0,Text,Label,Original dataset,Row in original dataset
82246,"WIST may refer to: WIST-FM, a radio station (...",Human,wiki,82246
107829,Wilhelm Vöge (16 February 1868 – 30 December 1...,Human,wiki,107829
133832,Brood XXIII (also known as the Mississippi Val...,Human,wiki,133832
99298,"Eliza Taylor Ransom (May 31, 1863 – June 7, 19...",Human,wiki,99298
24841,St Bene't's is a Church of England parish chur...,Human,wiki,24841
...,...,...,...,...
80169,VisualSim Architect is an electronic system-le...,Machine,wiki,80169
127684,"Joel-Peter Witkin (born September 13, 1939) is...",Machine,wiki,127684
15268,A stock statement is a business statement that...,Machine,wiki,15268
119902,Pamela Beryl Harriman (née Digby; 20 March 192...,Machine,wiki,119902


### Putting it all together

In [28]:
combined = pd.concat([combined_essays, combined_grover, combined_reviews, combined_wiki])

In [29]:
combined.to_csv('combined_data.csv', index = False)

In [30]:
combined['Label'].value_counts()

Label
Machine    40000
Human      40000
Name: count, dtype: int64

In [None]:
combined['Original dataset'