In [2]:
import numpy as np
import pandas as pd
import os

# Data Import

Defining given_essays as the official Kaggle train dataset and PaLM_generated_essays as a dataset exclusively LLM generated. Two datasets could be concatenated to make our training dataset and proportion of student written vs LLM generated would be approximately equal.

In [3]:
given_essays = pd.read_csv('train_essays.csv')
PaLM_generated_essays = pd.read_csv('LLM_generated_essay_PaLM.csv')
#essays = pd.concat([essays, generated_essays])

ArguGPT is a balanced corpus of 4,038 argumentative essays generated by 7 GPT models in response to essay prompts from three sources. Not the same prompts as the Kaggle dataset but nice variety of LLMs. Needs to be combined with human generated dataset. More info: https://www.kaggle.com/datasets/alejopaullier/argugpt

In [4]:
argugpt_essays = pd.read_csv(r"ArguGPT\argugpt.csv")

The DAIGT data is given in 4 csv files with over 30,000 rows in each. I am concatenating all the files here. Data comprised of:
* Text generated with ChatGPT by MOTH (https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset)
* Persuade corpus contributed by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/persaude-corpus-2/)
* Text generated with Llama-70b and Falcon180b by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b)
* Text generated with ChatGPT by Radek (https://www.kaggle.com/datasets/radek1/llm-generated-essays)
* Official train essays
* Essays generated with various LLMs 


Kinda the ultimate dataset. More info here: https://www.kaggle.com/datasets/thedrcat/daigt-proper-train-dataset/

Note: Label is in response to the question "is it AI generated", so therefore 0 is student written in 1 is AI generated.

In [5]:
os.listdir('DAIGT')
daigt = pd.DataFrame()
for file in os.listdir('DAIGT'):
    daigt = pd.concat([daigt, pd.read_csv(r"DAIGT//%s" % file)])
daigt.reset_index(inplace = True)
daigt.drop(['index'], axis = 1, inplace = True)

In [6]:
daigt

Unnamed: 0,text,label,source,fold,essay_id,prompt
0,There are alot reasons to keep our the despise...,0,persuade_corpus,2,,
1,Driving smart cars that drive by themself has ...,0,persuade_corpus,4,,
2,"Dear Principal,\n\nI believe that students at ...",0,persuade_corpus,0,,
3,"Dear Principal,\n\nCommunity service should no...",0,persuade_corpus,0,,
4,My argument for the development of the driverl...,0,persuade_corpus,3,,
...,...,...,...,...,...,...
159451,"""Oh man I didn't make the soccer team!"", yelle...",0,persuade_corpus,7,F7341069C4A4,
159452,I believe that using this technology could be ...,0,persuade_corpus,8,AFE6E553DAC2,
159453,The Face on Mars is a fascinating phenomenon t...,1,falcon_180b_v1,3,falcon_180b_v1_600,You have read the article 'Unmasking the Face ...
159454,Texting & Driving\n\nUsing your phone while dr...,0,persuade_corpus,1,A5F84C104693,


# Data Cleaning

* Create modified text column without any formatting issues that would interfere with our features
    * Get rid of new line syntax (/n/n)
    * Strip white space before or after essay
    * Want to get rid of backslashes before apostrophes but having trouble with that

In [7]:
daigt['modified text'] = daigt['text'].str.replace("\n\n"," ").str.replace("\n","").str.strip()

In [8]:
# adding column that removes punctuation, numbers, and splits by word into lists for efficiency in later steps
daigt['split text no punctuation'] = daigt['modified text'].str.lower().str.replace(r'[^a-zA-Z\d ]+', '', regex = True).str.split(' ')

# EDA

Questions I would like to answer
* How many prompts are there? How many questions have their prompt attached?
* Are there equal proportions of each label?
* What is the average word count/sentence length for essays of each label?
* What is the average character count of words in an essay and does it vary by label?
* How can we characterize the distribution of word counts?
* Are there special characters/syntax in essays that we need to account for?
* Notable punctuation differences?

**Q: How many prompts are there? How many questions have their prompt attached?** \
A: There are 4880 unique prompts and about 20% of questions have prompts. Of the essays with prompts, there are ~27000 human generated essays and ~7000 AI generated essays. If we wanted to train a model with contents from the prompts we could narrow down the subset and still have a good amount of data.

In [41]:
# number/distribution of unique prompts
daigt['prompt'].value_counts()

Some schools offer distance learning as an option for students to attend classes from home by way of online or video conferencing. Do you think students would benefit from being able to attend classes from home? Take a position on this issue. Support your response with reasons and examples.                                                                                                                                                                                                                                 297
When people ask for advice, they sometimes talk to more than one person. Explain why seeking multiple opinions can help someone make a better choice. Use specific details and examples in your response.                                                                                                                                                                                                                                                                                       

In [18]:
# proportion of null data by column
daigt.isnull().sum()/len(daigt)

text        0.000000
label       0.000000
source      0.000000
fold        0.000000
essay_id    0.208578
prompt      0.784818
dtype: float64

In [20]:
# count of essays with each label from those where the prompt is given
daigt.dropna(subset = ['prompt'])['label'].value_counts()

1    27049
0     7263
Name: label, dtype: int64

**Q: Are there equal proportions of each label?** \
A: No! There is significatly more non-AI generated texts. Will have to consider sampling to even the amounts out in the model.

In [40]:
daigt['label'].value_counts()

0    115372
1     44084
Name: label, dtype: int64

**Q: What is the average word count/sentence length for essays of each label?** \
A: The human generated essays have an average of 418 words, while the AI generated ones have an average of 317 words. This difference seems significant but will confirm with A/B testing if this would make a good feature.

In [90]:
# adding word count column
daigt['word count'] = daigt['modified text'].str.split().apply(len)

# looking at the average word count of each label -- non AI generated seems to be longer in these cases.
daigt[['label', 'word count']].groupby('label').mean()

Unnamed: 0_level_0,word count
label,Unnamed: 1_level_1
0,417.590811
1,316.540877


**Q: What is the average character count of words in an essay and does it vary by label?** \
A: The student generated essays have shorter words by about half a character on average. Will do A/B testing to see if this would be significant enough to be a feature.

In [86]:
# function to get average word length (in characters) per essay
def avg_word_length(x):
    total_characters = 0
    for i in range(len(x)):
        total_characters += len(x[i])
    return total_characters/len(x)

In [104]:
# create new feature for average word length in characters (removing numbers and punctuation)
daigt['avg word length'] = daigt['modified text'].str.replace(r'[^a-zA-Z\d ]+', '', regex = True).str.split(' ').apply(avg_word_length)

In [105]:
# group by label to see if there is a difference
daigt[['label','avg word length']].groupby('label').mean()

Unnamed: 0_level_0,avg word length
label,Unnamed: 1_level_1
0,4.414253
1,4.880637


**Q: How can we characterize the distribution of word counts?** \
A:

In [106]:
# find most common word of each essay
# remove punctuation and make lowercase
# we could modify this to remove certain words so that the words we find are more indicative of the essay type
def most_common(lst):
    return max(set(lst), key=lst.count)

daigt['most common word'] = daigt['modified text'].str.lower().str.replace(r'[^a-zA-Z\d ]+', '', regex = True).str.split(' ').apply(most_common)

In [108]:
daigt[['label','most common word']].groupby('label').agg(pd.Series.mode)

Unnamed: 0_level_0,most common word
label,Unnamed: 1_level_1
0,the
1,and


common AI words/phrases: \
* To begin with
* Here are some examples of
* In conclusion
* It is important to note
* a few things to keep in mind
* in the realm of
* embark, embarked, delve, dive, discover, invaluable, relentless, groundbreaking, endeavor, enlightening, insights, esteemed, shed light, underscores, crucial, unlock, ensure, remember
* From… to …
* Whether it’s… or…, it seems…
* a dash of
* In today's world
* overuse of demonstrative pronouns: this, that, these, and those
* lists -- maybe comma count

In [11]:
# make count column for each of the overused words
words = ['embark', 'embarked', 'delve', 'dive', 'discover', 'invaluable', 'relentless', 'groundbreaking', 'endeavor', \
         'enlightening', 'insights', 'esteemed', 'shed light', 'underscores', 'crucial', 'unlock', 'ensure', 'remember']
for i in words:
    daigt[i] = daigt['modified text'].str.lower().str.contains(i)

In [14]:
# look at proportion of essays in each group with the words
for i in words:
    print(daigt[['label',i]].groupby('label').mean())

         embark
label          
0      0.000546
1      0.002654
       embarked
label          
0      0.000104
1      0.000000
          delve
label          
0      0.000754
1      0.003539
           dive
label          
0      0.013539
1      0.058207
       discover
label          
0      0.036664
1      0.047954
       invaluable
label            
0        0.000719
1        0.022752
       relentless
label            
0        0.000104
1        0.001543
       groundbreaking
label                
0            0.000659
1            0.003062
       endeavor
label          
0      0.010158
1      0.013157
       enlightening
label              
0          0.000173
1          0.000318
       insights
label          
0      0.000789
1      0.035546
       esteemed
label          
0      0.000104
1      0.000295
       shed light
label            
0        0.000199
1        0.000590
       underscores
label             
0         0.000000
1         0.000227
        crucial
label       

**Q: Notable punctuation differences? Differences in the use of numbers?** \
A: 

In [135]:
# Numbers: isolate numbers

Unnamed: 0_level_0,dive
label,Unnamed: 1_level_1
0,0.013053
1,0.058184


0         True
1         True
2         True
3         True
4         True
          ... 
159451    True
159452    True
159453    True
159454    True
159455    True
Name: modified text, Length: 159456, dtype: bool

Way to check for typos/typo count

# A/B Testing

Questions I would like to answer:
* Is prompt a good feature? Should we narrow down the dataset to only include rows with prompts?

# Train Test Split

# Feature Engineering

# Model creation/testing