# Listing 01 - Load, join, clean data

You will need to download all of the datasets discussed in the book into your local data directory.

The corresponding python scripts for this notebook are:
* [CaseStudy_4.1_00-01.py Data Prep](CaseStudy_4.1_00-01.py)
* [CaseStudy_4.1_00-02.py Summarization](CaseStudy_4.1_00-02.py)


In [None]:
import pandas as pd
import numpy as np
import random


In [None]:
df1 = pd.read_csv("data/llm-detect-ai-generated-text/train_essays.csv")
df2 = pd.read_csv("data/train_v2_drcat_02.csv")
df3 = pd.read_csv("data/Training_Essay_Data.csv")

df1["source"] = "LA-Lab"
df2["source"] = "Darek"
df3["source"] = "Sunil"

In [None]:
# Ensure that all data sets have the same column for the target
df2["generated"] = df2["label"]

In [None]:
cols = ["source", "text", "generated"]
df1 = df1.loc[:,cols]
df2 = df2.loc[:,cols]
df3 = df3.loc[:,cols]

df = pd.concat([df1, df2, df3], ignore_index=True)
records = len(df)

print(f"Joined dataset contains {records}")

In [None]:
# Drop duplicates
df.drop_duplicates(subset=['text'], keep='first', inplace=True, ignore_index=True)

new_records = len(df)
print("Dropped", records-new_records, "Records")

df['RANDOM'] = df.apply(lambda x: random.random(), axis=1)
df.to_csv("data/complete_dataset.csv", index=False)

# Listing 02 - Summarization

Quick analysis to understand the dataset content

In [None]:
!pip install tabulate

In [None]:
df["chars"] = df['text'].apply(len)

def word_count(t):
    wds = t.split(" ")
    return len(wds)

df["words"] = df['text'].apply(word_count)

def word_len(t):
    wds = t.split(" ")
    lens = [len(x) for x in wds]
    return np.mean(lens)

df["avg_wd"] = df.apply(lambda x: word_len(x['text']), axis='columns')

df["creator"] = np.where(df['generated']==1,"GenAI","Human")

summary = df.groupby(["source","creator"]).agg({"generated":"count","chars":"mean", "words":"mean","avg_wd":"mean"}).reset_index()
summary = summary.round(1)

summary.columns = ["Data", "Origin", "Records", "Avg Chrs", "Avg Wds", "Avg Wd Len"]


# Display DataFrame as a Markdown Table
markdown_table = summary.to_markdown(index=False)
print(markdown_table)

# Listing 03 - Feature Engineering

We do very simple feature engineering for our baseline model using a python command line application that calculates text statistics and adds them as new columns. We install this library and then execute the script as a simple BASH command in the following cell.

In [None]:
!pip install texturizer

In [None]:
%%bash
# BASH SCRIPT TO PROCESS DATA WITH SIMPLE FEATURES USING TEXTURIZER

input=data/complete_dataset.csv
output=data/complete_with_features.csv
texturizer -columns=text -literacy $input > $output


Take a look at the content of the file we just generated

In [None]:
!head data/complete_with_features.csv