In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

We're going to analyze a dataset that collects excerpts of text that have been created by Large Language Models (LLMS) and humans. The data is from [Kaggle](https://www.kaggle.com/datasets/starblasters8/human-vs-llm-text-corpus), but it has been preprocessed somewhat:
- Only 10% of the data is present (uniformly randomly sampled)
- The text fields have been `strip`ped from leading and ending whitespace
- The irrelevant columns have been removed.

The idea is to try to detect if the text was generated by an LLM. Let's explore the data a bit.

In [2]:
df = pd.read_csv('human_vs_llm.csv.gz')

Note that Pandas can natively also read compressed data. Let's look at the shape of the data and the columns.

In [3]:
df.shape

(78282, 2)

In [4]:
df.columns

Index(['text', 'source'], dtype='object')

In [5]:
df['source'].unique()

array(['LLM', 'Human'], dtype=object)

So the `source` column should act as our class label: it identifies the LLM used, or alternatively that the text came from a human being. We also have a worrying `Unknown` label that we should probably get rid of. Let's see the relative frequency.

In [6]:
(df['source'] == 'Human').sum()/df.shape[0]

0.44653943435272475

In [7]:
(~(df['source'].isin(['Human','Unknown']))).sum()/df.shape[0]

0.5534605656472752

The dataset appears ok, it's almost a 50:50 split, so neither class is very badly overrepresented.

In [8]:
df

Unnamed: 0,text,source
0,The Mongol Empire was governed by a civilian a...,LLM
1,Tito woke up with a headache. He called in sic...,Human
2,@Holt \nThanks for taking time to give me so ...,Human
3,The appendix does have use. It has a role in m...,LLM
4,A giant panda in Hong Kong called Ying Ying is...,LLM
...,...,...
78277,Fire Protection in Commercial and Industrial B...,Human
78278,Restaurants aren't required to list ingredient...,LLM
78279,"In the United States, their persistent legal c...",Human
78280,"They will pay you a percentage for your time, ...",Human


In [9]:
df['source'].value_counts()

source
LLM      43326
Human    34956
Name: count, dtype: int64

Let's perform a train-test split. By default this is done at a 3:1 ratio, which works well for us. Rows are selected randomly.

It is good practice to always explicitly define the seed so that we can repeat the experiments later.

In [10]:
SEED = 1234
df_train, df_test = train_test_split(df,random_state=SEED)

In [11]:
df_train

Unnamed: 0,text,source
33966,The colour of the sky is caused by sunlight li...,LLM
31900,The availability of the Bible in vernacular la...,LLM
48861,Front Line Employees and Service Quality Cours...,Human
27757,We study the power and limits of optimal dynam...,LLM
50355,sorry to say I did not have a good experience ...,LLM
...,...,...
55985,***India's grand plan to create world's longes...,Human
32399,The Face on Mars has long been a topic of fasc...,LLM
60620,This is a very sensitive topic so sorry for po...,LLM
34086,The American Way of Dining Out Research Paper\...,Human


In [12]:
df_test

Unnamed: 0,text,source
44700,SORRY ABOUT ALL THE ERRORS A BETTER COPY WILL ...,Human
65161,Like other posts...you get what you pay for......,LLM
45938,Counter-argument: The problem with cultural ap...,LLM
37709,Internet of Things (IoT) is the next big evolu...,LLM
4847,Learning new skills is an essential part of pe...,LLM
...,...,...
77824,is smiling and waving The camera pans around t...,LLM
60419,Inventory systems are generally software produ...,LLM
66993,Kate was walking on the sidewalk. She noticed ...,LLM
15468,"Venus, also known as the Earth's sister planet...",LLM


In order for our model to make *any* sense, we must be able to beat the dummy classifier. Let us determine the most common class and see the accuracy if we predicted every value was from this class.

In [13]:
df_train['source'].value_counts()

source
LLM      32444
Human    26267
Name: count, dtype: int64

So the dummy classifier should predict that all values are LLMs. Let's see the accuracy we can achieve with this.

In [14]:
acc_dummy = (df_test['source'] == 'LLM').sum()/df_test.shape[0]
acc_dummy

0.5560267743089264

As the data is almost 50:50, the dummy classifier is little better than tossing a coin.

Then we'll preprocess the data into bags of words using the `CountVectorizer` from scikit learn.

The function `fit_transform` does two things: it fits the data (assigns numerical values to words, that is, indices in the vector), and applies the transformation to the dataset in question. This corresponds to doing first `fit` and then `transform`.

The function `transform` can only be applied after `fit` has been called: it will then map the known words to their indices and count them; unknown words (that were not encountered during `fit`) are ignored. This is appropriate for the test set.

In [15]:
cv = CountVectorizer()
X_train = cv.fit_transform(df_train['text'])
X_train

<58711x186107 sparse matrix of type '<class 'numpy.int64'>'
	with 12200852 stored elements in Compressed Sparse Row format>

In [16]:
X_test = cv.transform(df_test['text'])
X_test

<19571x186107 sparse matrix of type '<class 'numpy.int64'>'
	with 4016942 stored elements in Compressed Sparse Row format>

We will then convert the class labels, given as strings, into (arbitrarily ordered) numerical classes using the label encoder. This is appropriate because the classifiers tend to assume numerical class labels.

In [17]:
le = LabelEncoder()
y_train = le.fit_transform(df_train['source'])
y_train

array([1, 1, 0, ..., 1, 0, 1])

In [18]:
y_test = le.transform(df_test['source'])
y_test

array([0, 1, 1, ..., 1, 1, 0])

We will start by contructing a Bernoulli Naive Bayes classifier. The interface for all classifiers is relatively uniform in sklearn. We first `fit` the data which constructs the model. The arguments are `X` and `y`: observations (observations by the row, features by the column) as a matrix `X`, and the associated class labels as a vector `y`.

The data is precisely in this format: which we can check (the number of rows in `X` must match the length of `y`).

In [19]:
X_train.shape, y_train.shape

((58711, 186107), (58711,))

In [20]:
bnb = BernoulliNB()
bnb.fit(X_train,y_train)

We can then `predict` the class labels of our test observations.

In [21]:
y_pred_bnb = bnb.predict(X_test)
y_pred_bnb

array([1, 1, 1, ..., 1, 1, 1])

Accuracy simply tells the fraction of observations that match the ground truth.

In [22]:
acc_bnb = (y_test == y_pred_bnb).sum()/y_test.shape[0]
acc_bnb

0.7103878187113587

This is significantly better than tossing a coin, but not very reliable. Let's see if we can improve this by using a Multinomial Naive Bayes classifier. It works exactly the same way.

In [23]:
mnb = MultinomialNB()
mnb.fit(X_train,y_train)
y_pred_mnb = mnb.predict(X_test)
acc_mnb = (y_test == y_pred_mnb).sum()/y_test.shape[0]
acc_mnb

0.7352204792805682

Slightly better! Let's have a more nouanced view about the results.

We say that a *positive* case is such that the text was produced by LLM, as this is consistent with the idea of "detecting LLM use".

We will then compute the true/false positives/negatives and then precision and recall which are standard measures.

In [24]:
y_test_inv = le.inverse_transform(y_test)
y_pred_mnb_inv = le.inverse_transform(y_pred_mnb)
tp = ((y_test_inv == 'LLM') & (y_pred_mnb_inv == 'LLM')).sum()
fp = ((y_test_inv == 'Human') & (y_pred_mnb_inv == 'LLM')).sum()
fn = ((y_test_inv == 'LLM') & (y_pred_mnb_inv == 'Human')).sum()
tn = ((y_test_inv == 'Human') & (y_pred_mnb_inv == 'Human')).sum()
acc = (tp+tn)/(tp+fp+tn+fn)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
acc, precision, recall

(0.7352204792805682, 0.7207933064766037, 0.8549898915640507)

This suggests that we are reliable in recovering observations that were generated by an LLM but the precision is lower, so we get too many observations labelled as LLMs. Let's have a look at the confusion matrix.

In [25]:
np.array([[tp,fn],[fp,tn]])

array([[9304, 1578],
       [3604, 5085]])

|                     | **Predicted positive** | **Predicted negative** |
|---------------------|------------------------|------------------------|
| **Actual positive** | 9304                   | 1578                   |
| **Actual negative** | 3604                   | 5085                   |

The numbers above were entered manually, so if you run the code again, they might be inconsistent with what you get from NumPy, at least if you are using a different version of the libraries.

This confirms our expectations: the number of false positives is quite high in comparison to false negatives. So our model is trigger happy to label human beings as being AI.