# Artificial Text Detection

This notebook demonstrates the process of training a Transformer-based model for artificial text detection.

We utilized the IMDB dataset as a source of human-written texts. From each text, we extract a few initial words to prompt the text generation model.

The steps involved are as follows:
1. Generating training and test sets using the `gpt2-small` model.
2. Generating an additional test set with the `gpt2-large` model.
3. Training the `distil-BERT` model to classify texts into human-written and machine-generated.
4. Investigating whether the detector trained with texts from the small model can detect texts generated by a large model.

Our main evaluation metric is **accuracy**.

This notebook runs in the Google Colab enviroment.

As a follow up, we'll provide some suggestions on how to further enhance this code at the end of the session..

First, let's install all necassary packages, supress warnings, and mound Google Drive.

In [None]:
!pip install transformers # supports Transformer-based models
!pip install datasets # datasets for experiments
!pip install evaluate # evaluation metrics for experiments
!pip install transformers[torch] # backend for training

In [None]:
from transformers.utils import logging

logging.set_verbosity_error()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# !mkdir '/content/drive/My Drive/atd'
# !mkdir '/content/drive/My Drive/atd/data'
# !mkdir '/content/drive/My Drive/atd/model'
output_path = '/content/drive/My Drive/atd'

Next, import pandas to manipulate data and fix the random seed abnd tqdm to control exec time .

In [None]:
import pandas as pd # data manipulation & storage
from tqdm.auto import tqdm

In [None]:
from transformers import  set_seed # fix random seed
set_seed(0)

We are going to use two models:
* `gpt2-small` to generate the training data and the test set 1
* 'gpt2-large' to generate the test set 2.

We use only one decoding strategy with $top-k$ threshold set to 50, buf feel free to experiment with it.

We set the mininal length of generated texts to 50.


In [None]:
model_idx = 'gpt2' # ID of the GPT2-small model
large_model_idx = 'gpt2-large' # ID of the GPT2-large model
decoding_strategy = {'min_length':50, 'top_k':50}  # the params of decoding strategy

## Data generation

We utilize pipeline tools from the Transformers package. To generate texts with a pipeline, you need to specify the pipeline task (`text-generation`), provide the model identifier, and optionally, the GPU device identifier you intend to use.

For text generation, it's important to configure the padding token and its placement. Given that texts are generated from left to right, the padding should be positioned on the left. The purpose of padding is to standardize texts of varying lengths, enabling efficient batch processing.
Example:

[PAD] [PAD] [PAD] The

[PAD] [PAD] The cat

[PAD] The cat sat


Here we use three prompts of different length, but pad them to ensure the same length.

In [None]:
from transformers import pipeline # import pipeline tools

# initialize the pipeline
generator = pipeline('text-generation', model=model_idx, device=0)

# define which token should be used for padding
generator.tokenizer.pad_token_id = generator.model.config.eos_token_id

# define the placement of the padding token
generator.tokenizer.padding_side='left'

In [None]:
generator('The cat', do_sample=True, min_length=50) # example run on a single prompt

[{'generated_text': "The cat's name is known as Sapho, but her owners don't really know anything about her past. They say she'd been topless during a previous date with her husband.\n\nHer boyfriend told the man in charge of the relationship"}]

In [None]:
output = generator(['The cat', 'The dog'], do_sample=True, min_length=50, top_k=50, batch_size = 2) # example run on a batch
output

[[{'generated_text': 'The cat, a common male, is one of hundreds found in areas of India that are frequently inhabited by black squirrels.\n\nThe cat is often mistaken for a black squirrel but other breeds of cats — cat or dog — may also be mistaken'}],
 [{'generated_text': 'The dog\'s owner took this into account when making the claim that the owners were "trying to protect their dog".\n\nThe owner is not named and he does not know the dog\'s ownership situation, but the dog is considered his "gu'}]]

We'll utilize the IMDB dataset, typically employed for sentiment analysis benchmarking. However, our intention is to use it as a source of human-written texts. Hence, we can ignore the labels in the dataset.

In [None]:
from datasets import load_dataset # import loading function

# load the IMDB dataset
data = load_dataset('imdb')

In [None]:
data # data splits

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [None]:
data['train'][0] # single data entry

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

We pre-process the text in rhe following way:
1. we strip the text to be at most 100 tokens long
2. we use 5 first tokens as a prompt

Finally, we store the prompts and the texts in a data frame.


In [None]:
records = [] # an empy list to store the data
# loop over data entries
for text in tqdm(data['unsupervised']['text']):

  # split the text in tokens
  tokens = text.split()

  # use first 5 tokens as a prompt
  prefix = ' '.join(tokens[:5]) + ' '

  # join tokens
  joined_tokens = ' '.join(tokens)

  # store prefix and joined tokens
  record = [prefix, joined_tokens]
  records.append(record)

texts = pd.DataFrame.from_records(records, columns = ['prefix', 'text']) # create a data frame

  0%|          | 0/50000 [00:00<?, ?it/s]

In [None]:
texts

Unnamed: 0,prefix,text
0,This is just a precious,This is just a precious little diamond. The pl...
1,When I say this is,When I say this is my favourite film of all ti...
2,I saw this movie because,I saw this movie because I am a huge fan of th...
3,Being that the only foreign,Being that the only foreign films I usually li...
4,After seeing Point of No,After seeing Point of No Return (a great movie...
...,...,...
49995,License To Kill (1989) is,License To Kill (1989) is an inanely dismal in...
49996,I love watching a James,I love watching a James Bond. It's not very in...
49997,I can't decide what was,I can't decide what was the worst thing about ...
49998,UGH... As an adorer of,UGH... As an adorer of the James Bond characte...


This is the core function. It takes a list of prompts and extends each one. The function's input arguments are:
* `prefixes` = prompts
* `decoding_strategy` = parameters for the decoding strategy
* `bs` = batch size

The function returns a list of generated texts.

In [None]:
def continue_prefix(prefixes, decoding_strategy, bs):
  # Generate text continuations for the given prefixes
  output = generator(
    prefixes,
    do_sample=True,
    min_length=decoding_strategy['min_length'],
    top_k=decoding_strategy['top_k'],
    batch_size=bs
  )

  # Extract the generated texts from the output
  generated_texts = [i[0]['generated_text'] for i in output]

  # Return the list of generated texts
  return generated_texts


We apply the `continue_prefix` function to our data and store it.  This process can be repeated with any text generation model using a similar approach.



In [None]:
# generate text continuations for a list of prompts
generated_texts = continue_prefix(texts['prefix'].tolist(), decoding_strategy, bs=256)

# assign the generated texts to the 'generated_text' column in the DataFrame
texts['generated_text'] = generated_texts

#add a column 'model_index' to the DataFrame and assign the model index 'model_idx'
texts['model_index'] = model_idx

# iterate through the decoding strategy parameters and add them as columns in the DataFrame
for k,v in decoding_strategy.items():
  texts[k] = str(v)

# save the dataframe
# texts.to_csv(f"{output_path}/data/{model_idx}_texts.csv", index=None)

## Building artifical text detection

Now, let's load the data generated with two models. It was geenrated prior to today's session, so no need to way until everyting is generated.

In [None]:
# loading data, generated with the gpt-2 small model
texts_from_small_model = pd.read_csv(f'{output_path}/data/gpt2_texts.csv')

# loading data, generated with the gpt-2 large model
texts_from_large_model  = pd.read_csv(f'{output_path}/data/gpt2-large_texts.csv')

In [None]:
texts_from_small_model # sample

Unnamed: 0,prefix,text,generated_text,model_index,min_length,top_k
0,This is just a precious,This is just a precious little diamond. The pl...,This is just a precious urn that someone will ...,gpt2,50,50
1,When I say this is,When I say this is my favourite film of all ti...,When I say this is icky but I'm not in a hurry...,gpt2,50,50
2,I saw this movie because,I saw this movie because I am a huge fan of th...,"I saw this movie because I thought, ""If you p...",gpt2,50,50
3,Being that the only foreign,Being that the only foreign films I usually li...,Being that the only foreign urn in a world whe...,gpt2,50,50
4,After seeing Point of No,After seeing Point of No Return (a great movie...,After seeing Point of No _____________________...,gpt2,50,50
...,...,...,...,...,...,...
49995,License To Kill (1989) is,License To Kill (1989) is an inanely dismal in...,License To Kill (1989) is a work by Paul J. S...,gpt2,50,50
49996,I love watching a James,I love watching a James Bond. It's not very in...,I love watching a James vernacular. When you'v...,gpt2,50,50
49997,I can't decide what was,I can't decide what was the worst thing about ...,I can't decide what was different for me? \n...,gpt2,50,50
49998,UGH... As an adorer of,UGH... As an adorer of the James Bond characte...,UGH... As an adorer of Lethal Weaponists in m...,gpt2,50,50


In [None]:
texts_from_large_model # sample

Unnamed: 0,prefix,text,generated_text,model_index,min_length,top_k
0,This is just a precious,This is just a precious little diamond. The pl...,"This is just a precious life. So you see, the...",gpt2-large,50,50
1,When I say this is,When I say this is my favourite film of all ti...,When I say this is ******** (not a typo. I mea...,gpt2-large,50,50
2,I saw this movie because,I saw this movie because I am a huge fan of th...,I saw this movie because I had heard the nam...,gpt2-large,50,50
3,Being that the only foreign,Being that the only foreign films I usually li...,Being that the only foreign 『Ouuzou』 she knew ...,gpt2-large,50,50
4,After seeing Point of No,After seeing Point of No Return (a great movie...,After seeing Point of No I was hooked. The pl...,gpt2-large,50,50
...,...,...,...,...,...,...
995,A plastic surgeon gets suspicious,A plastic surgeon gets suspicious when the pol...,A plastic surgeon gets suspicious  To view th...,gpt2-large,50,50
996,"Obviously forgotten today, and maybe","Obviously forgotten today, and maybe that's a ...","Obviously forgotten today, and maybe not for ...",gpt2-large,50,50
997,I first saw this movie,I first saw this movie on HBO as a child. I co...,I first saw this movie but it's definitely wo...,gpt2-large,50,50
998,Today's audiences are a bit,Today's audiences are a bit spoiled and jaded....,Today's audiences are a bit icky for a horror ...,gpt2-large,50,50


We create a new dataframe, in which we include the target labels. `H` stands for the human-written texts and `M` stands for the machine-generated texts. We extract the texts from the corresponding columns of the data frame.

We will downsample the number of texts from the small to make the computations faster.

In [None]:
# Create a new dataframe with target labels ('H' for human-written, 'M' for machine-generated)
def transform_data_labels(texts):
  records = []

  # loop over rows in the dataset
  for idx, row in texts.iterrows():

    # extract the human written texts and label them with H
    records.append([row['text'], 'H'])

    # extract the machine generated texts and label them with M
    records.append([row['generated_text'], 'M'])

  # store everyting in a new data frame
  df = pd.DataFrame.from_records(records, columns = ['text', 'label'])
  return df

In [None]:
# apply the function 'transform_data_labels' to the texts generated from the small model
df_small_model = transform_data_labels(texts_from_small_model)

# down sample 10,000 rows from the resulting dataframe
df_small_model = df_small_model.sample(10000)

# apply the function 'transform_data_labels' to the texts generated from the large model
df_large_model = transform_data_labels(texts_from_large_model)


Now, we split the dataset into three parts using a 60/20/20 ratio and create a `DatasetDict` object, which we will further feed to the classifier.

In [None]:
from sklearn.model_selection import train_test_split # import the train_test_split function from the sklearn library


# split the df_small_model dataset into train and test sets with a 60/40 ratio
train, test = train_test_split(df_small_model, test_size=0.4)

# further split the test set into validation and test sets with a 50/50 ratio
val, test = train_test_split(test, test_size=0.5)

# reset the index of the dataframes after splitting
train.reset_index(inplace=True)
val.reset_index(inplace=True)
test.reset_index(inplace=True)


In [None]:
from datasets import Dataset, DatasetDict # import necessary modules for creating datasets

# create an empty DatasetDict object
ds = DatasetDict()

# add  datasets to the DatasetDict with specified keys
# each dataset is created from a pandas dataframe (train, val, test, df_large_model)
ds['train'] = Dataset.from_pandas(train)
ds['validation'] = Dataset.from_pandas(val)
ds['test_s'] = Dataset.from_pandas(test) # <--- this is the test set # 1
ds['test_l'] = Dataset.from_pandas(df_large_model) # <--- this is the test set # 2

print(ds)


DatasetDict({
    train: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 6000
    })
    validation: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 2000
    })
    test_s: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 2000
    })
    test_l: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})


In [None]:
# save the dataset to disk
ds.save_to_disk(f'{output_path}/data/dataset')

Saving the dataset (0/1 shards):   0%|          | 0/6000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2000 [00:00<?, ? examples/s]

We define the label convertores.

In [None]:
# map class IDs to labels
id2label = {0: 'H', 1: 'M'}

# map labels to class IDs
label2id = {'H': 0, 'M': 1}


Let stsrt building the model! The first step is to preprocess the texts.

We import the `AutoTokenizer` class from the transformers library.
Then we load a pre-trained tokenizer for the `distilbert-base-uncased` model. A tokenizer is necessary to convert text data into a format that can be fed into the model for processing.

In [None]:
from transformers import AutoTokenizer # import  the AutoTokenizer class from the transformers library

# load a pre-trained tokenizer for the 'distilbert-base-uncased' model
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

In [None]:
# preprocess the texts by tokenizring them using the tokenizer's dictionary and mapping the labels to their respective ids
def preprocess(batch):

    #tokenize and truncate texts to have 128 tokens and pad, when necessary
    tokenized_batch = tokenizer(batch['text'], padding=True, truncation=True, max_length=128)

    # convert labels
    tokenized_batch['label'] = [label2id[label] for label in batch['label']]

    # return processed data
    return tokenized_batch

This code applies the preprocess function to the dataset ds using batch processing. This means that the function will be applied to the data in chunks or batches, rather than one entry at a time. This can be more memory-efficient and faster.

In [None]:
#  apply the 'preprocess' function to the dataset 'ds' using batch processing
tokenized_ds = ds.map(preprocess, batched=True)
tokenized_ds

Map:   0%|          | 0/6000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index', 'text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 6000
    })
    validation: Dataset({
        features: ['index', 'text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
    test_s: Dataset({
        features: ['index', 'text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
    test_l: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
})

We create data collator which  responsible for processing the data before feeding it to the model during training.  This data collator will use the provided tokenizer for padding sequences, which is important for making sure all sequences in a batch have the same length.

In [None]:
from transformers import DataCollatorWithPadding # import the DataCollatorWithPadding class from the transformers package

# create an instance of DataCollatorWithPadding
# it takes 'tokenizer' as an argument, which will be used for padding sequences
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
import evaluate # import the evaluate package

accuracy = evaluate.load('accuracy') # we will use the accuracy metric as the main one

In [None]:
import numpy as np # import the numpy package

# this function hets the predictions (e.g. the probilities of each class, takes the most probable precition and compares it to the gold label)
def compute_metrics(eval_pred):

    # get the prediction probabilities and the gold labels
    predictions, labels = eval_pred

    # get the most likely prediction
    predictions = np.argmax(predictions, axis=1)

    # compute and return the accuracy value
    return accuracy.compute(predictions=predictions, references=labels)

Let us define the model architecure. We will use the `distilbert-base-uncased` model as a backbone for binary predicitions.

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer # import necessary components from the transformers library

# initialize a model for sequence classification (e.g. for text classification)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)


In [None]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [None]:
# define the training arguments for the model
training_args = TrainingArguments(
    output_dir=f'tmp/',                           # directory to save the model and results
    learning_rate=2e-5,                            # learning rate for optimization
    per_device_train_batch_size=32,              # batch size per GPU for training
    per_device_eval_batch_size=32,               # batch size per GPU for evaluation
    num_train_epochs=5,                           # number of training epochs
    weight_decay=0.01,                            # weight decay for regularization
    evaluation_strategy='epoch',                  # evaluation strategy during training (per epoch)
    save_strategy='epoch',                        # saving strategy during training (per epoch)
    load_best_model_at_end=True,                  # load the best model at the end of training
)

# intialize the Trainer with necessary components and settings
trainer = Trainer(
    model=model,                                  # model to be trained
    args=training_args,                           # training arguments defined above
    train_dataset=tokenized_ds['train'],          # training dataset
    eval_dataset=tokenized_ds['validation'],      # validation dataset
    tokenizer=tokenizer,                          # tokenizer for data processing
    data_collator=data_collator,                  # data collator for padding
    compute_metrics=compute_metrics               # function to compute evaluation metrics
)


Finally let's train the model!

In [None]:
# train the model
trainer.train()

{'eval_loss': 0.02218625135719776, 'eval_accuracy': 0.996, 'eval_runtime': 6.92, 'eval_samples_per_second': 289.019, 'eval_steps_per_second': 9.104, 'epoch': 1.0}
{'eval_loss': 0.02113543637096882, 'eval_accuracy': 0.996, 'eval_runtime': 6.8505, 'eval_samples_per_second': 291.948, 'eval_steps_per_second': 9.196, 'epoch': 2.0}
{'loss': 0.0392, 'learning_rate': 9.361702127659576e-06, 'epoch': 2.66}
{'eval_loss': 0.02154308743774891, 'eval_accuracy': 0.996, 'eval_runtime': 6.834, 'eval_samples_per_second': 292.657, 'eval_steps_per_second': 9.219, 'epoch': 3.0}
{'eval_loss': 0.02481190487742424, 'eval_accuracy': 0.995, 'eval_runtime': 6.8541, 'eval_samples_per_second': 291.796, 'eval_steps_per_second': 9.192, 'epoch': 4.0}
{'eval_loss': 0.02079903893172741, 'eval_accuracy': 0.996, 'eval_runtime': 6.8171, 'eval_samples_per_second': 293.38, 'eval_steps_per_second': 9.241, 'epoch': 5.0}
{'train_runtime': 365.5891, 'train_samples_per_second': 82.059, 'train_steps_per_second': 2.571, 'train_los

TrainOutput(global_step=940, training_loss=0.021376898567727273, metrics={'train_runtime': 365.5891, 'train_samples_per_second': 82.059, 'train_steps_per_second': 2.571, 'train_loss': 0.021376898567727273, 'epoch': 5.0})

Now we are predictiing the performance on two test sets.

In [None]:
# predict on test set from the small model
prediction = trainer.predict(tokenized_ds['test_s'])
prediction.metrics

{'test_loss': 0.017055347561836243,
 'test_accuracy': 0.997,
 'test_runtime': 6.7299,
 'test_samples_per_second': 297.183,
 'test_steps_per_second': 9.361}

In [None]:
# predict on test set from the large model
prediction = trainer.predict(tokenized_ds['test_l'])
prediction.metrics

{'test_loss': 0.03125665336847305,
 'test_accuracy': 0.9935,
 'test_runtime': 6.8253,
 'test_samples_per_second': 293.028,
 'test_steps_per_second': 9.23}

We notice only a small drop in performance between these two sets. But remember, this might not always be the case. Feel free to play around with different models and tweaking the decoding strategies in the code!

**Task 1.** Go ahead and tweak the code for the authorship attribution task. Now, you'll be classifying between different models and human writing.

Hint 1: Since you have more than two labels (H, M1, M2), make sure to adjust all the variables that handle labels and the number of labels.

**Task 2.** Let's enhance the code to incorporate multiple datasets for both training and testing. Play with the datasets available in the datasets package! You might also want to experiment with a cross-dataset setup, where you train on one dataset and evaluate on another.