<a href="https://colab.research.google.com/github/mcavol/AI-vs-Human-Text-Detector/blob/main/fine_tuned_ai_human_detector_488.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is AI vs Human detector.
Firstly run only first block to downgrade NumPy library and then Restart Runtime! After that you can run all other blocks.
It takes this dataset: https://huggingface.co/datasets/ardavey/human-ai-generated-textv
It use RoBERTa model from Hugging Face which is the best for this purpose around free models. At the block 7 user can check his inputs. But this dataset is very small, only 488 samles, so model will have very low accuracy.

In [None]:
#@title 1. Install Libraries and Downgrade NumPy
# We force numpy to a version before 2.0 to solve the compatibility issue.
!pip install transformers[torch] datasets evaluate pyarrow "numpy<2.0" -q

print("\n✅ Libraries installed and NumPy downgraded.")
print("‼️ IMPORTANT: You must now restart the runtime.")
print("Click 'Runtime' -> 'Restart runtime' from the menu above.")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m104.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.[0m[31m
[0m
✅ Libraries installed and NumPy downgraded.
‼️ IMPORTANT: You must now restart the runtime.
Click 'Runtime' -> 'Restart runtime' from the menu above.


In [None]:
#@title 2. Load the ardavey/human-ai-generated-text Dataset
import pandas as pd
from datasets import Dataset

# This is the direct URL to the raw data file for the dataset you requested.
# Using pandas to load it directly is the most robust method.
url = "https://huggingface.co/datasets/ardavey/human-ai-generated-text/resolve/main/data/train-00000-of-00001.parquet"

print(f"Downloading data directly from: {url}")

try:
    # Use pandas to read the Parquet file directly from the URL into a DataFrame.
    df = pd.read_parquet(url)
    print("Download and initial load successful.")

    # --- Data Processing ---
    # The dataset uses a 'class' column (0 for human, 1 for AI).
    # We must rename it to 'label' for the Trainer to use it automatically.
    df.rename(columns={'class': 'label'}, inplace=True)

    # We only need the 'text' and our new 'label' columns.
    df_final = df[['text', 'label']]

    # Convert the processed pandas DataFrame into a Hugging Face Dataset.
    dataset = Dataset.from_pandas(df_final)

    # Shuffle the dataset to ensure a good mix for training.
    dataset = dataset.shuffle(seed=42)

    print("\nDataset successfully processed and prepared!")
    print(dataset)
    print("\nExample of Human-written text (label=0):")
    print(dataset.filter(lambda example: example['label'] == 0)[0]['text'])
    print("\nExample of AI-generated text (label=1):")
    print(dataset.filter(lambda example: example['label'] == 1)[0]['text'])

except Exception as e:
    print(f"\nAn error occurred during manual download or processing: {e}")
    print("This could be a temporary network issue or a problem with the dataset URL.")

Downloading data directly from: https://huggingface.co/datasets/ardavey/human-ai-generated-text/resolve/main/data/train-00000-of-00001.parquet
Download and initial load successful.

Dataset successfully processed and prepared!
Dataset({
    features: ['text', 'label'],
    num_rows: 488
})

Example of Human-written text (label=0):


Filter:   0%|          | 0/488 [00:00<?, ? examples/s]

The human brain is made up of billions of neurons, each connected to thousands of others, making it a highly complex and intricate system.

Example of AI-generated text (label=1):


Filter:   0%|          | 0/488 [00:00<?, ? examples/s]

The ontological implications of utilizing large language models in philosophical discourse are multifaceted and warrant a nuanced examination. Specifically, the efficacy of LLMs in simulating human-like reasoning and generating coherent arguments has sparked debate among epistemologists, with some arguing that these tools represent a paradigm shift in the pursuit of knowledge, while others contend that they merely facilitate the propagation of spurious arguments.


In [None]:
#@title 3. Load RoBERTa Model and Tokenizer
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
#@title 4. Tokenize the Dataset
def tokenize_function(examples):
    # Truncate to 512 tokens, the max for roberta-base
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Remove the original text column as it's no longer needed after tokenization
tokenized_dataset = tokenized_dataset.remove_columns(["text"])
tokenized_dataset.set_format("torch")

# This dataset has about 33k examples. We'll use a larger subset for training.
train_size = 400
eval_size = 88

small_train_dataset = tokenized_dataset.select(range(train_size))
small_eval_dataset = tokenized_dataset.select(range(train_size, train_size + eval_size))

print(f"Training dataset size: {len(small_train_dataset)}")
print(f"Evaluation dataset size: {len(small_eval_dataset)}")

Map:   0%|          | 0/488 [00:00<?, ? examples/s]

Training dataset size: 400
Evaluation dataset size: 88


In [None]:
#@title 5. Configure and Run the Fine-Tuning
from transformers import TrainingArguments, Trainer
import numpy as np
import evaluate

model_dir = "ai-human-detector-roberta-ardavey"

training_args = TrainingArguments(
    output_dir=model_dir,
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

print("Starting fine-tuning...")
trainer.train()
print("Fine-tuning complete!")

print("\nEvaluating the fine-tuned model...")
eval_results = trainer.evaluate()
print(f"Evaluation Accuracy: {eval_results['eval_accuracy']:.4f}")

Starting fine-tuning...


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.700262,0.534091
2,No log,0.693049,0.534091
3,No log,0.690654,0.534091
4,0.698400,0.687545,0.579545
5,0.698400,0.664786,0.579545


Fine-tuning complete!

Evaluating the fine-tuned model...


Evaluation Accuracy: 0.5795


In [None]:
#@title 6. Save Model and Create Inference Pipeline
from transformers import pipeline

final_model_path = "./final_ai_human_model_ardavey"
trainer.save_model(final_model_path)
tokenizer.save_pretrained(final_model_path)

print(f"Model saved to {final_model_path}")

# This pipeline will classify text as AI or HUMAN
detector = pipeline("text-classification", model=final_model_path, device=0)

print("\nInference pipeline created successfully!")


#@title 7. Test the Fine-Tuned Detector
ai_text = "The synergistic application of blockchain technology and artificial intelligence is poised to redefine digital trust paradigms."
human_text = "Wow, this pizza is incredible! I haven't had one this good in ages."

def pretty_print_result(text, result):
    # LABEL_0 is Human, LABEL_1 is AI
    label_map = {'LABEL_0': 'HUMAN', 'LABEL_1': 'AI'}
    label = label_map[result[0]['label']]
    score = result[0]['score']
    print(f"Text: '{text[:80]}...'")
    print(f"--> Verdict: {label} (Confidence: {score:.4f})\n")

print("\n--- Classifying Test Cases ---")
pretty_print_result(ai_text, detector(ai_text))
pretty_print_result(human_text, detector(human_text))

Model saved to ./final_ai_human_model_ardavey


Device set to use cuda:0



Inference pipeline created successfully!

--- Classifying Test Cases ---
Text: 'The synergistic application of blockchain technology and artificial intelligence...'
--> Verdict: AI (Confidence: 0.5873)

Text: 'Wow, this pizza is incredible! I haven't had one this good in ages....'
--> Verdict: AI (Confidence: 0.5543)



In [None]:
#@title 7.  Check input from User
check = input("Write your text sample here:")
pretty_print_result(check, detector(check))

Write your text sample here:Yes, sure we could have a talk. I will be free tomorrow after 18:15 (Kyiv time, so after 17:15 in Paris), or on Tuesday any time after 16:00 (Kyiv time, so after 15:00 in Paris). Or tomorrow I also will be free between 16:00 and 17:15 (Kyiv time, so it's 15:00-16:15 in Paris).
Text: 'Yes, sure we could have a talk. I will be free tomorrow after 18:15 (Kyiv time, ...'
--> Verdict: AI (Confidence: 0.5718)

