# NLP Project — Human vs Machine Text Classification

**Author:** Mohammed Bouadjimi  
**Date:** August 2025

This project aims to fine-tune a RoBERTa-based language model to distinguish between text written by humans and text generated by AI (machine). The model is trained and evaluated on a labeled dataset using Hugging Face Transformers and Datasets.

---


## 1. Setup

We begin by installing the necessary libraries and importing the required modules for the project.


In [None]:
# Author: Mohammed Bouadjimi
# NLP Assignment 2 - Bot or Not? Detecting Machine-Generated Text
# Step 1: Setup and Load Dataset

!pip install -q transformers datasets scikit-learn pandas matplotlib

from datasets import load_dataset
from transformers import AutoTokenizer
import pandas as pd
!pip install --no-cache-dir gdown
!gdown --folder https://drive.google.com/drive/folders/1CAbb3DjrOPBNm0ozVBfhvrEh9P9rAppc



Retrieving folder contents
Processing file 1e_G-9a66AryHxBOwGWhriePYCCa4_29e subtaskA_dev_monolingual.jsonl
Processing file 123UQ92LxtHaVTbNYlmjnG1CWwD-x7wDL subtaskA_dev_multilingual.jsonl
Processing file 1HeCgnLuDoUHhP-2OsTSSC3FXRLVoI6OG subtaskA_train_monolingual.jsonl
Processing file 13-9-DakCeLFbPgCiVIU0v6_BCQx0ppz6 subtaskA_train_multilingual.jsonl
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1e_G-9a66AryHxBOwGWhriePYCCa4_29e
To: /content/SubtaskA/subtaskA_dev_monolingual.jsonl
100% 10.8M/10.8M [00:00<00:00, 40.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=123UQ92LxtHaVTbNYlmjnG1CWwD-x7wDL
To: /content/SubtaskA/subtaskA_dev_multilingual.jsonl
100% 21.2M/21.2M [00:00<00:00, 75.0MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1HeCgnLuDoUHhP-2OsTSSC3FXRLVoI6OG
From (redirected): https://drive.google.com/uc?id=1HeCgnLuDoUHhP-2OsTSSC3FXRLVoI

## 2. Load Dataset

We load the human vs machine classification dataset, which includes two categories:
- `0`: Human-written text
- `1`: Machine-generated text

The dataset is loaded into pandas and converted to Hugging Face Datasets format.


In [None]:
import pandas as pd

# Load monolingual files
train_path = "SubtaskA/subtaskA_train_monolingual.jsonl"
dev_path = "SubtaskA/subtaskA_dev_monolingual.jsonl"

# Read the JSONL files
train_df = pd.read_json(train_path, lines=True)
dev_df = pd.read_json(dev_path, lines=True)

# Print basic info
print("Train set shape:", train_df.shape)
print("Validation set shape:", dev_df.shape)
print("\nColumns:", train_df.columns.tolist())

# Show sample data
train_df.head()


Train set shape: (119757, 5)
Validation set shape: (5000, 5)

Columns: ['text', 'label', 'model', 'source', 'id']


Unnamed: 0,text,label,model,source,id
0,Forza Motorsport is a popular racing game that...,1,chatGPT,wikihow,0
1,Buying Virtual Console games for your Nintendo...,1,chatGPT,wikihow,1
2,Windows NT 4.0 was a popular operating system ...,1,chatGPT,wikihow,2
3,How to Make Perfume\n\nPerfume is a great way ...,1,chatGPT,wikihow,3
4,How to Convert Song Lyrics to a Song'\n\nConve...,1,chatGPT,wikihow,4


## 3. Text Cleaning

Before training the model, we clean the input text to ensure better tokenization and learning. The cleaning process involves:
- Lowercasing all characters
- Removing leading/trailing spaces
- Replacing multiple whitespaces or newline characters with a single space
- Removing special characters except basic punctuation

We apply this cleaning function to both the training and validation datasets and store the result in a new column called `clean_text`.


In [None]:
import re

def clean_text(text):
    # Basic cleaning: lowercase, strip, remove extra whitespace and special chars
    text = text.lower().strip()
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces/newlines
    text = re.sub(r'[^a-zA-Z0-9.,!?\'\s]', '', text)  # Remove non-alphanum characters
    return text

# Apply cleaning
train_df['clean_text'] = train_df['text'].apply(clean_text)
dev_df['clean_text'] = dev_df['text'].apply(clean_text)

# Preview result
train_df[['text', 'clean_text']].head()


Unnamed: 0,text,clean_text
0,Forza Motorsport is a popular racing game that...,forza motorsport is a popular racing game that...
1,Buying Virtual Console games for your Nintendo...,buying virtual console games for your nintendo...
2,Windows NT 4.0 was a popular operating system ...,windows nt 4.0 was a popular operating system ...
3,How to Make Perfume\n\nPerfume is a great way ...,how to make perfume perfume is a great way to ...
4,How to Convert Song Lyrics to a Song'\n\nConve...,how to convert song lyrics to a song' converti...


## 4. Tokenization with Hugging Face Transformers

To prepare the cleaned text for model input, we tokenize it using the `roberta-base` tokenizer from Hugging Face.

**Steps included:**
- Load the pretrained RoBERTa tokenizer.
- Define a tokenization function that:
  - Pads each input to a fixed maximum length (128 tokens),
  - Truncates any overly long text.
- Sample a smaller subset of the dataset (10,000 training and 2,000 validation) to speed up processing and training.
- Convert the pandas DataFrames into Hugging Face `Dataset` objects.
- Apply the tokenizer using `.map()` to convert raw text into token IDs, attention masks, etc.

This step transforms text into numerical format, which the model can understand.


In [None]:
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
# Tokenization function
def tokenize(batch):
    return tokenizer(batch['clean_text'], padding="max_length", truncation=True, max_length=128)
# Sample 10,000 from training set and 2,000 from dev set
train_small_df = train_df[['clean_text', 'label']].sample(n=10000, random_state=42)
dev_small_df = dev_df[['clean_text', 'label']].sample(n=2000, random_state=42)

# Convert to Hugging Face Datasets
from datasets import Dataset
train_hf = Dataset.from_pandas(train_small_df.reset_index(drop=True))
dev_hf = Dataset.from_pandas(dev_small_df.reset_index(drop=True))

# Tokenize
train_encoded = train_hf.map(tokenize, batched=True)
dev_encoded = dev_hf.map(tokenize, batched=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

## 5. Model Setup and Training Configuration

We now fine-tune a pretrained `roberta-base` model for binary text classification using the Hugging Face `Trainer` API.

**Steps:**
- **Model Definition**: Load `RobertaForSequenceClassification` with `num_labels=2` for binary classification.
- **Metric Function**: Define a function to compute accuracy and F1 score using predictions vs true labels.
- **Training Arguments**:
  - Save model every 500 steps
  - Batch size of 16 per device
  - 3 epochs of training
  - Weight decay for regularization
  - Logging every 10 steps (no wandb)
  - Save at most 2 models
- **Trainer Initialization**: Combine the model, tokenizer, datasets, and metrics to prepare for training.

This configuration sets up the fine-tuning loop using Hugging Face’s high-level API to simplify training.


## 6. Model Training

We now train the model using the `Trainer` API. This will fine-tune `roberta-base` on our dataset for 3 epochs with the configuration we defined earlier.


In [None]:
from transformers import RobertaForSequenceClassification, TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# Reload model
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2)

# Compute metrics function
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds)
    }

# Define training arguments (compatible version)
training_args = TrainingArguments(
    output_dir="./results",
    save_steps=500,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_total_limit=2,
    load_best_model_at_end=False,
    report_to="none"  # no wandb
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_encoded,
    eval_dataset=dev_encoded,
    compute_metrics=compute_metrics
)

# Start training
trainer.train()


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  return forward_call(*args, **kwargs)


Step,Training Loss
10,0.6926
20,0.6245
30,0.5822
40,0.5791
50,0.4681
60,0.512
70,0.5063
80,0.4164
90,0.445
100,0.5242


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)


TrainOutput(global_step=1875, training_loss=0.22494450912574926, metrics={'train_runtime': 977.3615, 'train_samples_per_second': 30.695, 'train_steps_per_second': 1.918, 'total_flos': 1973332915200000.0, 'train_loss': 0.22494450912574926, 'epoch': 3.0})

## 7. Evaluation

After training, we evaluate the model on the development set to check its performance using accuracy and F1 score.


In [None]:
# Evaluate the model on the dev set
metrics = trainer.evaluate(eval_dataset=dev_encoded)
metrics


  return forward_call(*args, **kwargs)


{'eval_loss': 1.8667258024215698,
 'eval_accuracy': 0.672,
 'eval_f1': 0.6659877800407332,
 'eval_runtime': 15.0532,
 'eval_samples_per_second': 132.862,
 'eval_steps_per_second': 8.304,
 'epoch': 3.0}

In [None]:
import torch

# Set device (use GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move model to device
model.to(device)

# Tokenize input and move to same device
text = "This is a great product!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
inputs = {key: value.to(device) for key, value in inputs.items()}

# Run inference
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits).item()

print("Predicted class:", predicted_class)


Predicted class: 1


  return forward_call(*args, **kwargs)


In [None]:
label_map = {0: "human", 1: "chatgpt"}
print("Predicted label:", label_map[predicted_class])


Predicted label: chatgpt


In [21]:
model.save_pretrained("human_bot_model")
tokenizer.save_pretrained("human_bot_model")


('human_bot_model/tokenizer_config.json',
 'human_bot_model/special_tokens_map.json',
 'human_bot_model/vocab.json',
 'human_bot_model/merges.txt',
 'human_bot_model/added_tokens.json',
 'human_bot_model/tokenizer.json')

In [None]:
texts = [
    "I don’t believe this is true.",
    "That’s amazing news!",
    "Can anyone confirm this?",
    "This seems fake."
]

for t in texts:
    inputs = tokenizer(t, return_tensors="pt", padding=True, truncation=True, max_length=128)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
        pred_class = torch.argmax(outputs.logits).item()
    print(t, "→", label_map[pred_class])


I don’t believe this is true. → chatgpt
That’s amazing news! → chatgpt
Can anyone confirm this? → chatgpt
This seems fake. → human


In [23]:
import shutil
shutil.make_archive("human_bot_model", 'zip', "human_bot_model")


'/content/human_bot_model.zip'