# **Team - Nexus Interrogators**

# **Subtask 1: Voight-Kampff AI Detection Sensitivity**

- This is the initial approach to solve this task.

- Results can be seen below in the notebook.

- To run it, you can run all cells of this notebook individually.

- When training, it will prompt to add the API key for wandb which will be needed to start training.

## **Installing Dependencies**

In [None]:
# !pip install transformers datasets torch scikit-learn evaluate

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)


## **Importing Libraries**

In [1]:
import json
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score
from datasets import Dataset
import numpy as np
import evaluate

## **Loading Data**

In [2]:
!wget -O st1data.zip "https://github.com/huzaifahtariqahmed/Voight-Kampff-Nexus-Interrogators/raw/refs/heads/main/data/subtask1.zip"
!unzip st1data.zip -d st1data

--2025-05-01 15:12:48--  https://github.com/huzaifahtariqahmed/Voight-Kampff-Nexus-Interrogators/raw/refs/heads/main/data/subtask1.zip
Resolving github.com (github.com)... 20.207.73.82
Connecting to github.com (github.com)|20.207.73.82|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/huzaifahtariqahmed/Voight-Kampff-Nexus-Interrogators/refs/heads/main/data/subtask1.zip [following]
--2025-05-01 15:12:48--  https://raw.githubusercontent.com/huzaifahtariqahmed/Voight-Kampff-Nexus-Interrogators/refs/heads/main/data/subtask1.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40538991 (39M) [application/zip]
Saving to: ‘st1data.zip’


2025-05-01 15:13:26 (1.19 MB/s) - ‘st1data.zip’ saved [40538991/40

In [3]:
# Function to load and extract required fields
def load_data(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            entry = json.loads(line)
            data.append({"id": entry["id"],"text": entry["text"], "label": entry["label"]})
    return pd.DataFrame(data)  # Convert to DataFrame

# Load training and validation data into DataFrames
train_df = load_data("st1data/train.jsonl")
val_df = load_data("st1data/val.jsonl")

# Display first few rows
print("Training Data:")
print(train_df.head())

print("\nValidation Data:")
print(val_df.head())


Training Data:
                                     id  \
0  ea468d03-1973-5039-86b2-ff225bb92c4e   
1  0d05f269-6d67-521d-9b5d-cc18f482c6c1   
2  c2ec79f3-da80-58f8-bef0-3e0ea7ab072f   
3  4ad37c58-0bb7-536b-997d-cfccabd0d094   
4  07747b0c-5051-5e0d-8096-b4d4ed8bd98e   

                                                text  label  
0  Duke Ellington, a titan of jazz, revolutionize...      1  
1  I reflected on the shifting dynamics of media ...      1  
2  In F. Scott Fitzgerald's "The Great Gatsby," t...      1  
3  I still chuckle when I think about that time I...      1  
4  Yoga, originating in ancient India, encompasse...      1  

Validation Data:
                                     id  \
0  7caf42b9-fd48-5e97-a0d0-0ae28a1f9603   
1  28b61fc4-e82b-5cf8-bc34-1ecdb7182993   
2  22398c76-da72-5724-973e-0981b8e9cbee   
3  3cd1e50d-e1f0-5f8f-bfb8-0b8a6048bcaa   
4  6e5745a6-0335-50cc-bdf0-fa0e1fee7518   

                                                text  label  
0  In William F

In [4]:
# converting the data to the hugging face format.
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

## **Tokenization**

In [5]:
model_name = "bert-base-uncased"  # Change this if using a different model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Tokenize dataset
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

# Remove original text column (not needed after tokenization)
train_dataset = train_dataset.remove_columns(["text"])
val_dataset = val_dataset.remove_columns(["text"])

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/23707 [00:00<?, ? examples/s]

Map:   0%|          | 0/3589 [00:00<?, ? examples/s]

In [9]:
!python baseline_st1.py --train_file_path "st1data/train.jsonl" --dev_file_path "st1data/val.jsonl" --model $model_name --prediction_file_path results/subtask1/exp1-bert-base.csv --test_file_path "st1data/val.jsonl"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|████████████████████████| 23707/23707 [00:59<00:00, 395.27 examples/s]
Map: 100%|██████████████████████████| 3589/3589 [00:08<00:00, 405.06 examples/s]
  trainer = Trainer(
{'loss': 0.1068, 'grad_norm': 0.1047993078827858, 'learning_rate': 1.7750787224471436e-05, 'epoch': 0.34}
{'loss': 0.0515, 'grad_norm': 0.04118435084819794, 'learning_rate': 1.550157444894287e-05, 'epoch': 0.67}
 33%|█████████████                          | 1482/4446 [13:02<23:41,  2.09it/s]
  0%|                                                   | 0/225 [00:00<?, ?it/s][A
  1%|▍                                          | 2/225 [00:00<00:17, 12.79it/s][A
  2%|▊                                          | 4/225 [00:00<

## **Define Model**

In [6]:
num_labels = 2  # Since you have two labels

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## **Training Arguments**

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
)



## **Metrics**

In [None]:
def compute_metrics(eval_pred):
    f1_metric = evaluate.load("f1")
    recall_metric = evaluate.load("recall")
    accuracy_metric = evaluate.load("accuracy")

    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    results = {}
    # Micro F1-score
    results.update(f1_metric.compute(predictions=predictions, references=labels, average="micro"))
    # Macro F1-score
    results["macro_f1"] = f1_metric.compute(predictions=predictions, references=labels, average="macro")["f1"]
    # Macro Recall
    results["macro_recall"] = recall_metric.compute(predictions=predictions, references=labels, average="macro")["recall"]
    # Accuracy
    results["accuracy"] = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]

    return results

## **Trainer**

In [None]:
# Update Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,  # Include custom metrics
)

## **Train the Model**

In [None]:
# Train the model
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msamiyaalizaidi[0m ([33msamiyaalizaidi-habib-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,F1,Macro F1,Macro Recall,Accuracy
1,0.0618,0.064942,0.985511,0.984101,0.981217,0.985511
2,0.0114,0.164556,0.972694,0.969731,0.962505,0.972694
3,0.0095,0.082132,0.988298,0.987158,0.984256,0.988298


Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.38k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

TrainOutput(global_step=8892, training_loss=0.03730698917844524, metrics={'train_runtime': 7482.6914, 'train_samples_per_second': 9.505, 'train_steps_per_second': 1.188, 'total_flos': 1.871272136825856e+16, 'train_loss': 0.03730698917844524, 'epoch': 3.0})

## **Evaluate the Model**

In [None]:
# Evaluate on validation set
results = trainer.evaluate()
print(results)  # Will now include F1-score

{'eval_loss': 0.08213236182928085, 'eval_f1': 0.9882975759264419, 'eval_macro_f1': 0.987158293974524, 'eval_macro_recall': 0.9842563263271129, 'eval_accuracy': 0.9882975759264419, 'eval_runtime': 106.8562, 'eval_samples_per_second': 33.587, 'eval_steps_per_second': 4.202, 'epoch': 3.0}


## **Saving the Model**

In [None]:
model.save_pretrained("./st1modelv1")
tokenizer.save_pretrained("./st1tokenizerv1")

('./st1tokenizerv1/tokenizer_config.json',
 './st1tokenizerv1/special_tokens_map.json',
 './st1tokenizerv1/vocab.txt',
 './st1tokenizerv1/added_tokens.json',
 './st1tokenizerv1/tokenizer.json')