# **Team - Nexus Interrogators**

# **Subtask 2 V3: Human-AI Collaborative Text Classification**

- After finetuning our model on augmented dataset. We now try to evaluate that out of the four augmentation techniques used, which is overall improving our results the most.

- The Four techniques used are:
    - Back Translation
    - Antonym
    - Synonym
    - Delete

- Now, we have created individual subsets for each technique, and in this notebook, we will merge them with original dataset, and then evaluate the performance again.

- Results can be seen below in the notebook.

- To run it, you can run all cells of this notebook individually.

- External script `baseline_st2.py` used below. Make sure it is in the same directory as this notebook.

## **Installing Dependencies**

In [1]:
import json
from collections import Counter
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score
from datasets import Dataset, concatenate_datasets
import numpy as np
import os
import torch
import evaluate
from datasets import load_dataset

## **Loading Original Train Dataset**

In [None]:
target_dir = "/kaggle/working/st2data"
os.makedirs(target_dir, exist_ok=True)
!gdown "https://drive.google.com/uc?id=1u5C4o_fmjL5nQ_RtgLDShuG97Ix6_KGK" -O "/kaggle/working/st2data/train.jsonl"

In [None]:
# Function to load and extract required fields
def load_data(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            entry = json.loads(line)
            data.append({"text": entry["text"], "label": entry["label"]})
    return pd.DataFrame(data)  # Convert to DataFrame

# Load training and validation data into DataFrames
original_dataset_train_df = load_data("st2data/train.jsonl")

In [4]:
original_dataset_train = Dataset.from_pandas(original_dataset_train_df)

## **Back Translation**

### **Loading Back Translation Augmented Data**

In [5]:
bt_dataset = load_data("augmentation_strategies/output_backtranslation.jsonl")

In [6]:
bt_dataset = Dataset.from_pandas(bt_dataset)

### **Merging with Original Train Dataset**

In [7]:
aug_dataset_bt = concatenate_datasets([original_dataset_train, bt_dataset])

In [8]:
aug_dataset_bt.to_json("ind_augmentation_datasets/aug_dataset_bt.jsonl", orient="records", lines=True)

Creating json from Arrow format:   0%|          | 0/316 [00:00<?, ?ba/s]

645284476

### **Finetuning & Evaluating the Model on Original Train + BT Dataset**

Here we are using an evaluation script `baseline_st2.py`.

The parameters we are passing to the script are:

- `train file path`
- `val file path`
- `model name`
- `prediction file path`
- `test file path` (Currently it is just a placeholder as we have not recieved the Test Dataset)

In [None]:
!python baseline_st2.py --train_file_path "ind_augmentation_datasets/aug_dataset_bt.jsonl" --dev_file_path "/kaggle/working/st2data/val.jsonl" --test_file_path "/kaggle/working/st2data/val.jsonl" --model bert-base-uncased --prediction_file_path results/subtask2/data_Ind_Aug_BT.csv

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████████████████| 315936/315936 [05:44<00:00, 917.77 examples/s]
Map: 100%|███████████████████████| 72661/72661 [01:10<00:00, 1035.27 examples/s]
  trainer = Trainer(
{'loss': 0.65, 'grad_norm': 8.346234321594238, 'learning_rate': 1.9493568317633952e-05, 'epoch': 0.03}
{'loss': 0.3804, 'grad_norm': 10.021520614624023, 'learning_rate': 1.8987136635267905e-05, 'epoch': 0.05}
{'loss': 0.3213, 'grad_norm': 3.037353754043579, 'learning_rate': 1.8480704952901856e-05, 'epoch': 0.08}
{'loss': 0.2992, 'grad_norm': 11.432672500610352, 'learning_rate': 1.7974273270535806e-05, 'epoch': 0.1}
{'loss': 0.284, 'grad_norm': 6.843706130981445, 'learning_rate': 1.746784158816976e-05, 'epoch': 0.13}
{'

## **Antonyms**

### **Loading Antonym Augmented Data**

In [10]:
antonym_aug = load_data("augmentation_strategies/output_antonym.jsonl")

In [11]:
antonym_aug = Dataset.from_pandas(antonym_aug)

### **Merging with Original Train Dataset**

In [12]:
aug_dataset_antonym = concatenate_datasets([original_dataset_train, antonym_aug])

In [13]:
aug_dataset_antonym.to_json("ind_augmentation_datasets/aug_dataset_antonym.jsonl", orient="records", lines=True)

Creating json from Arrow format:   0%|          | 0/316 [00:00<?, ?ba/s]

668107457

### **Finetuning & Evaluating the Model on Original Train + Antonym Dataset**

Here we are using an evaluation script `baseline_st2.py`.

The parameters we are passing to the script are:

- `train file path`
- `val file path`
- `model name`
- `prediction file path`
- `test file path` (Currently it is just a placeholder as we have not recieved the Test Dataset)

In [None]:
!python baseline_st2.py --train_file_path "ind_augmentation_datasets/aug_dataset_antonym.jsonl" --dev_file_path "/kaggle/working/st2data/val.jsonl" --test_file_path "/kaggle/working/st2data/val.jsonl" --model bert-base-uncased --prediction_file_path results/subtask2/data_Ind_Aug_Antonyn.csv

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████████████████| 315936/315936 [05:44<00:00, 916.75 examples/s]
Map: 100%|███████████████████████| 72661/72661 [01:09<00:00, 1052.80 examples/s]
  trainer = Trainer(
{'loss': 0.6644, 'grad_norm': 12.640812873840332, 'learning_rate': 1.9493568317633952e-05, 'epoch': 0.03}
{'loss': 0.3671, 'grad_norm': 7.9746809005737305, 'learning_rate': 1.8987136635267905e-05, 'epoch': 0.05}
{'loss': 0.3111, 'grad_norm': 4.261466979980469, 'learning_rate': 1.8480704952901856e-05, 'epoch': 0.08}
{'loss': 0.2828, 'grad_norm': 18.284711837768555, 'learning_rate': 1.7974273270535806e-05, 'epoch': 0.1}
{'loss': 0.2683, 'grad_norm': 7.143779754638672, 'learning_rate': 1.746784158816976e-05, 'epoch': 0.13

## **Synonyms**

### **Loading Synonym Augmented Data**

In [15]:
synonym_aug = load_data("augmentation_strategies/output_synonym.jsonl")

In [16]:
synonym_aug = Dataset.from_pandas(synonym_aug)

### **Merging with Original Train Dataset**

In [17]:
aug_dataset_synonym = concatenate_datasets([original_dataset_train, synonym_aug])

In [18]:
aug_dataset_synonym.to_json("ind_augmentation_datasets/aug_dataset_synonym.jsonl", orient="records", lines=True)

Creating json from Arrow format:   0%|          | 0/316 [00:00<?, ?ba/s]

666782188

### **Finetuning & Evaluating the Model on Original Train + Synonym Dataset**

Here we are using an evaluation script `baseline_st2.py`.

The parameters we are passing to the script are:

- `train file path`
- `val file path`
- `model name`
- `prediction file path`
- `test file path` (Currently it is just a placeholder as we have not recieved the Test Dataset)

In [None]:
!python baseline_st2.py --train_file_path "ind_augmentation_datasets/aug_dataset_synonym.jsonl" --dev_file_path "/kaggle/working/st2data/val.jsonl" --test_file_path "/kaggle/working/st2data/val.jsonl" --model bert-base-uncased --prediction_file_path results/subtask2/data_Ind_Aug_Synonym.csv

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████████████████| 315936/315936 [05:50<00:00, 901.11 examples/s]
Map: 100%|███████████████████████| 72661/72661 [01:11<00:00, 1021.68 examples/s]
  trainer = Trainer(
{'loss': 0.6532, 'grad_norm': 16.309383392333984, 'learning_rate': 1.9493568317633952e-05, 'epoch': 0.03}
{'loss': 0.3563, 'grad_norm': 10.39608383178711, 'learning_rate': 1.8987136635267905e-05, 'epoch': 0.05}
{'loss': 0.3052, 'grad_norm': 5.806362152099609, 'learning_rate': 1.8480704952901856e-05, 'epoch': 0.08}
{'loss': 0.2843, 'grad_norm': 14.519922256469727, 'learning_rate': 1.7974273270535806e-05, 'epoch': 0.1}
{'loss': 0.2668, 'grad_norm': 5.035423755645752, 'learning_rate': 1.746784158816976e-05, 'epoch': 0.13}

## **Deletion**

### **Loading Deletion Augmented Data**

In [20]:
aug_deletion = load_data("augmentation_strategies/output_deletion.jsonl")

In [21]:
aug_deletion = Dataset.from_pandas(aug_deletion)

### **Merging with Original Train Dataset**

In [22]:
aug_dataset_deletion = concatenate_datasets([original_dataset_train, aug_deletion])

In [23]:
aug_dataset_deletion.to_json("ind_augmentation_datasets/aug_dataset_deletion.jsonl", orient="records", lines=True)

Creating json from Arrow format:   0%|          | 0/316 [00:00<?, ?ba/s]

666243126

### **Finetuning & Evaluating the Model on Original Train + Deletion Dataset**

Here we are using an evaluation script `baseline_st2.py`.

The parameters we are passing to the script are:

- `train file path`
- `val file path`
- `model name`
- `prediction file path`
- `test file path` (Currently it is just a placeholder as we have not recieved the Test Dataset)

In [None]:
!python baseline_st2.py --train_file_path "ind_augmentation_datasets/aug_dataset_deletion.jsonl" --dev_file_path "/kaggle/working/st2data/val.jsonl" --test_file_path "/kaggle/working/st2data/val.jsonl" --model bert-base-uncased --prediction_file_path results/subtask2/data_Ind_Aug_Delection.csv

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████████████████| 315936/315936 [05:45<00:00, 914.52 examples/s]
Map: 100%|███████████████████████| 72661/72661 [01:09<00:00, 1050.56 examples/s]
  trainer = Trainer(
{'loss': 0.6617, 'grad_norm': 7.396088600158691, 'learning_rate': 1.9493568317633952e-05, 'epoch': 0.03}
{'loss': 0.3759, 'grad_norm': 8.806495666503906, 'learning_rate': 1.8987136635267905e-05, 'epoch': 0.05}
{'loss': 0.3159, 'grad_norm': 2.182642698287964, 'learning_rate': 1.8480704952901856e-05, 'epoch': 0.08}
{'loss': 0.2917, 'grad_norm': 17.376754760742188, 'learning_rate': 1.7974273270535806e-05, 'epoch': 0.1}
{'loss': 0.2756, 'grad_norm': 9.432795524597168, 'learning_rate': 1.746784158816976e-05, 'epoch': 0.13}
