Some quick reading for anyone running this whose name isn't Aditya:

So, elephant in the room. Why *ELECTRA*?

Well, it's a) faster to train, and b) 'interestingly different', but performs almost just as favorably as *BERT*.

*ELECTRA* (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) 
is a pretraining approach that trains two transformer models: the generator and the discriminator.
The generator replaces tokens in a sequence and is trained as a masked language model. 
The discriminator, which is the model we’re interested in, tries to identify which tokens 
were replaced by the generator in the sequence. *ELECTRA* replaces the Masked Language Modeling 
(MLM) of *BERT* with Replaced Token Detection (RTD), which is more efficient.

And a quick rundown:

Masked Language Modeling (MLM): This technique is used in models like *BERT*. 
In MLM, some percentage of the input tokens are masked at random, and then the 
model is trained to predict those masked tokens. The model can attend to tokens bidirectionally, 
meaning it has full access to the tokens on the left and right of the masked token. 
This technique is great for tasks that require a good contextual understanding of an entire sequence.

Replaced Token Detection (RTD): This technique was introduced by *ELECTRA*. In RTD, a generator replaces tokens
in the sequence, and a discriminator attempts to identify which tokens were replaced by the generator. 
Unlike MLM, which uses a transformer encoder to predict corrupted tokens, RTD uses a generator to generate
ambiguous corruptions and a discriminator to distinguish the ambiguous tokens from the original inputs. 
This pre-training task is a replacement for masking the input *ELECTRA* was built with the same
classification and detection purposes as RoBERTa, but with a lighter network that utilizes RTD.

Google's paper for *ELECTRA* mentions that when run against the standard *BERT* model on the same NVIDIA V100 training hardware, they both ran for four days, but thoughout the whole session, *ELECTRA* took about 1/4 the computational resources used by *BERT*. We'll run *ELECTRA*'s small discriminator model to exploit the lesser computational demand with as much GPU parallelization as possible, in hopes of hyperaccelerating training. While running this, you'll notice the *ELECTRA* tokenizer variant takes an agonizing amount of time to process data, but the end result is a direct  8-12x training speedup.

*ELECTRA: Pre-training Text Encoders as Discriminators Rather than Generators*: https://openreview.net/pdf?id=r1xMH1BtvB

# Setup

In [None]:
%%capture
!pip show transformers
!pip show accelerate
!pip install transformers[torch] -U
!pip install accelerate -U
!pip install transformers
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install --upgrade transformers
!pip install datasets
!pip install gdown

# Preprocessing <nr> 
Note: Code downloads dataset, no need to have it on hand.

In [None]:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset
from transformers import TrainingArguments
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import os
import gdown


# 1. Prepare Dataset
# 2. Load pretrained Tokenizer, call it with dataset -> encoding
# 3. Build PyTorth Dataset with encodings
# 4. Load pretrained Model
# 5. Load HF Trainer and train it

#The data format, combined with a whole new dataset format update makes for really terrible processing headaches
#That said, I'm redoing the whole dataset separation from the original dataset.

file_id = '1E5v6t-y2Pzeyi5C-amlYibp2cXUN0MLW'
url = f'https://drive.google.com/uc?id={file_id}'
file_path = 'all_data.csv'
if not os.path.exists(file_path):
    # Download the file
    gdown.download(url, file_path, quiet=False)

# Read the csv file
all_data = pd.read_csv(file_path, engine='python')

from sklearn.model_selection import train_test_split
toxicity_train_df, toxicity_test_df = train_test_split(all_data, test_size=0.40, random_state=42)

# List of categories to check
categories_to_check = ['obscene', 'sexual_explicit', 'threat', 'insult', 'identity_attack']

toxicity_train_df[categories_to_check] = toxicity_train_df[categories_to_check].apply(pd.to_numeric, errors='coerce')
toxicity_test_df[categories_to_check] = toxicity_test_df[categories_to_check].apply(pd.to_numeric, errors='coerce')

# Check if any category is above the 0.33 threshold
toxicity_train_df['toxic'] = (toxicity_train_df[categories_to_check] >= 0.33).any(axis=1).astype(float)
toxicity_test_df['toxic'] = (toxicity_test_df[categories_to_check] >= 0.33).any(axis=1).astype(float)

# Convert boolean values to 1.0 for True and 0.0 for False
toxicity_train_df['toxic'] = toxicity_train_df['toxic'].astype(float)
toxicity_test_df['toxic'] = toxicity_test_df['toxic'].astype(float)

toxicity_train_df = toxicity_train_df[['comment_text', 'toxic', 'obscene', 'sexual_explicit', 'threat', 'insult', 'identity_attack']]
toxicity_test_df = toxicity_test_df[['comment_text', 'toxic', 'obscene', 'sexual_explicit', 'threat', 'insult', 'identity_attack']]

# Can be adjusted to downsample training data
sample_rate = 1.0

toxicity_train_df = toxicity_train_df.sample(frac=sample_rate, random_state=42)

print("Toxic train examples")
print(toxicity_train_df.head(4))

print("Toxic test examples")
print(toxicity_test_df.head(4))

Test Lengths of DFs

In [None]:
print(len(toxicity_train_df))
print(len(toxicity_test_df))

# Visualization of toxicity in train

In [None]:
# Count toxic and non-toxic comments
toxic_count = toxicity_train_df['toxic'].sum()
non_toxic_count = len(toxicity_train_df) - toxic_count

# Plot side-by-side bars for toxic and non-toxic comments
labels = ['Toxic Comments', 'Non-Toxic Comments']
counts = [toxic_count, non_toxic_count]

plt.bar(labels, counts, color=['red', 'blue'])
plt.ylabel('Comment Count')

plt.show()

# Visualization of toxicity in test

In [None]:
# Count toxic and non-toxic comments
toxic_count = toxicity_test_df['toxic'].sum()
non_toxic_count = len(toxicity_test_df) - toxic_count

# Plot side-by-side bars for toxic and non-toxic comments
labels = ['Toxic Comments', 'Non-Toxic Comments']
counts = [toxic_count, non_toxic_count]

plt.bar(labels, counts, color=['red', 'blue'])
plt.ylabel('Comment Count')

plt.show()

# Splitting and Labelling

In [None]:
model_name = "google/electra-small-discriminator"

# Reset index to ensure consistency
toxicity_train_df.reset_index(drop=True, inplace=True)
toxicity_test_df.reset_index(drop=True, inplace=True)

# Select relevant columns from DataFrame and drop NaN values
train_data = toxicity_train_df[['comment_text', 'toxic']].dropna()
test_data = toxicity_test_df[['comment_text', 'toxic']].dropna()

# Extract features and labels
train_texts = train_data['comment_text'].tolist()
train_labels = train_data['toxic'].tolist()
test_texts = test_data['comment_text'].tolist()
test_labels = test_data['toxic'].tolist()

# Print examples of texts & labels
print("train_texts:")
print(train_texts[:5])
print("train_labels:")
print(train_labels[:5])
print("test_texts")
print(test_texts[:5])
print("test_labels:")
print(test_labels[:5])

# Split train data into train and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2)


# Dataset

In [None]:
from torch.utils.data import Dataset, DataLoader
from transformers import ElectraTokenizer, ElectraForSequenceClassification
import torch

class ToxicDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx]).float()
        return item

    def __len__(self):
        return len(self.labels)

class ToxicDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx]).float()
        return item

    def __len__(self):
        return len(self.labels)
       

tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)



train_dataset = ToxicDataset(train_encodings, train_labels)
val_dataset = ToxicDataset(val_encodings, val_labels)
test_dataset = ToxicDataset(test_encodings, test_labels)

print("Train Dataset")
# Iterate over train_dataset and print some samples
for i in range(2):  # Print first 2 samples
    sample = train_dataset[i]
    print(f"Sample {i + 1}:")
    # Convert input_ids tensor to list and access its keys
    encoding_keys = tokenizer.convert_ids_to_tokens(sample["input_ids"].tolist())
    print("Encoding keys:", encoding_keys)  # Print keys of encoding
    print("Label:", sample["labels"].item())  # Print label
    print()

print("Val Dataset")
# Iterate over val dataset and print some samples
for i in range(2):  # Print first 2 samples
    sample = val_dataset[i]
    print(f"Sample {i + 1}:")
    # Convert input_ids tensor to list and access its keys
    encoding_keys = tokenizer.convert_ids_to_tokens(sample["input_ids"].tolist())
    print("Encoding keys:", encoding_keys)  # Print keys of encoding
    print("Label:", sample["labels"].item())  # Print label
    print()

print("Test Dataset")
# Iterate over test dataset and print some samples
for i in range(2):  # Print first 2 samples
    sample = test_dataset[i]
    print(f"Sample {i + 1}:")
    # Convert input_ids tensor to list and access its keys
    encoding_keys = tokenizer.convert_ids_to_tokens(sample["input_ids"].tolist())
    print("Encoding keys:", encoding_keys)  # Print keys of encoding
    print("Label:", sample["labels"].item())  # Print label
    print()

In [None]:
print("Train Dataset")
print(len(train_dataset))

# Native PyTorch (instead of HF Trainer)

<B>Benchmark Note:</B><br>On a RTX 3060 Mobile, tokenization took ~45 minutes, and training took 1 hour and 25 minutes.

In [None]:
from torch.utils.data import DataLoader
from transformers import ElectraTokenizerFast, ElectraForSequenceClassification
from transformers import get_linear_schedule_with_warmup
from transformers import AdamW
import torch
import random

# Access GPU or CPU depending on status
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# Set random seed for reproducability
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

# Grab generic electra-base model to be fine tuned

model = ElectraForSequenceClassification.from_pretrained('google/electra-small-discriminator', num_labels=1)
model.to(device)
model.train()

# Initialize training params
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
optim = AdamW(model.parameters(), lr=1e-5)
num_train_epochs = 1

# Fine-tuned electra-base model
from torch.cuda.amp import GradScaler, autocast

# Initialize GradScaler
scaler = GradScaler()

# Number of training steps
num_training_steps = num_train_epochs * len(train_loader)

# Create the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optim, num_warmup_steps=0, num_training_steps=num_training_steps)

for epoch in range(num_train_epochs):
  total_loss = 0.0
  for batch_idx, batch in enumerate(train_loader):
    optim.zero_grad()

    # Move data to device
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)
    # Use autocast to enable mixed precision
    with autocast():
      outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
      loss = outputs.loss

    # Scale the loss before calling backward
    scaler.scale(loss).backward()

    # Update the optimizer parameters
    scaler.step(optim)

    # Update the scale factor
    scaler.update()

    total_loss += loss.item()

    if (batch_idx + 1) % 1 == 0:  # Print progress every batch
      print(f"Epoch [{epoch + 1}/{num_train_epochs}], Batch [{batch_idx + 1}/{len(train_loader)}], Loss: {total_loss / (batch_idx + 1):.4f}")
    scheduler.step()

  print(f"Epoch [{epoch + 1}/{num_train_epochs}], Average Loss: {total_loss / len(train_loader):.4f}")

model.eval()

# Store the fine-tuned model for later use
model.save_pretrained('/usr/fine_tuned_electra_model')


Testing the model on the test dataset:

In [None]:
from sklearn.metrics import classification_report
from torch.utils.data import DataLoader

# Initialize testing params
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

# Store the predictions and true labels
predictions = []
true_labels = []

# Evaluate the model
model.eval()
with torch.no_grad():
    for batch in test_loader:
        # Move data to device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = labels.to('cpu').numpy()

        # Store predictions and true labels
        predictions.append(logits)
        true_labels.append(label_ids)

# Flatten the predictions and true values for aggregate evaluation
predictions = np.concatenate(predictions, axis=0)
true_labels = np.concatenate(true_labels, axis=0)

# Print classification report
print(classification_report(true_labels, np.argmax(predictions, axis=1)))

Work in progress past this point. I'll add more code later to validate and cross-compare ELECTRA to BERT. <br><br>*If training doesn't work, the model I've trained should be available in the same GitHub directory as this notebook.*