Adam and AdamW are subtly different in terms of how they handle weight decay. AdamW modifies the weight decay process to decouple it from the gradient updates, which can lead to better training performance for some models like BERT. Hence, it's generally recommended to use AdamW for transformers.

In [2]:
import pandas as pd
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from sklearn.model_selection import train_test_split
import torch
from torch.optim import AdamW
from torch.utils.data import Dataset, DataLoader
import plotly.express as px
from sklearn.metrics import f1_score
import numpy as np


**Data**

In [3]:
df = pd.read_parquet("/kaggle/input/climatetext/train.parquet")
df['label_int'] = df['label'].str.split("_").str[0]
df['label_int'] = df['label_int'].astype('int')    # get int label
df.head()

Unnamed: 0,quote,label,source,url,language,subsource,id,label_int
0,"There is clear, compelling evidence that many ...",5_science_unreliable,FLICC,https://huggingface.co/datasets/fzanartu/FLICC...,en,CARDS,,5
1,"For most of the Holocene (last 10k years), sea...",1_not_happening,FLICC,https://huggingface.co/datasets/fzanartu/FLICC...,en,hamburg_test1,,1
2,"China, which hosts U.N. climate talks next wee...",4_solutions_harmful_unnecessary,FLICC,https://huggingface.co/datasets/fzanartu/FLICC...,en,CARDS,,4
3,And the fabricated documents (which Dr. Mann a...,0_not_relevant,FLICC,https://huggingface.co/datasets/fzanartu/FLICC...,en,CARDS,,0
4,It's going to be 42 here today and the hottest...,1_not_happening,FLICC,https://huggingface.co/datasets/fzanartu/FLICC...,en,hamburg_test3,,1


In [5]:
label_counts = df['label'].value_counts().reset_index()
label_counts.columns = ['label', 'count']

fig = px.bar(label_counts, x='label', y='count', title='Distribution of Labels in the Data')
fig.show()

**Model** 

***Tokenizer***

*Choosing Max lenght*

75% percentile

In [6]:
df['quote_length'] = df['quote'].apply(len)

# Using Plotly to create a histogram of quote lengths
fig = px.histogram(df, x='quote_length', title='Distribution of Quote Lengths', 
                   labels={'quote_length': 'Length of Quote'}, 
                   nbins=30, color_discrete_sequence=['#636EFA'])  
fig.show()

In [7]:
# Calculate quantiles
quantiles = df['quote_length'].quantile([0.75, 0.9, 0.95])
print(quantiles)

0.75    365.0
0.90    544.0
0.95    722.5
Name: quote_length, dtype: float64




- Distilbert should be less energy consuming, it has less params 
- Lower case so less params 

In [8]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels = 8)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Understanding the Message**
Pre-trained Model Weights: When you load a model like bert-base-cased, it comes with weights that have been pre-trained on a large corpus of text data. These weights are primarily associated with the BERT model's architecture responsible for understanding the language patterns (e.g., layers that handle token embeddings, attention mechanisms).

**New Task-Specific Weights**: However, since BertForSequenceClassification adapts BERT to a classification task by adding a classifier on top of the base BERT model, the weights for this classifier (classifier.bias, classifier.weight) are not part of the original pre-trained model. They need to be trained (initialized) for your specific classification task.

**What This Means for Training**
Initialization: The classifier layer’s weights are initialized randomly and need to be trained (i.e., fine-tuned) on a dataset specific to your classification task. This fine-tuning process adapts the general language understanding capabilities of BERT to the nuances of your specific classification problem.

**Necessity of Training**: This message is a reminder that although the base BERT layers are pre-trained, the classifier's weights are new and untrained. You must train the model on a downstream task (like your text classification) to adjust these weights meaningfully for predictions and inference.

**Is Your Code Handling This?**
In the code you provided, you are indeed setting up and executing this necessary training:

*Data Preparation*: You encode your text data, prepare labels, and set up DataLoader instances for both training and validation datasets.

*Training Loop*: You run a training loop where:

You perform a forward pass where the model calculates predictions and, if labels are provided (which they are), computes the loss using the internally managed cross-entropy criterion.
You perform backpropagation (loss.backward()) and an optimizer step (optimizer.step()) to update the classifier’s weights based on the computed loss.
You reset gradients for the next iteration with optimizer.zero_grad().

*Validation Step*: After each epoch, you evaluate the model on a validation set to monitor performance improvements and generalize the model's ability beyond the training data.

In [10]:
# Dataset class
class QuotesDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

# Function to encode the data
def encode_data(tokenizer, texts, labels, max_length):
    encodings = tokenizer(texts, truncation=True, padding='max_length', max_length=max_length, return_tensors='pt')
    return QuotesDataset(encodings, labels)

Validation

- During validation, for each batch, the model outputs logits (i.e., raw model outputs before applying an activation function like softmax).
- The logits are then converted into actual predictions using the torch.argmax function, which selects the index of the highest value in each set of logits across the specified dimension (dim=-1 means along the last dimension, effectively choosing the predicted class).
- These predictions and the true labels from the validation dataset batches are collected into lists: all_predictions and all_true_labels.

In [6]:
"""
# subset of the dataset for testing
small_dataset = df.sample(frac=1, random_state=42).reset_index(drop=True)
small_dataset = small_dataset.iloc[:1000]

texts = small_dataset["quote"].to_list()
labels = small_dataset["label_int"].to_list()
"""

In [11]:
texts = df["quote"].to_list()
labels = df["label_int"].to_list()

# dict for mapping label indices to names 
label_dict = df[['label_int', 'label']].drop_duplicates().set_index('label_int')['label'].to_dict()

In [12]:
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.33, random_state=42)

# Encode the data
max_length = 365  # 75% percentile of all quote lengths
train_dataset = encode_data(tokenizer, X_train, y_train, max_length)
val_dataset = encode_data(tokenizer, X_test, y_test, max_length)

# Data loaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

# Training settings
lr=2e-5   # need to experiment with this
epochs = 3


optimizer = AdamW(model.parameters(), lr=lr)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training loop
for epoch in range(epochs):
    model.train()
    for batch in train_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch) # Forward pass; model calculates loss if 'labels' are included.
        loss = outputs.loss   # Access the computed loss.
        loss.backward()       # Compute gradients.
        optimizer.step()      # Update model parameters.
        optimizer.zero_grad() # Clear gradients for the next training step.
        outputs = model(**batch)       
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

    # Validation 
    model.eval()
    all_predictions = []
    all_true_labels = []
    
    for batch in val_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        all_predictions.extend(predictions.cpu().numpy())
        all_true_labels.extend(batch['labels'].cpu().numpy())

    # Calculate F1 scores
    f1_scores_per_class = f1_score(all_true_labels, all_predictions, average=None)
    weighted_f1_score = f1_score(all_true_labels, all_predictions, average='weighted')

    # Printing F1 scores with corresponding class names
    for index, score in enumerate(f1_scores_per_class):
        print(f"Class: {label_dict[index]}, F1 Score: {score:.4f}")
    print(f"Weighted F1 Score: {weighted_f1_score:.4f}")



To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).



Epoch 1, Loss: 0.8911074995994568
Class: 0_not_relevant, F1 Score: 0.6880
Class: 1_not_happening, F1 Score: 0.7035
Class: 2_not_human, F1 Score: 0.5650
Class: 3_not_bad, F1 Score: 0.4519
Class: 4_solutions_harmful_unnecessary, F1 Score: 0.6394
Class: 5_science_unreliable, F1 Score: 0.5835
Class: 6_proponents_biased, F1 Score: 0.5537
Class: 7_fossil_fuels_needed, F1 Score: 0.5765
Weighted F1 Score: 0.6165



To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).



Epoch 2, Loss: 0.5752794742584229
Class: 0_not_relevant, F1 Score: 0.7020
Class: 1_not_happening, F1 Score: 0.7680
Class: 2_not_human, F1 Score: 0.6667
Class: 3_not_bad, F1 Score: 0.6532
Class: 4_solutions_harmful_unnecessary, F1 Score: 0.6655
Class: 5_science_unreliable, F1 Score: 0.5695
Class: 6_proponents_biased, F1 Score: 0.5755
Class: 7_fossil_fuels_needed, F1 Score: 0.5876
Weighted F1 Score: 0.6585



To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).



Epoch 3, Loss: 0.769638180732727
Class: 0_not_relevant, F1 Score: 0.7442
Class: 1_not_happening, F1 Score: 0.7539
Class: 2_not_human, F1 Score: 0.6928
Class: 3_not_bad, F1 Score: 0.6222
Class: 4_solutions_harmful_unnecessary, F1 Score: 0.6730
Class: 5_science_unreliable, F1 Score: 0.5984
Class: 6_proponents_biased, F1 Score: 0.6767
Class: 7_fossil_fuels_needed, F1 Score: 0.5444
Weighted F1 Score: 0.6828
