<a href="https://colab.research.google.com/github/gtyellow/NLPkaggleDemo/blob/main/NLP_Kaggle_Demo_distilBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This code uses the distilBERT model to predict if a text is a disaster tweet or not.  This got a decent score for Kaggle but not as good as BERT or roBERTa.

I wrote this notebook in google colab because Kaggle wouldn't let me install the transformers for bert and roberta.  That may change if kaggle updates their system.  Since its not directly in kaggle, you will need to download the datasets from kaggle, then process them, then upload a submission.

I also tested BERT and roBERTa and got a slightly better score with BERT for this task.

If at all possible, use a GPU for this as a CPU will take 10 to 15 times longer.

In [1]:
#Uncomment the following to install py

#!pip install transformers torch


In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

import torch
from torch.utils.data import Dataset, DataLoader

from transformers import DistilBertTokenizer
from transformers import DistilBertForSequenceClassification, AdamW
from transformers import get_scheduler
from tqdm.auto import tqdm

In [4]:
#Convert test and train datasets to csv
#You will need to download the csv files and replace the 'content/...' to the path where you saved the file.  If you use colab, you will

test_df = pd.read_csv("/content/test.csv")
train_df = pd. read_csv('/content/train.csv')

In [5]:
# Drop unnecessary columns
df = train_df.drop(columns=['id', 'keyword', 'location'])

# Split the data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(df['text'], df['target'], test_size=0.2, random_state=42)


In [10]:
# Load DistilBERT tokenizer
distilbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenize the text for DistilBERT
train_encodings_distilbert = distilbert_tokenizer(train_texts.tolist(), truncation=True, padding=True)
val_encodings_distilbert = distilbert_tokenizer(val_texts.tolist(), truncation=True, padding=True)


In [12]:
#This creates the dataloaders for bert and roberta

class TextDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create a DataLoader for the train data
train_dataset_distilbert = TextDataset(train_encodings_distilbert,train_labels.tolist())
val_dataset_distilbert = TextDataset(val_encodings_distilbert, val_labels.tolist())

train_loader_distilbert = DataLoader(train_dataset_distilbert, batch_size=16, shuffle=True)
val_loader_distilbert = DataLoader(val_dataset_distilbert, batch_size=16, shuffle=False)

In [14]:
#Loads models and sets perameters

# Load model
distilbert_model = DistilBertForSequenceClassification.from_pretrained('bert-base-uncased')


# Define optimizer
distilbert_optimizer = AdamW(distilbert_model.parameters(), lr=5e-5)


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

You are using a model of type bert to instantiate a model of type distilbert. This is not supported for all configurations of models and can yield errors.


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'embeddings.LayerNorm.bias', 'embeddings.LayerNorm.weight', 'embeddings.position_embeddings.weight', 'embeddings.word_embeddings.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'transformer.layer.0.attention.k_lin.bias', 'transformer.layer.0.attention.k_lin.weight', 'transformer.layer.0.attention.out_lin.bias', 'transformer.layer.0.attention.out_lin.weight', 'transformer.layer.0.attention.q_lin.bias', 'transformer.layer.0.attention.q_lin.weight', 'transformer.layer.0.attention.v_lin.bias', 'transformer.layer.0.attention.v_lin.weight', 'transformer.layer.0.ffn.lin1.bias', 'transformer.layer.0.ffn.lin1.weight', 'transformer.layer.0.ffn.lin2.bias', 'transformer.layer.0.ffn.lin2.weight', 'transformer.layer.0.output_layer_norm.bias', 'transformer.layer.0.output_layer_norm.weight', 'transformer.lay

In [18]:
#Function for training and evaluating the model on the train dataset

def train(model, optimizer, train_loader, val_loader, num_epochs=3, model_name="model"):
    model.train()
    num_batches = len(train_loader)

    for epoch in range(num_epochs):
        print(f"Epoch {epoch + 1}/{num_epochs}")
        print(f"Model: {model_name}")
        #Loops through epochs of training.  I used 3 epochs but more epochs could marginally improve the score
        for batch_idx, batch in enumerate(tqdm(train_loader), start=1):
            optimizer.zero_grad()
            inputs = {key: val.to(device) for key, val in batch.items() if key != 'labels'}
            labels = batch['labels'].to(device)
            outputs = model(**inputs, labels=labels)
            loss = outputs.loss
            loss.backward()
            optimizer.step()
        evaluate(model, val_loader)#calls the evaluate function below

#Function for evaluating the performance of the model.  I used accuracy and F1 score for evaluating the model.
#Accuracy is good for many use cases but will fail if there are very few positive or negative cases.  Works best on more balanced datasets.
#F1 accounts for unbalanced datasets.
def evaluate(model, val_loader):
    model.eval()
    correct = 0
    total = 0
    all_predictions = []
    all_labels = []
    with torch.no_grad():
        for batch in val_loader:
            inputs = {key: val.to(device) for key, val in batch.items() if key != 'labels'}
            labels = batch['labels'].to(device)
            outputs = model(**inputs)
            predictions = outputs.logits.argmax(dim=-1)
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
            correct += (predictions == labels).sum().item()
            total += labels.size(0)
    accuracy = correct / total
    f1 = f1_score(all_labels, all_predictions, average='weighted')
    print(f'Validation F1 Score: {f1:.4f}')
    print(f'Validation Accuracy: {accuracy:.4f}')


# Move models to GPU if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
distilbert_model.to(device)


# Train distilBERT model
print("Training distilBERT Model")
train(distilbert_model, distilbert_optimizer, train_loader_distilbert, val_loader_distilbert, model_name="distilBERT")



Training BERT Model
Epoch 1/3
Model: distilBERT


  0%|          | 0/381 [00:00<?, ?it/s]

Validation F1 Score: 0.7581
Validation Accuracy: 0.7623
Epoch 2/3
Model: distilBERT


  0%|          | 0/381 [00:00<?, ?it/s]

Validation F1 Score: 0.7723
Validation Accuracy: 0.7814
Epoch 3/3
Model: distilBERT


  0%|          | 0/381 [00:00<?, ?it/s]

Validation F1 Score: 0.7785
Validation Accuracy: 0.7853


In [21]:
#This predicts disaster or not with the test set for distilBERT.

# Drop unnecessary columns in test_df
test_texts = test_df['text'].tolist()

# Tokenize the text for BERT
test_encodings_distilbert = distilbert_tokenizer(test_texts, truncation=True, padding=True)

# Create a dataset for test data
class TestDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

    def __len__(self):
        return len(self.encodings['input_ids'])

# Create a DataLoader for the test data
test_dataset_distilbert = TestDataset(test_encodings_distilbert)
test_loader_distilbert = DataLoader(test_dataset_distilbert, batch_size=16, shuffle=False)

#Function to make predictions on the test dataframe
def make_predictions(model, test_loader):
    model.eval()
    predictions = []

    with torch.no_grad():
        for batch in tqdm(test_loader):
            inputs = {key: val.to(device) for key, val in batch.items()}
            outputs = model(**inputs)
            logits = outputs.logits
            batch_predictions = logits.argmax(dim=-1).cpu().numpy()
            predictions.extend(batch_predictions)

    return predictions

# Make predictions with the trained bert model
test_predictions = make_predictions(distilbert_model, test_loader_distilbert)

  0%|          | 0/204 [00:00<?, ?it/s]

In [22]:
# Add predictions to the test dataframe
test_df['target'] = test_predictions



In [23]:
#Drop unneeded columns to prepare file for submission
test_df.drop(columns=['keyword','location','text'], inplace=True)

In [24]:

# Optionally, save to a file
test_df.to_csv('submissions.csv', index=False)