<a href="https://colab.research.google.com/github/gtyellow/NLPkaggleDemo/blob/main/NLP_Kaggle_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This got a very nice score in Kaggle which put my in the top 10% of submitters.  

I wrote this notebook in google colab because Kaggle wouldn't let me install the transformers for bert and roberta.  That may change if kaggle updates their system.  Since its not directly in kaggle, you will need to download the datasets from kaggle, then process them, then upload a submission.

I tested BERT and roBERTa and got a slightly better score with BERT for this task.

If at all possible, use a GPU for this as a CPU will take 10 to 15 times longer.

In [1]:
#Uncomment the following to install py

#!pip install transformers torch


In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

import torch
from torch.utils.data import Dataset, DataLoader

from transformers import BertTokenizer, RobertaTokenizer
from transformers import BertForSequenceClassification, RobertaForSequenceClassification, AdamW
from transformers import get_scheduler
from tqdm.auto import tqdm

In [16]:
#Convert test and train datasets to csv
#You will need to download the csv files and replace the 'content/...' to the path where you saved the file.  If you use colab, you will

test_df = pd.read_csv("/content/test.csv")
train_df = pd. read_csv('/content/train.csv')

In [4]:
# Drop unnecessary columns
df = train_df.drop(columns=['id', 'keyword', 'location'])

# Split the data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(df['text'], df['target'], test_size=0.2, random_state=42)


In [5]:

# Load tokenizers
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# Tokenize the text for BERT
train_encodings_bert = bert_tokenizer(train_texts.tolist(), truncation=True, padding=True)
val_encodings_bert = bert_tokenizer(val_texts.tolist(), truncation=True, padding=True)

# Tokenize the text for RoBERTa
train_encodings_roberta = roberta_tokenizer(train_texts.tolist(), truncation=True, padding=True)
val_encodings_roberta = roberta_tokenizer(val_texts.tolist(), truncation=True, padding=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [6]:
#This creates the dataloaders for bert and roberta

class TextDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create datasets
train_dataset_bert = TextDataset(train_encodings_bert, train_labels.tolist())
val_dataset_bert = TextDataset(val_encodings_bert, val_labels.tolist())
train_dataset_roberta = TextDataset(train_encodings_roberta, train_labels.tolist())
val_dataset_roberta = TextDataset(val_encodings_roberta, val_labels.tolist())

# Create dataloaders
train_loader_bert = DataLoader(train_dataset_bert, batch_size=16, shuffle=True)
val_loader_bert = DataLoader(val_dataset_bert, batch_size=16, shuffle=False)
train_loader_roberta = DataLoader(train_dataset_roberta, batch_size=16, shuffle=True)
val_loader_roberta = DataLoader(val_dataset_roberta, batch_size=16, shuffle=False)


In [7]:
#Loads models and sets perameters

# Load models
bert_model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
roberta_model = RobertaForSequenceClassification.from_pretrained('roberta-base')

# Define optimizer
bert_optimizer = AdamW(bert_model.parameters(), lr=5e-5)
roberta_optimizer = AdamW(roberta_model.parameters(), lr=5e-5)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
#Function for training and evaluating the model on the train dataset

def train(model, optimizer, train_loader, val_loader, num_epochs=3, model_name="model"):
    model.train()
    num_batches = len(train_loader)

    for epoch in range(num_epochs):
        print(f"Epoch {epoch + 1}/{num_epochs}")
        print(f"Model: {model_name}")
        #Loops through epochs of training.  I used 3 epochs but more epochs could marginally improve the score
        for batch_idx, batch in enumerate(tqdm(train_loader), start=1):
            optimizer.zero_grad()
            inputs = {key: val.to(device) for key, val in batch.items() if key != 'labels'}
            labels = batch['labels'].to(device)
            outputs = model(**inputs, labels=labels)
            loss = outputs.loss
            loss.backward()
            optimizer.step()
        evaluate(model, val_loader)#calls the evaluate function below

#Function for evaluating the performance of the model.  I used accuracy and F1 score for evaluating the model.
#Accuracy is good for many use cases but will fail if there are very few positive or negative cases.  Works best on more balanced datasets.
#F1 accounts for unbalanced datasets.
def evaluate(model, val_loader):
    model.eval()
    correct = 0
    total = 0
    all_predictions = []
    all_labels = []
    with torch.no_grad():
        for batch in val_loader:
            inputs = {key: val.to(device) for key, val in batch.items() if key != 'labels'}
            labels = batch['labels'].to(device)
            outputs = model(**inputs)
            predictions = outputs.logits.argmax(dim=-1)
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
            correct += (predictions == labels).sum().item()
            total += labels.size(0)
    accuracy = correct / total
    f1 = f1_score(all_labels, all_predictions, average='weighted')
    print(f'Validation F1 Score: {f1:.4f}')
    print(f'Validation Accuracy: {accuracy:.4f}')


"""
# Define models and optimizers
bert_model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
roberta_model = RobertaForSequenceClassification.from_pretrained('roberta-base')

bert_optimizer = AdamW(bert_model.parameters(), lr=5e-5)
roberta_optimizer = AdamW(roberta_model.parameters(), lr=5e-5)
"""

# Move models to GPU if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
bert_model.to(device)
roberta_model.to(device)

# Train BERT model
print("Training BERT Model")
train(bert_model, bert_optimizer, train_loader_bert, val_loader_bert, model_name="BERT")

# Train RoBERTa model
print("Training RoBERTa Model")
train(roberta_model, roberta_optimizer, train_loader_roberta, val_loader_roberta, model_name="RoBERTa")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training BERT Model
Epoch 1/3
Model: BERT


  0%|          | 0/381 [00:00<?, ?it/s]

Validation F1 Score: 0.8208
Validation Accuracy: 0.8207
Epoch 2/3
Model: BERT


  0%|          | 0/381 [00:00<?, ?it/s]

Validation F1 Score: 0.8216
Validation Accuracy: 0.8267
Epoch 3/3
Model: BERT


  0%|          | 0/381 [00:00<?, ?it/s]

Validation F1 Score: 0.8032
Validation Accuracy: 0.8030
Training RoBERTa Model
Epoch 1/3
Model: RoBERTa


  0%|          | 0/381 [00:00<?, ?it/s]

Validation F1 Score: 0.8311
Validation Accuracy: 0.8313
Epoch 2/3
Model: RoBERTa


  0%|          | 0/381 [00:00<?, ?it/s]

Validation F1 Score: 0.8189
Validation Accuracy: 0.8253
Epoch 3/3
Model: RoBERTa


  0%|          | 0/381 [00:00<?, ?it/s]

Validation F1 Score: 0.8099
Validation Accuracy: 0.8096


In [17]:
#This predicts disaster or not with the test set for BERT.  If you want to try roBERTa, you will need to remove the ''' on the code block below.
#You can also add ''' before and after this code block to speed up processing.

# Drop unnecessary columns in test_df
test_texts = test_df['text'].tolist()

# Tokenize the text for BERT
test_encodings_bert = bert_tokenizer(test_texts, truncation=True, padding=True)

# Create a dataset for test data
class TestDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

    def __len__(self):
        return len(self.encodings['input_ids'])

# Create a DataLoader for the test data
test_dataset_bert = TestDataset(test_encodings_bert)
test_loader_bert = DataLoader(test_dataset_bert, batch_size=16, shuffle=False)

#Function to make predictions on the test dataframe
def make_predictions(model, test_loader):
    model.eval()
    predictions = []

    with torch.no_grad():
        for batch in tqdm(test_loader):
            inputs = {key: val.to(device) for key, val in batch.items()}
            outputs = model(**inputs)
            logits = outputs.logits
            batch_predictions = logits.argmax(dim=-1).cpu().numpy()
            predictions.extend(batch_predictions)

    return predictions

# Make predictions with the trained bert model
test_predictions = make_predictions(bert_model, test_loader_bert)

In [9]:
#This predicts disaster or not with the test set for roBERTa.  You will need to remove the ''' before and after the below code block.
#You can also add ''' before and after the above code block with BERT to speed up processing.
'''

# Drop unnecessary columns in test_df
test_texts = test_df['text'].tolist()

# Tokenize the text for RoBERTa
test_encodings_roberta = roberta_tokenizer(test_texts, truncation=True, padding=True)

# Create a dataset for test data
class TestDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

    def __len__(self):
        return len(self.encodings['input_ids'])

# Create a DataLoader for the test data
test_dataset_roberta = TestDataset(test_encodings_roberta)
test_loader_roberta = DataLoader(test_dataset_roberta, batch_size=16, shuffle=False)


#Function to make predictions on the test dataframe
def make_predictions(model, test_loader):
    model.eval()
    predictions = []

    with torch.no_grad():
        for batch in tqdm(test_loader):
            inputs = {key: val.to(device) for key, val in batch.items()}
            outputs = model(**inputs)
            logits = outputs.logits
            batch_predictions = logits.argmax(dim=-1).cpu().numpy()
            predictions.extend(batch_predictions)

    return predictions

# Make predictions with the trained RoBERTa model
test_predictions = make_predictions(roberta_model, test_loader_roberta)
'''

In [22]:
# Add predictions to the test dataframe
test_df['target'] = test_predictions



In [23]:
#Drop unneeded columns to prepare file for submission
test_df.drop(columns=['keyword','location','text'], inplace=True)

In [24]:

# Optionally, save to a file
test_df.to_csv('submissions.csv', index=False)