Mounting Google Drive: which is used to save the fine-tuned model and tokenizer.

In [None]:
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/drive/MyDrive/M3/

Installing Required Packages

In [None]:
!pip install datasets
!pip install transformers
!pip install torch

In [None]:
from pprint import pprint
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
from torch.utils.data import DataLoader

In [None]:
import random
import numpy as np
import torch
from torch.utils.data import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
import pandas as pd
from datasets import load_dataset

In [None]:
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

model_name = "distilbert-base-uncased"

Loading the Dataset: loads the custom dataset using the load_dataset function from the datasets package.

In [None]:
dataset_dict = load_dataset('HUPD/hupd',
                            name='sample',
                            data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather",
                            icpr_label=None,
                            train_filing_start_date='2016-01-01',
                            train_filing_end_date='2016-01-21',
                            val_filing_start_date='2016-01-22',
                            val_filing_end_date='2016-01-31')

Loding the training and testing datasets from the HUPD/hupd dataset and converts them into Pandas dataframes.

In [None]:
train_df = dataset_dict['train'].to_pandas()
test_df = dataset_dict['validation'].to_pandas()

Creating the training data by concatenating the abstract and claims text from the training set, and assigning the label 0 for the abstract and label 1 for the claims.

In [None]:
train_texts = list(train_df['abstract']) + list(train_df['claims'])
train_labels = [0] * len(train_df) + [1] * len(train_df)

Creating lists of training and testing texts by concatenating the abstract and claims columns of the respective dataframes, and creating labels for the texts based on their origin (0 for training and 1 for testing).

In [None]:
test_texts = list(test_df['abstract']) + list(test_df['claims'])
test_labels = [0] * len(test_df) + [1] * len(test_df)

tokenizer instance for the specified pre-trained DistilBERT model

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Using the tokenizer object to encode the train_texts and test_texts into numerical encodings suitable for models.

In [None]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

This is a class definition for creating a custom PyTorch dataset named FTDataset. The dataset takes in two arguments, encodings and labels, which are the encoded texts and corresponding labels respectively.

In [None]:
class FTDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

Instantiating FTDataset objects for the train and test sets using the encoded texts (train_encodings and test_encodings) and labels (train_labels and test_labels). This step creates PyTorch Dataset objects that can be fed into the DataLoader later for training and evaluation.

In [None]:
train_dataset = FTDataset(train_encodings, train_labels)
test_dataset = FTDataset(test_encodings, test_labels)

Initializing a pre-trained transformer model for sequence classification using the AutoModelForSequenceClassification class from the Hugging Face Transformers library

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    output_attentions=False,
    output_hidden_states=False,
)

Initializing the training arguments for the model, including the number of training epochs, batch size, and learning rate, among other hyperparameters.

In [None]:
training_args = TrainingArguments(
    output_dir='./results',            
    num_train_epochs=2,                
    per_device_train_batch_size=32,    
    per_device_eval_batch_size=64,     
    warmup_steps=500,                  
    learning_rate=5e-5,                
    weight_decay=0.01,                 
    logging_dir='./logs',              
    logging_steps=10,
)

Trainer object which is responsible for training the model using the specified training and evaluation datasets, along with the given hyperparameters and settings.

In [None]:
trainer = Trainer(
    model=model,                       
    args=training_args,                
    train_dataset=train_dataset,       
    eval_dataset=test_dataset           
)

trains the specified model on the train_dataset according to the args specified in training_args.

In [None]:
trainer.train()

evaluates the trained model






In [None]:
eval_results = trainer.evaluate()

Saving the fine-tuned DistilBERT model, along with its configuration and vocabulary, to the specified directory 

In [None]:
model.save_pretrained("./results/saved_model")

saving the trained tokenizer

In [None]:
tokenizer.save_pretrained("./results/saved_model")
model.save_pretrained("./results/saved_model")