# Title: Fine-Tuning BERT for Sequence Classification: A Comprehensive Implementation and Analysis

#### Group Member Names :  
#### Kalpesh Pravin patil 
#### Sourabh Shantaram Bhoir




### INTRODUCTION:  
BERT (Bidirectional Encoder Representations from Transformers) is a powerful method for pre-training language models. It involves training a general-purpose "language understanding" model on a large text corpus, such as Wikipedia. Once pre-trained, this model can be fine-tuned for specific Natural Language Processing (NLP) tasks, like question answering, making BERT highly effective for various applications.
*********************************************************************************************************************
#### AIM :
To fine-tune and evaluate a BERT model for [specific task, e.g., text classification], leveraging state-of-the-art NLP techniques to achieve high accuracy and adaptability in addressing real-world challenges.
*********************************************************************************************************************
#### Github Repo:
https://github.com/kalpeshpravinpatil/MLP_Final_Project
*********************************************************************************************************************
#### DESCRIPTION OF PAPER:
The paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" introduces BERT, a groundbreaking language model that leverages deep bidirectional transformers to achieve state-of-the-art results across numerous NLP tasks. By employing a masked language model (MLM) and next sentence prediction (NSP) during pre-training, BERT captures contextual information from both directions, overcoming the limitations of unidirectional models like GPT. This architecture eliminates the need for complex task-specific designs, allowing simple fine-tuning to excel in tasks such as question answering, sentiment analysis, and named entity recognition. BERT significantly advances benchmarks like GLUE and SQuAD, demonstrating the power of transfer learning with pre-trained models, and its open-source release has revolutionized natural language processing research and applications.

*********************************************************************************************************************
#### PROBLEM STATEMENT :
Traditional language models, such as GPT and ELMo, face limitations in leveraging full bidirectional context due to their unidirectional or shallow bidirectional architectures. This restricts their ability to handle tasks requiring deep contextual understanding, such as question answering, sentiment analysis, and natural language inference. Additionally, these models often rely on complex task-specific architectures, making their adaptation for downstream tasks inefficient.
*********************************************************************************************************************
#### CONTEXT OF THE PROBLEM:
Language representation models are foundational to many NLP applications, yet prior approaches predominantly focus on left-to-right or right-to-left contexts, which limit their ability to process information holistically. Complex tasks like sentence-pair relationships (e.g., natural language inference) and token-level tasks (e.g., named entity recognition) demand a representation that integrates context from both directions. Moreover, the need for task-specific architectures increases development complexity and reduces scalability, creating a bottleneck for broader adoption of NLP solutions.
*********************************************************************************************************************
#### SOLUTION:
The paper introduces BERT (Bidirectional Encoder Representations from Transformers), a novel pre-trained language model designed to learn deep bidirectional representations by jointly conditioning on both left and right contexts. Through the use of two innovative pre-training objectives—masked language modeling (MLM) and next sentence prediction (NSP)—BERT overcomes the limitations of prior models. It enables efficient fine-tuning for a wide range of NLP tasks without requiring complex task-specific architectures, achieving state-of-the-art performance on multiple benchmarks, such as GLUE and SQuAD, and setting a new standard for language understanding in NLP.
*


# Background
*********************************************************************************************************************


|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|
Paper Referrence:https://arxiv.org/pdf/1810.04805 |A revolutionary bidirectional transformer-based model that pre-trains deep contextual language representations, achieving state-of-the-art results across diverse NLP tasks with minimal task-specific adaptations.  | Dataset:  https://huggingface.co/datasets/nyu-mll/glue  |  limited coverage of complex, real-world linguistic phenomena,


*********************************************************************************************************************






# Implement paper code :
*********************************************************************************************************************

*



In [1]:
!git clone https://github.com/google-research/bert.git


Cloning into 'bert'...
remote: Enumerating objects: 340, done.[K
remote: Counting objects: 100% (340/340), done.[K
remote: Compressing objects: 100% (154/154), done.[K
remote: Total 340 (delta 203), reused 303 (delta 185), pack-reused 0[K
Receiving objects: 100% (340/340), 192.58 KiB | 5.35 MiB/s, done.
Resolving deltas: 100% (203/203), done.


In [2]:
%cd bert


/content/bert


In [5]:
from datasets import load_dataset
import os

# Load the MRPC dataset
dataset = load_dataset("glue", "mrpc")

# Create the 'data' directory if it doesn't exist
os.makedirs('data', exist_ok=True)

# Convert datasets to pandas DataFrames
train_df = dataset['train'].to_pandas()
val_df = dataset['validation'].to_pandas()

# Save the data as TSV files
train_df[['sentence1', 'sentence2', 'label']].to_csv('data/train.tsv', index=False, header=False, sep='\t')
val_df[['sentence1', 'sentence2', 'label']].to_csv('data/dev.tsv', index=False, header=False, sep='\t')

print("Data saved successfully.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Data saved successfully.


In [6]:
from transformers import BertTokenizer, BertForSequenceClassification

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
from transformers import TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("glue", "mrpc")

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True, padding='max_length')

tokenized_datasets = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [8]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=8,   # batch size for training
    per_device_eval_batch_size=8,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=tokenized_datasets['train'],         # training dataset
    eval_dataset=tokenized_datasets['validation']      # evaluation dataset
)


In [9]:
trainer.train()


Step,Training Loss
10,0.7802
20,0.7874
30,0.7622
40,0.7467
50,0.7073
60,0.6665
70,0.6376
80,0.6293
90,0.5812
100,0.6831


TrainOutput(global_step=1377, training_loss=0.39418783539347485, metrics={'train_runtime': 1018.4023, 'train_samples_per_second': 10.805, 'train_steps_per_second': 1.352, 'total_flos': 2895274053181440.0, 'train_loss': 0.39418783539347485, 'epoch': 3.0})

In [10]:
eval_results = trainer.evaluate()
print(eval_results)


{'eval_loss': 0.6390026211738586, 'eval_runtime': 11.1132, 'eval_samples_per_second': 36.713, 'eval_steps_per_second': 4.589, 'epoch': 3.0}


In [6]:
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
import os

# Load the MRPC dataset
dataset = load_dataset("glue", "mrpc")

# Tokenize the dataset
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True, padding='max_length')

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Load the BERT model
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Create the 'results' directory if it doesn't exist
os.makedirs('results', exist_ok=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
training_args_lr = TrainingArguments(
    output_dir='./results_lr',          # output directory
    num_train_epochs=3,                 # number of training epochs
    per_device_train_batch_size=16,     # batch size for training
    per_device_eval_batch_size=16,      # batch size for evaluation
    learning_rate=5e-5,                 # increased learning rate
    warmup_steps=500,                   # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                  # strength of weight decay
    logging_dir='./logs_lr',            # directory for storing logs
    logging_steps=10,
)

trainer_lr = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args_lr,               # training arguments, defined above
    train_dataset=tokenized_datasets['train'],         # training dataset
    eval_dataset=tokenized_datasets['validation']      # evaluation dataset
)

# Train with the increased learning rate
trainer_lr.train()
eval_results_lr = trainer_lr.evaluate()
print("Learning Rate Adjustment Results:", eval_results_lr)


Step,Training Loss
10,0.7369
20,0.7467
30,0.7132
40,0.6952
50,0.6961
60,0.6695
70,0.6269
80,0.6468
90,0.61
100,0.5941


Learning Rate Adjustment Results: {'eval_loss': 0.5363577008247375, 'eval_runtime': 11.7999, 'eval_samples_per_second': 34.577, 'eval_steps_per_second': 2.203, 'epoch': 3.0}


In [8]:
training_args_bs = TrainingArguments(
    output_dir='./results_bs',           # output directory
    num_train_epochs=3,                  # number of training epochs
    per_device_train_batch_size=8,       # decreased batch size for training
    per_device_eval_batch_size=8,        # decreased batch size for evaluation
    learning_rate=2e-5,                  # default learning rate
    warmup_steps=500,                    # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                   # strength of weight decay
    logging_dir='./logs_bs',             # directory for storing logs
    logging_steps=10,
)

trainer_bs = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args_bs,               # training arguments, defined above
    train_dataset=tokenized_datasets['train'],         # training dataset
    eval_dataset=tokenized_datasets['validation']      # evaluation dataset
)

# Train with the decreased batch size
trainer_bs.train()
eval_results_bs = trainer_bs.evaluate()
print("Batch Size Adjustment Results:", eval_results_bs)


Step,Training Loss
10,0.1794
20,0.0502
30,0.0383
40,0.1023
50,0.2009
60,0.0795
70,0.1111
80,0.0621
90,0.0968
100,0.198


Batch Size Adjustment Results: {'eval_loss': 0.959290623664856, 'eval_runtime': 11.5022, 'eval_samples_per_second': 35.472, 'eval_steps_per_second': 4.434, 'epoch': 3.0}


In [9]:
training_args_epochs = TrainingArguments(
    output_dir='./results_epochs',       # output directory
    num_train_epochs=5,                  # increased number of training epochs
    per_device_train_batch_size=16,      # batch size for training
    per_device_eval_batch_size=16,       # batch size for evaluation
    learning_rate=2e-5,                  # default learning rate
    warmup_steps=500,                    # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                   # strength of weight decay
    logging_dir='./logs_epochs',         # directory for storing logs
    logging_steps=10,
)

trainer_epochs = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args_epochs,           # training arguments, defined above
    train_dataset=tokenized_datasets['train'],         # training dataset
    eval_dataset=tokenized_datasets['validation']      # evaluation dataset
)

# Train with the increased number of epochs
trainer_epochs.train()
eval_results_epochs = trainer_epochs.evaluate()
print("Number of Epochs Adjustment Results:", eval_results_epochs)


Step,Training Loss
10,0.0006
20,0.0028
30,0.046
40,0.0615
50,0.0187
60,0.0018
70,0.1015
80,0.009
90,0.0091
100,0.0006


Number of Epochs Adjustment Results: {'eval_loss': 1.069010853767395, 'eval_runtime': 11.8558, 'eval_samples_per_second': 34.414, 'eval_steps_per_second': 2.193, 'epoch': 5.0}


In [10]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

# Load BERT-Large model and tokenizer
model_name = 'bert-large-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True, padding='max_length')

tokenized_datasets = dataset.map(tokenize_function, batched=True)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [11]:
training_args = TrainingArguments(
    output_dir='./results_large',        # output directory
    num_train_epochs=3,                  # number of training epochs
    per_device_train_batch_size=8,       # batch size for training (adjusted for larger model)
    per_device_eval_batch_size=8,        # batch size for evaluation
    learning_rate=2e-5,                  # learning rate
    warmup_steps=500,                    # number of warmup steps
    weight_decay=0.01,                   # strength of weight decay
    logging_dir='./logs_large',          # directory for storing logs
    logging_steps=10,
)


In [12]:
trainer = Trainer(
    model=model,                         # the model to be trained
    args=training_args,                  # training arguments
    train_dataset=tokenized_datasets['train'],         # training dataset
    eval_dataset=tokenized_datasets['validation']      # evaluation dataset
)

trainer.train()
eval_results = trainer.evaluate()
print("Evaluation Results for BERT-Large:", eval_results)


Step,Training Loss
10,0.8724
20,0.8446
30,0.8081
40,0.849
50,0.8675
60,0.7553
70,0.7514
80,0.7428
90,0.7074
100,0.7068


Evaluation Results for BERT-Large: {'eval_loss': 0.5476898550987244, 'eval_runtime': 38.049, 'eval_samples_per_second': 10.723, 'eval_steps_per_second': 1.34, 'epoch': 3.0}


### Results :
Evaluation Results for 3 Epochs:
Evaluation Loss: 0.5477
Evaluation Speed:
Runtime: 38 seconds for the validation set.
Samples Processed: 10.72 samples/second.
Training Observations:
Training loss consistently decreased, indicating the model's ability to learn effectively:
Loss Values: From 0.872 at the beginning to as low as 0.149 toward the end of training.
Evaluation metrics demonstrate stable convergence.
Extended Epochs (5 Epochs):
Extended training further reduces the loss but with diminishing returns, highlighting the need for balanced training schedules to avoid overfitting.
.
*******************************************************************************************************************************


#### Observations :
BERT's bidirectional pretraining allows effective fine-tuning with minimal task-specific changes.
Increasing epochs shows diminishing returns, suggesting careful hyperparameter tuning is critical.

*******************************************************************************************************************************
*


### Conclusion and Future Direction :
Conclusion:
The implementation validates the effectiveness of BERT in fine-tuning for sequence classification tasks. Its adaptability and state-of-the-art performance highlight the importance of pre-trained language models in NLP.

Future Directions:
Extend experiments to explore other datasets or tasks (e.g., sentiment analysis, named entity recognition).
Test different BERT variants like RoBERTa or DistilBERT for efficiency and accuracy trade-offs.
Investigate multi-lingual applications of BERT.

*******************************************************************************************************************************
#### Learnings :
Pre-trained Models: Pretraining with deep bidirectional representations significantly boosts downstream performance with minimal task-specific modifications.
Fine-tuning Importance: Adjusting epochs and batch size is critical for balanced learning.
Framework Utility: Hugging Face simplifies NLP implementations, making advanced research reproducible.
*******************************************************************************************************************************
#### Results Discussion :
Fine-tuning BERT demonstrates significant performance gains, particularly on well-curated datasets like GLUE.
The model achieves competitive loss values, reflecting its robustness in contextual understanding.
Extended training epochs provide marginal benefits, emphasizing the need for careful monitoring during fine-tuning.


*******************************************************************************************************************************
#### Limitations :
High Computational Requirements:
Training BERT is resource-intensive, often requiring multiple GPUs or TPUs.
Data Dependence:
Performance is strongly tied to high-quality, task-specific datasets.
Epoch Sensitivity:
Extended training risks overfitting, requiring careful validation.


*******************************************************************************************************************************
#### Future Extension :
Task Expansion: Explore BERT’s utility in complex tasks like summarization, machine translation, or dialogue systems.
Model Compression: Use lighter models (e.g., DistilBERT) for resource-constrained environments.
Cross-Lingual Adaptation: Test BERT on multi-lingual datasets to validate its global applicability.
Explainability: Investigate methods to interpret BERT’s decision-making process.

# References:

[1] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 

[2] Hugging Face Transformers Library: https://huggingface.co/transformers/

[3] GLUE Benchmark Dataset: https://gluebenchmark.com/

  