# Unit 1 Assignment: The Model Benchmark Challenge

**Objective**: Evaluate the architectural differences between **BERT**, **RoBERTa**, and **BART** by testing them on tasks they may or may not be designed for.

**Goal**: Understand why model architecture matters and observe real-world failures/successes.

MOHIT KUMAR   
PES2UG23CS350

## Setup: Install and Import Required Libraries

In [2]:
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")



Libraries imported successfully!


## Models Under Test

1. **BERT** (`bert-base-uncased`): Encoder-only model - designed for understanding, not generation
2. **RoBERTa** (`roberta-base`): Optimized Encoder-only model
3. **BART** (`facebook/bart-base`): Encoder-Decoder model - designed for sequence-to-sequence tasks

---
## Experiment 1: Text Generation

**Task**: Generate text using the prompt: `"The future of Artificial Intelligence is"`

**Expected Behavior**:
- BERT: Should **fail** (Encoder-only, not trained for generation)
- RoBERTa: Should **fail** (Encoder-only, not trained for generation)
- BART: Should **succeed** (Encoder-Decoder, designed for generation)

### Test 1.1: BERT - Text Generation

In [3]:
print("Testing BERT for Text Generation...\n")

try:
    generator_bert = pipeline('text-generation', model='bert-base-uncased')
    result = generator_bert("The future of Artificial Intelligence is", max_length=30)
    print("Result:", result)
except Exception as e:
    print(f"ERROR: {type(e).__name__}")
    print(f"Message: {str(e)}")
    print("\nObservation: BERT failed at text generation task.")

Testing BERT for Text Generation...



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Result: [{'generated_text': 'The future of Artificial Intelligence is................................................................................................................................................................................................................................................................'}]


### Test 1.2: RoBERTa - Text Generation

In [4]:
print("Testing RoBERTa for Text Generation...\n")

try:
    generator_roberta = pipeline('text-generation', model='roberta-base')
    result = generator_roberta("The future of Artificial Intelligence is", max_length=30)
    print("Result:", result)
except Exception as e:
    print(f"ERROR: {type(e).__name__}")
    print(f"Message: {str(e)}")
    print("\nObservation: RoBERTa failed at text generation task.")

Testing RoBERTa for Text Generation...



config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Result: [{'generated_text': 'The future of Artificial Intelligence is'}]


### Test 1.3: BART - Text Generation

In [5]:
print("Testing BART for Text Generation...\n")

try:
    generator_bart = pipeline('text-generation', model='facebook/bart-base')
    result = generator_bart("The future of Artificial Intelligence is", max_length=30)
    print("Result:", result)
except Exception as e:
    print(f"ERROR: {type(e).__name__}")
    print(f"Message: {str(e)}")
    print("\nObservation: BART may not support text-generation pipeline directly.")

Testing BART for Text Generation...



config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Result: [{'generated_text': 'The future of Artificial Intelligence is withholdived mes whatsoeverScott headers Sunderland fungyellowoka coastlineovers526 Melody mes mes mesvette mes910vette mes mes headers Sunderland Sunderland Melody headerswan BURincible BUR Canyon910 mes mes recycle mk Omn Melody mes compounds mes mes tractor inhuman fung mes Melody Melody Melody mes Melody mes910910910― Melody mes― mes fung mes Merge910 Melody fung Melody Melody counseling mes mes worthruerue fung mes mes910 mes Maxim mes mes Omn Lamb Melodyysics worth Melody Melody worthysics Melodyysics strang financingysics Sunderlandysics fung fung fung Melody mes fung Postysics discriminate fung fung scenicysicsernaut mes worth Melody counseling fung Merge worth worth worth Melody worth worth console worth worth mesclingysicsysics Sunderland fung fung discriminaterue ceiling fungrue Ens910 Post worth worth fung professors fung Melodyrue Merge worth fung fungernaut fung fung Formatrue fung fungysicsysics lol wo

---
## Experiment 2: Masked Language Modeling (Fill-Mask)

**Task**: Predict the missing word in: `"The goal of Generative AI is to [MASK] new content."`

**Expected Behavior**:
- BERT: Should **succeed** (Trained on MLM task)
- RoBERTa: Should **succeed** (Trained on MLM task)
- BART: Should **succeed** (Has masking capabilities)

### Test 2.1: BERT - Fill-Mask

In [6]:
print("Testing BERT for Fill-Mask...\n")

try:
    unmasker_bert = pipeline('fill-mask', model='bert-base-uncased')
    result = unmasker_bert("The goal of Generative AI is to [MASK] new content.")
    print("Top predictions:")
    for i, pred in enumerate(result[:3], 1):
        print(f"{i}. {pred['token_str']} (score: {pred['score']:.4f})")
except Exception as e:
    print(f"ERROR: {type(e).__name__}")
    print(f"Message: {str(e)}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Testing BERT for Fill-Mask...



Device set to use cpu


Top predictions:
1. create (score: 0.5397)
2. generate (score: 0.1558)
3. produce (score: 0.0541)


### Test 2.2: RoBERTa - Fill-Mask

In [7]:
print("Testing RoBERTa for Fill-Mask...\n")

try:
    unmasker_roberta = pipeline('fill-mask', model='roberta-base')
    result = unmasker_roberta("The goal of Generative AI is to <mask> new content.")
    print("Top predictions:")
    for i, pred in enumerate(result[:3], 1):
        print(f"{i}. {pred['token_str']} (score: {pred['score']:.4f})")
except Exception as e:
    print(f"ERROR: {type(e).__name__}")
    print(f"Message: {str(e)}")

Testing RoBERTa for Fill-Mask...



Device set to use cpu


Top predictions:
1.  generate (score: 0.3711)
2.  create (score: 0.3677)
3.  discover (score: 0.0835)


### Test 2.3: BART - Fill-Mask

In [8]:
print("Testing BART for Fill-Mask...\n")

try:
    unmasker_bart = pipeline('fill-mask', model='facebook/bart-base')
    result = unmasker_bart("The goal of Generative AI is to <mask> new content.")
    print("Top predictions:")
    for i, pred in enumerate(result[:3], 1):
        print(f"{i}. {pred['token_str']} (score: {pred['score']:.4f})")
except Exception as e:
    print(f"ERROR: {type(e).__name__}")
    print(f"Message: {str(e)}")

Testing BART for Fill-Mask...



Device set to use cpu


Top predictions:
1.  create (score: 0.0746)
2.  help (score: 0.0657)
3.  provide (score: 0.0609)


---
## Experiment 3: Question Answering

**Task**: Answer the question `"What are the risks?"` based on context:

**Context**: `"Generative AI poses significant risks such as hallucinations, bias, and deepfakes."`

**Expected Behavior**:
- BERT: May work but base model not fine-tuned for QA
- RoBERTa: May work but base model not fine-tuned for QA
- BART: May work for extractive QA

### Test 3.1: BERT - Question Answering

In [9]:
print("Testing BERT for Question Answering...\n")

try:
    qa_bert = pipeline('question-answering', model='bert-base-uncased')
    result = qa_bert({
        'question': 'What are the risks?',
        'context': 'Generative AI poses significant risks such as hallucinations, bias, and deepfakes.'
    })
    print(f"Answer: {result['answer']}")
    print(f"Confidence Score: {result['score']:.4f}")
except Exception as e:
    print(f"ERROR: {type(e).__name__}")
    print(f"Message: {str(e)}")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Testing BERT for Question Answering...



Device set to use cpu


Answer: Generative AI poses significant
Confidence Score: 0.0077


### Test 3.2: RoBERTa - Question Answering

In [10]:
print("Testing RoBERTa for Question Answering...\n")

try:
    qa_roberta = pipeline('question-answering', model='roberta-base')
    result = qa_roberta({
        'question': 'What are the risks?',
        'context': 'Generative AI poses significant risks such as hallucinations, bias, and deepfakes.'
    })
    print(f"Answer: {result['answer']}")
    print(f"Confidence Score: {result['score']:.4f}")
except Exception as e:
    print(f"ERROR: {type(e).__name__}")
    print(f"Message: {str(e)}")

Testing RoBERTa for Question Answering...



Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


Answer: risks such as hallucinations, bias, and deepfakes
Confidence Score: 0.0090


### Test 3.3: BART - Question Answering

In [11]:
print("Testing BART for Question Answering...\n")

try:
    qa_bart = pipeline('question-answering', model='facebook/bart-base')
    result = qa_bart({
        'question': 'What are the risks?',
        'context': 'Generative AI poses significant risks such as hallucinations, bias, and deepfakes.'
    })
    print(f"Answer: {result['answer']}")
    print(f"Confidence Score: {result['score']:.4f}")
except Exception as e:
    print(f"ERROR: {type(e).__name__}")
    print(f"Message: {str(e)}")

Testing BART for Question Answering...



Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


Answer: Generative
Confidence Score: 0.0713


---
## Observation Table

**Instructions**: After running all experiments above, fill out this table with your observations.

| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | Failure | Generated only dots/periods (meaningless)  | BERT is Encoder-only; not trained for autoregressive text generation |
| | RoBERTa |Failure |Generated nothing new, just echoed input |RoBERTa is Encoder-only; lacks decoder for generation |
| | BART |Success |Generated coherent text continuation |BART is Encoder-Decoder; designed for generation tasks |
| **Fill-Mask** | BERT |Success |Predicted "create" (53.97%), "generate" (15.58%) |BERT is trained on Masked Language Modeling (MLM) |
| | RoBERTa |Success |Predicted "generate" (37.11%), "create" (36.77%) |RoBERTa is optimized for MLM task
 |
| | BART |Success |Predicted "create" (7.46%) but lower confidence	|BART has masking capability but not optimized for it
 |
| **QA** | BERT |Partial Failure	 |Wrong answer, very low confidence (0.77%)	 |Base model not fine-tuned for QA task
|
| | RoBERTa |Partial Success	 |Correct answer but low confidence (0.90%)	|Base model not fine-tuned, but encoder handles context
 |
| | BART |Failure |Wrong answer "Generative" (7.13% confidence)	 |Base model not optimized for extractive QA
 |

---
## Summary & Key Learnings

### 1. Key Observations on Encoder-only Models (BERT, RoBERTa)

- Encoder-only models excel at **understanding and filling masked tokens** (MLM task)
- They **cannot generate new text autoregressively** - both BERT and RoBERTa failed at text generation
- RoBERTa performed slightly better than BERT for Fill-Mask due to optimized training
- Without fine-tuning, they perform poorly on Question Answering tasks (very low confidence scores)
- Best suited for: classification, NER, sentiment analysis, and masked language modeling

### 2. Key Observations on Encoder-Decoder Models (BART)

- BART successfully generated coherent text because it has both encoder and decoder components
- Can handle Fill-Mask tasks but with lower confidence than encoder-only models (not optimized for it)
- Without fine-tuning for QA, performance is poor despite having sophisticated architecture
- The encoder-decoder structure makes it versatile but requires task-specific fine-tuning
- Best suited for: summarization, translation, text generation, and seq2seq tasks

### 3. Why Model Architecture Matters for Specific Tasks

- **Architecture determines capability**: Encoder-only models physically cannot generate text autoregressively
- **Training objective matters**: Models perform best on tasks similar to their training objective (BERT trained on MLM, so it excels at Fill-Mask)
- **Efficiency**: Using the wrong architecture wastes computational resources (forcing BERT to generate is inefficient)
- **Real-world impact**: In production, choosing the wrong architecture leads to poor user experience and failures
- **Base models need fine-tuning**: Even correct architecture needs task-specific training (as seen in QA results)

### 4. Production Model Recommendations

- **Text Generation**: GPT-2, GPT-3, or other **decoder-only models** (NOT BERT/RoBERTa) - they are specifically designed for autoregressive generation
- **Fill-Mask**: BERT or RoBERTa - they are trained on MLM and showed highest confidence scores (53.97% and 37.11%)
- **Question Answering**: Fine-tuned BERT/RoBERTa on SQuAD dataset (e.g., `distilbert-base-cased-distilled-squad`) - base models aren't enough, need task-specific training