In [1]:
from transformers import pipeline


## EXPERIMENT 1: TEXT GENERATION

In [3]:
gen_bert = pipeline("text-generation", model="bert-base-uncased",framework="pt")
gen_bert("The future of Artificial Intelligence is")


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


[{'generated_text': 'The future of Artificial Intelligence is................................................................................................................................................................................................................................................................'}]

In [5]:
gen_roberta = pipeline("text-generation", model="roberta-base",framework="pt")
gen_roberta("The future of Artificial Intelligence is")


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


[{'generated_text': 'The future of Artificial Intelligence is'}]

In [7]:
gen_bart = pipeline("text-generation", model="facebook/bart-base",framework="pt")
gen_bart("The future of Artificial Intelligence is", max_length=30)


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence isenda avert avert greeting victories local Reef speech SHOULD SHOULD SHOULD Designs indirect detects restroom restroom restroom escape���igrigr avertGGGGutter adaptiveUr HOR adaptive deerimov avert complicitytraditionalCoach retained killers CI adaptiveFoot nuanced Castro restroom Cassidy restroom Aboriginal restroom confusing First nominate renewables restroomotiation restroom��� flaws Abraham flaws retained restroom squash SHOULD proxy proxyController restroom restroom vacuum adaptiveCoach Gained deer outings CI restroom restroom flaws restroom deer deerBoo deer restroom restroom adaptive CI squash escapeCt outings restroomededCtBooexistioxide routine Clayton avert punishmentsseenBoo HOR227CtCt PacksBooBooBoo restroom escape Partner neat restroom adaptiveBoomia��� retained proxy proxyCoach speech restroom retained FISA restroom flawsBooBootainingCt restroomCtCt ChamCt outingsseenioxideioxideBooBoo flexible restroom restroom 

### Observation (Experiment 1: Text Generation)

- **BERT** produced degenerate output consisting mainly of repeated punctuation characters
  rather than meaningful text. This indicates a failure to perform true text generation.

- **RoBERTa** returned only the input prompt without generating any additional tokens,
  showing that it was unable to continue the sequence.

- **BART** generated a long continuation; however, the output was largely incoherent and
  contained repetitive and nonsensical tokens. While text was produced, the quality was poor.

### Explanation (Architectural Reason)

BERT and RoBERTa are **encoder-only models** trained primarily using Masked Language Modeling.
They do not learn autoregressive next-token prediction, which is essential for text generation.
When forced into a generation task, they either produce degenerate outputs or fail to extend
the input sequence meaningfully.

BART is an **encoder–decoder model** and is architecturally capable of text generation.
However, the base BART model is not fine-tuned for open-ended generation. As a result, although
it can generate text, the output may be incoherent or repetitive when used without task-specific
training.


## EXPERIMENT 2: FILL-MASK (Masked Language Modeling)

In [10]:
fill_bert = pipeline("fill-mask", model="bert-base-uncased",framework="pt")
fill_bert("The goal of Generative AI is to [MASK] new content.")


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.5396931171417236,
  'token': 3443,
  'token_str': 'create',
  'sequence': 'the goal of generative ai is to create new content.'},
 {'score': 0.15575730800628662,
  'token': 9699,
  'token_str': 'generate',
  'sequence': 'the goal of generative ai is to generate new content.'},
 {'score': 0.05405494570732117,
  'token': 3965,
  'token_str': 'produce',
  'sequence': 'the goal of generative ai is to produce new content.'},
 {'score': 0.04451525956392288,
  'token': 4503,
  'token_str': 'develop',
  'sequence': 'the goal of generative ai is to develop new content.'},
 {'score': 0.01757749542593956,
  'token': 5587,
  'token_str': 'add',
  'sequence': 'the goal of generative ai is to add new content.'}]

In [12]:
fill_roberta = pipeline("fill-mask", model="roberta-base",framework="pt")
fill_roberta("The goal of Generative AI is to <mask> new content.")


Device set to use cpu


[{'score': 0.3711312711238861,
  'token': 5368,
  'token_str': ' generate',
  'sequence': 'The goal of Generative AI is to generate new content.'},
 {'score': 0.3677145838737488,
  'token': 1045,
  'token_str': ' create',
  'sequence': 'The goal of Generative AI is to create new content.'},
 {'score': 0.08351453393697739,
  'token': 8286,
  'token_str': ' discover',
  'sequence': 'The goal of Generative AI is to discover new content.'},
 {'score': 0.021335123106837273,
  'token': 465,
  'token_str': ' find',
  'sequence': 'The goal of Generative AI is to find new content.'},
 {'score': 0.016521651297807693,
  'token': 694,
  'token_str': ' provide',
  'sequence': 'The goal of Generative AI is to provide new content.'}]

In [13]:
fill_bart = pipeline("fill-mask", model="facebook/bart-base",framework="pt")
fill_bart("The goal of Generative AI is to <mask> new content.")


Device set to use cpu


[{'score': 0.07461534440517426,
  'token': 1045,
  'token_str': ' create',
  'sequence': 'The goal of Generative AI is to create new content.'},
 {'score': 0.06571901589632034,
  'token': 244,
  'token_str': ' help',
  'sequence': 'The goal of Generative AI is to help new content.'},
 {'score': 0.060880281031131744,
  'token': 694,
  'token_str': ' provide',
  'sequence': 'The goal of Generative AI is to provide new content.'},
 {'score': 0.03593571111559868,
  'token': 3155,
  'token_str': ' enable',
  'sequence': 'The goal of Generative AI is to enable new content.'},
 {'score': 0.03319474309682846,
  'token': 1477,
  'token_str': ' improve',
  'sequence': 'The goal of Generative AI is to improve new content.'}]

### Observation (Experiment 2: Masked Language Modeling)

- **BERT** successfully predicted highly relevant words such as *"create"*, *"generate"*,
  and *"produce"* with strong confidence scores. The top prediction was *"create"*.

- **RoBERTa** also performed very well, predicting *"generate"* and *"create"* as the
  most likely tokens with nearly equal confidence, indicating strong contextual
  understanding.

- **BART** was able to fill the masked token, but the confidence scores were significantly
  lower and the predictions were less precise compared to BERT and RoBERTa.

### Explanation (Architectural Reason)

BERT and RoBERTa are trained using **Masked Language Modeling (MLM)**, where random tokens
are masked and the model learns to predict them using bidirectional context. This training
objective makes them highly effective for fill-mask tasks.

BART, while capable of handling masked tokens, is primarily optimized for
**sequence-to-sequence learning** rather than MLM. As a result, it performs reasonably
well but does not match the confidence or precision of encoder-only MLM-trained models.


## EXPERIMENT 3: QUESTION ANSWERING

In [16]:
qa_bert = pipeline("question-answering", model="bert-base-uncased",framework="pt")
qa_bert(
    question="What are the risks?",
    context="Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
)


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.007564860628917813,
 'start': 0,
 'end': 37,
 'answer': 'Generative AI poses significant risks'}

In [17]:
qa_roberta = pipeline("question-answering", model="roberta-base",framework="pt")
qa_roberta(
    question="What are the risks?",
    context="Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
)


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.0044537503272295,
 'start': 60,
 'end': 82,
 'answer': ', bias, and deepfakes.'}

In [18]:
qa_bart = pipeline("question-answering", model="facebook/bart-base",framework="pt")
qa_bart(
    question="What are the risks?",
    context="Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
)


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.05502151511609554,
 'start': 0,
 'end': 81,
 'answer': 'Generative AI poses significant risks such as hallucinations, bias, and deepfakes'}

### Observation (Experiment 3: Question Answering)

- **BERT** returned a partial answer: *"Generative AI poses significant risks"*.
  The answer was relevant but incomplete, missing specific examples such as
  hallucinations, bias, and deepfakes. The confidence score was very low.

- **RoBERTa** extracted only a fragment of the expected answer:
  *", bias, and deepfakes."* While this portion is correct, it lacks full context
  and completeness. The confidence score was also very low.

- **BART** returned the most complete answer:
  *"Generative AI poses significant risks such as hallucinations, bias, and deepfakes"*.
  Although the answer was correct, the confidence score remained low, indicating
  uncertainty.

### Explanation (Architectural Reason)

All three models used in this experiment are **base pretrained models** and are
**not fine-tuned for question answering tasks** such as SQuAD. As a result, the
question-answering head was randomly initialized, which leads to low confidence
scores and inconsistent answer quality.

While BERT and RoBERTa rely on encoder-based span extraction, their lack of
task-specific fine-tuning causes incomplete or fragmented answers. BART, being
an encoder–decoder model, is capable of producing more complete spans, but still
suffers from low reliability without QA-specific training.
