# Unit 1: Model Benchmark Challenge


In [12]:
!pip install transformers torch --quiet


In [13]:
from transformers import pipeline


## Experiment 1: Text Generation
### Model: BERT (bert-base-uncased)


In [14]:
from transformers import pipeline

bert_generator = pipeline(
    "text-generation",
    model="bert-base-uncased"
)

bert_generator("The future of Artificial Intelligence is", max_length=30)


If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is................................................................................................................................................................................................................................................................'}]

### Model: RoBERTa (roberta-base)


In [15]:
roberta_generator = pipeline(
    "text-generation",
    model="roberta-base"
)

roberta_generator("The future of Artificial Intelligence is", max_length=30)


If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is'}]

### Model: BART (facebook/bart-base)


In [16]:
bart_generator = pipeline(
    "text-generation",
    model="facebook/bart-base"
)

bart_generator("The future of Artificial Intelligence is", max_length=40)


Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence istips competenttipstipsScore Un Veteranossibilitytipstipstips Surfacetipsbuf MediumtipsIncreases crawl crawl DOIIncreases DOIbonestipsIncreases DOI Vanilla Jeff Jeff perkScore faultyScore toughnessUnit neither Jeff flubuf faulty faultytips faulty faulty Wh Lay Koraching faultyfull faulty faulty faultybuf 291.[ Papers faulty competenttipsanalysisGround Papers Iristips faulty Lay faulty faulty Layivalry faulty faulty Kor competent parked brainstipstips faulty Sod Kortips Kor PapersSimple faultyruction Laytips Laytips Papersoak Kor Paperstips faulty324324tips Commonwealth faulty faultyivalry Kor parkedtipstips barbaric Kortips cautious Kor Laytips Noisetips faulty Marriage 247tips Lay bittentipstips psychiatric faulty faulty Papers Kor Kor concedtips oversized Layfull Laytips faulty competent Kor faulty Papers mortals faulty decisive faulty Laytips Commonwealth Lay faulty Kor faulty Teachers faulty fract faulty faulty Commonwealt

## Experiment 2: Masked Language Modeling (Fill-Mask)
### Model: BERT (bert-base-uncased)


In [17]:
bert_mask = pipeline(
    "fill-mask",
    model="bert-base-uncased"
)

bert_mask("The goal of Generative AI is to [MASK] new content.")


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.5396932363510132,
  'token': 3443,
  'token_str': 'create',
  'sequence': 'the goal of generative ai is to create new content.'},
 {'score': 0.15575720369815826,
  'token': 9699,
  'token_str': 'generate',
  'sequence': 'the goal of generative ai is to generate new content.'},
 {'score': 0.05405500903725624,
  'token': 3965,
  'token_str': 'produce',
  'sequence': 'the goal of generative ai is to produce new content.'},
 {'score': 0.04451530799269676,
  'token': 4503,
  'token_str': 'develop',
  'sequence': 'the goal of generative ai is to develop new content.'},
 {'score': 0.01757744885981083,
  'token': 5587,
  'token_str': 'add',
  'sequence': 'the goal of generative ai is to add new content.'}]

### Model: RoBERTa (roberta-base)


In [18]:
roberta_mask = pipeline(
    "fill-mask",
    model="roberta-base"
)

roberta_mask("The goal of Generative AI is to <mask> new content.")


Device set to use cpu


[{'score': 0.3711312413215637,
  'token': 5368,
  'token_str': ' generate',
  'sequence': 'The goal of Generative AI is to generate new content.'},
 {'score': 0.3677145540714264,
  'token': 1045,
  'token_str': ' create',
  'sequence': 'The goal of Generative AI is to create new content.'},
 {'score': 0.08351420611143112,
  'token': 8286,
  'token_str': ' discover',
  'sequence': 'The goal of Generative AI is to discover new content.'},
 {'score': 0.021335121244192123,
  'token': 465,
  'token_str': ' find',
  'sequence': 'The goal of Generative AI is to find new content.'},
 {'score': 0.016521666198968887,
  'token': 694,
  'token_str': ' provide',
  'sequence': 'The goal of Generative AI is to provide new content.'}]

### Model: BART (facebook/bart-base)


In [19]:
bart_mask = pipeline(
    "fill-mask",
    model="facebook/bart-base"
)

bart_mask("The goal of Generative AI is to <mask> new content.")


Device set to use cpu


[{'score': 0.07461541891098022,
  'token': 1045,
  'token_str': ' create',
  'sequence': 'The goal of Generative AI is to create new content.'},
 {'score': 0.06571870297193527,
  'token': 244,
  'token_str': ' help',
  'sequence': 'The goal of Generative AI is to help new content.'},
 {'score': 0.060880109667778015,
  'token': 694,
  'token_str': ' provide',
  'sequence': 'The goal of Generative AI is to provide new content.'},
 {'score': 0.03593561053276062,
  'token': 3155,
  'token_str': ' enable',
  'sequence': 'The goal of Generative AI is to enable new content.'},
 {'score': 0.03319477662444115,
  'token': 1477,
  'token_str': ' improve',
  'sequence': 'The goal of Generative AI is to improve new content.'}]

## Experiment 3: Question Answering
### Model: BERT (bert-base-uncased)


In [20]:
bert_qa = pipeline(
    "question-answering",
    model="bert-base-uncased"
)

bert_qa(
    question="What are the risks?",
    context="Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
)


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.02191578270867467,
 'start': 46,
 'end': 82,
 'answer': 'hallucinations, bias, and deepfakes.'}

### Model: RoBERTa (roberta-base)


In [21]:
roberta_qa = pipeline(
    "question-answering",
    model="roberta-base"
)

roberta_qa(
    question="What are the risks?",
    context="Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
)


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.005414959508925676,
 'start': 38,
 'end': 71,
 'answer': 'such as hallucinations, bias, and'}

### Model: BART (facebook/bart-base)


In [22]:
bart_qa = pipeline(
    "question-answering",
    model="facebook/bart-base"
)

bart_qa(
    question="What are the risks?",
    context="Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
)


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.022303341887891293,
 'start': 0,
 'end': 31,
 'answer': 'Generative AI poses significant'}

## Observation Table

| Task | Model | Classification (Success / Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
|-----|------|------------------------------------|--------------------------------------|--------------------------------------------|
| Generation | BERT | Failure | Returned only the input prompt without generating new text | BERT is an encoder only model trained with masked language modeling and lacks a decoder for autoregressive generation |
| Generation | RoBERTa | Failure | Echoed the input prompt without producing any continuation | RoBERTa is also an encoder only architecture and cannot generate tokens sequentially |
| Generation | BART | Success | Generated a continuation with new tokens, though the output was repetitive and noisy. | BART has an encoder decoder architecture with a decoder capable of autoregressive generation |
| Fill Mask | BERT | Success | Correctly predicted words like “create” and “generate” with high confidence. | BERT is trained using masked language modeling (MLM) |
| Fill-Mask | RoBERTa | Success | Accurately predicted context-aware words such as “generate” and “create” | RoBERTa is optimized for MLM with improved training strategies. |
| Fill-Mask | BART | Partial Success | Produced reasonable predictions but with lower confidence | BART is trained as a denoising autoencoder, not pure MLM |
| QA | BERT | Partial Success | Returned the correct answer but with very low confidence | The model is not fine-tuned for question answering tasks |
| QA | RoBERTa | Partial Success | Returned only a partial answer span with low confidence | Encoder only model without QA specific fine tuning |
| QA | BART | Partial Success | Returned a partial span of the answer with low confidence | Encoder decoder model not fine- uned for extractive QA |
