# Exercises XP: Day 3 - BERT in Practice
Follow the prompts below. Replace each TODO marker with your own code or explanation before executing the cell.


## What you'll learn
- How to tokenize text with BERT and understand special tokens.
- How to run a pretrained sentiment pipeline.
- How to build custom BERT-based sentiment and NER analyzers.
- How to compare encoder (BERT) versus decoder (GPT) families.
- How BERT supplies retrieval power inside a RAG stack.


## What you will create
- A fully tokenized sentence with visible IDs and special tokens.
- A working sentiment pipeline powered by a fine-tuned DistilBERT model.
- Custom helper classes for sentiment classification and NER.
- A comparison table that contrasts BERT and GPT.
- A written explanation of how BERT embeddings drive retrieval in RAG.


> Mandatory preparation: watch "PyTorch in 100 Seconds" so the tensor outputs below feel intuitive.

## Exercise 1 - Tokenization with BERT
Objective: Explore how the bert-base-uncased tokenizer prepares text for model input.

Instructions:
1. (Optional) Install the required libraries.
2. Load the tokenizer, craft a sample sentence, and encode it with padding plus truncation.
3. Print the tokens next to their integer IDs and flag the special tokens.
4. Inspect the attention mask to see how padding is hidden from the model.

Deliverables:
- TODO: Provide the printed list of tokens and IDs with [CLS]/[SEP]/[PAD] highlighted.
- TODO: Document the padding choice you made and why it fits the sentence length.


In [1]:
# Optional setup: install dependencies if they are missing in your environment.
%pip install -q transformers torch


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

sample_sentence = "TODO: replace with a short sentence you want to tokenize"
print(sample_sentence)


In [None]:
encoding = tokenizer(
    sample_sentence,
    add_special_tokens=True,
    padding="max_length",
    truncation=True,
    max_length=24,  # TODO: adjust if your sentence needs more room
    return_attention_mask=True,
    return_tensors="pt"
)

input_ids = encoding["input_ids"][0].tolist()
tokens = tokenizer.convert_ids_to_tokens(input_ids)
print("index | token        | id")
print("-------------------------")
for idx, (token, token_id) in enumerate(zip(tokens, input_ids)):
    print(f"{idx:>5} | {token:<12} | {token_id:>5}")

print("\nAttention mask:", encoding["attention_mask"][0].tolist())
special_positions = [(i, tok) for i, tok in enumerate(tokens) if tok in tokenizer.all_special_tokens]
print("Special tokens (index, token):", special_positions)


### Exercise 1 reflection
- TODO: Describe how [CLS] and [SEP] behave inside the encoder.
- TODO: Explain how the attention mask hides padded positions from self-attention.


## Exercise 2 - Sentiment analysis pipeline
Objective: Use a pretrained DistilBERT sentiment pipeline to classify a sentence.

Instructions:
1. Import the `pipeline` helper from transformers.
2. Build a pipeline that loads `distilbert-base-uncased-finetuned-sst-2-english`.
3. Pass in a sentence and review the predicted label and score.

Deliverables:
- TODO: Record the sentence you tested.
- TODO: Capture the label plus confidence score and interpret the result.


In [None]:
from transformers import pipeline

sentiment_pipeline = pipeline(
    task="sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

sentence = "TODO: add a sentence whose sentiment you want to test"
prediction = sentiment_pipeline(sentence)
prediction


### Exercise 2 reflection
- TODO: Does the predicted label match your expectation? Why or why not?
- TODO: How confident is the model and what does the score tell you?


## Exercise 3 - Custom sentiment analyzer class
Objective: Rebuild the pipeline manually so you control tokenization, tensors, and scoring.

Instructions:
1. Import `AutoTokenizer` and `AutoModelForSequenceClassification`.
2. Implement `BERTSentimentAnalyzer` with methods for initialization, preprocessing, and prediction.
3. Test the class with multiple sentences.

Hints:
- Keep a `max_length` attribute so you can reuse it while tokenizing.
- Apply `torch.softmax` to transform logits into probabilities.
- Return both the label and the probability for clarity.


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from typing import Dict

class BERTSentimentAnalyzer:
    def __init__(self, model_name: str = "distilbert-base-uncased-finetuned-sst-2-english", max_length: int = 128):
        '''TODO: load the tokenizer/model and move the model to the proper device.'''
        raise NotImplementedError("Initialize tokenizer, model, and device here.")

    def preprocess(self, text: str) -> Dict[str, torch.Tensor]:
        '''TODO: clean the text, tokenize, and return tensors ready for inference.'''
        raise NotImplementedError("Return a dict of tensors produced by the tokenizer.")

    def predict(self, text: str) -> Dict[str, float]:
        '''TODO: run a forward pass, apply softmax, and return a label plus probability.'''
        raise NotImplementedError("Add inference and post-processing logic.")


In [None]:
# TODO: instantiate your analyzer and test several sentences once the class is ready.
# analyzer = BERTSentimentAnalyzer()
# samples = [
#     "TODO: add a clearly positive statement",
#     "TODO: add a clearly negative statement"
# ]
# for text in samples:
#     print(text)
#     print(analyzer.predict(text))


## Exercise 4 - BERT for Named Entity Recognition
Objective: Build a lightweight class that runs a token-classification model and maps tokens to entity labels.

Instructions:
1. Import `AutoTokenizer` and `AutoModelForTokenClassification`.
2. Implement `BERTNamedEntityRecognizer` with init plus a `recognize` method.
3. Tokenize sample text, run the model, convert the predictions to entity spans, and test with a short paragraph.

Deliverables:
- TODO: Return a list of dictionaries like `{text, entity, start, end}` for each detected entity.
- TODO: Explain how you handled subword tokens that begin with `##`.


In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

class BERTNamedEntityRecognizer:
    def __init__(self, model_name: str = "dslim/bert-base-NER"):
        '''TODO: load the tokenizer and model, and detect the available device.'''
        raise NotImplementedError("Initialize tokenizer, model, and device.")

    def recognize(self, text: str):
        '''TODO: tokenize the text, run the model, map predictions to BIO labels, and merge word pieces.'''
        raise NotImplementedError("Return structured entities.")


In [None]:
# TODO: instantiate the recognizer and test it on text that includes people, places, or organizations.
# ner = BERTNamedEntityRecognizer()
# sample_text = "TODO: add a short paragraph with at least two entities."
# ner.recognize(sample_text)


## Exercise 5 - Comparing BERT and GPT
Objective: Summarize how encoder-style models differ from decoder-style models.

Fill the table with concise statements (one line each).

| Category | BERT | GPT |
|----------|------|-----|
| Architecture | TODO | TODO |
| Primary purpose | TODO | TODO |
| Typical use cases | TODO | TODO |
| Strengths | TODO | TODO |
| Weaknesses | TODO | TODO |


## Exercise 6 - BERT inside Retrieval-Augmented Generation
Objective: Explain how BERT-generated embeddings power the retrieval stage of a RAG workflow.

Address each bullet with a short paragraph:
1. TODO: Describe how BERT encodes queries and documents.
2. TODO: Explain how those embeddings are stored and searched in a vector database.
3. TODO: Outline how the retrieved passages are handed to a generative model like GPT.
4. TODO: Provide a concrete application example (industry or product) where RAG with BERT makes sense.
