## Exercise 1

Suppose we are trying to search some information within the materials from this course. The course is reasonably well organized, and everything is within jupyter notebooks. Jupyter notebook files (.ipynb) are actually in JSON format.

Choose any .ipynb file and open it using the json library in Python. Investigate:

How can we find a cell within the notebook?
How can we known if the cell contains Python code or Markdown annotations?

In [1]:
import json

def load_notebook_cells(path: str) -> list:
    """
    Carrega um arquivo .ipynb e retorna a lista de células.
    """
    with open(path, 'r', encoding='utf-8') as f:
        nb = json.load(f)
    return nb.get('cells', [])

load_notebook_cells('../deep_learning/01-from_sklearn_to_pytorch.ipynb')


[{'cell_type': 'markdown',
  'metadata': {},
  'source': ['# Logistic Regression with Pytorch\n',
   '\n',
   "Allright, everyone. Let's go ahead and start doing things. We will start with a brief review, so that we are all on the same page.\n",
   '\n',
   '## Review: Logistic Regression with Scikit-Learn\n',
   '\n',
   'If you have reached this point, you probably know by heart how to define a Logistic Regression pipeline in Scikit Learn:\n']},
 {'cell_type': 'code',
  'execution_count': 17,
  'metadata': {},
  'outputs': [],
  'source': ['from sklearn.feature_extraction.text import CountVectorizer\n',
   'from sklearn.linear_model import LogisticRegression\n',
   'from sklearn.pipeline import Pipeline\n',
   '\n',
   'vectorizer = CountVectorizer(binary=True)\n',
   'classifier = LogisticRegression()\n',
   'pipeline = Pipeline([("vectorizer", vectorizer),\n',
   '                        ("classifier", classifier)])']},
 {'cell_type': 'markdown',
  'metadata': {},
  'source': ['Rem

## Exercise 2

The simplest way to search for text is using keywords. Improve your code so that it:

1. Collects a keyword from the user, and
1. Indicates all notebooks/cells that contain that keyword.

Reflect: what is the best way to present these results to the user?

In [2]:
import os

def keyword_search_in_notebooks(paths: list, keyword: str) -> list:
    """
    Para cada caminho em paths, busca keyword (case-insensitive) em cada célula.
    Retorna lista de tuplas: (arquivo, índice_da_célula, tipo, snippet).
    """
    results = []
    for path in paths:
        cells = load_notebook_cells(path)
        for idx, cell in enumerate(cells):
            source = ''.join(cell.get('source', []))
            if keyword.lower() in source.lower():
                results.append((path, idx, cell['cell_type'], source[:200].replace('\n', ' ')))
    return results

# todos os notebooks da pasta '../deep_learning'
notebooks = [os.path.join('../deep_learning', f) for f in os.listdir('../deep_learning') if f.endswith('.ipynb')]
keyword_search_in_notebooks(notebooks, 'import')

[('../deep_learning\\01-from_sklearn_to_pytorch.ipynb',
  1,
  'code',
  'from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline  vectorizer = CountVectorizer(binary=True) classif'),
 ('../deep_learning\\01-from_sklearn_to_pytorch.ipynb',
  5,
  'code',
  "import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score  df = pd.read_csv('https://raw.githubusercontent.com/tiagoft/NLP/refs/heads/main/wik"),
 ('../deep_learning\\01-from_sklearn_to_pytorch.ipynb',
  8,
  'code',
  'import torch.nn as nn linear_layer = nn.Linear(in_features=3, out_features=1) print(linear_layer)'),
 ('../deep_learning\\01-from_sklearn_to_pytorch.ipynb',
  10,
  'code',
  'import torch X = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])  # We can call the linear_layer, and it will perform its operation: output = linear_layer(X) print(output)  # We can also access

## Exercise 3

There is an inherent fragility in the previous system: it requires the user to guess the keyword correctly. Trivia fact: in the early 2000s, the hability to guess keywords in Google had the same hype that using AI chatbots has nowadays.

We could prevent our user from trying to guess the exact keyword or phrase, afterall, we have an estimator for phrase similarity: BERT!

Improve your code so that it:

1. Collects a phrase from the user,
1. Calculates the phrase embedding $q$ using the CLS token from BERT
1. Traverses the course material calculating the embedding $x_i$ for each cell
1. Finds the $k$ (try with $k=1$, then generalize to any $k$) cells with minimal cosine distance ($d = \frac{ <q, x_1>}{||x|| ||c_i||}$) with relationship to the phrase.

Reflect: was this a better choice for retrieval? How can we measure this difference? (tip: research how information retrieval systems are evaluated!)

In [12]:
import torch
from transformers import BertTokenizer, BertModel
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Carrega BERT (PyTorch)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

def embed_text(text: str) -> np.ndarray:
    """
    Retorna embedding do texto usando o token CLS do BERT.
    """
    # Tokenização e conversão para tensores PyTorch
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # CLS embedding é a primeira posição
    cls_emb = outputs.last_hidden_state[0, 0, :]
    return cls_emb.cpu().numpy().squeeze()


def semantic_search(cells: list, query: str, k) -> list:
    """
    Retorna as k células cujos embeddings são mais similares ao query.
    """
    q_emb = embed_text(query).reshape(1, -1)
    similarities = []

    for cell in cells:
        if cell['cell_type'] == 'markdown':
            text = ''.join(cell.get('source', []))
            emb = embed_text(text).reshape(1, -1)
            score = cosine_similarity(q_emb, emb)[0, 0]
            similarities.append((cell, float(score)))

    # Ordena decrescente e retorna top k
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:k]

all_notebooks = [os.path.join('../deep_learning', f) for f in os.listdir('../deep_learning') if f.endswith('.ipynb')]
# Carrega todas as células de todos os notebooks
cells = []
for path in all_notebooks:
    cells.extend(load_notebook_cells(path))
# Exemplo de busca semântica
query = "BERT"
top_results = semantic_search(cells, query, k=10)
for cell, score in top_results:
    print(f"Score: {score:.4f}\nTipo: {cell['cell_type']}\nConteúdo: {''.join(cell['source'])}\n")

Score: 0.8999
Tipo: markdown
Conteúdo: ## Making a pipeline with PyTorch

Score: 0.8927
Tipo: markdown
Conteúdo: ## 

Score: 0.8798
Tipo: markdown
Conteúdo: ### Evaluating the model

We can convert our 

Score: 0.8332
Tipo: markdown
Conteúdo: ## Visualizing embeddings

Score: 0.8321
Tipo: markdown
Conteúdo: # Modelling sequences

Score: 0.8212
Tipo: markdown
Conteúdo: 

```mermaid
graph LR;
    subgraph Inputs;
    INPUT["[CLS]
        remove
        some
        parts
        [MASK]
        a
        sentence"];
    end;
    INPUT --> BERT["BERT"];
    subgraph Outputs;
    OUTPUT["C
    T1
    T2
    T3
    T4
    T5
    T6"];
    end;
    BERT --> OUTPUT;
    Train["Loss: T4 should be the word 'of'"]
    OUTPUT --- Train;
```


This task suggests that the embedding space created by BERT should allow representing words in the context of the rest of the sentence!

To play with this task with Hugging Face's library, you can use:



Score: 0.7951
Tipo: markdown
Conteúdo: ## Exercise

Ou

## Exercise 4

Now let's leave our retrival system waiting for a while.

Make a small program that:

1. Collects a question from the user
1. Uses an API to redirect this question to an LLM, and immediately returns the answer.
1. Add prompt information so that your answers can only regard NLP-related subjects (these are called "safeguards")

In [4]:
from dotenv import load_dotenv
import google.generativeai as genai

load_dotenv()
GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')
genai.configure(api_key=GEMINI_API_KEY)
models = list(genai.list_models())
for m in models:
    # verificar quais modelos sao generativeContent (verificar lista supported_generation_methods por 'generateContent')
    if 'generateContent' in m.supported_generation_methods:
        print(f"Modelo: {m.name}")


# Para ver os modelos disponíveis:
# print(genai.list_models())

generation_config = genai.GenerationConfig(
    max_output_tokens=500,
    temperature=1.0,
)
gemini_model = genai.GenerativeModel(model_name="gemini-1.5-flash")

def llm_query(question: str, domain: str = 'NLP') -> str:
    """
    Envia question ao Gemini, com system prompt restringindo ao domínio.
    """
    system_prompt = (
        f"Você é um assistente especializado em {domain}. "
        "Responda somente usando informações desse domínio."
    )
    full_prompt = system_prompt + "\n\n" + question
    response = gemini_model.generate_content(
        full_prompt,
        generation_config=generation_config
    )
    return response.text

Modelo: models/gemini-1.0-pro-vision-latest
Modelo: models/gemini-pro-vision
Modelo: models/gemini-1.5-pro-latest
Modelo: models/gemini-1.5-pro-001
Modelo: models/gemini-1.5-pro-002
Modelo: models/gemini-1.5-pro
Modelo: models/gemini-1.5-flash-latest
Modelo: models/gemini-1.5-flash-001
Modelo: models/gemini-1.5-flash-001-tuning
Modelo: models/gemini-1.5-flash
Modelo: models/gemini-1.5-flash-002
Modelo: models/gemini-1.5-flash-8b
Modelo: models/gemini-1.5-flash-8b-001
Modelo: models/gemini-1.5-flash-8b-latest
Modelo: models/gemini-1.5-flash-8b-exp-0827
Modelo: models/gemini-1.5-flash-8b-exp-0924
Modelo: models/gemini-2.5-pro-exp-03-25
Modelo: models/gemini-2.5-pro-preview-03-25
Modelo: models/gemini-2.5-flash-preview-04-17
Modelo: models/gemini-2.5-flash-preview-04-17-thinking
Modelo: models/gemini-2.5-pro-preview-05-06
Modelo: models/gemini-2.0-flash-exp
Modelo: models/gemini-2.0-flash
Modelo: models/gemini-2.0-flash-001
Modelo: models/gemini-2.0-flash-exp-image-generation
Modelo: mode

## Exercise 5

Now, let's joint everything.

We are able to find specific information from our courseware. Also, we are able to use LLMs. Use both abilities to:

1. Collect a question from the user
1. Retrieve the $K$ most relevant cells from the course material
1. Use the content of these cells as part of a prompt. The prompt includes both the question and the content from the relevant cells.
1. Phrase your prompt so that the LLM can only return information that is contained in the course material.

Reflect: how does this compare to the system in Exercise 4? How can we measure the differences?

In [5]:
def rag_answer(question: str, notebook_paths: list, k) -> str:
    # Carrega todas as células de todos os notebooks
    all_cells = []
    for path in notebook_paths:
        all_cells.extend(load_notebook_cells(path))

    markdown_cells = [c for c in all_cells if c['cell_type'] == 'markdown']
    top = semantic_search(markdown_cells, question, k=k)

    retrieved_text = '\n'.join(''.join(c['source']) for c, _ in top)
    print("=== CONTEXTO ===")
    print(retrieved_text)
    print("=== FIM DO CONTEXTO ===")
    prompt = (
        "Use apenas as informações abaixo para responder à pergunta." +
        "\n\nInformações recuperadas:\n" + retrieved_text +
        "\n\nPergunta: " + question
    )
    return llm_query(prompt, domain='NLP')

# Exemplo de RAG

notebook_paths = [os.path.join('../deep_learning', f) for f in os.listdir('../deep_learning') if f.endswith('.ipynb')]
all_cells = []
for path in notebook_paths:
    all_cells.extend(load_notebook_cells(path))
N = len(all_cells)
print(N)
question = "O que é um tokenizador?"
answer = rag_answer(question, notebook_paths, N)
print(answer)

198
=== CONTEXTO ===
### Question 1: What is the underlying premise of the Bag-of-Words classifier, that is, why does BoW allow to classify these texts?
### Question 2: What is the underlying premise of a BERT-based classifier, that is, why should BERT embeddings be interesting to classify these texts?

## Making a pipeline with PyTorch
### Evaluating the model

We can convert our 
## Visualizing embeddings
### Example code to test the trained tokenizer:
## Exercise

Our usual way to approach classification is to do something in the lines of:
In fact, we could simply decide our class by thresholding our output:
# Further reading

We usually think that the last layer of the language model should be used in downstream tasks. However, [this paper](https://arxiv.org/abs/2502.02013) indicates that this is not always true! Check the paper - and see how it applies to the language model you have previously trained!

The embedding for the [CLS] token can be accessed using:

## Exercise

Make a 

## Exercise 6

If you have reached this far, let's start optimizing our systems.

To do so:

1. Identify which step of your processing pipeline takes the longer
1. Study if there are techniques or data structures that can make this specific step faster
1. If possible, implement the optimization and test the results.
1. Iterate until you cannot optimize anymore.

In [6]:
import time
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# --- 1) Profiling da implementação original ---
start = time.time()
_ = semantic_search(cells, "minha pergunta aqui", k=10)
print(f"Tempo original: {time.time() - start:.2f}s")


# --- 2) Pré-compute todos os embeddings de célula UMA ÚNICA VEZ ---
# (supondo que `cells` já está carregado)
cell_texts = [''.join(c['source']) for c in cells if c['cell_type'] == 'markdown']

# Cria um array (n_células × dimensão do embedding)
cell_embeddings = np.stack([embed_text(txt) for txt in cell_texts])


# --- 3) Função de busca semântica VETORIZADA ---
def semantic_search_fast(cell_texts: list[str],
                         cell_embeddings: np.ndarray,
                         query: str,
                         k: int = 5) -> list[tuple[str, float]]:
    """
    Retorna os top-k pares (texto_da_célula, score) mais similares.
    Usa apenas NumPy + sklearn para calcular cosseno de forma vetorizada.
    """
    # embedding da query
    q_emb = embed_text(query).reshape(1, -1)

    # cálculo de similaridade em batch
    scores = cosine_similarity(cell_embeddings, q_emb).flatten()

    # seleciona índices dos k maiores valores
    top_idxs = np.argsort(scores)[-k:][::-1]

    return [(cell_texts[i], float(scores[i])) for i in top_idxs]


# --- 4) Profiling da versão otimizada ---
start = time.time()
topk = semantic_search_fast(cell_texts, cell_embeddings, "minha pergunta aqui", k=10)
print(f"Tempo otimizado: {time.time() - start:.2f}s")

# Exibe os resultados
for txt, sc in topk:
    print(f"Score: {sc:.4f}\nTexto: {txt[:200]}")


Tempo original: 55.83s
Tempo otimizado: 0.10s
Score: 0.8230
Texto: ## Making a pipeline with PyTorch
Score: 0.8093
Texto: ## 
Score: 0.8045
Texto: ## Visualizing embeddings
Score: 0.7939
Texto: ### Evaluating the model

We can convert our 
Score: 0.7762
Texto: # Modelling sequences
Score: 0.7695
Texto: ## Exercise

Our usual way to approach classification is to do something in the lines of:
Score: 0.7602
Texto: ### Question 1: What is the underlying premise of the Bag-of-Words classifier, that is, why does BoW allow to classify these texts?
Score: 0.7581
Texto: 

```mermaid
graph LR;
    subgraph Inputs;
    INPUT["[CLS]
        remove
        some
        parts
        [MASK]
        a
        sentence"];
    end;
    INPUT --> BERT["BERT"];
    subgraph Ou
Score: 0.7574
Texto:     
There are many details in this implementation, so I made a [video exploring them all](https://youtu.be/FXtGq_TYLzM).
Score: 0.7505
Texto: ### Example code to test the trained tokenizer:
