Nombre: Marcela Cabrera

# Ejercicio 9: Uso de la API de Google Gemini

En este ejercicio vamos a aprender a utilizar la API de OpenAI

## 1. Uso básico

El siguiente código sirve para conectarse con la API de Google Gemini de forma básica

In [47]:
import google.generativeai as genai

# Leer la API key desde un archivo de texto
with open('api-key.txt', 'r') as file:
    api_key = file.read().strip()

# Configurar la API de Google Gemini
genai.configure(api_key=api_key)

# Crear el modelo (usando el más reciente y estable)
model = genai.GenerativeModel('gemini-2.5-flash')

## 2. Retrieval

### 2.1 Cargo el corpus de 20 News Groups

In [9]:
from sklearn.datasets import fetch_20newsgroups

# Cargar el corpus de noticias sin cabeceras, pies de página y citas
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
docs = newsgroups.data
df = pd.DataFrame(docs, columns=['doc'])
df.head(10)


Unnamed: 0,doc
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...
5,\n\nBack in high school I worked as a lab assi...
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...
7,"\n[stuff deleted]\n\nOk, here's the solution t..."
8,"\n\n\nYeah, it's the second one. And I believ..."
9,\nIf a Christian means someone who believes in...


### 2.2 Transformo a embeddings



1.  Normalizar el corpus




In [11]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import re

df = df.dropna(subset=["doc"]).reset_index(drop=True)

# Limpieza básica
def normalize_text(s: str) -> str:
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["doc_norm"] = df["doc"].astype(str).map(normalize_text)

df.head()

Unnamed: 0,doc,doc_norm
0,\n\nI am sure some bashers of Pens fans are pr...,I am sure some bashers of Pens fans are pretty...
1,My brother is in the market for a high-perform...,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...,Finally you said what you dream about. Mediter...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,Think! It's the SCSI card doing the DMA transf...
4,1) I have an old Jasmine drive which I cann...,1) I have an old Jasmine drive which I cannot ...



2. Definir una función chunk_text, y dividir los textos en chunks.



In [13]:
def chunk_doc(doc: str, max_chars: int = 800, overlap: int = 100):

    chunks = []
    start = 0
    n = len(doc)
    while start < n:
        end = min(start + max_chars, n)
        chunk = doc[start:end]
        chunk = chunk.strip()
        if len(chunk) > 0:
            chunks.append(chunk)
        if end == n:
            break
        start = max(0, end - overlap)
    return chunks

records = []
for i, row in df.iterrows():
    chunks = chunk_doc(row["doc_norm"], max_chars=800, overlap=100)
    for j, ch in enumerate(chunks):
        records.append({
            "doc_id": int(i),
            "chunk_id": j,
            "text": ch
        })

chunks_df = pd.DataFrame(records)
chunks_df.head(), len(chunks_df)

(   doc_id  chunk_id                                               text
 0       0         0  I am sure some bashers of Pens fans are pretty...
 1       1         0  My brother is in the market for a high-perform...
 2       2         0  Finally you said what you dream about. Mediter...
 3       2         1  urds and Turks once upon a time! Ohhhh so swed...
 4       3         0  Think! It's the SCSI card doing the DMA transf...,
 38871)

In [14]:
from sentence_transformers import SentenceTransformer

MODEL_NAME = "intfloat/e5-base-v2"   # recomendado para retrieval
model = SentenceTransformer(MODEL_NAME)

# Textos a indexar (pasajes)
passages = ["passage: " + t for t in chunks_df["text"].tolist()]

,The secret `HF_TOKEN` does not exist in your Colab secrets.
,To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
,You will be able to reuse this secret in all of your notebooks.
,Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [15]:
# Embeddings (N x D)
# Se debe usar normalize_embeddings=True para similitud coseno
embeddings = model.encode(
    passages,
    batch_size=16,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
).astype("float32")

Batches:   0%|          | 0/2430 [00:00<?, ?it/s]

### 2.3 Creo una query y hago la búsqueda

In [16]:
def embed_query(query: str) -> np.ndarray:
    q = "query: " + query
    vec = model.encode(
        [q],
        convert_to_numpy=True,
        normalize_embeddings=True
    ).astype("float32")
    return vec

query_text = "¿De qué se tratan los documentos recuperados?"

query_vec = embed_query(query_text)
query_vec.shape

(1, 768)

In [17]:
print(embeddings.shape, embeddings.dtype)

(38871, 768) float32


In [18]:
!pip install faiss-cpu

Collecting faiss-cpu
,  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
,Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
,[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m113.8 MB/s[0m eta [36m0:00:00[0m
,[?25hInstalling collected packages: faiss-cpu
,Successfully installed faiss-cpu-1.13.2


In [27]:
import faiss
import numpy as np

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

D, I = index.search(query_vec, k=10)

Obtengo los 5 documentos más similares a mi query

In [38]:
# Recuperamos los fragmentos más relevantes según la consulta
top_k = 5
D, I = index.search(query_vec, k=top_k)

top_idxs = I.flatten().tolist()
top_passages = chunks_df.iloc[top_idxs].copy()
top_passages["distance"] = D.flatten()
top_passages[["doc_id", "chunk_id", "distance", "text"]].head(10)

Unnamed: 0,doc_id,chunk_id,distance,text
11494,5441,23,0.424097,"eial Convention, Inc., for distribution by the..."
1239,682,3,0.429677,-------- Introduction & Administrivia --------...
24011,11436,6,0.43066,he rest of us. The protocol may be changed sli...
15766,7435,9,0.430687,d finish up the subject for a while. Perhaps y...
9515,4624,1,0.437481,----------------------- | | 1 2 | | Select 1 i...


In [39]:
passages_text = []
for i, row in enumerate(top_passages.itertuples(index=False), 1):
    passages_text.append(f"Fragmento {i}:\n{row.text}")

context_block = "\n\n".join(passages_text)

In [48]:

prompt = f"""
Eres un asistente que analiza textos.

PREGUNTA DEL USUARIO:
{query_text}

PASAJES (top {top_k}):
{context_block}

"""

# Llamada a Gemini (usando el modelo ya creado)
response = model.generate_content(prompt)

print(response.text)


A continuación, el análisis de los fragmentos recuperados:
,
,---
,
,**1) Análisis de cada fragmento:**
,
,*   **Fragmento 1:**
,    Describe la disponibilidad de documentos médicos, incluyendo folletos en español sobre quimioterapia y radioterapia para pacientes con cáncer, distribuidos por el National Cancer Institute. También hace referencia a resúmenes de noticias sobre el SIDA de una publicación médica.
,
,*   **Fragmento 2:**
,    Se trata de la introducción a un documento de Preguntas Frecuentes (FAQ) publicado mensualmente en grupos de noticias de Usenet. Advierte a los lectores sobre la posible obsolescencia de la información y les indica cómo acceder a la versión más reciente a través de sitios de archivo.
,
,*   **Fragmento 3:**
,    El fragmento describe un protocolo de seguridad que involucra "seeds" y agentes de custodia ("escrow agents") transportando discos a agencias. El autor expresa escepticismo sobre su eficacia, insinuando que una entidad como la NSA podría tener y