# Ejercicio 6: Dense Retrieval e Introducción a FAISS

Nombre: Marcela Cabrera

In [10]:
from sklearn.datasets import fetch_20newsgroups

# Cargar el corpus de noticias sin cabeceras, pies de página y citas
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
#Limitar el corpus a los primeros 2000 documentos
newsgroupsdocs = newsgroups.data[:2000]
labels = newsgroups.target[:2000]
target_names = newsgroups.target_names

In [17]:
print(f"Longitud: {len(newsgroupsdocs[0])} caracteres")
print(f"Contenido (primeros 300 caracteres):")
print(newsgroupsdocs[0][:300])
if len(newsgroupsdocs[0]) > 300:
    print("...")

Longitud: 712 caracteres
Contenido (primeros 300 caracteres):


I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they

...


Parte 2: Generación de Embeddings


In [6]:
from sentence_transformers import SentenceTransformer

# Opción 1: Usar SBERT (más simple)
print("\nCargando modelo SBERT (all-MiniLM-L6-v2)...")
model_sbert = SentenceTransformer('all-MiniLM-L6-v2')

print("Generando embeddings con SBERT...")
embeddings_sbert = model_sbert.encode(newsgroupsdocs,
                                       convert_to_numpy=True,
                                       show_progress_bar=True,
                                       batch_size=32)

print(f"✓ Embeddings generados: {embeddings_sbert.shape}")
print(f"  - Dimensiones: {embeddings_sbert.shape[1]}D")
print(f"  - Tamaño en memoria: {embeddings_sbert.nbytes / 1024 / 1024:.2f} MB")

  from .autonotebook import tqdm as notebook_tqdm



Cargando modelo SBERT (all-MiniLM-L6-v2)...


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Generando embeddings con SBERT...


Batches: 100%|██████████| 63/63 [00:51<00:00,  1.21it/s]

✓ Embeddings generados: (2000, 384)
  - Dimensiones: 384D
  - Tamaño en memoria: 2.93 MB





In [7]:
print("ALTERNATIVA: Generación de Embeddings con E5")

# Cargar modelo E5
from sentence_transformers import SentenceTransformer
model_e5 = SentenceTransformer('intfloat/e5-base')

# IMPORTANTE: E5 requiere el prefijo "passage: " para documentos
newsgroupsdocs_e5 = ["passage: " + doc for doc in newsgroupsdocs]

# Generar embeddings
embeddings_e5 = model_e5.encode(
    newsgroupsdocs_e5,
    convert_to_numpy=True,
    show_progress_bar=True,
    batch_size=32
)

# Usar E5 como modelo principal
embeddings = embeddings_e5
model = model_e5
print(f"Embeddings E5 generados: {embeddings_e5.shape}")


ALTERNATIVA: Generación de Embeddings con E5


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Batches: 100%|██████████| 63/63 [08:04<00:00,  7.69s/it]

Embeddings E5 generados: (2000, 768)





Parte 3: Consulta

In [13]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def buscar_documentos(query, top_k=5, usar_e5=False, mostrar_completo=False):

    # Preparar la consulta según el modelo
    if usar_e5:
        query_procesada = "query: " + query
    else:
        query_procesada = query

    # Codificar la consulta
    query_embedding = model.encode([query_procesada], convert_to_numpy=True)

    # Calcular similitud coseno
    similarities = cosine_similarity(query_embedding, embeddings)[0]

    # Obtener los top_k índices más similares
    top_indices = np.argsort(similarities)[::-1][:top_k]
    top_scores = similarities[top_indices]

    # Mostrar resultados
    print(f"CONSULTA: '{query}'")
    print(f"{'='*80}")
    print(f"Top {top_k} documentos más similares:\n")

    for rank, (idx, score) in enumerate(zip(top_indices, top_scores), 1):
        print(f"\n{'─'*80}")
        print(f"RESULTADO #{rank} | Similitud: {score:.4f} ({score*100:.2f}%)")
        print(f"Categoría: {target_names[labels[idx]]}")
        print(f"Documento ID: {idx}")
        print(f"Longitud: {len(newsgroupsdocs[idx])} caracteres")
        print(f"\nCONTENIDO:")
        print("─" * 80)

        if mostrar_completo:
            print(newsgroupsdocs[idx])
        else:
            # Mostrar primeros 500 caracteres
            texto = newsgroupsdocs[idx][:500]
            print(texto)
            if len(newsgroupsdocs[idx]) > 500:
                print(f"\n... [Texto truncado - {len(newsgroupsdocs[idx]) - 500} caracteres restantes]")

        print("─" * 80)

    return top_indices, top_scores

print("Función de búsqueda definida: buscar_documentos()")

Función de búsqueda definida: buscar_documentos()


In [14]:
# Lista de consultas de ejemplo
consultas_ejemplo = [
    "God, religion, and spirituality",
    "space exploration",
    "car maintenance"
]

print(f"\nEjecutando {len(consultas_ejemplo)} consultas de ejemplo...\n")

# Ejecutar búsquedas
for consulta_num, query in enumerate(consultas_ejemplo, 1):
    print(f"# CONSULTA {consulta_num} de {len(consultas_ejemplo)}")

    # Buscar documentos (cambiar usar_e5=True si usas E5)
    # Para ver documentos completos, usa mostrar_completo=True
    indices, scores = buscar_documentos(query, top_k=5, usar_e5=False, mostrar_completo=False)

    if consulta_num < len(consultas_ejemplo):
        print("\n" * 2)


Ejecutando 3 consultas de ejemplo...

# CONSULTA 1 de 3
CONSULTA: 'God, religion, and spirituality'
Top 5 documentos más similares:


────────────────────────────────────────────────────────────────────────────────
RESULTADO #1 | Similitud: 0.8210 (82.10%)
Categoría: sci.med
Documento ID: 171
Longitud: 474 caracteres

CONTENIDO:
────────────────────────────────────────────────────────────────────────────────

    But no one (or at least, not many people) are trying to pass off God
as a scientific fact.  Not so with Kirlian photography.  I'll admit that
it is possible that some superior intelligence exists elsewhere, and if
people want to label that intelligence "God", I'm not going to stop
them.  Anyway, let's _not_ turn this into a theological debate.  ;-)


    Read alt.fan.robert.mcelwaine sometime.  I've never been so
closed-minded before subscribing to that group.  :)

────────────────────────────────────────────────────────────────────────────────

──────────────────────────────