## Examen Final de Recuperación de Información

_Modalidad:_ Práctico

_Entrega:_ Jupyter Notebook con código y análisis

_Valor:_ 20 puntos

### Objetivo:

Desarrollar un sistema de recuperación de información basado en un corpus de documentos. Se deben aplicar técnicas de preprocesamiento, indexación, representación en espacio vectorial y evaluación de tres métodos de recuperación de información mediante benchmarking.

In [4]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer
from sklearn.metrics import precision_score, recall_score, f1_score



In [5]:
# Descargar recursos de NLTK
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\glenn\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\glenn\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\glenn\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Parte 1: Selección y Preprocesamiento del Corpus (4 puntos)  
Se trabajará con el corpus **20 Newsgroups**, un conjunto de documentos de texto extraídos de foros de discusión en diversas categorías. Se puede descargar con `sklearn.datasets.fetch_20newsgroups`.  

1. **Carga del corpus** (1 punto): Descargar y visualizar ejemplos de textos.  

In [6]:
# Cargar el corpus
# categories = ['sci.space', 'rec.sport.baseball', 'talk.politics.mideast']  # Se pueden modificar
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

In [7]:
# Mostrar los primeros 5 documentos sin procesar
print("Primeros 5 documentos sin procesar:")
for i, doc in enumerate(newsgroups.data[:5]):
    print(f"Documento {i+1}: {doc}\n")

Primeros 5 documentos sin procesar:
Documento 1: 

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!



Documento 2: My brother is in the market for a high-performance video card that supports
VESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:

  - Diamond Stealth Pro Local Bus

  - Orchid Farenheit 1280

  - ATI Graphi

2. **Preprocesamiento** (3 puntos): Implementar tokenización, eliminación de stopwords, lematización y vectorización del texto con TF-IDF.

In [8]:
def preprocess_text(text):
    """Preprocesamiento de texto: tokenización, stopwords, lematización y limpieza."""
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return ' '.join(tokens)

In [9]:
# Aplicar preprocesamiento al corpus
preprocessed_docs = [preprocess_text(doc) for doc in newsgroups.data]

## Parte 2: Indexación y Representación Vectorial (4 puntos)  
1. Construir una representación en **espacio vectorial** usando **TF-IDF** (2 puntos).  

In [10]:
# Vectorización con TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(preprocessed_docs)
tfidf_matrix = normalize(tfidf_matrix)

2. Implementar una estructura de indexación eficiente como **Elasticsearch**, **FAISS** o **ChromaDB** (2 puntos).

In [11]:
# Conexión con Elasticsearch
es = Elasticsearch("http://localhost:9200")
index_name = "newsgroups_index"

In [12]:
# Crear índice en Elasticsearch
if es.indices.exists(index=index_name):
    es.indices.delete(index=index_name)

es.indices.create(index=index_name, body={
    "settings": {
        "analysis": {
            "analyzer": {
                "custom_analyzer": {
                    "type": "standard",
                    "stopwords": "_english_"
                }
            }
        }
    }
})

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'newsgroups_index'})

In [13]:
# Indexar documentos en Elasticsearch
def generate_actions():
    for i, doc in enumerate(preprocessed_docs):
        yield {
            "_index": index_name,
            "_id": i,
            "_source": {"text": doc}
        }

bulk(es, generate_actions())  

(18846, [])

## Parte 3: Aplicación de Técnicas de Recuperación de Información (6 puntos)  
Implementar tres enfoques de recuperación de información y comparar su desempeño:    

In [14]:
# Ejemplo de consulta
query = "computer graphics and image processing"

1. **Búsqueda exacta con modelo vectorial TF-IDF y similitud del coseno** (2 puntos).

In [None]:
# Función de búsqueda con TF-IDF y similitud del coseno
def search_tfidf(query, top_n=10):
    query_vec = tfidf_vectorizer.transform([preprocess_text(query)])
    similarities = cosine_similarity(query_vec, tfidf_matrix).flatten()
    top_indices = np.argsort(similarities)[-top_n:][::-1]
    return [(i, newsgroups.data[i], similarities[i]) for i in top_indices]

In [None]:
print("TF-IDF Results:")
results_tfidf = search_tfidf(query)
for idx, (doc_id, doc, score) in enumerate(results_tfidf):
    print(f"Result {idx+1}: (Score: {score:.4f})\n{doc[:500]}\n")

TF-IDF Results:
Result 1: (Score: 0.4521)

I usually use "Algorithms for graphics and image processing" by
Theodosios Pavlidis, but other people here got them same idea and now
3 of 4 copies in the libraries have been stolen!

Another reference is "Digital Image Processing" by Gonzalez and
Wintz/Wood, which is widely available but a little expensive ($55
here- I just checked today).

Result 2: (Score: 0.4441)
Archive-name: graphics/resources-list/part2
Last-modified: 1993/04/17


Computer Graphics Resource Listing : WEEKLY POSTING [ PART 2/3 ]
Last Change : 17 April 1993


14. Plotting packages

Gnuplot 3.2
-----------
  It is one of the best 2- and 3-D plotting packages, with
  online help.It's a command-line driven interactive function plotting utility
  for UNIX, MSDOS, Amiga, Archimedes, and VMS platforms (at least!).
  Fre

Result 3: (Score: 0.4428)
Archive-name: graphics/resources-list/part2
Last-modified: 1993/04/27


Computer Graphics Resource Listing : WEEKLY POSTING [ PART 2/

2. **Búsqueda basada en Word2Vec** (2 puntos).

In [17]:
# Modelo Word2Vec
sentences = [doc.split() for doc in preprocessed_docs]
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, workers=4)

In [None]:
def get_word2vec_vector(words):
    vectors = [word2vec_model.wv[word] for word in words if word in word2vec_model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(word2vec_model.vector_size)

def search_word2vec(query, top_n=5):
    query_tokens = preprocess_text(query).split()
    query_vec = get_word2vec_vector(query_tokens)
    similarities = []
    
    for doc in preprocessed_docs:
        doc_tokens = doc.split()
        doc_vec = get_word2vec_vector(doc_tokens)
        similarity = np.dot(query_vec, doc_vec) if np.any(doc_vec) else 0
        similarities.append(similarity)
    
    top_indices = np.argsort(similarities)[-top_n:][::-1]
    return [(i, newsgroups.data[i], similarities[i]) for i in top_indices]


In [None]:
print("\nWord2Vec Results:")
results_word2vec = search_word2vec(query)
for idx, (doc_id,doc, score) in enumerate(results_word2vec):
    print(f"Result {idx+1}: (Score: {score:.4f})\n{doc[:500]}\n")


Word2Vec Results:
Result 1: (Score: 62.5755)
 and  A VGA monitor..
e-mail


Result 2: (Score: 61.5370)
Not on my system.

Result 3: (Score: 56.7137)

Version 2.03 drivers are current.

Result 4: (Score: 56.4450)
Which newsgroup discusses graphic design on PCs and macs?

Result 5: (Score: 56.1364)
cica.indiana.edu pc/drivers  the current version is 2.0




3. **Recuperación con un modelo basado en transformers (Ej: `sentence-transformers` para embeddings)** (2 puntos).

In [20]:
# Modelo basado en transformers
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
X_sbert = sbert_model.encode(preprocessed_docs)


In [None]:
def search_sbert(query, top_n=10):
    query_vec = sbert_model.encode([query])[0]
    similarities = cosine_similarity([query_vec], X_sbert).flatten()
    top_indices = np.argsort(similarities)[-top_n:][::-1]
    return [(i, newsgroups.data[i], similarities[i]) for i in top_indices]

In [None]:
print("\nSBERT Results:")
results_sbert = search_sbert(query)
for idx, (doc_id,doc, score) in enumerate(results_sbert):
    print(f"Result {idx+1}: (Score: {score:.4f})\n{doc[:500]}\n")


SBERT Results:
Result 1: (Score: 0.5729)

I usually use "Algorithms for graphics and image processing" by
Theodosios Pavlidis, but other people here got them same idea and now
3 of 4 copies in the libraries have been stolen!

Another reference is "Digital Image Processing" by Gonzalez and
Wintz/Wood, which is widely available but a little expensive ($55
here- I just checked today).

Result 2: (Score: 0.5593)
Hello,

    I am searching for rendering software which has been developed
to specifically take advantage of multi-processor computer systems.
Any pointers to such software would be greatly appreciated.
    
Thanks.


Result 3: (Score: 0.5532)

What kind of polygons?  Shaded?  Texturemapped?  Hm?  More comes into play with
fast routines than just "polygons".  It would be nice to know exaclty what
system (VGA is a start, but what processor?) and a few of the specifics of the
implementation.  You need to give  more info if you want to get any answers! :P

                           

## Parte 4: Evaluación mediante Benchmarking (6 puntos)  
1. **Definición de una Ground Truth** (2 puntos): Se deben seleccionar al menos 10 consultas y definir manualmente los documentos relevantes.  

In [58]:
from sklearn.metrics import precision_score, recall_score, f1_score
from collections import defaultdict

# 🔹 Definir la nueva Ground Truth basada en contenido y no en IDs
ground_truth_texts = {
    "computer graphics": "computer graphics",
    "space mission": "space mission",
    "political debate": "government politics debate",
    "baseball game": "baseball team match",
    "climate change": "global warming climate",
    "quantum mechanics": "quantum physics theory",
    "artificial intelligence": "machine learning AI",
    "financial markets": "stock investment banking",
    "healthcare technology": "medical healthcare innovation",
    "renewable energy": "solar wind energy"
}

2. **Cálculo de precisión y recall para cada técnica** (2 puntos): Implementar evaluación con métricas estándar.

3. **Análisis comparativo** (2 puntos): Comparar los resultados de las tres técnicas y justificar su efectividad con base en los resultados.

In [74]:
# **Ejecutar Evaluación con el nuevo método**
tfidf_results = evaluate_model_text_based(search_tfidf, "TF-IDF")
word2vec_results = evaluate_model_text_based(search_word2vec, "Word2Vec")
sbert_results = evaluate_model_text_based(search_sbert, "SBERT")


🔍 **Evaluando:** computer graphics (TF-IDF)
🔹 Resultados recuperados: [('Technion - Israel Institute of Technology\n         Department of Computer Science\n\n       GRADUATE STUDIES IN COMPUTER GRAPHICS\n\nApplications are invited for graduate students wishing\nto specialize in computer graphics and related fields.\nActive research is being conducted in the fields of\nimage rendering, geometric modelling and computer animation.\nState of the art graphics workstations (Sun, Silicon Graphics)\nand video equipment are available.\nThe Technion offers full scholarship support (tuition and \nassistantships) for suitable candidates.\n\nFor more information contact', 0.48719038462142616), ("Within the next several months I'll be looking for a job in computer\ngraphics software.  I'm in need of info on graphics software companies. \nI've checked the FAQ, the resource list, and siggraph.org, haven't found\nanything.  The last Computer Graphics Career Handbook that I'm aware of,\nwas published 

KeyboardInterrupt: 

In [None]:
# Crear DataFrame con los resultados
df_results = pd.DataFrame(tfidf_results, columns=["Query", "Precision (TF-IDF)", "Recall (TF-IDF)", "F1-score (TF-IDF)"])
df_results["Precision (Word2Vec)"], df_results["Recall (Word2Vec)"], df_results["F1-score (Word2Vec)"] = zip(*[(p, r, f) for _, p, r, f in word2vec_results])
df_results["Precision (SBERT)"], df_results["Recall (SBERT)"], df_results["F1-score (SBERT)"] = zip(*[(p, r, f) for _, p, r, f in sbert_results])


In [None]:
# Guardar los resultados
df_results.to_csv("benchmarking_results_text_based.csv", index=False)

In [None]:
print("\n✅ **Evaluación Final (Benchmarking basado en contenido):**")
print(df_results)


✅ **Evaluación Final (Benchmarking):**
                     Query  Precision (TF-IDF)  Recall (TF-IDF)  \
0        computer graphics                 0.0              0.0   
1            space mission                 0.0              0.0   
2            baseball game                 0.0              0.0   
3        quantum mechanics                 0.0              0.0   
4  artificial intelligence                 0.0              0.0   
5         renewable energy                 0.0              0.0   

   F1-score (TF-IDF)  Precision (Word2Vec)  Recall (Word2Vec)  \
0                0.0                   0.0                0.0   
1                0.0                   0.0                0.0   
2                0.0                   0.0                0.0   
3                0.0                   0.0                0.0   
4                0.0                   0.0                0.0   
5                0.0                   0.0                0.0   

   F1-score (Word2Vec)  Precision 

## Entrega y Formato  
Cada estudiante debe entregar un **Jupyter Notebook** con el código y análisis bien documentado. Se evaluará claridad, calidad del código y profundidad del análisis.

## Criterios de Evaluación (20 puntos)  
| Sección | Puntos |
|---------|--------|
| Carga y preprocesamiento del corpus | 4 |
| Indexación y representación vectorial | 4 |
| Implementación de tres técnicas de recuperación | 6 |
| Evaluación con benchmarking (ground truth, precisión y recall) | 6 |
| **Total** | **20** |


## Observaciones  
- Se recomienda utilizar `scikit-learn`, `nltk`, `gensim`, `sentence-transformers` y `faiss` para la implementación.  
- Se valorará la optimización del código y la presentación clara de los resultados.  
- No se aceptan notebooks con errores de ejecución. 