# Ejercicio 4: Modelo Probabilístico

## Objetivo de la práctica
- Comprender los componentes del modelo vectorial mediante cálculos manuales y observación directa.
- Aplicar el modelo de espacio vectorial con TF-IDF para recuperar documentos relevantes.
- Comparar la recuperación con BM25 frente a TF-IDF.
- Analizar visualmente las diferencias entre los modelos.
- Evaluar si los rankings generados son consistentes con lo que considerarías documentos relevantes.

JORGE ROJAS

## Parte 0: Carga del Corpus

In [34]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

In [35]:
#tipo de newsgroupsdocs
print(type(newsgroupsdocs))

<class 'list'>


In [36]:
#ver un documento
print(newsgroupsdocs[0])



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




In [37]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re

In [38]:
#peso de newsgroupsdocs
print(newsgroupsdocs.__sizeof__())

150808


In [39]:
#vectorizando newsgroupsdocs 
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
#tamaño del vector
X = vectorizer.fit_transform(newsgroupsdocs)
print(X.shape)
corpus_vect=vectorizer.transform(newsgroupsdocs)
corpus_vect.toarray()
#tamaño del vector
print(corpus_vect.shape)


(18846, 134410)
(18846, 134410)


In [40]:
# query chicken
query = ["ÿhooked"]
#vectorizando query
query_vect = vectorizer.transform(query)
#valor de la query
#imprime segun  el idf el vector
print(query_vect.toarray())
#similaridad entre query y corpus
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(query_vect, corpus_vect)
#similaridad entre query y corpus
print(similarity)

[[0. 0. 0. ... 0. 0. 1.]]
[[0. 0. 0. ... 0. 0. 0.]]


In [41]:
vectorizer.get_feature_names_out()

array(['00', '000', '0000', ..., '³ation', 'ýé', 'ÿhooked'],
      shape=(134410,), dtype=object)

In [42]:
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: C:\Users\gboy2\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [43]:
import nltk
from nltk.corpus import stopwords, words
import re
from sklearn.datasets import fetch_20newsgroups

# Descargar recursos necesarios
nltk.download('stopwords')
nltk.download('words')

# Corpus de palabras válidas en inglés
palabras_validas = set(words.words())
stop_words = set(stopwords.words('english'))

# Cargar corpus
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
docs = newsgroups.data

# Función de limpieza
def limpiar_texto(texto):
    texto = texto.lower()
    texto = re.sub(r'[^a-z\s]', '', texto)  # solo letras y espacios
    palabras = texto.split()
    palabras_filtradas = [
        palabra for palabra in palabras
        if palabra in palabras_validas and palabra not in stop_words and len(palabra) > 2
    ]
    return ' '.join(palabras_filtradas)

# Aplicar limpieza al corpus (puedes usar solo los primeros N documentos para pruebas)
corpus_limpio = [limpiar_texto(doc) for doc in docs[:100]]  # usa [:100] para evitar demoras/memoria


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gboy2\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\gboy2\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [44]:
# Vectorizar corpus limpio
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus_limpio)

# Tamaño del corpus vectorizado
print(X.shape)

(100, 2162)


In [45]:
# Vectorizar query también limpiándola
query_clean = ["sure"]
query_limpia = [limpiar_texto(q) for q in query_clean]
query_vectorizado = vectorizer.transform(query_limpia)
print(query_vectorizado.toarray())

[[0. 0. 0. ... 0. 0. 0.]]


In [46]:
# Similaridad
similarity_clean = cosine_similarity(query_vectorizado, X)
print(similarity_clean)
print(len(corpus_limpio))  # o len(newsgroupsdocs)


[[0.11741628 0.         0.         0.         0.         0.19507578
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.09999192 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.04281659 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.      

## Parte 1: Cálculo de TF, DF, IDF y TF-IDF

### Actividad 
1. Utiliza el corpus cargado.
2. Construye la matriz de términos (TF), y calcula la frecuencia de documentos (DF)
3. Calcula TF-IDF utilizando sklearn.
4. Visualiza los valores en un DataFrame para analizar las diferencias entre los términos.

In [47]:
# Tomar una muestra para hacer el análisis manejable
sample_size = 1000
np.random.seed(42)
sample_indices = np.random.choice(len(newsgroupsdocs), sample_size, replace=False)
docs_sample = [newsgroupsdocs[i] for i in sample_indices]

print(f"Muestra seleccionada: {len(docs_sample)} documentos")

# 1. Construir matriz de términos (TF)
vectorizer_count = CountVectorizer(
    max_features=1000,
    stop_words='english',
    lowercase=True
)

tf_matrix = vectorizer_count.fit_transform(docs_sample)
feature_names = vectorizer_count.get_feature_names_out()

print(f"Matriz TF construida: {tf_matrix.shape}")

# 2. Calcular frecuencia de documentos (DF)
df_values = np.array(tf_matrix.sum(axis=0)).flatten()

# 3. Calcular TF-IDF usando sklearn
vectorizer_tfidf = TfidfVectorizer(
    max_features=1000,
    stop_words='english',
    lowercase=True
)

tfidf_matrix = vectorizer_tfidf.fit_transform(docs_sample)

# 4. Visualizar en DataFrame
N = len(docs_sample)
idf_manual = np.log(N / df_values)

# Crear DataFrame para análisis
df_analysis = pd.DataFrame({
    'término': feature_names,
    'df': df_values,
    'idf_manual': idf_manual,
    'tf_total': df_values  # suma de TF en todos los documentos
})

# Mostrar términos más y menos frecuentes
print("\nTérminos más frecuentes (DF alto):")
print(df_analysis.nlargest(10, 'df')[['término', 'df', 'idf_manual']])

print("\nTérminos más discriminativos (IDF alto):")
print(df_analysis.nlargest(10, 'idf_manual')[['término', 'df', 'idf_manual']])

Muestra seleccionada: 1000 documentos
Matriz TF construida: (1000, 1000)

Términos más frecuentes (DF alto):
    término    df  idf_manual
130      ax  2831   -1.040630
351    file   477    0.740239
662  people   369    0.996959
314     edu   364    1.010601
298     don   360    1.021651
515    like   355    1.035637
486    just   343    1.070025
927     use   338    1.084709
492    know   289    1.241329
889    time   255    1.366492

Términos más discriminativos (IDF alto):
           término  df  idf_manual
714          proof  21    3.863233
919  understanding  21    3.863233
38              29  22    3.816713
42             300  22    3.816713
43              31  22    3.816713
72              ad  22    3.816713
87           amend  22    3.816713
116           army  22    3.816713
139      basically  22    3.816713
153           bits  22    3.816713


In [48]:
vectorizer = CountVectorizer()
tf_matrix = vectorizer.fit_transform(corpus_limpio)
# Obtener el vocabulario (términos)
vocabulario = vectorizer.get_feature_names_out()

# Mostrar dimensiones de la matriz TF
print("Dimensión de la matriz TF (documentos x términos):", tf_matrix.shape)

Dimensión de la matriz TF (documentos x términos): (100, 2162)


In [49]:
# Mostrar TF de un documento de ejemplo
print("TF del documento 0 (frecuencia de cada término):")
print(tf_matrix[12].toarray())

TF del documento 0 (frecuencia de cada término):
[[1 2 0 ... 0 0 0]]


In [50]:
# Calcular DF: cuántos documentos contienen cada término
df_vector = np.asarray((tf_matrix > 0).sum(axis=0)).ravel()

# Mostrar algunos ejemplos de términos y su DF
for termino, df in zip(vocabulario[:], df_vector[:]):
    print(f"Término: {termino} - DF: {df}")

Término: abandoned - DF: 1
Término: ability - DF: 5
Término: able - DF: 3
Término: absent - DF: 1
Término: accelerated - DF: 1
Término: accepted - DF: 1
Término: access - DF: 1
Término: accident - DF: 2
Término: according - DF: 4
Término: accurate - DF: 1
Término: accused - DF: 1
Término: acetone - DF: 1
Término: across - DF: 2
Término: act - DF: 1
Término: acting - DF: 2
Término: action - DF: 2
Término: active - DF: 1
Término: actively - DF: 1
Término: activity - DF: 2
Término: acton - DF: 1
Término: actual - DF: 1
Término: actually - DF: 6
Término: adapter - DF: 1
Término: add - DF: 2
Término: added - DF: 1
Término: address - DF: 5
Término: adhere - DF: 1
Término: admit - DF: 1
Término: admonish - DF: 1
Término: adult - DF: 1
Término: advance - DF: 2
Término: aeon - DF: 1
Término: afford - DF: 3
Término: afresh - DF: 1
Término: afternoon - DF: 1
Término: age - DF: 1
Término: aggression - DF: 1
Término: ago - DF: 4
Término: agree - DF: 2
Término: ahead - DF: 1
Término: aid - DF: 1
Tér

In [51]:
#DISTANCIA DE COSENO QUERY Y CADA DOCUMENTO 
from sklearn.metrics.pairwise import cosine_similarity
query = ["ÿhooked 00"]
query_vect = vectorizer.transform(query)
similarity = cosine_similarity(query_vect, X_tfidf)
print("\nSimilitud entre la query y los documentos:")
print(similarity)




Similitud entre la query y los documentos:
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]]


In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Construcción de la matriz TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus_limpio)

# Obtener los términos del vocabulario
terminos = vectorizer.get_feature_names_out()

# Crear un DataFrame con los valores TF-IDF 
df_tfidf = pd.DataFrame(tfidf_matrix[:].toarray(), columns=terminos)

# Visualizar las primeras filas del DataFrame
print(df_tfidf.head())

# Visualizar los primeros términos y su peso en los primeros documentos
print(df_tfidf.T.head())


   abandoned   ability  able  absent  accelerated  accepted  access  accident  \
0        0.0  0.000000   0.0     0.0          0.0       0.0     0.0       0.0   
1        0.0  0.000000   0.0     0.0          0.0       0.0     0.0       0.0   
2        0.0  0.000000   0.0     0.0          0.0       0.0     0.0       0.0   
3        0.0  0.086763   0.0     0.0          0.0       0.0     0.0       0.0   
4        0.0  0.000000   0.0     0.0          0.0       0.0     0.0       0.0   

   according  accurate  ...  year  yearlong  yellow  yes  yesterday  yet  \
0        0.0       0.0  ...   0.0       0.0     0.0  0.0        0.0  0.0   
1        0.0       0.0  ...   0.0       0.0     0.0  0.0        0.0  0.0   
2        0.0       0.0  ...   0.0       0.0     0.0  0.0        0.0  0.0   
3        0.0       0.0  ...   0.0       0.0     0.0  0.0        0.0  0.0   
4        0.0       0.0  ...   0.0       0.0     0.0  0.0        0.0  0.0   

   youd  young  youth  zero  
0   0.0    0.0    0.0   0.

In [53]:
# Extraer los valores IDF
idf = pd.Series(vectorizer.idf_, index=terminos)

# Seleccionar los 20 términos más comunes (menor IDF)
top_terms = idf.sort_values().head(20).index

# Crear un DataFrame con estos términos en todos los documentos
visual_df = df_tfidf[top_terms]

# Mostrar la tabla
print("\nVisualización de TF-IDF para los términos más comunes:")
print(visual_df)



Visualización de TF-IDF para los términos más comunes:
       would       one      dont      know      time       get     think  \
0   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
1   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
2   0.000000  0.056973  0.117816  0.061011  0.062136  0.000000  0.000000   
3   0.000000  0.000000  0.000000  0.000000  0.058334  0.000000  0.058334   
4   0.000000  0.128645  0.000000  0.137763  0.000000  0.000000  0.000000   
..       ...       ...       ...       ...       ...       ...       ...   
95  0.039993  0.041936  0.043360  0.000000  0.045736  0.000000  0.000000   
96  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.079023   
97  0.062706  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
98  0.000000  0.000000  0.000000  0.000000  0.000000  0.144063  0.000000   
99  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   

        like      also       us

## Parte 2: Ranking de documentos usando TF-IDF

### Actividad 

1. Dada una consulta, construye el vector de consulta
2. Calcula la similitud coseno entre la consulta y cada documento usando los vectores TF-IDF
3. Genera un ranking de los documentos ordenados por relevancia.
4. Muestra los resultados en una tabla.

In [54]:
query_clean = ["sure"]
query_limpia = [limpiar_texto(q) for q in query_clean]
query_vectorizado = vectorizer.transform(query_limpia)
print(query_vectorizado.toarray())

[[0. 0. 0. ... 0. 0. 0.]]


In [55]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Calcular similitud coseno entre la consulta y todos los documentos
similitudes = cosine_similarity(query_vectorizado, tfidf_matrix).flatten()

# Obtener el índice del documento más relevante
doc_mas_relevante = np.argmax(similitudes)
similitud_maxima = similitudes[doc_mas_relevante]

print(f"El documento más relevante es el número: {doc_mas_relevante}")
print(f"Con una similitud de: {similitud_maxima:.4f}")


El documento más relevante es el número: 5
Con una similitud de: 0.1951


In [56]:

print("\nContenido del documento más relevante:\n")
print(docs[doc_mas_relevante])



Contenido del documento más relevante:



Back in high school I worked as a lab assistant for a bunch of experimental
psychologists at Bell Labs.  When they were doing visual perception and
memory experiments, they used vector-type displays, with 1-millisecond
refresh rates common.

So your case of 1/200th sec is quite practical, and the experimenters were
probably sure that it was 5 milliseconds, not 4 or 6 either.


Steve


In [57]:
# Revisar en qué documentos aparece la palabra 'sure' después de limpieza
conteo_sure = []

for i, doc in enumerate(corpus_limpio):
    conteo = doc.split().count("sure")
    if conteo > 0:
        conteo_sure.append((i, conteo))

# Ordenar por frecuencia descendente
conteo_sure.sort(key=lambda x: x[1], reverse=True)

# Mostrar los documentos con más ocurrencias de 'sure'
for idx, count in conteo_sure[:5]:
    print(f"Documento {idx} tiene {count} ocurrencias de 'sure'")
    print(f"Primeras líneas:\n{docs[idx][:200]}\n{'-'*50}")


Documento 61 tiene 2 ocurrencias de 'sure'
Primeras líneas:
completed

Why would you dispose a channel if you are going to play more
sounds soon? If you are trying to write a game, you shouldn't
be using SndPlay. Instead, make a channel and use BufferCmds
to p
--------------------------------------------------
Documento 0 tiene 1 ocurrencias de 'sure'
Primeras líneas:


I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However,
--------------------------------------------------
Documento 5 tiene 1 ocurrencias de 'sure'
Primeras líneas:


Back in high school I worked as a lab assistant for a bunch of experimental
psychologists at Bell Labs.  When they were doing visual perception and
memory experiments, they used vector-type displays
--------------------------------------------------
Documento 87 tiene 1 ocurrencias de 'sure'
Primeras líneas:
SREBRE

In [58]:
print("sure" in palabras_validas)  
print("sure" in stop_words)        


True
False


In [59]:
idx_dominik = conteo_sure[0][0]
print(f"\nDocumento más probable (Dominik): {idx_dominik}")
print(docs[idx_dominik])



Documento más probable (Dominik): 61
completed

Why would you dispose a channel if you are going to play more
sounds soon? If you are trying to write a game, you shouldn't
be using SndPlay. Instead, make a channel and use BufferCmds
to play sounds on it. It works great. You can add CallBacks to
the channel also to let you know when the channel is getting
empty. Before it gets empty.

7.1,

Callbacks are very reliable, I found them 100% reliable, even
under System 4.1. I was doing continuous background sound with
interrupting sound effects on System 6.0 with the IM-V
documentation.

You probably were cancelling your callback commands out of
your channels, of course you didn't get called. In general, if
you have problems with sounds working when you play one per
channel and then close the channel (with the related
slowdown), but then when you play more than one you don't
work, then you are adding more than one synthesizer to a
channel, possibly the same one multiple times. This might be

In [60]:
# 1. Definir consulta
query = "computer software programming"
print(f"Consulta: '{query}'")

# 2. Construir vector de consulta
query_vector = vectorizer_tfidf.transform([query])

# 3. Calcular similitud coseno
similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()

# 4. Generar ranking
doc_scores = list(enumerate(similarities))
doc_scores.sort(key=lambda x: x[1], reverse=True)

# 5. Mostrar resultados en tabla
results_tfidf = []
print("\nTop 10 documentos más relevantes (TF-IDF):")
for i, (doc_idx, score) in enumerate(doc_scores[:10]):
    preview = docs_sample[doc_idx][:150] + "..."
    results_tfidf.append({
        'Rank': i+1,
        'Doc_ID': doc_idx,
        'Score': round(score, 4),
        'Documento': preview
    })
    print(f"{i+1}. Score: {score:.4f}")
    print(f"   {preview}\n")

# Crear DataFrame de resultados
df_tfidf_results = pd.DataFrame(results_tfidf)
print("Tabla de resultados TF-IDF:")
print(df_tfidf_results[['Rank', 'Doc_ID', 'Score']])

Consulta: 'computer software programming'


ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 1000 while Y.shape[1] == 2162

## Parte 3: Ranking con BM25

### Actividad 

1. Implementa un sistema de recuperación usando el modelo BM25.
2. Usa la misma consulta del ejercicio anterior.
3. Calcula el score BM25 para cada documento y genera un ranking.
4. Compara manualmente con el ranking de TF-IDF.

In [61]:
# 1. Implementar BM25
class BM25:
    def __init__(self, k1=1.2, b=0.75):
        self.k1 = k1
        self.b = b
        
    def fit(self, documents):
        self.documents = []
        for doc in documents:
            tokens = re.findall(r'\b[a-zA-Z]{3,}\b', doc.lower())
            self.documents.append(tokens)
        
        self.N = len(self.documents)
        self.doc_lengths = [len(doc) for doc in self.documents]
        self.avgdl = sum(self.doc_lengths) / self.N
        
        # Construir vocabulario
        self.vocab = set()
        for doc in self.documents:
            self.vocab.update(doc)
        
        # Calcular IDF
        self.idf = {}
        for term in self.vocab:
            df = sum(1 for doc in self.documents if term in doc)
            self.idf[term] = np.log((self.N - df + 0.5) / (df + 0.5))
    
    def score(self, query, doc_idx):
        query_terms = re.findall(r'\b[a-zA-Z]{3,}\b', query.lower())
        doc = self.documents[doc_idx]
        doc_len = self.doc_lengths[doc_idx]
        
        score = 0
        for term in query_terms:
            if term in self.vocab:
                tf = doc.count(term)
                numerator = tf * (self.k1 + 1)
                denominator = tf + self.k1 * (1 - self.b + self.b * (doc_len / self.avgdl))
                score += self.idf.get(term, 0) * (numerator / denominator)
        
        return score

# 2. Entrenar BM25
bm25 = BM25()
bm25.fit(docs_sample)

# 3. Calcular scores para la misma consulta
bm25_scores = []
for i in range(len(docs_sample)):
    score = bm25.score(query, i)
    bm25_scores.append((i, score))

# 4. Generar ranking
bm25_scores.sort(key=lambda x: x[1], reverse=True)

# 5. Mostrar resultados
results_bm25 = []
print("Top 10 documentos más relevantes (BM25):")
for i, (doc_idx, score) in enumerate(bm25_scores[:10]):
    preview = docs_sample[doc_idx][:150] + "..."
    results_bm25.append({
        'Rank': i+1,
        'Doc_ID': doc_idx,
        'Score': round(score, 4),
        'Documento': preview
    })
    print(f"{i+1}. Score: {score:.4f}")
    print(f"   {preview}\n")

# Crear DataFrame de resultados
df_bm25_results = pd.DataFrame(results_bm25)
print("Tabla de resultados BM25:")
print(df_bm25_results[['Rank', 'Doc_ID', 'Score']])

Top 10 documentos más relevantes (BM25):
1. Score: 13.8803
   FOR SALE:

*** COMPLETE PACKAGE ONLY ***

(1) COMMODORE C64 COMPUTER LIKE NEW IN THE BOX WITH POWER SUPPLY
    AND OWNERS MANUALS 
(2) COMMODORE 1541C...

2. Score: 10.3525
   Hello,

    I am searching for rendering software which has been developed
to specifically take advantage of multi-processor computer systems.
Any poi...

3. Score: 7.2809
   : In article <1993May12.193454.29823@hal.com>, bobp@hal.com (Bob Pendleton) writes...
: >From article <1993May7.235404.22590@pony.Ingres.COM>, by mwme...

4. Score: 7.2300
   -------------------------------------------------------------------------

                               CALL FOR PAPERS


   The Eighth Internationa...

5. Score: 6.9500
   I am DESPERATELY trying to find a PC based e-mail wide area network service
or the necessary network software to establish one myself. While I am awar...

6. Score: 6.2916
   Hi there,

I'm looking for tools that can make X programming e

## Parte 4: Comparación visual entre TF-IDF y BM25

### Actividad 

1. Utiliza un gráfico de barras para visualizar los scores obtenidos por cada documento según TF-IDF y BM25.
2. Compara los rankings visualmente.
3. Identifica: ¿Qué documentos obtienen scores más altos en un modelo que en otro?
4. Sugiere: ¿A qué se podría deber esta diferencia?

In [62]:
# 1. Preparar datos para visualización
top_20_docs = list(set([x[0] for x in doc_scores[:20]] + [x[0] for x in bm25_scores[:20]]))

tfidf_scores_dict = {doc_idx: score for doc_idx, score in doc_scores}
bm25_scores_dict = {doc_idx: score for doc_idx, score in bm25_scores}

tfidf_vals = [tfidf_scores_dict.get(doc_idx, 0) for doc_idx in top_20_docs]
bm25_vals = [bm25_scores_dict.get(doc_idx, 0) for doc_idx in top_20_docs]

# 2. Crear gráfico de barras
plt.figure(figsize=(15, 6))

x = np.arange(len(top_20_docs))
width = 0.35

plt.bar(x - width/2, tfidf_vals, width, label='TF-IDF', alpha=0.8)
plt.bar(x + width/2, bm25_vals, width, label='BM25', alpha=0.8)

plt.xlabel('Documentos')
plt.ylabel('Score')
plt.title('Comparación de Scores: TF-IDF vs BM25')
plt.legend()
plt.xticks(x, [f'Doc{i}' for i in top_20_docs], rotation=45)
plt.tight_layout()
plt.show()

# 3. Identificar diferencias
print("Análisis de diferencias:")
print("\nDocumentos que TF-IDF rankea más alto:")
for i in range(5):
    tfidf_doc = doc_scores[i][0]
    bm25_rank = next((j for j, (doc_idx, _) in enumerate(bm25_scores) if doc_idx == tfidf_doc), -1)
    print(f"Doc {tfidf_doc}: TF-IDF rank {i+1}, BM25 rank {bm25_rank+1}")

print("\nDocumentos que BM25 rankea más alto:")
for i in range(5):
    bm25_doc = bm25_scores[i][0]
    tfidf_rank = next((j for j, (doc_idx, _) in enumerate(doc_scores) if doc_idx == bm25_doc), -1)
    print(f"Doc {bm25_doc}: BM25 rank {i+1}, TF-IDF rank {tfidf_rank+1}")

# 4. Sugerencia sobre las diferencias
print("\n¿A qué se debe la diferencia?")
print("- BM25 penaliza menos los documentos largos debido a la normalización por longitud")
print("- TF-IDF puede favorecer documentos con muchas repeticiones de términos de consulta")
print("- BM25 tiene saturación en la frecuencia de términos (no crece indefinidamente)")

NameError: name 'doc_scores' is not defined

## Parte 5: Evaluación con consulta relevante

### Actividad 

1. Elige una consulta y define qué documentos del corpus deberían considerarse relevantes.
2. Evalúa Precision@3 o MAP para los rankings generados con TF-IDF y BM25.
3. Responde: ¿Cuál modelo da mejores resultados respecto a tu criterio de relevancia?

In [64]:
# 1. Elegir consulta y definir documentos relevantes
query_eval = "computer software programming"
print(f"Consulta para evaluación: '{query_eval}'")

# Definir criterio de relevancia (documentos que contienen múltiples términos de la consulta)
relevant_docs = set()
query_terms = query_eval.lower().split()

for i, doc in enumerate(docs_sample):
    doc_lower = doc.lower()
    term_count = sum(1 for term in query_terms if term in doc_lower)
    if term_count >= 2:  # Al menos 2 términos de la consulta
        relevant_docs.add(i)

print(f"Documentos considerados relevantes: {len(relevant_docs)}")

# 2. Calcular Precision@3
def precision_at_k(ranking, relevant_docs, k):
    top_k = [doc_idx for doc_idx, _ in ranking[:k]]
    relevant_retrieved = len(set(top_k) & relevant_docs)
    return relevant_retrieved / k

precision_3_tfidf = precision_at_k(doc_scores, relevant_docs, 3)
precision_3_bm25 = precision_at_k(bm25_scores, relevant_docs, 3)

print(f"\nPrecision@3:")
print(f"TF-IDF: {precision_3_tfidf:.3f}")
print(f"BM25: {precision_3_bm25:.3f}")

# 3. Calcular MAP (Mean Average Precision)
def calculate_map(ranking, relevant_docs):
    if not relevant_docs:
        return 0.0
    
    precisions = []
    relevant_retrieved = 0
    
    for i, (doc_idx, _) in enumerate(ranking):
        if doc_idx in relevant_docs:
            relevant_retrieved += 1
            precision = relevant_retrieved / (i + 1)
            precisions.append(precision)
    
    return sum(precisions) / len(relevant_docs) if precisions else 0.0

map_tfidf = calculate_map(doc_scores, relevant_docs)
map_bm25 = calculate_map(bm25_scores, relevant_docs)

print(f"\nMAP (Mean Average Precision):")
print(f"TF-IDF: {map_tfidf:.3f}")
print(f"BM25: {map_bm25:.3f}")

# 4. Respuesta final
if map_bm25 > map_tfidf:
    better_model = "BM25"
else:
    better_model = "TF-IDF"

print(f"\n¿Cuál modelo da mejores resultados?")
print(f"Según las métricas calculadas, {better_model} obtiene mejor rendimiento para esta consulta.")
print(f"Esto se debe a que {better_model} logra recuperar más documentos relevantes en posiciones más altas del ranking.")

Consulta para evaluación: 'computer software programming'
Documentos considerados relevantes: 15


NameError: name 'doc_scores' is not defined