# Exploring Ranking Models in Information Retrieval

## Objective
Understand the practical implementation and differences between the Vector Space Model and the Binary Independence Model in ranking documents relative to a user query.

### Step 1: Data Preprocessing

Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.
Write a function to load and preprocess the text documents from a specified directory. This step involves reading each file, converting the text to lowercase for uniform processing, and storing the results in a dictionary.

In [1]:
import os
import re
import collections
import pandas as pd
from numpy import dot
from numpy.linalg import norm
import numpy as np

In [2]:
# Define the path to the directory containing the text files
CORPUS_DIR = "../libros"
documents = {}

In [3]:
def clean_text(text):
    # Remover símbolos y paréntesis utilizando expresiones regulares
    cleaned_text = re.sub(r'[^\w\s]', '', text)
    return cleaned_text

for filename in os.listdir(CORPUS_DIR):
    if filename.endswith('.txt'):
        file_path = os.path.join(CORPUS_DIR, filename)
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read().lower()
            cleaned_text = clean_text(text)
            documents[filename] = cleaned_text

In [4]:
# Dictionary to save the doc normalized
normalized_word_counts = {}

In [5]:
for doc in documents:
    word_count = collections.Counter(documents[doc].split())
    total_words = sum(word_count.values())
    normalized_word_count = {word: count/total_words for word, count in word_count.items()}
    normalized_word_counts[doc] = normalized_word_count

In [6]:
df = pd.DataFrame.from_dict(normalized_word_counts)

In [7]:
df_no_nan = df.fillna(0)
df_no_nan = df_no_nan.rename_axis('Files')
df_no_nan.head(10)

Unnamed: 0_level_0,pg100.txt,pg10676.txt,pg10681.txt,pg1080.txt,pg11.txt,pg1184.txt,pg120.txt,pg1232.txt,pg1259.txt,pg1260.txt,...,pg73447.txt,pg7370.txt,pg74.txt,pg76.txt,pg768.txt,pg84.txt,pg844.txt,pg8800.txt,pg98.txt,pg996.txt
Files,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the,0.031548,0.07061,0.024495,0.055297,0.061879,0.061475,0.064093,0.05887,0.059102,0.042232,...,0.088325,0.06104,0.053341,0.044282,0.039849,0.05609,0.033986,0.051372,0.059054,0.052207
project,0.000103,0.000965,0.000492,0.013824,0.002984,0.000233,0.001234,0.001662,0.0004,0.000478,...,0.002612,0.001479,0.001192,0.00078,0.000765,0.001127,0.00372,0.000788,0.000634,0.000221
gutenberg,9e-05,0.000745,0.000428,0.013514,0.00295,0.00019,0.00122,0.001643,0.000363,0.000462,...,0.002392,0.001462,0.001178,0.000762,0.000732,0.001114,0.003678,0.000779,0.000626,0.000205
ebook,1.3e-05,0.00011,6.9e-05,0.002019,0.000441,2.8e-05,0.000182,0.000246,7.3e-05,6.9e-05,...,0.000357,0.000218,0.000176,0.000114,0.000109,0.000166,0.00055,0.000116,9.4e-05,3.3e-05
of,0.019516,0.030938,0.019502,0.04023,0.021395,0.027768,0.025306,0.034317,0.025328,0.023714,...,0.051983,0.044183,0.021496,0.015634,0.019697,0.035357,0.021939,0.023071,0.029765,0.031301
complete,2.5e-05,5.9e-05,0.000138,0.0,0.0,8.6e-05,8.4e-05,9.4e-05,4.1e-05,0.000101,...,5.5e-05,1.7e-05,0.000149,0.0,5e-05,0.000102,4.2e-05,6.3e-05,4.3e-05,5.6e-05
works,6.9e-05,0.000355,0.000192,0.00497,0.001119,8.2e-05,0.000449,0.000793,0.000147,0.000196,...,0.000935,0.000588,0.000433,0.000315,0.000269,0.000512,0.001353,0.000394,0.000245,0.000139
william,9.2e-05,0.0,0.0,0.0,0.00017,0.0,0.0,0.0,4e-06,0.0,...,0.00033,3.4e-05,0.0,0.000131,0.0,0.00032,0.0,2.7e-05,0.0,0.0
shakespeare,9e-06,0.0,1.5e-05,0.0,3.4e-05,6e-06,0.0,0.0,4e-06,0.0,...,0.0,0.0,0.0,1.8e-05,0.0,1.3e-05,0.0,0.0,0.0,1.4e-05
this,0.007402,0.003393,0.000409,0.011028,0.005968,0.005259,0.004938,0.006837,0.004799,0.003636,...,0.005718,0.006252,0.004483,0.00255,0.002801,0.005711,0.004988,0.005239,0.004241,0.006449


In [8]:
df_no_nan.to_csv("vect_normalized.csv",index=True)

### Step 2:  Vector Space Model (VSM)

Task: Implement a simple Vector Space Model using term frequency.

Requirements:
* _Document and Query Representation:_ Convert each document and the query into a vector where each dimension corresponds to a term from the corpus. Use simple term frequency for weighting.
* _Cosine Similarity Calculation:_ Calculate the cosine similarity between the query vector and each document vector.
* _Ranking:_ Rank the documents based on their cosine similarity scores from highest to lowest.

In [9]:
df = pd.read_csv("vect_normalized.csv")

In [10]:
import math
def euclidean_distance(vector1, vector2):
    squared_diff = sum([(x - y)**2 for x, y in zip(vector1, vector2)])
    distance = math.sqrt(squared_diff)
    return distance

In [11]:
def get_query_vector(query):
    query_vector = df.loc[df['Files'] == query].values[0][1:]
    return query_vector

In [12]:
def rank_documents(query):
    query_vector = get_query_vector(query)
    similarities = []

    for i in range(df.shape[1] - 1):
        doc_vector = df.iloc[:, i + 1].values
        distance = euclidean_distance(query_vector, doc_vector)
        similarities.append((df.columns[i + 1], distance))

    ranked_documents = sorted(similarities, key=lambda x: x[1])
    return ranked_documents

In [13]:
df.head(10)

Unnamed: 0,Files,pg100.txt,pg10676.txt,pg10681.txt,pg1080.txt,pg11.txt,pg1184.txt,pg120.txt,pg1232.txt,pg1259.txt,...,pg73447.txt,pg7370.txt,pg74.txt,pg76.txt,pg768.txt,pg84.txt,pg844.txt,pg8800.txt,pg98.txt,pg996.txt
0,the,0.031548,0.07061,0.024495,0.055297,0.061879,0.061475,0.064093,0.05887,0.059102,...,0.088325,0.06104,0.053341,0.044282,0.039849,0.05609,0.033986,0.051372,0.059054,0.052207
1,project,0.000103,0.000965,0.000492,0.013824,0.002984,0.000233,0.001234,0.001662,0.0004,...,0.002612,0.001479,0.001192,0.00078,0.000765,0.001127,0.00372,0.000788,0.000634,0.000221
2,gutenberg,9e-05,0.000745,0.000428,0.013514,0.00295,0.00019,0.00122,0.001643,0.000363,...,0.002392,0.001462,0.001178,0.000762,0.000732,0.001114,0.003678,0.000779,0.000626,0.000205
3,ebook,1.3e-05,0.00011,6.9e-05,0.002019,0.000441,2.8e-05,0.000182,0.000246,7.3e-05,...,0.000357,0.000218,0.000176,0.000114,0.000109,0.000166,0.00055,0.000116,9.4e-05,3.3e-05
4,of,0.019516,0.030938,0.019502,0.04023,0.021395,0.027768,0.025306,0.034317,0.025328,...,0.051983,0.044183,0.021496,0.015634,0.019697,0.035357,0.021939,0.023071,0.029765,0.031301
5,complete,2.5e-05,5.9e-05,0.000138,0.0,0.0,8.6e-05,8.4e-05,9.4e-05,4.1e-05,...,5.5e-05,1.7e-05,0.000149,0.0,5e-05,0.000102,4.2e-05,6.3e-05,4.3e-05,5.6e-05
6,works,6.9e-05,0.000355,0.000192,0.00497,0.001119,8.2e-05,0.000449,0.000793,0.000147,...,0.000935,0.000588,0.000433,0.000315,0.000269,0.000512,0.001353,0.000394,0.000245,0.000139
7,william,9.2e-05,0.0,0.0,0.0,0.00017,0.0,0.0,0.0,4e-06,...,0.00033,3.4e-05,0.0,0.000131,0.0,0.00032,0.0,2.7e-05,0.0,0.0
8,shakespeare,9e-06,0.0,1.5e-05,0.0,3.4e-05,6e-06,0.0,0.0,4e-06,...,0.0,0.0,0.0,1.8e-05,0.0,1.3e-05,0.0,0.0,0.0,1.4e-05
9,this,0.007402,0.003393,0.000409,0.011028,0.005968,0.005259,0.004938,0.006837,0.004799,...,0.005718,0.006252,0.004483,0.00255,0.002801,0.005711,0.004988,0.005239,0.004241,0.006449


In [14]:
query = str(input("Ingresa la palabra a buscar: "))
ranked_docs = rank_documents(query)
print(f"Documentos ordenados por similitud con la consulta '{query}':")
for doc, score in ranked_docs:
    print(f"{doc}: {score}")

Documentos ordenados por similitud con la consulta 'three':
pg2000.txt: 0.017308322793281407
pg20228.txt: 0.0321925809569759
pg47629.txt: 0.032485335101475325
pg10681.txt: 0.03634928853441162
pg1513.txt: 0.05972695105671366
pg100.txt: 0.061166906877525025
pg5740.txt: 0.06265589992226044
pg67979.txt: 0.06345821310642252
pg844.txt: 0.06722406469796459
pg45.txt: 0.06918912487394976
pg2641.txt: 0.07039266512721219
pg2554.txt: 0.07062984196153223
pg67098.txt: 0.07093867121163172
pg21700.txt: 0.07107018443921771
pg16389.txt: 0.0713104933301948
pg2542.txt: 0.07164595000031555
pg145.txt: 0.07189486303604609
pg600.txt: 0.0724668247590083
pg1260.txt: 0.07273313031698485
pg3825.txt: 0.07276164759819782
pg768.txt: 0.07282193858575037
pg1342.txt: 0.07301142361266386
pg28054.txt: 0.07321805346129127
pg64317.txt: 0.07354189924434201
pg16.txt: 0.07370620644955433
pg8800.txt: 0.07386188461628561
pg37106.txt: 0.07405278595828135
pg514.txt: 0.07405361305566266
pg394.txt: 0.07415471388384229
pg30254.txt: 

### Step 3: Binary Independence Model (BIM)

Task: Implement a basic Binary Independence Model to rank documents.

Requirements:
* _Binary Representation:_ Represent the corpus and the query in binary vectors (1 if the term is present, 0 otherwise).
* _Probability Estimation:_ Assume arbitrary probabilities for the presence of each term in relevant and non-relevant documents.
* _Relevance Scoring:_ Calculate the relevance score for each document based on the product of probabilities for terms present in the query.
* _Ranking:_ Rank the documents based on their relevance scores from highest to lowest.