# Exploring Ranking Models in Information Retrieval

## Objective
Understand the practical implementation and differences between the Vector Space Model and the Binary Independence Model in ranking documents relative to a user query.

### Step 1: Data Preprocessing

Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.
Write a function to load and preprocess the text documents from a specified directory. This step involves reading each file, converting the text to lowercase for uniform processing, and storing the results in a dictionary.

In [48]:
import os
import re
# Define the path to the directory containing the text files
CORPUS_DIR = '../../week01/data'
documents = {}
for filename in os.listdir(CORPUS_DIR):
    if filename.endswith('.txt'):
        file_path = os.path.join(CORPUS_DIR, filename)
        with open(file_path, 'r', encoding='utf-8') as file:
            documents[filename] = re.sub("^[\w| ]", "", file.read().lower(), 0, re.IGNORECASE)
            # documents[filename] = file.read().lower()  # Read and convert to lowercase

### Step 2:  Vector Space Model (VSM)

Task: Implement a simple Vector Space Model using term frequency.

Requirements:
* _Document and Query Representation:_ Convert each document and the query into a vector where each dimension corresponds to a term from the corpus. Use simple term frequency for weighting.
* _Cosine Similarity Calculation:_ Calculate the cosine similarity between the query vector and each document vector.
* _Ranking:_ Rank the documents based on their cosine similarity scores from highest to lowest.

In [47]:
import collections
dictionary = set()
for doc in documents:
    word_count = collections.Counter(documents[doc].split())
    dictionary.update(word_count.keys())

### Step 3: Binary Independence Model (BIM)

Task: Implement a basic Binary Independence Model to rank documents.

Requirements:
* _Binary Representation:_ Represent the corpus and the query in binary vectors (1 if the term is present, 0 otherwise).
* _Probability Estimation:_ Assume arbitrary probabilities for the presence of each term in relevant and non-relevant documents.
* _Relevance Scoring:_ Calculate the relevance score for each document based on the product of probabilities for terms present in the query.
* _Ranking:_ Rank the documents based on their relevance scores from highest to lowest.

In [1]:
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

def read_documents(folder_path):
    texts = []
    filenames = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                texts.append(file.read())
                filenames.append(filename)
    return filenames, texts

def binarize_documents(vectorizer, corpus):
    X = vectorizer.fit_transform(corpus).toarray()
    return (X > 0).astype(int)  # Convert counts to binary

def calculate_relevance_scores(binary_docs, binary_query, prob_rel, prob_non_rel):
    scores = []
    for doc in binary_docs:
        score = 1
        for term_index, term_presence in enumerate(binary_query):
            if term_presence == 1:
                score *= prob_rel[term_index] if doc[term_index] == 1 else prob_non_rel[term_index]
        scores.append(score)
    return scores

# Main execution
folder_path = '../../week01/data'
filenames, corpus = read_documents(folder_path)
vectorizer = CountVectorizer(binary=True)
binary_docs = binarize_documents(vectorizer, corpus)


In [4]:
# Example query
query = input("Enter your query: ")
binary_query = vectorizer.transform([query]).toarray()[0]

# Assume probabilities
prob_rel = np.random.uniform(0.7, 0.9, len(vectorizer.get_feature_names_out()))
prob_non_rel = np.random.uniform(0.1, 0.3, len(vectorizer.get_feature_names_out()))

scores = calculate_relevance_scores(binary_docs, binary_query, prob_rel, prob_non_rel)
sorted_indices = np.argsort(scores)[::-1]

# Output ranked documents
for index in sorted_indices:
    print(f"{filenames[index]}: {scores[index]}")

Wuthering Heights.txt: 0.720964100971606
Winnie-the-Pooh.txt: 0.720964100971606
he Adventures of Sherlock Holmes.txt: 0.720964100971606
Heart of Darkness.txt: 0.720964100971606
History of Tom Jones, a Foundling.txt: 0.720964100971606
History of Woman Suffrage, Volume III.txt: 0.720964100971606
Jane Eyre- An Autobiography.txt: 0.720964100971606
John Dewey's logical theory.txt: 0.720964100971606
Kentucky in American Letters.txt: 0.720964100971606
lato and the Other Companions of Sokrates.txt: 0.720964100971606
Leviathan.txt: 0.720964100971606
Little Women.txt: 0.720964100971606
Little Women; Or, Meg, Jo, Beth, and Amy.txt: 0.720964100971606
Magazine of western history.txt: 0.720964100971606
Memoirs of a London doll.txt: 0.720964100971606
Metamorphosis.txt: 0.720964100971606
Middlemarch.txt: 0.720964100971606
Moby Dick.txt: 0.720964100971606
My Life — Volume 1.txt: 0.720964100971606
Narrative of the Life of Frederick Douglass.txt: 0.720964100971606
Noli Me Tangere.txt: 0.720964100971606
N