**Pre-Requisites:**


*   Create a new folder, named "DCU_IR_PROJECT" in your Google Drive account.
*   Download, unzip, and load the Cranfield Collection files (cran.all.1400, cran.query, cranqrel, and cranqrel.readme) from https://github.com/terrierteam/jtreceval and load them in the above folder.


*   Run the below command to mount a google drive using your google account and giving the required access using the prompts.

*   Update colab notebook settings to use GPU/TPU for faster processing





In [1]:
# Mount Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).




*   Install pytrec_eval evaluation module. It is a Python interface to TREC's evaluation tool which we are using for this project.



In [2]:
# Install pytrec_eval for the model evaluations
!pip install pytrec_eval

Collecting pytrec_eval
  Downloading pytrec_eval-0.5.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pytrec_eval
  Building wheel for pytrec_eval (setup.py) ... [?25l[?25hdone
  Created wheel for pytrec_eval: filename=pytrec_eval-0.5-cp310-cp310-linux_x86_64.whl size=308109 sha256=6b16b1aa25fa6e965862136367b4bc36aeaf3a798d90b2e933e76049b7b22ca5
  Stored in directory: /root/.cache/pip/wheels/51/3a/cd/dcc1ddfc763987d5cb237165d8ac249aa98a23ab90f67317a8
Successfully built pytrec_eval
Installing collected packages: pytrec_eval
Successfully installed pytrec_eval-0.5




*   Import the required python packages



In [3]:
import os
import nltk
import math
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from collections import defaultdict
from operator import itemgetter
import pytrec_eval
import pandas as pd



*   Download NLTK "StopWords" module which includes a list of 40 stop words, including: "a", "an", "the", "of", "in", etc. The stopwords in nltk are the most common words in data. They are words that you do not want to use to describe the topic of your content.
*   Download "Punkt" tokenizer module which divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.



In [4]:
# Download NLTK "StopWords" and "Punkt" packages
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True



*   Declare various file paths for the programs to use/re-use, like path to the Cranfield files etc.



In [5]:
# Set the file paths
data_path = '/content/drive/MyDrive/DCU_IR_PROJECT'
collection_file = os.path.join(data_path, 'cran.all.1400')
queries_file = os.path.join(data_path, 'cran.qry')
relevance_file = os.path.join(data_path, 'cranqrel')
original_document_path = os.path.join(data_path, 'originalDocument')
tokenized_document_path = os.path.join(data_path, 'tokenizedDocument')
output_dir = os.path.join(data_path, 'output')

*  Read the content of the Cranfield collection file and split the content into individual documents based on the delimiter ".I".
*  For each document, extract the document ID and content and store each document in a dictionary, where the document ID is the key and the content is the value.
*  Create a folder named "originalDocument" (if it doesn't exist) to store individual document files with their content and write each document's content to a separate text file within the "originalDocument" folder, using the document ID as the filename.
*  Return the collection of documents as a dictionary.

In [6]:
# Load the Cranfield collection
def load_cranfield_collection(collection_file):
    with open(collection_file, 'r', encoding='utf-8') as file:
        documents = file.read().split('.I')[1:]

    collection = {}
    for i, doc in enumerate(documents, start=1):
        '''
        #Exiting the loop after 10 iterations
        if i > 10:  # Exit the loop after 10 iterations
            print("Exiting the loop after 10 iterations.")
            break
        '''
        doc_id, content = doc.split('.T\n')
        doc_id = doc_id.strip()
        content = content.strip()
        collection[doc_id] = content

        # Create a new folder "originalDocument" if it doesn't exist
        if not os.path.exists(original_document_path):
            os.makedirs(original_document_path)

        # Write the content to a new text file with doc_id as the file name
        doc_file = os.path.join(original_document_path, f'{doc_id}.txt')
        with open(doc_file, 'w', encoding='utf-8') as f:
            f.write(content)

    return collection

This pre-processing function tokenizes the document contents, removes the stop words, performs stemming etc.
*  Tokenize: breaks the content into individual words or tokens.
*  Remove stop words: remove common English words like, "the," "is," "and" etc. that do not contribute much to the meaning of the text.
*  Stem the tokens: reduce each token to its base or root form using Porter stemming algorithm.
*  Create a folder, named "tokenizedDocument" (if it doesn't exist) and write the pre-processed tokens to a text file using the document Id as the filename.
*  Return the list of stemmed tokens.


In [7]:
# Function to pre-process the source documents
def preprocess_document(doc_id, doc_content):
    tokens = word_tokenize(doc_content.lower())
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

    # Create a new folder "originalDocument" if it doesn't exist
    if not os.path.exists(tokenized_document_path):
        os.makedirs(tokenized_document_path)

    # Write the tokens to a new text file with doc_id as the file name
    doc_file = os.path.join(tokenized_document_path, f'{doc_id}.txt')
    with open(doc_file, 'w', encoding='utf-8') as f:
        f.write(' '.join(stemmed_tokens))

    return stemmed_tokens

*  Read the queries from the cranfield query files, split them into individual queries, and  then create a dictionary with query Ids as the keys and query content as the values and return the dictionary.

In [8]:
# Load the queries
def load_queries(queries_file):
    with open(queries_file, 'r', encoding='utf-8') as file:
        queries = file.read().split('.I')[1:]

    queries_dict = {}
    for query in queries:
        query_id, content = query.split('.W\n')
        queries_dict[query_id.strip()] = content.strip()

    return queries_dict

*  Calculate the Term Frequency (TF) of the each token by counting its occurrences and return a dictionary having the TF values for each token.

In [9]:
# Function to calculate term frequency (TF)
def calculate_tf(query_tokens):
    tf = defaultdict(int)
    for token in query_tokens:
        tf[token] += 1
    return tf

*  Take the inverted index, which is a dictionary with each token mapping to a list of document Ids for that token.
*  Calculate the Inverse Document Frequency (IDF) for each token by taking the log of the total number of documents divided by the document frequency of the token plus one and return the dictionary having the IDF values for each token.

In [10]:
# Function to calculate inverse document frequency (IDF)
def calculate_idf(inverted_index, num_documents):
    idf = {}
    for token, doc_ids in inverted_index.items():
        idf[token] = math.log(num_documents / (len(doc_ids) + 1))
    return idf

*  Build the inverted index by processing the documents to obtain the tokens, and then, for each token, append the document Ids to the corresponding list for that token.

In [11]:
# Function to build inverted index
def build_inverted_index(documents):
    inverted_index = defaultdict(list)
    for doc_id, doc_content in documents.items():
        tokens = preprocess_document(doc_id, doc_content)
        for token in tokens:
            if doc_id not in inverted_index[token]:
                inverted_index[token].append(doc_id)
    return inverted_index

*  Build a term-document matrix for using in the VSM model by iterating through each document, pre-processing its content and getting tokens, calculating the TF for each token, and finally computing the TF-IDF weight for each term in the document. Build it in the structure having the tokens as rows, document Ids as the columns, and TF-IDF weights as the values.

In [12]:
# Function to build the term-document matrix for the Vector Space Model
def build_term_document_matrix(documents, inverted_index, idf):
    term_document_matrix = defaultdict(dict)
    for doc_id, doc_content in documents.items():
        tokens = preprocess_document(doc_id, doc_content)
        tf = calculate_tf(tokens)
        for token, freq in tf.items():
            tfidf = freq * idf[token]
            term_document_matrix[token][doc_id] = tfidf
    return term_document_matrix

*  Calculate BM25 score between a given query and a document using the BM25 ranking formula. It takes into account term frequencies and document frequencies and iterates through each token in the query, calculates the term's TF and IDF in the document, and then calculates the BM25 score based on these values. The final score is the sum of the scores for all tokens in the query.

In [13]:
# Function to calculate the BM25 score for a query and a document
def calculate_bm25_score(query_tokens, doc_tokens, inverted_index, num_documents, avg_doc_length, k1=1.5, b=0.75):
    score = 0
    doc_len = len(doc_tokens)
    for token in query_tokens:
        if token in inverted_index:
            df = len(inverted_index[token])
            idf = math.log((num_documents - df + 0.5) / (df + 0.5) + 1.0)
            tf = doc_tokens.count(token)
            numerator = tf * (k1 + 1)
            denominator = tf + k1 * (1 - b + b * (doc_len / avg_doc_length))
            score += idf * (numerator / denominator)
    return score

*  Calculate the BM25 Language Model score between a given query and a document by iterating through each token in the query and computing the token's probability in the document based on the document's language model and the collection's language model. The final score is the product of the probabilities for all tokens in the query.

In [14]:
# Function to calculate the BM25 Language Model score for a query and a document
def calculate_language_model_score(query_tokens, doc_tokens, token_freq, total_token_count, lambd=0.5):
    score = 1
    doc_len = len(doc_tokens)
    for token in query_tokens:
        tf = doc_tokens.count(token)
        prob = (1 - lambd) * (tf / doc_len) + lambd * (token_freq[token] / total_token_count)
        score *= prob
    return score

*  Rank documents using the VSM model by calculating the VSM score for each document in the term-document matrix based on the query tokens. The scores are accumulated for each document, then sorted in descending order using their scores, to get the ranked list of documents.

In [15]:
# Function to rank documents using the Vector Space Model
def rank_documents_vsm(query_tokens, term_document_matrix):
    scores = defaultdict(float)
    for token in query_tokens:
        if token in term_document_matrix:
            for doc_id, tfidf in term_document_matrix[token].items():
                scores[doc_id] += tfidf
    ranked_docs = sorted(scores.items(), key=itemgetter(1), reverse=True)
    return ranked_docs

*  Same as the previous one, but rank documents using the BM25 model by calculating the BM25 score for each document in the term-document matrix based on the query tokens. The scores are accumulated for each document, then sorted in descending order using their scores, to get the ranked list of documents.

In [16]:
# Function to rank documents using BM25
def rank_documents_bm25(query_tokens, documents, inverted_index, num_documents, avg_doc_length):
    scores = defaultdict(float)
    for doc_id, doc_content in documents.items():
        doc_tokens = preprocess_document(doc_id, doc_content)
        score = calculate_bm25_score(query_tokens, doc_tokens, inverted_index, num_documents, avg_doc_length)
        scores[doc_id] = score
    ranked_docs = sorted(scores.items(), key=itemgetter(1), reverse=True)
    return ranked_docs

*  Same like previous ones, rank documents using the LM model by calculating the LM score for each document in the term-document matrix based on the query tokens. The scores are accumulated for each document, then sorted in descending order using their scores, to get the ranked list of documents.

In [17]:
# Function to rank documents using Language Model (LM)
def rank_documents_language_model(query_tokens, documents, token_freq, total_token_count):
    scores = defaultdict(float)
    for doc_id, doc_content in documents.items():
        doc_tokens = preprocess_document(doc_id, doc_content)
        score = calculate_language_model_score(query_tokens, doc_tokens, token_freq, total_token_count)
        scores[doc_id] = score
    ranked_docs = sorted(scores.items(), key=itemgetter(1), reverse=True)
    return ranked_docs

*  Write the ranked documents in the specific format to an output file which will be used later on to evaluate using pytrec_eval to compute evaluation metrics for MAP, P@5, and NDCG. It creates an output file for each indexing models (SVM, BM25, and LM) having the ranked documents and their corresponding similarity scores to use later for the evaluations.

In [18]:
# Function to write the output file in the required format
def write_output_file(query_id, ranked_docs, model_name, results_dir):
    # Create a new folder "results" if it doesn't exist
    if not os.path.exists(results_dir):
        os.makedirs(results_dir)

    output_file = os.path.join(results_dir, f'{model_name}_output.txt')
    with open(output_file, 'a') as f:
        for rank, (doc_id, score) in enumerate(ranked_docs, start=1):
            # As the values of iter, rank, and run_id are irrelevant for trec_eval, we can set them to fixed values.
            # For example, we can set iter and run_id to 0, and rank to the rank of the document in the ranked list.
            iter_value = 0
            rank_value = rank
            similarity_value = score
            run_id_value = "IR_System"
            f.write(f"{query_id} {iter_value} {doc_id} {rank_value} {similarity_value:.6f} {run_id_value}\n")

*  Read the Cranfield relevance file (cranqrel), extract the relevance judgments, and store them in a nested dictionary, which will make the relevance scores for specific queries and documents easily available at the time of the evaluation process.

In [19]:
# Load the relevance judgments
def load_relevance_judgments(relevance_file):
    relevance_judgments = defaultdict(dict)
    with open(relevance_file, 'r', encoding='utf-8') as f_qrel:
        for line in f_qrel:
            parts = line.strip().split()
            if len(parts) != 3:
                # Skip lines with incorrect format
                continue
            query_id, object_id, relevance = parts
            try:
                relevance_judgments[query_id][object_id] = int(relevance)
            except ValueError:
                # Skip lines with incorrect relevance values
                continue
    return relevance_judgments

*  This is a custom implementation of pytrec_eval::parse_run to handle multiple document Ids for the same query and allow the evaluation process to proceed without the AssertionError. Using a nested defaultdict so that if a document Id already exists for a specific query, then the new score will overwrite the old one.

In [20]:
# Custom implementation of parse_run of pytrec_eval to handle multiple document Ids for the same query and proceed without any error
def parse_run(f_run):
    run = defaultdict(dict)
    for line in f_run:
        query_id, _, object_id, _, score, _ = line.strip().split()
        run[query_id][object_id] = float(score)
    return run

*  For each model, read the output file having the retrieved documents and their scores.
*  Perform the evaluation of the retrieval models (VSM, BM25, LM) using the pytrec_eval library using the the relevance judgments and the evaluation measures (MAP, P@5, and NDCG).
*  Process and return the evaluation scores for each model in a dictionary where the model name is the key and the evaluation scores is the value.

In [21]:
# Function to evaluate retrieval models using pytrec_eval
def evaluate_models(model_names, output_dir, relevance_judgments):
    evaluation_results = {}
    for model_name in model_names:
        output_file = os.path.join(output_dir, f'{model_name}_output.txt')
        with open(output_file, 'r') as f:
          # using the custom parse_run to handle duplicate document Id for the same query scenarios
          run = parse_run(f)
          #run = pytrec_eval.parse_run(f)

        # Using MAP, P@5, and NDCG measures for the evaluation
        evaluator = pytrec_eval.RelevanceEvaluator(relevance_judgments, {'map', 'P_5', 'ndcg'})
        model_scores = evaluator.evaluate(run)
        #print(model_scores.values())
        # Store the evaluation results in a dictionary
        evaluation_results[model_name] = model_scores

    return evaluation_results

 **THE MAIN FUNCTION:**
1. Load the Cranfield collection from the collection file - multiple documents each with a unique document Id and content.

2. Preprocess the documents - tokenize, remove stop words, and apply stemming. Build an inverted index which maps each token to the list of document Ids containing that token.

3. Calculate the IDF - for each token use the corresponding inverted index and total number of documents.

4. Load the queries from the cranqrel file - each has a unique query Id and content.

5. Build the term-document matrix for the VSM model using the preprocessed documents, inverted index, and IDF values.

6. Rank the documents for each query using the VSM model and write the top 100 ranked documents to an output file with the prefix 'vsm'.

7. Rank the documents for each query using the BM25 model, and write the top 100 ranked documents to an output file with the prefix 'bm25'.

8. Rank the documents for each query using the LM model (Okapi BM25 Language Model), and write the top 100 ranked documents to an output file with the prefix 'bm25_lm'.

9. Load the relevance judgments for queries and documents from the cranqrel file.

10. Evaluate the retrieval models (VSM, BM25, and LM) using pytrec_eval based on the relevance judgments and the retrieved document rankings.

11. Convert the nested dictionary of evaluation results into a DataFrame, rename the columns in more understandable way, and set the decimal places, for easier analysis and manipulation.

12. Save the evaluation results DataFrame as a CSV file, containing Mean Average Precision (MAP), Precision at 5 (P@5), and Normalized Discounted Cumulative Gain (NDCG), for each retrieval model and query for the top 100 documents.

In [22]:
# The MAIN Function to run the IR System and Save & Print the evaluation results
if __name__ == "__main__":
    # Load the Cranfield collection
    documents = load_cranfield_collection(collection_file)

    # Preprocess the documents and build the inverted index
    inverted_index = build_inverted_index(documents)

    # Calculate inverse document frequency (IDF)
    num_documents = len(documents)
    idf = calculate_idf(inverted_index, num_documents)

    # Load the queries
    queries = load_queries(queries_file)

    # Build the term-document matrix for Vector Space Model
    term_document_matrix = build_term_document_matrix(documents, inverted_index, idf)

    # Rank documents using Vector Space Model
    for query_id, query_content in queries.items():
        query_tokens = preprocess_document(query_id, query_content)
        ranked_docs_vsm = rank_documents_vsm(query_tokens, term_document_matrix)
        write_output_file(query_id, ranked_docs_vsm[:100], 'vsm', output_dir)

    # Rank documents using BM25
    avg_doc_length = sum(len(preprocess_document(doc_id, doc_content)) for doc_id, doc_content in documents.items()) / num_documents
    for query_id, query_content in queries.items():
        query_tokens = preprocess_document(query_id, query_content)
        ranked_docs_bm25 = rank_documents_bm25(query_tokens, documents, inverted_index, num_documents, avg_doc_length)
        write_output_file(query_id, ranked_docs_bm25[:100], 'bm25', output_dir)

    # Rank documents using Okapi BM25 Language Model
    token_freq = defaultdict(int)
    total_token_count = 0
    for doc_id, doc_content in documents.items():
        doc_tokens = preprocess_document(doc_id, doc_content)
        for token in doc_tokens:
            token_freq[token] += 1
            total_token_count += 1
    for query_id, query_content in queries.items():
        query_tokens = preprocess_document(query_id, query_content)
        ranked_docs_lm = rank_documents_language_model(query_tokens, documents, token_freq, total_token_count)
        write_output_file(query_id, ranked_docs_lm[:100], 'bm25_lm', output_dir)

    # Load the relevance judgments from the cranqrel file
    relevance_judgments = load_relevance_judgments(relevance_file)
    #print(relevance_judgments.values())

    # Evaluate retrieval models using pytrec_eval
    model_names = ['vsm', 'bm25', 'bm25_lm']
    evaluation_results = evaluate_models(model_names, output_dir, relevance_judgments)

    # Print the evaluation results
    print("\033[1m\033[4mEvaluation Results:\033[0m")

    # Convert the nested dictionary into a DataFrame
    data = {}
    for model, model_results in evaluation_results.items():
        for query, query_results in model_results.items():
            if all(key in query_results for key in ['map', 'P_5', 'ndcg']):
                data[(model, query)] = query_results

    df = pd.DataFrame.from_dict(data, orient='index')
    # Rename the columns and index (row) with custom names
    df.columns = ['MAP_Score', 'P@5_Score', 'NDCG_Score']
    df.index.name = 'Model'

    # Set the number of decimal places
    df = df.round(4)
    # Transpose the DataFrame to get a more readable format
    #df = df.T

    # Save the Evaluation Results to a file
    df.to_csv(output_dir + 'evaluation_results.csv')
    print(df)

    # Load the relevance judgments from the cranqrel file
    relevance_judgments = load_relevance_judgments(relevance_file)
    #print(relevance_judgments.values())

    # Evaluate retrieval models using pytrec_eval
    model_names = ['vsm', 'bm25', 'bm25_lm']
    evaluation_results = evaluate_models(model_names, output_dir, relevance_judgments)


[1m[4mEvaluation Results:[0m
             MAP_Score  P@5_Score  NDCG_Score
vsm     100     0.0022        0.0      0.0462
        101     0.0000        0.0      0.0000
        102     0.0000        0.0      0.0000
        103     0.0000        0.0      0.0000
        104     0.0000        0.0      0.0000
...                ...        ...         ...
bm25_lm 218     0.0000        0.0      0.0000
        219     0.0000        0.0      0.0000
        223     0.0000        0.0      0.0000
        224     0.0000        0.0      0.0000
        225     0.0186        0.0      0.1484

[282 rows x 3 columns]
