# Developing an Information Retrieval System with Document Ranking

This project aims to augment the Information Retrieval (IR) system developed in the previous assignments by incorporating different Document Ranking strategies. You should use the Cranfield collection as the dataset. You can find
that in the original format here or in the TREC XML format with binary tagging here.



## Project Overview

In this project, you will implement three different approaches for document ranking, including the vector space model,
the binary independence model, and the language model. Then, you need to compare these ranking models resorting
to the evaluation criteria in Lecture 7. Key components and functionalities are as follows:


### ⬜ Document Ranking – Language Model

You will implement a function for document ranking utilizing
the language model. The function will take as input a query text and an integer indicating the number of top
documents to be retrieved. You can choose between Dirichlet smoothing or Jelinek-Mercer smoothing to avoid
zeroes. You do not need to fine-tune parameters λ or α. Albeit, you need to discuss why your chosen methods
and parameters are preferred.



In [3]:
from pathlib import Path  # For standard paths that work on both windows and linux
import pickle  # For write/read dicts to/from files
import numpy as np
from nltk.tokenize import RegexpTokenizer  # For query tokenization
from nltk.stem import PorterStemmer  # For query stemming
import math # For calculating idf (log(N/df))
from collections import Counter # For calculating query's terms frequency

In [4]:
tokenizer = RegexpTokenizer(r'\w+')
ps = PorterStemmer()

In [5]:
# reading posting list from file
# posting list has been created in the last step (preprocessing)
posting_list = None
docs = None
tokenized_docs = None

with open(Path("files") / "posting_list.pkl", "rb") as f:
        posting_list = pickle.load(f)

with open(Path("files") / "docs.pkl", "rb") as f:
        docs = pickle.load(f)

with open(Path("files") / "tokenized_docs.pkl", "rb") as f:
        tokenized_docs = pickle.load(f)

In [6]:
print(f"but({len(posting_list['but'])} docs): [doc_id: frequency]",posting_list["but"])

but(219 docs): [doc_id: frequency] {13: 2, 24: 1, 27: 1, 43: 1, 48: 2, 61: 1, 65: 1, 71: 2, 93: 1, 109: 2, 113: 1, 115: 1, 124: 1, 125: 1, 131: 1, 137: 1, 139: 1, 148: 1, 151: 1, 152: 1, 155: 2, 159: 1, 167: 1, 172: 1, 175: 1, 178: 1, 184: 1, 187: 1, 188: 1, 190: 1, 192: 1, 198: 2, 200: 1, 201: 4, 203: 1, 205: 1, 208: 1, 209: 1, 211: 2, 213: 1, 220: 1, 226: 1, 228: 1, 240: 1, 243: 1, 246: 1, 251: 2, 254: 2, 260: 1, 265: 1, 271: 1, 282: 1, 283: 1, 291: 1, 295: 1, 328: 4, 337: 2, 346: 1, 347: 1, 351: 1, 369: 1, 374: 1, 389: 1, 403: 2, 416: 1, 430: 1, 440: 1, 451: 1, 458: 2, 476: 2, 483: 1, 489: 1, 498: 2, 510: 1, 514: 1, 518: 1, 519: 1, 520: 1, 521: 1, 526: 1, 535: 1, 544: 1, 546: 1, 555: 1, 561: 1, 563: 1, 565: 1, 566: 1, 568: 1, 569: 1, 571: 1, 588: 1, 594: 1, 599: 1, 603: 1, 635: 1, 639: 1, 642: 1, 651: 1, 659: 1, 660: 1, 662: 1, 666: 1, 670: 1, 672: 1, 678: 1, 684: 2, 685: 1, 691: 1, 703: 2, 717: 1, 720: 1, 726: 1, 730: 1, 738: 1, 747: 1, 751: 1, 752: 2, 755: 1, 756: 2, 759: 1, 762: 

In [57]:
def rsv_score(query, k):
    query = [ps.stem(token) for token in tokenizer.tokenize(query)]

    docs_scores = np.zeros(len(docs))

    for doc_id in range(len(docs)):
        rsv = 0

        for term in set(query):
            if term in tokenized_docs[doc_id]:
                idf = math.log(len(docs) / len(posting_list[term]), 10)
                rsv += idf

        docs_scores[doc_id] = rsv
        
    return docs_scores.argsort()[-k:][::-1]

In [58]:
def enhanced_rsv_score(query, k, k1, b):
    query = [ps.stem(token) for token in tokenizer.tokenize(query)]

    docs_scores = np.zeros(len(docs))

    avg_doc_length = sum([len(doc) for doc in tokenized_docs]) / len(tokenized_docs)

    for doc_id in range(len(docs)):
        rsv = 0

        for term in set(query):
            if term in tokenized_docs[doc_id]:
                idf = math.log(len(docs) / len(posting_list[term]), 10)
                tf = posting_list[term][doc_id]
                
                factor = ((k1 + 1) * tf) / (
                    k1 * ((1 - b) + b * len(tokenized_docs[doc_id]) / avg_doc_length)
                    + tf
                )

                rsv += idf * factor

        docs_scores[doc_id] = rsv

    return docs_scores.argsort()[-k:][::-1]

In [59]:
def enhanced_rsv_score_long_query(query, k, k1, b, k3):
    query = [ps.stem(token) for token in tokenizer.tokenize(query)]

    docs_scores = np.zeros(len(docs))

    query_term_frequency = Counter(query)

    avg_doc_length = sum([len(doc) for doc in tokenized_docs]) / len(tokenized_docs)

    for doc_id in range(len(docs)):
        rsv = 0

        for term in set(query):
            if term in tokenized_docs[doc_id]:
                idf = math.log(len(docs) / len(posting_list[term]), 10)
                tf_d = posting_list[term][doc_id]
                tf_q = query_term_frequency[term]

                factor1 = ((k1 + 1) * tf_d) / (
                    k1 * ((1 - b) + b * len(tokenized_docs[doc_id]) / avg_doc_length)
                    + tf_d
                )

                factor2 = ((k3 + 1) * tf_q) / (k3 + tf_q)

                rsv += idf * factor1 * factor2

        docs_scores[doc_id] = rsv

    return docs_scores.argsort()[-k:][::-1]

In [61]:
result = rsv_score(
    "what is the basic mechanism of the transonic aileron buzz .",
    5,
)
result

array([ 495,  519,  902, 1247, 1071], dtype=int64)

In [62]:
result2 = enhanced_rsv_score(
    "what is the basic mechanism of the transonic aileron buzz .",
    5,
    k1 = 0,
    b = 0
)
result2

array([ 495,  519,  902, 1247, 1071], dtype=int64)

In [63]:
result2 = enhanced_rsv_score(
    "what is the basic mechanism of the transonic aileron buzz .",
    5,
    k1 = 1,
    b = 1
)
result2

array([495, 902, 519, 312, 439], dtype=int64)

In [66]:
# Show the results
for i, doc_id in enumerate(result2):
    print(f"================== Result {i + 1} ==================")
    print(docs[doc_id])
    print()

a theory of transonic aileron buzz, neglecting viscous
effects .
  usaf-sponsored analysis of the unsteady perturbations of
two-dimensional transonic flow around an airfoil, where local supersonic
regions terminated by shock waves are present in the vicinity of the
airfoil .  viscous effects are neglected, and a linearized theory of
the perturbations due to harmonic oscillations of an aileron is
developed .  a series solution for the pressure distribution is obtained,
and numerical results for the nonsteady hinge moment, from the
first approximation to the solution, are presented .  as a result of
flutter analysis a stability boundary for transonic aileron buzz is
obtained .  comparison of the theoretical results with experimental
observations shows satisfactory agreement .

two dimensional transonic unsteady flow with shock
waves .
  a study is made of the unsteady flow around an
airfoil at transonic mach numbers, the situation being
such that local supersonic regions terminated by
sh

### ⬜ Long queries – Probabilistic Model (Optional)

Handle long queries by having your function alternatively
switch between Okapi BM25 approaches based on the query length.

In [None]:
result3 = enhanced_rsv_score_long_query(
    "what is the basic mechanism of the transonic transonic transonic aileron buzz .",
    5,
    k1 = 1,
    b = 1,
    k3 = 0
)
result3

array([495, 902, 519, 312, 439], dtype=int64)

In [67]:
result3 = enhanced_rsv_score_long_query(
    "what is the basic mechanism of the transonic transonic transonic aileron buzz .",
    5,
    k1 = 1,
    b = 1,
    k3 = 1
)
result3

array([495, 902, 312, 439,  37], dtype=int64)

In [68]:
# Show the results
for i, doc_id in enumerate(result3):
    print(f"================== Result {i + 1} ==================")
    print(docs[doc_id])
    print()

a theory of transonic aileron buzz, neglecting viscous
effects .
  usaf-sponsored analysis of the unsteady perturbations of
two-dimensional transonic flow around an airfoil, where local supersonic
regions terminated by shock waves are present in the vicinity of the
airfoil .  viscous effects are neglected, and a linearized theory of
the perturbations due to harmonic oscillations of an aileron is
developed .  a series solution for the pressure distribution is obtained,
and numerical results for the nonsteady hinge moment, from the
first approximation to the solution, are presented .  as a result of
flutter analysis a stability boundary for transonic aileron buzz is
obtained .  comparison of the theoretical results with experimental
observations shows satisfactory agreement .

two dimensional transonic unsteady flow with shock
waves .
  a study is made of the unsteady flow around an
airfoil at transonic mach numbers, the situation being
such that local supersonic regions terminated by
sh