# Developing an Information Retrieval System with Document Ranking

This project aims to augment the Information Retrieval (IR) system developed in the previous assignments by incorporating different Document Ranking strategies. You should use the Cranfield collection as the dataset. You can find
that in the original format here or in the TREC XML format with binary tagging here.



## Project Overview

In this project, you will implement three different approaches for document ranking, including the vector space model,
the binary independence model, and the language model. Then, you need to compare these ranking models resorting
to the evaluation criteria in Lecture 7. Key components and functionalities are as follows:


### ⬜ Document Preprocessing

Your project will begin by reading and preprocessing a collection of text documents
– for each document you only need to retain the text with TITLE and TEXT tags. The dataset also contains
queries and relevant documents to each query which will be useful in the evaluation phase.


In [92]:
from pathlib import Path  # For standard paths that work on both windows and linux
import pickle  # For write/read dicts to/from files
import xml.etree.ElementTree as ET # For reading xml files
from nltk.tokenize import RegexpTokenizer  # For tokenization
from nltk.stem import PorterStemmer  # For stemming
import numpy as np

In [73]:
tokenizer = RegexpTokenizer(r'\w+')
ps = PorterStemmer()

In [14]:
DATA_PATH = Path("cranfield-dataset") / "cran.all.1400.xml"

In [75]:
root = ET.parse(DATA_PATH).getroot()

for i, doc in enumerate(root.findall('doc')):
    print(f"======== DOC {i} ========")
    print(doc.find('title').text)
    if i >= 1 : break

experimental investigation of the aerodynamics of a
wing in a slipstream .
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .


In [97]:
docs = []
posting_list = {}

for doc_id, doc in enumerate(root.findall("doc")):
    text = doc.find("text").text or ""
    tokens = tokenizer.tokenize(text)
    tokens = [ps.stem(token) for token in tokens]

    docs.append(text)

    for term in tokens:
        if posting_list.get(term) == None:
            posting_list[term] = {}
        
        if posting_list[term].get(doc_id) == None:
            posting_list[term][doc_id] = 0

        posting_list[term][doc_id] += 1

In [104]:
print(f"but({len(posting_list["but"])} docs): [doc_id: frequency]",posting_list["but"])

but(219 docs): [doc_id: frequency] {13: 2, 24: 1, 27: 1, 43: 1, 48: 2, 61: 1, 65: 1, 71: 2, 93: 1, 109: 2, 113: 1, 115: 1, 124: 1, 125: 1, 131: 1, 137: 1, 139: 1, 148: 1, 151: 1, 152: 1, 155: 2, 159: 1, 167: 1, 172: 1, 175: 1, 178: 1, 184: 1, 187: 1, 188: 1, 190: 1, 192: 1, 198: 2, 200: 1, 201: 4, 203: 1, 205: 1, 208: 1, 209: 1, 211: 2, 213: 1, 220: 1, 226: 1, 228: 1, 240: 1, 243: 1, 246: 1, 251: 2, 254: 2, 260: 1, 265: 1, 271: 1, 282: 1, 283: 1, 291: 1, 295: 1, 328: 4, 337: 2, 346: 1, 347: 1, 351: 1, 369: 1, 374: 1, 389: 1, 403: 2, 416: 1, 430: 1, 440: 1, 451: 1, 458: 2, 476: 2, 483: 1, 489: 1, 498: 2, 510: 1, 514: 1, 518: 1, 519: 1, 520: 1, 521: 1, 526: 1, 535: 1, 544: 1, 546: 1, 555: 1, 561: 1, 563: 1, 565: 1, 566: 1, 568: 1, 569: 1, 571: 1, 588: 1, 594: 1, 599: 1, 603: 1, 635: 1, 639: 1, 642: 1, 651: 1, 659: 1, 660: 1, 662: 1, 666: 1, 670: 1, 672: 1, 678: 1, 684: 2, 685: 1, 691: 1, 703: 2, 717: 1, 720: 1, 726: 1, 730: 1, 738: 1, 747: 1, 751: 1, 752: 2, 755: 1, 756: 2, 759: 1, 762: 

In [108]:
# test if above result is correct
num_but_in_doc_13 = sum(np.array(docs[13].split()) == "but")
print("Number of word 'but' in doc_13 =", num_but_in_doc_13)

Number of word 'but' in doc_13 = 2


In [118]:
# saving the posting list in a file
with open(Path('files') / "posting_list.pkl", "wb") as f:
    pickle.dump(posting_list, f)