<a href="https://colab.research.google.com/github/polinak1r/Document-Ranking-Information-Retrieval/blob/main/document_ranking_information_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Notebook Outline: Document Ranking & Information Retrieval

1. **Kaggle Setup and Data Download**  
   Kaggle credentials are configured, and competition files (documents, queries, qrels) are downloaded via the Kaggle CLI.

2. **Data Loading and Inspection**  
   All downloaded JSONL/JSON files are read into Python lists or dictionaries, and basic stats (document and query counts) are displayed.

3. **Evaluation Metric (P-Found)**  
   A function `pfound_score` is defined to measure retrieval performance, demonstrating how to compute a score based on ranked predictions.

4. **Text Preprocessing**  
   Titles plus a portion of content are tokenized, stemmed, and filtered for stopwords. A dictionary of document frequencies (df) is built, and very low-frequency tokens are removed.

5. **TF-IDF Construction**  
   A sparse TF matrix is created for each document, IDF values are computed, and both are combined to form the final TF-IDF matrix.

6. **Query Processing and Similarity Computation**  
   Queries undergo the same tokenization and stemming. Their term frequencies are assembled into a sparse matrix, and cosine-like similarity scores are calculated by multiplying document TF-IDF by query term frequencies.

7. **Submission File Creation**  
   (Query, document) pairs and their computed scores are gathered into a dataframe and saved as a CSV file for submission.

In [None]:
!pip install -q kaggle

In [None]:
import json

with open("kaggle.json", "r") as f:
    creds = json.load(f)

In [None]:
from google.colab import userdata
import os

os.environ["KAGGLE_USERNAME"] = creds["username"]
os.environ["KAGGLE_KEY"] = creds["key"]

In [None]:
from pathlib import Path
import random

import numpy as np
import pandas as pd
from tqdm import tqdm

data_dir = Path("data")

In [None]:
import os
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle --help

usage: kaggle [-h] [-v] [-W] {competitions,c,datasets,d,kernels,k,models,m,files,f,config} ...

options:
  -h, --help            show this help message and exit
  -v, --version         Print the Kaggle API version

commands:
  {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
                        Use one of:
                        competitions {list, files, download, submit, submissions, leaderboard}
                        datasets {list, files, download, create, version, init, metadata, status}
                        kernels {list, files, init, push, pull, output, status}
                        models {instances, get, list, init, create, delete, update}
                        models instances {versions, get, files, init, create, delete, update}
                        models instances versions {init, create, download, delete, files}
                        config {view, set, unset}
    competitions (c)    Commands related to Kaggle competitions
    datasets (d)    

In [None]:
!kaggle competitions download -c nlp-nup-2024-hw1 -f documents.jsonl

Downloading documents.jsonl.zip to /content
 99% 601M/605M [00:06<00:00, 90.9MB/s]
100% 605M/605M [00:06<00:00, 102MB/s] 


In [None]:
!unzip documents.jsonl.zip

Archive:  documents.jsonl.zip
  inflating: documents.jsonl         


In [None]:
docs = []
with open('documents.jsonl') as fp:
    for line in tqdm(fp, total=367840):
        docs.append(json.loads(line))

print(f'Number of documents to search in: {len(docs)}')

100%|██████████| 367840/367840 [00:14<00:00, 25239.38it/s]

Number of documents to search in: 367840





In [None]:
docs[0]

{'id': 'e1a4606fc85ca4c34cc1fc9dc3a85264',
 'url': 'https://www.reuters.com/article/us-puertorico-hurricane-lawsuit/judge-orders-further-extension-of-aid-to-puerto-rico-storm-evacuees-idUKKBN1KM4G6?edition-redirect=uk',
 'title': 'Judge orders further extension of aid to Puerto Rico storm evacuees',
 'contents': 'Judge orders further extension of aid to Puerto Rico storm evacuees\nBy Nate Raymond3 Min Read\nWORCESTER, Mass. (Reuters) - A federal judge on Wednesday extended until Aug. 31 an order preventing the eviction of hundreds of Puerto Rican families who fled the hurricane-ravaged island in 2017 and have been living in hotels and motels across the United States.\nFILE PHOTO: Buildings damaged by Hurricane Maria are seen in Lares, Puerto Rico, October 6, 2017. REUTERS/Lucas Jackson/File Photo\nU.S. District Judge Timothy Hillman in Worcester, Massachusetts, issued the order after hearing arguments over whether he should issue a longer-term injunction barring the federal government 

In [None]:
!kaggle competitions download -c nlp-nup-2024-hw1 -f queries_train.json

Downloading queries_train.json to /content
  0% 0.00/32.5k [00:00<?, ?B/s]
100% 32.5k/32.5k [00:00<00:00, 30.7MB/s]


In [None]:
!unzip queries_train.json.zip

unzip:  cannot find or open queries_train.json.zip, queries_train.json.zip.zip or queries_train.json.zip.ZIP.


In [None]:
with open('queries_train.json') as fp:
    queries = json.load(fp)

print(f'Number of train queries: {len(queries)}')

Number of train queries: 28


In [None]:
queries[0]

{'query_id': 'history-1',
 'query': 'Would the United Kingdom have been ready for WWII without the time gained through Appeasement?',
 'domain': 'history',
 'guidelines': "Many argue Britain's army was depleted in the early 1930s and stretched across the globe. UK defence spending had fallen significantly during the 1920s, from over £700 million in 1919 to 100 million in 1931.\n\nBetween 1934 and 1939, the UK launched a substantial programme of re-arming, recognising that war with Hitler was becoming increasingly likely. Although Appeasement was also motivated by Chamberlain's desire to end war, some argue this meant that the UK was more prepared in 1939 when war eventually broke out.  \n\nDespite these efforts, Germany was still better prepared for war under Hilter's single-minded preparation since he came to power in 1933. However, without Appeasement, the differential might have been much worse."}

In [None]:
!kaggle competitions download -c nlp-nup-2024-hw1 -f qrels_train.json

Downloading qrels_train.json to /content
100% 449k/449k [00:00<00:00, 1.33MB/s]
100% 449k/449k [00:00<00:00, 1.33MB/s]


In [None]:
!unzip qrels_train.json.zip

unzip:  cannot find or open qrels_train.json.zip, qrels_train.json.zip.zip or qrels_train.json.zip.ZIP.


In [None]:
with open('qrels_train.json') as fp:
    qrels = json.load(fp)

print(f'Number of assessed query/document pairs: {len(qrels)}')

Number of assessed query/document pairs: 4216


In [None]:
print(f'Example of single assesed pair:')
qrels[0:2]

Example of single assesed pair:


[{'query_id': 'history-20',
  'doc_id': '00aa648a657bdf73369bcb093030cc41',
  'relevance': 0,
  'iteration': 'Q0'},
 {'query_id': 'history-20',
  'doc_id': '0260670b7616127813246a8c76c6d223',
  'relevance': 0,
  'iteration': 'Q0'}]

In [None]:
def pfound_score(y_true: 'npt.NDArray[np.int_]', y_score: 'npt.NDArray[np.float_]', pbreak: float = .15) -> float:
    assert y_true.shape == y_score.shape

    indices = np.argsort(y_score)[::-1]

    y_max = max(y_true)

    pfound, plook = 0., 1.

    for rank, i in enumerate(indices):
        r = (2. ** y_true[i] - 1.) / (2. ** y_max)

        pfound += r * plook * pbreak ** rank

        plook *= 1. - r

    return pfound


def pfound(qrels_list: list[dict[str: str | int]],
           y_pred: list[dict[str: str | float]],
           pbreak: float = 0.15
          ) -> float:
    assert 0 < pbreak < 1
    zero_score_qrel = {'score': 0.0, 'relevance': 0.0}

    queries = set(qrel['query_id'] for qrel in qrels_list)
    p_found_list = []
    for cur_query in queries:
        cur_y_pred_dicts = [doc_ranked for doc_ranked in y_pred
                            if doc_ranked['query_id'] == cur_query]
        y = {qrel['doc_id']: qrel for qrel in qrels_list if qrel['query_id'] == cur_query}
        cur_y_pred = np.empty(len(cur_y_pred_dicts))
        cur_y_true = np.empty(len(cur_y_pred_dicts))
        for n, y_pred_dict in enumerate(cur_y_pred_dicts):
            cur_y_pred[n] = y_pred_dict['score']
            cur_y_true[n] = y.get(y_pred_dict['doc_id'], zero_score_qrel)['relevance']

        cur_pfound = pfound_score(np.array(cur_y_true), np.array(cur_y_pred))
        p_found_list.append(cur_pfound)
    return float(np.mean(p_found_list))

### Generating random predictions

For a pair of document and query we return just a random number. This solution can be treated as the lowest possible bound on the quality of our retrieval system.

In [None]:
#def random_similarity(doc: dict[str: str], query: dict[str: str]) -> float:
#    doc_text = doc['contents']
#    doc_title = doc['title']
#    query_text = query['query']
#    query_guidelines = query['guidelines']
#    return random.random()

Generating predictions

In [None]:
#preds = []
#
#for q in tqdm(queries):
#    for d in docs:
#        pred_sim = random_similarity(d, q)
#        preds.append({
#            'doc_id': d['id'],
#            'query_id': q['query_id'],
#            'score': pred_sim
#        })

Scorring the solution

In [None]:
#pfound(qrels, preds)

In [None]:
import json
import nltk
from collections import defaultdict, Counter
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from scipy.sparse import csr_matrix, coo_matrix
from math import log

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer('english')

In [None]:
nltk.download('stopwords')
stop_set = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#let's tokenize only the titles, not full text
df = defaultdict(int)

for doc in tqdm(docs):
    title_len = len(doc['title'])
    content_titles = doc['contents'][:title_len+200]
    tokens = [stemmer.stem(tok) for tok in tokenizer.tokenize(content_titles)]

    unique_tokens = set(tokens)
    for tok in unique_tokens:
        df[tok] += 1

100%|██████████| 367840/367840 [03:09<00:00, 1942.27it/s]


In [None]:
#token indexing
tok_indexed = {key: i for i, key in enumerate(df.keys())}

In [None]:
single_words = [key for key, value in df.items() if value == 1][:10]
single_words

['riphyakforb',
 'yurij',
 'burstcoin',
 'tanai',
 'heijin',
 'renshaw9',
 'decentralist',
 'universam',
 'perpetua',
 'scripturam']

In [None]:
#above we can see that we also have a lot of values that occur only once. let's filter them and stop words
df_filtered = {k: v for k, v in df.items() if v > 1 and k not in stop_set}

In [None]:
filtered_tok = {key: i for i, key in enumerate(df_filtered.keys())}

In [None]:
rows, cols, tf_values, idf_values = [], [], [], []

num_docs = len(docs)
num_tokens = len(filtered_tok)

for i, doc in tqdm(enumerate(docs)):
    title_len = len(doc['title'])
    content_titles = doc['contents'][:title_len+200]
    tokens = [stemmer.stem(tok) for tok in tokenizer.tokenize(content_titles)]
    token_counts = len(tokens)
    unique_tokens = Counter(tokens)

    for tok, count in unique_tokens.items():
        if tok not in filtered_tok:
            continue
        col = filtered_tok[tok]
        row = i
        tf_item = count / token_counts
        idf_item = log(num_docs / df_filtered[tok])

        rows.append(row)
        cols.append(col)
        tf_values.append(tf_item)
        idf_values.append(idf_item)

tf_matrix = coo_matrix((tf_values, (rows, cols)), shape=(num_docs, num_tokens))
idf_matrix = coo_matrix((idf_values, (rows, cols)), shape=(num_docs, num_tokens))

tf_idf_matrix = tf_matrix.multiply(idf_matrix) #completed TF-IDF matrix

367840it [03:38, 1685.05it/s]


In [None]:
print(tf_matrix.shape, idf_matrix.shape, tf_idf_matrix.shape)

(367840, 91819) (367840, 91819) (367840, 91819)


In [None]:
!kaggle competitions download -c nlp-nup-2024-hw1 -f queries_test.json

Downloading queries_test.json to /content
  0% 0.00/12.7k [00:00<?, ?B/s]
100% 12.7k/12.7k [00:00<00:00, 19.8MB/s]


In [None]:
!unzip queries_test.json.zip

unzip:  cannot find or open queries_test.json.zip, queries_test.json.zip.zip or queries_test.json.zip.ZIP.


In [None]:
with open('queries_test.json') as fp:
    qs_test = json.load(fp)

In [None]:
query_tf, query_rows, query_cols = [], [], []

#request processing
for i, query in tqdm(enumerate(qs_test)):
    tokens = [stemmer.stem(tok) for tok in tokenizer.tokenize(query['query'])]
    unique_tokens = Counter(tokens)

    for tok, count in unique_tokens.items():
        if tok not in filtered_tok:
            continue
        col = filtered_tok[tok]
        row = i
        tf_item = count
        query_rows.append(row)
        query_cols.append(col)
        query_tf.append(tf_item)

14it [00:00, 4056.95it/s]


In [None]:
#create a frequency matrix
query_tf_matrix = csr_matrix((query_tf, (query_rows, query_cols)), shape=(len(qs_test), num_tokens))

In [None]:
similarity_score = tf_idf_matrix.dot(query_tf_matrix.transpose())
similarity_score.shape

(367840, 14)

In [None]:
random_submission_items = []

for i, q in enumerate(qs_test):
    print(f'Generating socres for query {q["query_id"]}')
    for j, d in tqdm(enumerate(docs)):
        q_id = q['query_id']
        doc_id = d['id']
        random_submission_items.append({
            'id': f'{q_id}_{doc_id}',
            'query_id': q['query_id'],
            'doc_id': d['id'],
            'score': similarity_score[j, i]
        })

Generating socres for query economics-1


367840it [00:13, 27442.17it/s]


Generating socres for query economics-2


367840it [00:11, 33142.62it/s]


Generating socres for query economics-3


367840it [00:11, 33406.97it/s]


Generating socres for query economics-4


367840it [00:10, 35720.20it/s]


Generating socres for query economics-6


367840it [00:09, 37597.16it/s]


Generating socres for query economics-8


367840it [00:10, 33757.35it/s]


Generating socres for query economics-12


367840it [00:11, 30798.79it/s]


Generating socres for query economics-13


367840it [00:11, 33302.31it/s]


Generating socres for query economics-17


367840it [00:09, 37702.55it/s]


Generating socres for query economics-18


367840it [00:10, 34140.58it/s]


Generating socres for query economics-19


367840it [00:11, 33266.46it/s]


Generating socres for query economics-20


367840it [00:11, 31048.89it/s]


Generating socres for query economics-21


367840it [00:09, 37353.02it/s]


Generating socres for query economics-23


367840it [00:10, 35155.98it/s]


In [None]:
df = pd.DataFrame(random_submission_items)
df.set_index('id', inplace=True)
df.to_csv('submission.csv')