## Build Dataset
First, we need to build datasets for training, validating and testing a neural ranking model.

For training, we use the qrel of a competition (e.g., TREC robust04, NTCIR WWW), which contains the relevance judgements for query - doc pairs. Since qrel only contains query ids and doc ids, we need to extract queries and docuemnts from topic file and the corpus respectively.

For test/validation, we usually use neural ranking to rerank the first k documents retreived by a simple retrieval model (e.g., BM25) since neural ranking is too slow to compute the relevance score for every document in the corpus (eg., Clueweb12).

In Fulvus server,  Robust04 corpus is indexed by Anserini with `-transformed` option (For more details, check https://github.com/castorini/anserini).
 
The path of robust04 index is 
```/ir/index/lucene-index.robust04.pos+docvectors+rawdocs+transformed ```

Then,  build the robust04 datasets using the `build_datapack.py` script:

```shell
cd neural_ranking

python scripts/utils/build_datapack.py 
--index /ir/index/lucene-index.robust04.pos+docvectors+rawdocs+transformed 
--topic ./resources/topics_and_qrels/topics.robust04.301-450.601-700.txt 
--qrel ./resources/topics_and_qrels/qrels.robust2004.txt 
--output built_data/robust04
```

Then you will get the robust04 dataset for training and valdidation.

In [None]:
%load_ext autoreload
%autoreload 2

## Train 

In [2]:
import sys
import os
sys.path.append(os.path.abspath("../../"))

import neural_ranking
import matchzoo as mz
import copy
import torch

from neural_ranking.runners.dataset import ReRankDataset
from neural_ranking.runners.utils import  ReRankTrainer

In [3]:
# Define Loss and Metrics

ranking_task = mz.tasks.Ranking(mz.losses.RankHingeLoss())
ranking_task.metrics = [
    mz.metrics.NormalizedDiscountedCumulativeGain(k=20),
    mz.metrics.Precision(k=30),
]

In [4]:
dataset = ReRankDataset("robust04", rerank_hits=1000, debug_mode=True) # debug mode will only load 100 docs from the dataset
dataset.init_topic_splits(dev_ratio=0.2, test_ratio=0, seed=2020) # split data into train and dev randomly

In [5]:
# Auto-fill missing configs with default settings
model,preprocessor,dataset_builder, dataloader_builder = mz.auto.prepare(
            task=ranking_task,
            model_class=mz.models.Bert,
            data_pack=dataset.pack,
            embedding=None, # Bert does not need embedding
            preprocessor=mz.models.Bert.get_default_preprocessor())

In [6]:
# Preprocess query and document using BertPreprocessor
dataset.apply_preprocessor(preprocessor)

Multi-Core Processing text_left with bert_encode: 100%|██████████| 250/250 [00:00<00:00, 3130.26it/s]
Multi-Core Processing text_right with bert_encode: 100%|██████████| 1000/1000 [00:02<00:00, 462.53it/s]
Processing text_left with Chain Transform of TruncatedLength: 100%|██████████| 230/230 [00:00<00:00, 258699.36it/s]
Processing text_right with Chain Transform of TruncatedLength: 100%|██████████| 992/992 [00:00<00:00, 501174.36it/s]
Multi-Core Processing length_left with len: 100%|██████████| 250/250 [00:00<00:00, 44987.82it/s]
Multi-Core Processing length_right with len: 100%|██████████| 1000/1000 [00:00<00:00, 24957.48it/s]
Multi-Core Processing text_left with bert_encode: 100%|██████████| 250/250 [00:00<00:00, 3089.59it/s]
Multi-Core Processing text_right with bert_encode: 100%|██████████| 500/500 [00:01<00:00, 431.50it/s]
Processing text_left with Chain Transform of TruncatedLength: 100%|██████████| 213/213 [00:00<00:00, 158767.86it/s]
Processing text_right with Chain Transform o

We employ the pair-wise loss function (i.e., Hinge Loss), which requires to sample a positive example and a negative example from the training data, so we need a data loader to handle the sampling and batching work.

In [7]:
# Set up dataset and dataloader. For example, we need to sample positive and negative examples for training, but not for evaluation (test/dev).

def get_dataloaders(dataset, dataset_builder, dataloader_builder, batch_size=2):
    training_pack = dataset.train_pack_processed
    # Setup data
    trainset = dataset_builder.build(
        training_pack,
        batch_size=batch_size,
        sort=False,
    )
    train_loader = dataloader_builder.build(trainset)

    eval_dataset_kwargs = copy.copy(dataset_builder._kwargs)
    eval_dataset_kwargs["batch_size"] = batch_size * 2
    eval_dataset_kwargs["shuffle"] = False
    eval_dataset_kwargs["sort"] = False
    eval_dataset_kwargs["resample"] = False
    eval_dataset_kwargs["mode"] = "point"

    eval_dataset_builder = mz.dataloader.DatasetBuilder(
        **eval_dataset_kwargs,
    )
    
    dev_dataset = eval_dataset_builder.build(dataset.dev_pack_processed)
    dev_loader = dataloader_builder.build(dataset=dev_dataset, stage="dev")
    return train_loader, dev_loader

train_loader, dev_loader = get_dataloaders(dataset, dataset_builder, dataloader_builder)

In [8]:
optimizer = torch.optim.AdamW(model.parameters())
trainer = ReRankTrainer(
            model=model,
            optimizer=optimizer,
            trainloader=train_loader,
            validloader=dev_loader,
            epochs=3,
            patience=2,
            device="cuda",
            save_dir="checkpoint",
            fp16=False,
            clip_norm=5,
            batch_accumulation=2)

In [9]:
trainer.run()

HBox(children=(IntProgress(value=0, max=25), HTML(value='')))

[Iter-25 Loss-0.974]:
  Validation: normalized_discounted_cumulative_gain@20(0.0): 0.0306 - precision@30(0.0): 0.0018



HBox(children=(IntProgress(value=0, max=25), HTML(value='')))

[Iter-50 Loss-1.026]:
  Validation: normalized_discounted_cumulative_gain@20(0.0): 0.0441 - precision@30(0.0): 0.0018



HBox(children=(IntProgress(value=0, max=25), HTML(value='')))

[Iter-75 Loss-1.090]:
  Validation: normalized_discounted_cumulative_gain@20(0.0): 0.0306 - precision@30(0.0): 0.0018

Cost time: 19.354414463043213s
