# Experiment on NTCIR-17 Transfer Reranking Task with Train Dataset

This notebook shows how to train a BERT reranker using the train dataset of NTCIR-17 Transfer Task.

## Previous Step

- `preprocess-transfer1-train.ipynb`

## Requirement

- Java v11

## Path

In [60]:
import os
os.environ['INDEX'] = '../indexes/ntcir17-transfer/train'
os.environ['RUN'] = '../runs/ntcir17-transfer/train'
os.environ['MODEL'] = '../models'
os.environ['VENDOR'] = '../vendors'

## Datasets

In [None]:
import sys
!{sys.executable} -m pip install -q -U ir_datasets

In [3]:
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '../datasets'))

In [4]:
import pandas as pd
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/1/train')

## GPU Checking

In [5]:
import torch
for i in range(torch.cuda.device_count()):
    print(f"GPU {i}: {torch.cuda.get_device_name(i)}")

GPU 0: NVIDIA RTX A6000


## Create training dataset for BERT finetuning

- We split train topics into 70/13 (train/valid)
- Valid set will be used for inference too

In [None]:
queries = pd.DataFrame(dataset.queries_iter())
queries[0:5]

In [None]:
qrels = pd.DataFrame(dataset.qrels_iter())
# relevance score ranges from 0 to 2. Convert 1,2 to 1 for fine-turning
qrels['relevance'] = qrels['relevance'].apply(lambda x: 1 if x != 0 else 0)
qrels[0:5]

In [8]:
len(qrels)

261173

In [None]:
query_qrels = pd.merge(queries, qrels, how='right', on='query_id')
query_qrels[0:5]

In [10]:
len(query_qrels)

261173

In [None]:
docs = pd.DataFrame(dataset.docs_iter())
docs[0:5]

In [None]:
query_qrels_docs = pd.merge(query_qrels, docs, how='left', on='doc_id')
# Remove samplew where text was empty
query_qrels_docs = query_qrels_docs[~query_qrels_docs['text_y'].isnull()]
query_qrels_docs[0:5]

In [13]:
len(query_qrels_docs)

261168

In [14]:
train_df = query_qrels_docs[(query_qrels_docs['query_id'] >= '0001') & (query_qrels_docs['query_id'] <= '0070')]
valid_df = query_qrels_docs[(query_qrels_docs['query_id'] >= '0071') & (query_qrels_docs['query_id'] <= '0083')]

In [15]:
len(train_df), len(valid_df)

(231320, 29848)

In [16]:
train_df = train_df.copy()
valid_df = valid_df.copy()
train_df.loc[:, 'input_text'] = '[CLS] ' + train_df['text_x'] + ' [SEP] ' + train_df['text_y'] + ' [SEP]'
valid_df.loc[:, 'input_text'] = '[CLS] ' + valid_df['text_x'] + ' [SEP] ' + valid_df['text_y'] + ' [SEP]'
train_df = train_df.drop(['text_x', 'text_y'], axis=1)
valid_df = valid_df.drop(['text_x', 'text_y'], axis=1)

In [17]:
# Shuffle train set
train_df = train_df.sample(frac=1, random_state=42).reset_index(drop=True)
valid_df = valid_df.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
train_df.head(2)

---
## Training/Validation

- This section shows how to finetune a BERT model as a reranker
- Jump to "Testing (Inference)" section if you're interested in testing a BERT reranker trained by the organisers

In [None]:
import sys
!{sys.executable} -m pip install -q -U fugashi ipadic transformers[torch] accelerate===0.20.1

In [20]:
from torch.utils.data import Dataset

In [21]:
class CustomDataset(Dataset):
    def __init__(self, dataframe, tokenizer):
        self.tokenizer = tokenizer
        self.text = dataframe['input_text'].to_list()
        self.labels = dataframe['relevance'].to_list()
        
    def __len__(self):
        return len(self.text)

    def __getitem__(self, idx):
        text = self.text[idx]
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        return {
            'input_ids': torch.tensor(inputs['input_ids'], dtype=torch.long),
            'attention_mask': torch.tensor(inputs['attention_mask'], dtype=torch.long),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

In [22]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
model = AutoModelForSequenceClassification.from_pretrained("cl-tohoku/bert-base-japanese")

Some weights of the model checkpoint at cl-tohoku/bert-base-japanese were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

In [23]:
# # Specify a particular CUDA device
# import torch.nn as nn
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

In [24]:
train_data = CustomDataset(train_df, tokenizer)
valid_data = CustomDataset(valid_df, tokenizer)

### Config

- Change `per_device_train_batch_size` for your GPU spec
- Validation is done every 1000 steps to show learning progress
- Model is saved after 1 epoch

In [25]:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir=os.getenv('MODEL') + '/bert_with_transfer_train',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir=os.getenv('MODEL') + '/bert_with_transfer_train' + '/logs',
    evaluation_strategy="steps",
    eval_steps=1000,
    save_strategy="epoch",
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=valid_data,
)

In [26]:
%%time

# Train model
trainer.train()

# Save the best model to the disk
trainer.save_model()

# Free GPU memory
del train_data
torch.cuda.empty_cache()

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Step,Training Loss,Validation Loss
1000,0.0895,0.030154
2000,0.0707,0.029405
3000,0.0658,0.030115
4000,0.0545,0.032583
5000,0.0514,0.029624
6000,0.0455,0.036807
7000,0.0452,0.031328


CPU times: user 1h 56min 52s, sys: 24.5 s, total: 1h 57min 16s
Wall time: 1h 57min 34s


---
## Testing (Inference)

- This section shows how to use the trained model for reranking.
- If you're downloading a BERT reranker provided by the oganisers, save the files to the `model/bert_with_transfer_train` folder
- If you've trained the model by yourself, all the files should be available from the folder.

### Load the saved model

In [27]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Initialize model and tokenizer
model_path = os.getenv('MODEL') + '/bert_with_transfer_train'  # replace this with your model's path
model = AutoModelForSequenceClassification.from_pretrained(model_path).to('cuda')  # if using GPU
tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")

### Prediction

In [29]:
# Evaluation using Trainer.predict() with batch_size
eval_predictions = trainer.predict(valid_data)



In [33]:
# Obtain probabilities of all labels
logits = torch.tensor(eval_predictions.predictions)
probabilities = torch.softmax(logits, dim=-1)
prediction_df = pd.DataFrame(probabilities.tolist())
prediction_df

Unnamed: 0,0,1
0,0.999687,0.000313
1,0.999387,0.000613
2,0.998767,0.001233
3,0.999795,0.000205
4,0.998834,0.001166
...,...,...
29843,0.999726,0.000274
29844,0.998588,0.001412
29845,0.999769,0.000231
29846,0.999804,0.000196


## Reranking

### Prepare input data

In [36]:
top1000 = pd.DataFrame(dataset.scoreddocs_iter())
top1000_val = top1000[(top1000['query_id'] >= '0071') & (top1000['query_id'] <= '0083')]
len(top1000_val)

12038

In [77]:
top1000_val.head(5)

Unnamed: 0,query_id,doc_id,score
65437,71,gakkai-0000306411,32.141903
65438,71,gakkai-0000307932,23.646512
65439,71,gakkai-0000174085,21.992958
65440,71,gakkai-0000280722,21.573175
65441,71,gakkai-0000297523,20.313225


In [38]:
top1000_val_query = pd.merge(top1000_val, queries, on='query_id', how='left')
len(top1000_val_query)

12038

In [None]:
top1000_val_query.head(5)

In [40]:
top1000_val_query_doc = pd.merge(top1000_val_query, docs, on='doc_id', how='left')
len(top1000_val_query_doc)

12038

In [None]:
top1000_val_query_doc.head(5)

In [99]:
top1000_val_data = top1000_val_query_doc.copy()
top1000_val_data.loc[:, 'input_text'] = '[CLS] ' + top1000_val_data['text_x'] + ' [SEP] ' + top1000_val_data['text_y'] + ' [SEP]'
top1000_val_data = top1000_val_data.drop(['text_x', 'text_y'], axis=1)
top1000_val_data['relevance'] = 0 # Dummy data
top1000_val_data.insert(1, 'Q0', 'Q0')
top1000_val_data.insert(5, 'Run_ID', 'BM25') # User your run id
top1000_val_data['rank'] = top1000_val_data.groupby('query_id')['score'].rank(method='first', ascending=False) - 1
top1000_val_data['rank'] = top1000_val_data['rank'].astype(int)
len(top1000_val_data)

12038

In [None]:
top1000_val_data.head(5)

In [75]:
test_data = CustomDataset(top1000_val_data, tokenizer)

### Inference

- Finetuned reranker estimate the probability of the labels (0 or 1) of new input_texts
- We use the probability value of label 1 as a score of ranked documents

In [76]:
%%time
eval_predictions = trainer.predict(test_data)



CPU times: user 1min 45s, sys: 108 ms, total: 1min 45s
Wall time: 1min 45s


In [120]:
logits = torch.tensor(eval_predictions.predictions)
probabilities = torch.softmax(logits, dim=-1)
prediction_df = pd.DataFrame(probabilities.tolist())
prediction_df

Unnamed: 0,0,1
0,0.285297,0.714703
1,0.756743,0.243257
2,0.995781,0.004219
3,0.973803,0.026197
4,0.793503,0.206497
...,...,...
12033,0.999472,0.000528
12034,0.999775,0.000225
12035,0.998360,0.001640
12036,0.999361,0.000639


In [None]:
# Create a new DF for reranked results
top1000_val_rerank_data = top1000_val_data.copy()
top1000_val_rerank_data['score'] = pd.DataFrame(prediction_df)[1] # Replace original scores with label 1 prob
top1000_val_rerank_data.head(5)

### Generate trev_eval format

#### BM25 Ranker

In [100]:
top1000_val_data = top1000_val_data.drop('input_text', axis=1)
top1000_val_data = top1000_val_data.drop('relevance', axis=1)
# Change the order of fields
cols = top1000_val_data.columns.tolist()
col_index = cols.index('rank') - 2
cols.insert(col_index, cols.pop(cols.index('rank')))
top1000_val_data = top1000_val_data[cols]
len(top1000_val_data)

12038

In [101]:
top1000_val_data.head(5)

Unnamed: 0,query_id,Q0,doc_id,rank,score,Run_ID
0,71,Q0,gakkai-0000306411,0,32.141903,BM25
1,71,Q0,gakkai-0000307932,1,23.646512,BM25
2,71,Q0,gakkai-0000174085,2,21.992958,BM25
3,71,Q0,gakkai-0000280722,3,21.573175,BM25
4,71,Q0,gakkai-0000297523,4,20.313225,BM25


#### BERT Reranker

In [91]:
# Add a new column 'rank' that increases at tie-score cases within each 'topic_id'
top1000_val_rerank_data['rank'] = top1000_val_rerank_data.groupby('query_id')['score'].rank(method='first', ascending=False) - 1
top1000_val_rerank_data['rank'] = top1000_val_rerank_data['rank'].astype(int)
# Sort the DataFrame first by 'topic_id' and then by 'rank'
top1000_val_rerank_data = top1000_val_rerank_data.sort_values(['query_id', 'score'], ascending=[True, False])
top1000_val_rerank_data.head(5)
# Drop unnecessarily fields
top1000_val_rerank_data = top1000_val_rerank_data.drop('input_text', axis=1)
top1000_val_rerank_data = top1000_val_rerank_data.drop('relevance', axis=1)
# Change Run ID
top1000_val_rerank_data['Run_ID'] = 'BERT_Reranker' # User your run id
# Change the order of fields
cols = top1000_val_rerank_data.columns.tolist()
col_index = cols.index('rank') - 2
cols.insert(col_index, cols.pop(cols.index('rank')))
top1000_val_rerank_data = top1000_val_rerank_data[cols]
# Reset index
top1000_val_rerank_data = top1000_val_rerank_data.reset_index(drop=True)
len(top1000_val_rerank_data)

12038

In [92]:
top1000_val_rerank_data.head(10)

Unnamed: 0,query_id,Q0,doc_id,rank,score,Run_ID
0,71,Q0,gakkai-0000306411,0,0.714703,BERT_Reranker
1,71,Q0,gakkai-0000010792,1,0.361807,BERT_Reranker
2,71,Q0,gakkai-0000307932,2,0.243257,BERT_Reranker
3,71,Q0,gakkai-0000297523,3,0.206497,BERT_Reranker
4,71,Q0,gakkai-0000037173,4,0.164665,BERT_Reranker
5,71,Q0,gakkai-0000313261,5,0.15433,BERT_Reranker
6,71,Q0,gakkai-0000307724,6,0.144163,BERT_Reranker
7,71,Q0,gakkai-0000297686,7,0.140508,BERT_Reranker
8,71,Q0,gakkai-0000306410,8,0.128755,BERT_Reranker
9,71,Q0,gakkai-0000297687,9,0.118234,BERT_Reranker


## Experiment

### PyTerrier

In [None]:
# Change JAVA_HOME to fit your environment
JAVA_HOME = '/usr/lib/jvm/java-11-openjdk-amd64'
os.environ['JAVA_HOME'] = JAVA_HOME
os.getenv('JAVA_HOME')

In [None]:
import sys
!{sys.executable} -m pip install -q python-terrier

In [66]:
import pandas as pd
import pyterrier as pt
if not pt.started():
  pt.init(tqdm='notebook')

terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /home/jovyan/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /home/jovyan/.pyterrier...
Done


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



In [67]:
dataset_pt = pt.get_dataset('irds:ntcir-transfer/1/train')

In [108]:
topics_val = dataset_pt.get_topics()
topics_val = topics_val[(topics_val['qid'] >= '0071') & (topics_val['qid'] <= '0083')]
topics_val

Unnamed: 0,qid,query
70,71,
71,72,
72,73,
73,74,
74,75,
75,76,
76,77,
77,78,
78,79,
79,80,


In [109]:
from pyterrier.measures import *
pt.Experiment(
    [top1000_val_data, top1000_val_rerank_data],
    topics=topics_val,
    qrels=dataset_pt.get_qrels(),
    filter_by_topics=True,
    names=["BM25", "BERT Reranker"],
    eval_metrics=[nDCG@1, nDCG@5, nDCG@10, nDCG@20, MRR]
)

Unnamed: 0,name,nDCG@1,nDCG@5,nDCG@10,nDCG@20,RR
0,BM25,0.153846,0.172408,0.18142,0.25911,0.340364
1,BERT Reranker,0.307692,0.217439,0.231301,0.267536,0.421571


In [112]:
from pyterrier.measures import *
performance = pt.Experiment(
    [top1000_val_data, top1000_val_rerank_data],
    topics=topics_val,
    qrels=dataset_pt.get_qrels(),
    filter_by_topics=True,
    names=["BM25", "BERT Reranker"],
    eval_metrics=[MRR],
    perquery=True
)

In [119]:
performance.pivot(index='name', columns='qid', values='value').sort_index(ascending=False)

qid,0071,0072,0073,0074,0075,0076,0077,0078,0079,0080,0081,0082,0083
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
BM25,1.0,0.090909,0.25,0.052632,0.034483,0.142857,0.2,1.0,0.5,0.5,0.076923,0.5,0.076923
BERT Reranker,1.0,0.166667,1.0,0.011628,0.0625,1.0,0.142857,0.333333,1.0,0.2,0.004608,0.5,0.058824


---
## Where can we go from here?

- Increase epochs
- Try different transfer models to finetune
- Use the entire train set for training
- Use external data for finetuning (before/after the train dataset)