<a href="https://colab.research.google.com/github/len-rtz/wir-2024-relevancers/blob/main/relevancers_baseline_system_V2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Baseline System Team "Relevancers" TH Köln

The following first draft retrieval systems builds onto the baseline system from https://github.com/irgroup-classrooms/wir-2024

In [27]:
# Install required libraries
!pip3 install 'tira>=0.0.139' ir-datasets 'python-terrier==0.10.0'
!pip install transformers

# Import necessary libraries
import pyterrier as pt
import pandas as pd
from nltk.corpus import stopwords
import re
import nltk
from transformers import pipeline
from pyterrier import IterDictIndexer



In [28]:
# Create an API client to interact with the TIRA platform
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

ensure_pyterrier_is_loaded()
tira = Client()

In [29]:
# Load Dataset
from pyterrier import get_dataset

pt_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')

# Data Cleaning & Preprocessing

In [30]:
# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Text preprocessing
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', ' ', text)
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Create Index

In [31]:
# Create indexer
indexer = IterDictIndexer(
    "../data/clean_index",
    meta={'docno': 50, 'text': 4096},
    overwrite=True
)

# Create clean document iterator
def clean_docs_iter():
    for doc in pt_dataset.get_corpus_iter():
        yield {'docno': doc['docno'], 'text': clean_text(doc['text'])}

# Build index
index = indexer.index(clean_docs_iter())

Download from Zenodo: https://zenodo.org/records/14254044/files/subsampled-ms-marco-deep-learning-20241201-training-inputs.zip


Download: 100%|██████████| 9.51M/9.51M [00:01<00:00, 7.65MiB/s]


Download finished. Extract...
Extraction finished:  /root/.tira/extracted_datasets/ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training/


ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:0…

11:02:28.845 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents


# Query Rewriting

In [37]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Load the paraphrasing model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ramsrigouthamg/t5-large-paraphraser-diverse-high-quality")
model = AutoModelForSeq2SeqLM.from_pretrained("ramsrigouthamg/t5-large-paraphraser-diverse-high-quality")

def paraphrase_query(query, num_beams=5, temperature=0.7):
    # Ensure query is a string and add proper prefix
    text = f"paraphrase: {str(query)}"

    # Encode text
    encoding = tokenizer.encode_plus(
        text,
        max_length=128,
        padding='longest',
        truncation=True,
        return_tensors="pt"
    )

    # Set random seed for reproducibility
    torch.manual_seed(42)

    # Generate paraphrased outputs
    outputs = model.generate(
        input_ids=encoding['input_ids'],
        attention_mask=encoding['attention_mask'],
        max_length=128,
        min_length=10,  # Ensure output is not too short
        num_beams=num_beams,
        num_return_sequences=3,  # Generate 3 different versions
        temperature=temperature,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        early_stopping=True,
        repetition_penalty=2.5,  # Penalize repetition
        length_penalty=1.0,
        no_repeat_ngram_size=2
    )

    # Decode outputs
    paraphrases = []
    for output in outputs:
        paraphrased = tokenizer.decode(output, skip_special_tokens=True)
        # Remove the prefix if it appears in the output
        paraphrased = paraphrased.replace("paraphrase: ", "")
        paraphrases.append(paraphrased)

    return paraphrases

# Test the function
original_query = "how does climate change affect wildlife?"
paraphrased_queries = paraphrase_query(original_query)

print(f"Original Query: {original_query}")
print("\nParaphrased versions:")
for i, para in enumerate(paraphrased_queries, 1):
    print(f"{i}. {para}")

tokenizer_config.json:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

Original Query: how does climate change affect wildlife?

Paraphrased versions:
1. paraphrasedoutput: how can climate change affect wildlife?
2. paraphrasedoutput: How can climate change affect wildlife?
3. paraphrasedoutput: how can climate change impact wildlife?


In [40]:
class T5QueryRewriter(pt.Transformer):
    def __init__(self, num_beams=5, temperature=0.7):
        super().__init__()
        self.num_beams = num_beams
        self.temperature = temperature

    def transform(self, topics):
        # Create a copy of the input topics
        new_topics = topics.copy()

        # Determine which column contains the query
        query_column = 'text' if 'text' in new_topics.columns else 'query'

        # Paraphrase each query
        new_topics[query_column] = new_topics[query_column].apply(
            lambda x: paraphrase_query(x, self.num_beams, self.temperature)
        )

        return new_topics

In [41]:
# Print some example rewrites to verify the transformer
def print_example_rewrites(topics, t5_rewriter, num_examples=3):
    print("\nExample Query Rewrites:")
    sample_topics = topics.head(num_examples)
    rewrites = t5_rewriter.transform(sample_topics)
    query_column = 'text' if 'text' in topics.columns else 'query'

    for idx, row in sample_topics.iterrows():
        print(f"\nOriginal: {row[query_column]}")
        print(f"Rewritten: {rewrites.loc[idx, query_column]}")

# Retrieve Topics

In [42]:
# Retrieve topics
topics = pt_dataset.get_topics('text')

# Retrieval

In [None]:
# Define retrieval models
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
bm25_rm3 = bm25 >> pt.rewrite.RM3(index) >> bm25

# Create T5 rewriter pipeline
t5_rewriter = T5QueryRewriter(num_beams=5, temperature=0.7)

# Print some example rewrites first
print_example_rewrites(topics, t5_rewriter)

# Create the combined pipelines
bm25_t5 = t5_rewriter >> bm25
bm25_t5_rm3 = t5_rewriter >> bm25_rm3

# Evaluate all models
results = pt.Experiment(
    [bm25_rm3, bm25_t5, bm25_t5_rm3],
    topics,
    pt_dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"],
    names=["BM25+RM3", "BM25+T5", "BM25+T5+RM3"]
)

print("\nEvaluation Results:")
print(results)


Example Query Rewrites:

Original: who is aziz hashim
Rewritten: ['paraphrasedoutput: who is aziz hashim?', 'paraphrasedoutput: Who is aziz hashim?', 'paraphrasedoutput: who is aziz Hashim hashim?']

Original: who is rep scalise
Rewritten: ['paraphrasedoutput: who is rep scalise?', 'paraphrasedoutput: Who is rep scalise?', 'paraphrasedoutput: who is rep. scalise?']

Original: who killed nicholas ii of russia
Rewritten: ['paraphrasedoutput: who killed nicholas ii of russia?', 'paraphrasedoutput: who murdered nicholas ii of russia?', 'paraphrasedoutput: Who killed nicholas ii of russia?']


# Upload to TIRA

In [None]:
import os
from tira.third_party_integrations import persist_and_normalize_run

# Define the directory path for saving runs
run_dir = '../data/runs'

# Create the directory if it does not exist
os.makedirs(run_dir, exist_ok=True)

# Assign the results to the 'run' variable
run = bm25_rm3(pt_dataset.get_topics('text'))

# Persist and normalize the run
persist_and_normalize_run(
    run,
    system_name='bm25+rm3-relevancers',
    default_output=run_dir,
    upload_to_tira=pt_dataset,
)

The run file is normalized outside the TIRA sandbox, I will store it at "../data/runs".
Done. run file is stored under "../data/runs/run.txt.gz".
Run uploaded to TIRA. Claim ownership via: https://www.tira.io/claim-submission/7017959a-ad66-4ba6-8ac7-f67c47775207
