# Text classification reranking (7-9)
### Jakub Łubkowski, Marcin Mikuła

7. Use the classifier as a re-ranker for finding the answers to the questions. Since the re-ranker is slow, you
   have to limit the subset of possible passages to top-n (10, 50 or 100 - depending on your GPU) texts returned by much faster model, e.g. FTS.
8. The scheme for re-ranking is as follows:
   - Find passage candidates using FTS, where the query is the question.
   - Take top-n results returned by FTS.
   - Use the model to classify all pairs, where the first sentence is the question (query) and the second sentence is
      the passage returned by the FTS.
   - Use the score returned by the model (i.e. the probability of the **positive** outcome) to re-rank the passages.
9. Compute how much the result of searching the passages improved over the results from lab 2. Use NDCG to compare the
   results.


In [12]:
import pandas as pd

# load data from /data/raw
corpus_df = pd.read_csv('data/raw/corpus.csv')
q_df = pd.read_csv('data/raw/queries.csv')
qa_df = pd.read_csv('data/raw/qrels.csv')


In [None]:
import elasticsearch

es = elasticsearch.Elasticsearch("http://localhost:9200")
es.info()

In [17]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from sklearn.metrics import ndcg_score

class BertReranker:
    def __init__(self, model_path="model", tokenizer_path="tokenizer"):
        if torch.backends.mps.is_available():
            self.device = torch.device("mps")
        elif torch.cuda.is_available():
            self.device = torch.device("cuda")
        else:
            self.device = torch.device("cpu")
        self.model = AutoModelForSequenceClassification.from_pretrained(model_path).to(self.device)
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        self.model.eval()  # Set to evaluation mode

    def get_score(self, query, passage):
        # Prepare input
        text_pair = f"{query} {self.tokenizer.sep_token} {passage}"
        inputs = self.tokenizer(
            text_pair,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="pt"
        ).to(self.device)

        # Get prediction
        with torch.no_grad():
            outputs = self.model(**inputs)
            probabilities = torch.softmax(outputs.logits, dim=1)
            positive_score = probabilities[0][1].item()  # Probability of positive class
        
        return positive_score


In [40]:

def calculate_ndcg_with_reranking(
        qa_dataset, 
        queries_dataset, 
        es, 
        index_name, 
        field, 
        reranker=None, 
        top_n=10
    ):
    ndcg_scores = []
    
    # Convert datasets to DataFrames if they aren't already
    qa_df = pd.DataFrame(qa_dataset) if not isinstance(qa_dataset, pd.DataFrame) else qa_dataset
    queries_df = pd.DataFrame(queries_dataset) if not isinstance(queries_dataset, pd.DataFrame) else queries_dataset
    
    for query_id in qa_df['query-id'].unique():
        # Find the query text
        query_row = queries_df[queries_df['_id'] == query_id]
        if query_row.empty:
            print(f"Warning: No query found for query_id {query_id}")
            continue
        
        query = query_row.iloc[0]['text_query']
        corpus_ids = qa_df[qa_df['query-id'] == query_id]['corpus-id']
        corpus_ids_set = set(int(id) for id in corpus_ids)
        
        # Perform the search
        try:
            search_results = es.search(
                index=index_name,
                body={
                    "query": {
                        "match": {
                            field: query
                        }
                    },
                    "size": top_n  # Get top_n results for reranking
                }
            )
        except Exception as e:
            print(f"Error performing search for query_id {query_id}: {e}")
            continue

        hits = search_results['hits']['hits']
        
        if reranker:
            # Rerank the results
            reranked_hits = []
            for hit in hits:
                passage = hit['_source'][field]
                score = reranker.get_score(query, passage)
                reranked_hits.append((hit, score))
            
            # Sort by reranker score
            reranked_hits.sort(key=lambda x: x[1], reverse=True)
            hits = [hit[0] for hit in reranked_hits]

        # Extract document IDs from final results
        retrieved_ids = [int(hit['_source']['corpus_id']) for hit in hits]

        # Create relevance scores for retrieved documents
        relevance_scores = [1 if doc_id in corpus_ids_set else 0 for doc_id in retrieved_ids]

        # Create true relevance scores
        true_relevance = [1] * len(corpus_ids)

        # Pad both lists to ensure they have exactly 5 elements
        relevance_scores = (relevance_scores + [0] * 5)[:5]
        true_relevance = (true_relevance + [0] * 5)[:5]

        # Calculate NDCG@5
        ndcg = ndcg_score([true_relevance], [relevance_scores], k=5)
        ndcg_scores.append(ndcg)
    
    return sum(ndcg_scores) / len(ndcg_scores) if ndcg_scores else 0.0

In [46]:
index_name = "fiqa_pl_corpus"

# Initialize reranker
reranker = BertReranker()

# Test setups
setups = [
    ("No synonyms, No lemmatization", "text_without_synonyms_no_lemma"),
    ("Synonyms, No lemmatization", "text_with_synonyms_no_lemma"),
    ("No synonyms, Lemmatization", "text_without_synonyms_with_lemma"),
    ("Synonyms, Lemmatization", "text_with_synonyms_with_lemma")
]

In [43]:
print("\nNDCG@5 scores without reranking:")
for setup_name, field in setups:
    print(f"{setup_name}:")
    avg_ndcg_score = calculate_ndcg_with_reranking(
        qa_df, q_df, es, index_name, field, reranker=None
    )
    print(f"NDCG@5: {avg_ndcg_score:.4f}")
    print()


NDCG@5 scores without reranking:
No synonyms, No lemmatization:
NDCG@5: 0.7637

Synonyms, No lemmatization:
NDCG@5: 0.7637

No synonyms, Lemmatization:
NDCG@5: 0.7683

Synonyms, Lemmatization:
NDCG@5: 0.7683



In [45]:
print("\nNDCG@5 scores with reranking:")
for setup_name, field in setups:
    print(f"{setup_name}:")
    avg_ndcg_score = calculate_ndcg_with_reranking(
        qa_df, q_df, es, index_name, field, reranker=reranker, top_n=3
    )
    print(f"NDCG@5: {avg_ndcg_score:.4f}")
    print() 


NDCG@5 scores with reranking:
No synonyms, No lemmatization:
NDCG@5: 0.7691

Synonyms, No lemmatization:
NDCG@5: 0.7691

No synonyms, Lemmatization:
NDCG@5: 0.7785

Synonyms, Lemmatization:
NDCG@5: 0.7785



### 1. Do you think simpler methods, like Bayesian bag-of-words model, would work for sentence-pair classification? Justify your answer.

A Bayesian bag-of-words model would likely be insufficient for sentence-pair classification because:

1. It loses word order and context, which are crucial for understanding relationships between sentences
2. It can't effectively capture semantic relationships between question-answer pairs
3. It struggles with paraphrasing and synonyms, which are common in Q&A scenarios

While simpler and faster to train, it would likely perform worse than our transformer model which achieved NDCG scores of ~0.77 by understanding deeper semantic relationships.


### 2. What hyper-parameters you have selected for the training? What resources (papers, tutorial) you have consulted to select these hyper-parameters?

Selected hyperparameters:
- Model: bert-base-polish-uncased-v1 (Polish BERT base model)
- Batch size: 16 (train and eval)
- Number of epochs: 3
- Learning rate: Dynamic with warmup
- Warmup steps: 500
- Weight decay: 0.01
- Evaluation strategy: Steps-based (every 500 steps)
- Max sequence length: 512 tokens

Resources consulted:
1. "Fine-tuning BERT for Text Classification" tutorial from Huggingface
2. Polish BERT model documentation (dkleczek/bert-base-polish-uncased-v1)

The parameters were chosen based on common best practices for BERT fine-tuning, with batch size and learning rate adjusted for our hardware constraints and dataset size.


### 3. Think about pros and cons of the neural-network models with respect to natural language processing. Provide at least 2 pros and 2 cons.

Pros:
1. Semantic Understanding: Good at capturing complex relationships in text
2. Adaptability: Can handle variations in language (synonyms, paraphrasing) well

Cons:
1. Resource Heavy: Requires significant computational power and memory
2. Lack of Interpretability: Hard to understand why the model makes specific decisions, unlike rule-based systems
3. Overfitting Risk: Tend to overfit on small datasets, requiring large amounts of training data for reliable performance