# INFO 4271 - Exercise 4 - Statistical Ranking

Issued: May 7, 2024

Due: May 13, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Generative Relevance Models
Generative retrieval models use the probabilistic language model framework for matching queries and documents.

a) Implement the `rank()` function sketched below. In class, we discussed two alternative model variants. Choose the query likelihood model.

In [7]:
#Rank a collection of documents relative to a query using the query likelihood model
def rank(Q, D):
     query_terms = Q.lower().split()
     document_terms = [doc.lower().split() for doc in D]

     doc_language_models = []
     for doc_terms in document_terms:
          doc_model = {}
          for term in doc_terms:
               doc_model[term] = doc_terms.count(term) / len(doc_terms)
          doc_language_models.append(doc_model)
 
     ranked_documents = []
     for i, doc_model in enumerate(doc_language_models):
          query_likelihood = 1.0
          for term in query_terms:
               if term in doc_model:
                    query_likelihood *= doc_model[term]
               else:
                    # Smoothing
                    query_likelihood *= 1e-6
          ranked_documents.append((i, query_likelihood))
          
     return ranked_documents

Q = 'french bulldog'
D = ['the french revolution was a period of upheaval in france', 
     'the french bulldog is a small breed of domestic dog', 
     'french is a very french language spoken by the french']

print(rank(Q, D))                            

1. the french revolution was a period of upheaval in france
2. the french bulldog is a small breed of domestic dog
3. french is a very french language spoken by the french


b) Probabilistic language models may encounter previously unseen query terms. Explain why this can become problematic and how you would address the issue. 

If a probabilistic language model encounters an unseen query term in the query, the probability of the whole query will automatically become 0. To prevent this, we can do some smoothing. This means adding a small constant to all terms to avoid 0 probabilities.

# 2. Relevance Feedback
Relevance Feedback allows us to refine the query representation after a round of user interaction with the search results. If organic feedback is not available, we can assume highly ranked documents to be *pseudo* relevant. Discuss the advantages and disadvantages of the pseudo relevance feedback scheme. Think in particular about single versus multiple rounds of feedback.

Disadvantages:
-   Topic drift: By selecting the nearest relevant topics and also querying them, we risk that the topics of the documents found don't always match the original topic of the query.

Certainly! Let's delve into the concept of **pseudo relevance feedback** and explore its advantages and disadvantages.

## Pseudo Relevance Feedback (PRF)
Pseudo relevance feedback is a technique used in information retrieval systems to improve retrieval performance without requiring extensive user interaction. Here's how it works:

1. **Initial Retrieval**:
   - The system performs an initial retrieval using the user's query to find a set of top-ranked documents.
   - These top-ranked documents are assumed to be relevant, even though they haven't been explicitly labeled as such by the user.

2. **Query Modification**:
   - Based on the assumption that the top-ranked documents contain relevant information, the system modifies the original query.
   - The modified query aims to capture the underlying concepts present in the relevant documents.

3. **Relevance Feedback**:
   - The modified query is then used for a subsequent retrieval.
   - The user may provide feedback on the relevance of the results obtained with the modified query.
   - This feedback can further refine the query representation.

### Advantages of Pseudo Relevance Feedback:
1. **Automatic and Efficient**:
   - PRF automates the manual part of relevance feedback, reducing the need for extensive user interaction.
   - Users benefit from improved retrieval performance without spending additional time providing explicit feedback.

2. **Better Performance than Global Analysis**:
   - Evidence suggests that PRF tends to work better than global analysis methods.
   - By focusing on the top-ranked documents, PRF captures relevant signals more effectively.

3. **Effective in TREC Ad Hoc Tasks**:
   - PRF has been successfully applied in the Text Retrieval Conference (TREC) ad hoc task, where it improved retrieval effectiveness.

### Disadvantages of Pseudo Relevance Feedback:
1. **Assumption-Based Risk**:
   - The assumption that top-ranked documents are relevant may not always hold true.
   - If the initial retrieval retrieves irrelevant documents due to noise or query drift, the modified query may also be suboptimal.

2. **Query Drift**:
   - Query drift occurs when the modified query shifts toward the content of the top-ranked documents, even if they are not truly relevant.
   - For example, if the initial query is about "copper mines," but the top-ranked documents are all about "mines in Chile," the modified query may drift toward Chile-related content.

3. **Single vs. Multiple Rounds of Feedback**:
   - PRF typically involves a single round of feedback. However, multiple rounds can be performed.
   - Single-round PRF may not fully capture user intent, while multiple rounds can lead to overfitting or excessive drift.

In summary, pseudo relevance feedback strikes a balance between automation and risk. While it offers efficiency and improved performance, careful consideration of assumptions and potential query drift is essential. Researchers continue to explore ways to enhance PRF and mitigate its limitations¹². If you have any further questions or need additional clarification, feel free to ask!

Quelle: Unterhaltung mit Bing, 11.5.2024
(1) Pseudo relevance feedback - Stanford University. https://nlp.stanford.edu/IR-book/html/htmledition/pseudo-relevance-feedback-1.html.
(2) Relevance feedback and pseudo relevance feedback - Stanford University. https://nlp.stanford.edu/IR-book/html/htmledition/relevance-feedback-and-pseudo-relevance-feedback-1.html.
(3) Pseudo Relevance Feedback with Deep Language Models and Dense .... https://arxiv.org/pdf/2108.11044v1.
(4) Pseudo Relevance Feedback with Deep Language Models and Dense .... https://arxiv.org/pdf/2108.11044v2.pdf.