# INFO 4271 - Exercise 4 - Statistical Ranking

Issued: May 7, 2024

Due: May 13, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Generative Relevance Models
Generative retrieval models use the probabilistic language model framework for matching queries and documents.

a) Implement the `rank()` function sketched below. In class, we discussed two alternative model variants. Choose the query likelihood model.

In [1]:
#Rank a collection of documents relative to a query using the query likelihood model
def rank(Q, D):
     query_terms = Q.lower().split()
     document_terms = [doc.lower().split() for doc in D]

     doc_language_models = []
     for doc_terms in document_terms:
          doc_model = {}
          for term in doc_terms:
               doc_model[term] = doc_terms.count(term) / len(doc_terms)
          doc_language_models.append(doc_model)
 
     ranked_documents = []
     for i, doc_model in enumerate(doc_language_models):
          query_likelihood = 1.0
          for term in query_terms:
               if term in doc_model:
                    query_likelihood *= doc_model[term]
               else:
                    # Smoothing
                    query_likelihood *= 1e-6
          ranked_documents.append((i, query_likelihood))
          
     return ranked_documents

Q = 'french bulldog'
D = ['the french revolution was a period of upheaval in france', 
     'the french bulldog is a small breed of domestic dog', 
     'french is a very french language spoken by the french']

print(rank(Q, D))                            

[(0, 1e-07), (1, 0.010000000000000002), (2, 3e-07)]


b) Probabilistic language models may encounter previously unseen query terms. Explain why this can become problematic and how you would address the issue. 

If a probabilistic language model encounters an unseen query term in the query, the probability of the whole query will automatically become 0. To prevent this, we can do some smoothing. This means adding a small constant to all terms to avoid 0 probabilities.

# 2. Relevance Feedback
Relevance Feedback allows us to refine the query representation after a round of user interaction with the search results. If organic feedback is not available, we can assume highly ranked documents to be *pseudo* relevant. Discuss the advantages and disadvantages of the pseudo relevance feedback scheme. Think in particular about single versus multiple rounds of feedback.

Advantages:
- The chance of getting more relevant documents gets increased. If we use multiple rounds of feedback, the chance of getting even closer to the original query increases if the query was formed well.
- We get better results even if the query of the user isn't well formed.
- When forming a new query with the pseudo relevant feedback we can detect underlying concepts that weren't clear through the original query but are revealed by the pseudo relevant feedback documents.
- We don't need user interaction to get feedback (users never tend to give feedback). This saves time and interaction actions by the user.

Disadvantages:
- Topic drift: By selecting the nearest relevant topics and also querying them, we risk that the topics of the documents found don't always match the original topic of the query. With multiple rounds of feedback we increase the liklyhood of having non-relevant feedback even further.
- If the inital query results in irrelevant documents, the second query based on pseudo relevance feedback topics can be far off what the user actually wanted.