Before you turn this assignment in, make sure everything runs as expected by going to the menubar and running: 

**Kernel $\rightarrow$ Restart & Run All**

Please replace all spots marked with `# ADD YOUR CODE HERE` or `ADD YOUR ANSWER HERE`.

And start by filling in your name and student_id below:

In [None]:
NAME = ""
STUDENT_ID = ""

In [None]:
assert len(NAME) > 0, "Please fill in your name"
assert len(STUDENT_ID) > 0, "Please fill in your student id"

---

In [None]:
import doctest
import string

from collections import defaultdict, Counter
from math import log
from typing import Callable, List, Dict, Tuple

import nltk
import pandas as pd


if __name__ == "__main__":
    nltk.download("punkt_tab");
    nltk.download("stopwords");

    # Enable doctests
    def test(fn: Callable):
        doctest.run_docstring_examples(fn, globals(), verbose=True, name=fn.__name__)
else:
    # Disable doctests in autograding setup
    def test(fn: Callable): pass

# Week 3 - Preprocessing & Vector Space Model

Welcome to week three of the search engines course 👋

Part I of this week takes a look at preprocessing documents for retrieval. In part II, we will build our first ranked search engine using tf-idf scoring. And in part III, we will build a language-model-based search engine.

As always, for any questions, problems, or feedback please contact your TA. Good luck with the assignment!

### Resources

📚 [Preprocessing - Manning, Raghavan, Schütze - Chapter 2.2](https://nlp.stanford.edu/IR-book/pdf/02voc.pdf)

📚 [Vector Space Model - Manning, Raghavan, Schütze - Chapter 6.3](https://nlp.stanford.edu/IR-book/pdf/06vect.pdf)

📚 [Preprocessing - Croft, Metzler, and Strohman - Chapter 4.3](https://ciir.cs.umass.edu/downloads/SEIRiP.pdf#page=110)

📚 [Vector Space Model - Croft, Metzler, and Strohman - Chapter 7.1.2](https://ciir.cs.umass.edu/downloads/SEIRiP.pdf#page=261)

# Part I - Preprocessing

The Netflix dataset that we used in the first week was already preprocessed. Since you learned more about the different steps of preprocessing documents in this week's lecture, let's take a closer look at preprocessing documents for IR. For this, we will use a subset of a real dataset from a public TREC challenge for [biomedical information retrieval about Covid-19](https://ir.nist.gov/trec-covid/). We will process medical abstracts concerning vitamins in connection with covid. To make things a bit more interesting, we added a very simple boolean search engine below, which we will use to evaluate each step of our preprocessing pipeline.

<div class="alert alert-warning">
⚠️ Note: To be very clear, this is not an endorsement of treating covid-19 with vitamins. For more info see: https://www.mayoclinic.org/diseases-conditions/coronavirus/expert-answers/coronavirus-and-vitamin-d/faq-20493088
</div>
    
Let's begin by executing the boilerplate code below:

## Setup I: Simple boolean search

In [None]:
def load_covid_data(
    url: str = "https://raw.githubusercontent.com/irlabamsterdam/uva-ir0-assignments/main/data/trec-covid.csv"
):
    return pd.read_csv(url)

def create_index(df):
    index = defaultdict(list)

    for i, row in df.iterrows():
        tokens = preprocessing(row.text)

        for token in tokens:
            index[token].append(row.doc_id)

    return index

def search_and(index, query):
    tokens = preprocessing(query)
    return set.intersection(*[set(index[token]) for token in tokens])

def precision(scores):
    return sum(scores) / len(scores) if len(scores) > 0 else 0

def recall(scores, max_scores):
    return sum(scores) / sum(max_scores)

def f1(scores, max_scores):
    p = precision(scores)
    r = recall(scores, max_scores)
    return (2 * p * r) / (p + r) if (p + r) > 0 else 0

def evaluate(df):
    # Disable execution in autograde 
    if __name__ != "__main__": return

    index = create_index(df)
    index_size = len(index)
    doc2relevance = df.set_index("doc_id").relevance.to_dict()
    max_scores = list(df.doc_id.map(doc2relevance))

    queries = [
        ("A", "Vitamin D and COVID-19"),
        ("B", "which vitamins against Sars-Cov-2?"),
        ("C", "Vitamin A or D for the coronavirus?")
    ]

    print(f"Inverted index size: {index_size} tokens\n")

    for query_id, query in queries:
        docs = search_and(index, query)
        scores = [doc2relevance[d] for d in docs]

        print(f"Query {query_id}: '{query}' -> {preprocessing(query)}")
        print(
            f"Precision: {precision(scores):.4f},",
            f"Recall: {recall(scores, max_scores):.4f},",
            f"F1: {f1(scores, max_scores):.4f},",
            f"Retrieved docs: {len(docs)}/{len(max_scores)}\n",
        )


df = load_covid_data()
df.head()

## Setup II: Preprocessing pipeline

Below is an (empty) preprocessing pipeline. In the following, we will overwrite each step with the actual preprocessing code. When calling `evaluate(df)`, we build a new index with our updated pipeline and evaluate the impact of our preprocessing on three queries (A, B, C) all roughly expressing a similar information need from the TRECT corpus: "Do vitamins help against Covid-19?".

#### Test Queries
* **A:** Vitamin D and COVID-19
* **B:** Which vitamins against Sars-Cov-2?
* **C:** Vitamin A or D for the coronavirus?

After running the code below, you should be able to see the evaluation output for each of the three queries. For simplicity, we evaluate the ranking with the set-based metrics introduced last week.

In [None]:
def preprocessing(text: str):
    tokens = tokenize(text)
    tokens = normalize(tokens)
    tokens = stopping(tokens)
    tokens = stemming(tokens)
    return tokens

def tokenize(text: str) -> List[str]:
    return text.split()

def normalize(tokens: List[str]) -> List[str]:
    return tokens

def stopping(tokens: List[str]) -> List[str]:
    return tokens

def stemming(tokens: List[str]) -> List[str]:
    return tokens


evaluate(df)

## 1.1 Tokenization

After executing all this boilerplate, it is your turn! The first important step in our pipeline is tokenization, the process of splitting a piece of text into individual tokens. Tokenization is a complex and critical process. We might think that we have to break text on punctuation and spaces. However, we would rarely want to break tokens like `Ph.D.` or `U.K.` apart, while splitting certain pre- or suffixes is sometimes desirable. To make matters worse, tokenization is heavily language-dependent. If you are curious, take a look at an industry-grade rule-based [tokenizer from the spaCy library](https://spacy.io/usage/linguistic-features#tokenization). Modern LLMs train specific models that learn how to tokenize a piece of text (typically using statistics of co-occurring characters). But for this course, we will stick to a more classic rule-based approach.


📝 Replace the following naive tokenization on whitespace with a more sophisticated tokenizer:

1. Replace line breaks (`\n`) with spaces.
2. Replace the following special characters with spaces: `({[])}`
3. Remove all of the following punctuation if followed by more punctuation, whitespace, or is at the end of the text: `.,!?;:`
4. Split contractions such as `"don't" -> ["do", "n't"]` into separate tokens, use the list of contractions below.
5. Lastly, break all tokens on whitespace and return a list of all tokens.

<div class="alert alert-warning">
💡 Regular expressions such as `re.replace(pattern, replacement, text)` might come in handy but are not necessary to solve this task!
</div>

In [None]:
import re

contractions = ["'m", "'s", "'d", "'ll", "'ve", "'re", "n't"]


def tokenize(text: str) -> List[str]:
    """
    >>> tokenize("I'm, she's, we'd, he'll, we've, they're, ain't, don't, hadn't, isn't, shan't, might've")
    ['I', "'m", 'she', "'s", 'we', "'d", 'he', "'ll", 'we', "'ve", 'they', "'re", 'ai', "n't", 'do', "n't", 'had', "n't", 'is', "n't", 'sha', "n't", 'might', "'ve"]
    
    >>> tokenize("She said, `HELLO!` Then she paused.\\nDid you hear that? I wonder, should we go now?")
    ['She', 'said', '`HELLO!`', 'Then', 'she', 'paused', 'Did', 'you', 'hear', 'that', 'I', 'wonder', 'should', 'we', 'go', 'now']
    
    >>> tokenize("The committee's decision (which wasn't unanimous—there were several objections [notably from Smith, Ph.D. and Prof. Johnson]) was announced at 3:00 p.m.; however, the report won't be released until tomorrow.")
    ['The', 'committee', "'s", 'decision', 'which', 'was', "n't", 'unanimous—there', 'were', 'several', 'objections', 'notably', 'from', 'Smith', 'Ph.D', 'and', 'Prof', 'Johnson', 'was', 'announced', 'at', '3:00', 'p.m', 'however', 'the', 'report', 'wo', "n't", 'be', 'released', 'until', 'tomorrow']
    
    >>> tokenize(" This            is a lot of blank space . ")
    ['This', 'is', 'a', 'lot', 'of', 'blank', 'space']
    
    >>> tokenize("This... (pause)... is a tricky one!")
    ['This', 'pause', 'is', 'a', 'tricky', 'one']
    """

    # ADD YOUR CODE HERE

    return text.split()

In [None]:
test(tokenize)

In [None]:
evaluate(df)

## 1.2 Normalization

In step two, we normalize our tokens. Normalization maps different tokens that refer to the same concept (such as "COVID 19" and "Covid-19") to a common representation, which can help with vocabulary mismatch.

📝 Normalize the tokens below by converting all tokens to lowercase and map these terms `"covid", "sars-cov-2", and "coronavirus"` to `"covid-19"`.

In [None]:
def normalize(tokens: List[str]) -> List[str]:
    """
    >>> normalize(["Coronavirus", "COVID", "sars-cov-2", "Sars-Cov-1"])
    ['covid-19', 'covid-19', 'covid-19', 'sars-cov-1']
    """

    # ADD YOUR CODE HERE

    return tokens

In [None]:
test(normalize)

In [None]:
evaluate(df)

## 1.3 Stopping

In week-1, we listed the most common words in our index. One observation was that the most common words in a language are not necessarily the ones that carry a lot of information. An early approach to deal with these very common words, such as `as, an, a, the`, was to simply remove them (a step called stopping).

📝 Complete the method below to ignore all stopwords contained in the English stopword list of the NLTK library: `nltk.corpus.stopwords.words("english")`.

In [None]:
import nltk


def stopping(tokens: List[str]) -> List[str]:
    """
    >>> stopping(['each', 'his', 'here', 'other', "mightn't", 'mightn', 'not', 'the', 'have', 'same'])
    []
    >>> stopping(["hamlet", "wonders", "to", "be", "or", "not", "to", "be"])
    ['hamlet', 'wonders']
    """

    # ADD YOUR CODE HERE

    return tokens

In [None]:
test(stopping)

In [None]:
evaluate(df)

## 1.4 Stemming

Lastly, we use stemming as a way to unify word inflections to a common token (e.g., `going` -> `go`).

📝 Implement stemming using the [English SnowballStemmer](https://www.nltk.org/api/nltk.stem.SnowballStemmer.html).

In [None]:
from nltk.stem import SnowballStemmer


def stemming(tokens: List[str]) -> List[str]:
    """
    >>> stemming(["universe", "universal"])
    ['univers', 'univers']
    >>> stemming(["universe", "university"])
    ['univers', 'univers']
    >>> stemming(["likely", "like"])
    ['like', 'like']
    >>> stemming(["fairly", "fair"])
    ['fair', 'fair']
    """

    # ADD YOUR CODE HERE

    return tokens

In [None]:
test(stemming)

In [None]:
evaluate(df)

## 1.5 Inspect the processing steps

📝 Now, go back and execute all the preprocessing steps (Kernel > Restart and Run All) and inspect the evaluation results for our three queries. Describe how each preprocessing step affects the following aspects:

1. Describe how each step affects the size of our index.
2. Describe how each step influences retrieval performance.
3. Lastly, evaluate how the query text has changed with the preprocessing and how that might affect our information need.

<div class="alert alert-info">ADD YOUR ANSWER HERE</div>

# Part II - Vector Space Model and TF-IDF

Let's leave the topic of Covid and go back to the movie dataset from week-1. So far, in this course, we mostly looked at boolean search engines which return results in no particular order. Let's change that and begin by implementing the vector space model using tf-idf weighting. Begin by loading the movie dataset:

In [None]:
def load_movies(
        url: str = "https://raw.githubusercontent.com/irlabamsterdam/uva-ir0-assignments/main/data/netflix.csv"
):
    df = pd.read_csv(url)
    df = df.fillna("")
    df["genres"] = df["genres"].str.split("|")
    df["directors"] = df["directors"].str.split("|")
    df["actors"] = df["actors"].str.split("|")
    return df

def movie_tokenizer(text: str) -> List[str]:
    tokenizer = nltk.tokenize.TreebankWordTokenizer()
    tokens = [t for s in nltk.sent_tokenize(text) for t in tokenizer.tokenize(s)]
    tokens = [t for t in tokens if not all([c in string.punctuation for c in t])]
    return tokens

movie_df = load_movies()
movie_df.head()

## 2.1 Inverted index with frequency information

Vector space models represent each document and query as vectors with an entry for each token in our corpus. In reality, we have millions of tokens in our index, so storing such large vectors (which are typically sparse, containing mostly zeros) is slow and inefficent. Thus, we again turn to an inverted index to efficiently store and retrieve information.

📝 Build an inverted index for **movie title and description** that contains the number of occurrences of a token in a document (**term frequency**), the number of (unique) documents each token appears in (**document frequency**), and lastly create one entry in your index to **count the total number of documents** in our corpus (`index["total_documents"]`). The resulting index should have the following structure:

```Python
{
    "the": {
        "postings": [
            {"doc_id": 1029, "term_frequency": 1},
            {"doc_id": 1038, "term_frequency": 1},
            {"doc_id": 1155, "term_frequency": 1},
            ...
        ]
        "document_frequency": 4647
    },
    "crown": {
        "postings": [
            {'doc_id': 117, 'term_frequency': 1},
            {'doc_id': 505, 'term_frequency': 1},
            {'doc_id': 899, 'term_frequency': 1},
            ...
        ],
        "document_frequency": 11
    },
    ...
    "total_documents": 1234
}
```


<div class="alert alert-warning">
💡 Tip: The "collections.defaultdict" and "collections.Counter" classes might be helpful in this task.
</div>
<div class="alert alert-danger">
⚠️ Note: From now on use the `movie_tokenizer()` method from the cell above to preprocess your documents when creating the index, and NOT your tokenize method from task 1.1.
</div>

In [None]:
def create_index(df: pd.DataFrame) -> Dict[
    str, List[int]]:
    """
    >>> index = create_index(movie_df)
    >>> index["total_documents"]
    6114
    
    >>> index["headspace"]
    {'postings': [{'id': 5074, 'term_frequency': 2}, {'id': 5231, 'term_frequency': 1}, {'id': 5281, 'term_frequency': 2}], 'document_frequency': 3}
    
    >>> index["saul"]
    {'postings': [{'id': 915, 'term_frequency': 4}, {'id': 2117, 'term_frequency': 2}], 'document_frequency': 2}
    
    >>> index["burnham"]
    {'postings': [{'id': 749, 'term_frequency': 3}, {'id': 1250, 'term_frequency': 2}, {'id': 4795, 'term_frequency': 2}, {'id': 5832, 'term_frequency': 2}], 'document_frequency': 4}
    """
    index = {}

    for _, row in df.iterrows():
        # You can access details for each movie using the row variable:
        # E.g., row.id, row.title, row.description
        tokens = movie_tokenizer(row.title) + movie_tokenizer(row.description)

    # ADD YOUR CODE HERE

    return index


if __name__ == "__main__":
    movie_index = create_index(movie_df)

In [None]:
test(create_index)

## 2.2 TF-IDF

📝 Now, let's use the index to build a search engine that returns a list of documents ranked by their tf-idf scores:
- Return a list of tuples containing title (you can use the helper `id2title`) and their score, e.g., `[("movie A", 10), ("movie B", 9.5),...]`.
- Return just the top-k search results sorted by score descendingly, break ties by sorting by title alphabetically
- Round all tf-idf scores to 8 decimal places.

There are different formulations of tf-idf. For this assignment, we define tf-idf as:

#### Term Frequency (TF)
Let $f_{t,d}$ be the number of times the term $t$ occurs in document $d$ (`term_frequency` is your index posting list). Then compute the term frequency component as:

$\textrm{tf}(t, d) = 1 + \ln(f_{t,d})$

#### Inverse-Document Frequency (IDF)
Let $N_{t}$ be the number of documents that contain the term t (`document_frequency` in your index), and $N$ be the total number of documents in your corpus. Compute the inverse-document frequency as:

$\textrm{idf}(t) = \ln(\frac{N}{N_t})$

#### Compute document scores
Lastly rank all documents by the sum of their tf-idf scores for each term $t$ inside query $q$:

$\textrm{tf-idf}(q, d) = \sum_{t \in q} \textrm{tf}(t, d) * \textrm{idf}(t)$

<div class="alert alert-warning">
💡 You can use `log()` to calculate the natural logarithm ln.
</div>

<div class="alert alert-danger">
⚠️ Don't group tf-idf scores by title, some movies (with different ids) have the same title!
</div>

In [None]:
# Create a mapping of move ids to their title
id2title = movie_df.set_index("id").title.to_dict()


def search_tf_idf(index: Dict, tokens: List[str], top_k: int) -> List[
    Tuple[str, float]]:
    """
    >>> search_tf_idf(movie_index, ["headspace"], 3)
    [('headspace guide to sleep', 12.90131457), ('headspace unwind you mind', 12.90131457), ('headspace guide to meditation', 7.61972421)]
    
    >>> search_tf_idf(movie_index, ["trump", "obama"], 5)
    [('trump: an american dream', 11.04120227), ('american factory: a conversation with the obamas', 7.33204214), ('barry', 7.33204214), ('becoming', 7.33204214), ('our great national parks', 7.33204214)]
    
    >>> search_tf_idf(movie_index, ["breaking", "bad"], 3)
    [('breaking bad', 9.9971137), ('el camino: a breaking bad movie', 9.9971137), ('the road to el camino: behind the scenes of el camino: a breaking bad movie', 9.9971137)]
    """
    titles = []

    # ADD YOUR CODE HERE

    return titles[:top_k]

In [None]:
test(search_tf_idf)

# Part III - Language Models and Query Likelihood (LM)

In this last part, we implement language-model-based retrieval. Language models are probability distributions over a sequence of words. For example, given the entire text of lord of the rings, what is the probability of Gollum saying "my precious".

If we are just concerned how likely an individual word is given a document collection, we use a unigram model (basically the word frequency dividied by the number of all words in the corpus). But we could also calculate the probability of a word, given the previous word (2-gram) or the previous two words (3-gram). Modern large language models like GPT-4 or LLaMA use deep neural networks to predict the next word conditioned on a whole sequence of tokens. For this assignment, we will focus on the simple unigram model.

Let's begin by updating our index to build a language model.

## 3.1 An inverted index for language models

📝 Create a new inverted index where the term frequency is now the probability of a word appearing in a document. Let $f_{t,d}$ again be the number of times a token appears in document $d$ and $|d|$ is the length of a document:

$\textrm{tf}(t, d) = \frac{f_{t,d}}{|d|}$

Next, we modify the notion of document frequency and count the number of (non-unique) occurrences of a token in our corpus, divided by the number of all tokens in our corpus (i.e., the length of all documents). For example, how often was the term "netflix" used divided by the total number of words across all documents in our corpus.

We call this the `corpus_frequency` of a token, which describes how likely a term is overall to be drawn from our document collection. Round all probabilities to 8 decimal places.

The resulting index should have the following structure:


```Python
{
    "crown": {
        "postings": [
            {'id': 117, 'term_frequency': 0.02273...},
            {'id': 505, 'term_frequency': 0.04348...},
            {'id': 899, 'term_frequency': 0.01667...},
            ...
        ],
        "corpus_frequency": = 0.00005342
    }
}
```

In [None]:
def create_lm_index(df: pd.DataFrame) -> Dict[
    str, List[int]]:
    """ 
    >>> index = create_lm_index(movie_df)
    >>> index["headspace"]
    {'postings': [{'id': 5074, 'term_frequency': 0.08333333}, {'id': 5231, 'term_frequency': 0.01960784}, {'id': 5281, 'term_frequency': 0.07407407}], 'corpus_frequency': 2.013e-05}
    
    >>> index["saul"]
    {'postings': [{'id': 915, 'term_frequency': 0.05882353}, {'id': 2117, 'term_frequency': 0.05}], 'corpus_frequency': 2.416e-05}
    
    >>> index["burnham"]
    {'postings': [{'id': 749, 'term_frequency': 0.06521739}, {'id': 1250, 'term_frequency': 0.05405405}, {'id': 4795, 'term_frequency': 0.07142857}, {'id': 5832, 'term_frequency': 0.04166667}], 'corpus_frequency': 3.624e-05}
    """
    index = {}

    # Example of how to iterate over a dataframe
    for i, row in df.iterrows():
        # You can access details for each movie using the row variable:
        # E.g., row.id, row.title, row.description
        tokens = movie_tokenizer(row.title) + movie_tokenizer(row.description)

    # ADD YOUR CODE HERE

    return index


if __name__ == "__main__":
    lm_index = create_lm_index(movie_df)

In [None]:
test(create_lm_index)

## 3.2 Language model - Query likelihood search

Next, let's implement the query likelihood model. We want to use our unigram language model to predict how likely a document is for a given query $P(d \mid q)$. Using Bayes' rule we can see that the probability of a document being relevant $P(d \mid q)$ is proportional to the likelihood of a query being generated from the document: $P(q \mid d)$. Let's explain this step by step:

First, we remember Bayes' rule:

$P(d \mid q) = \normalsize \frac{P(q \mid d) P(d)}{P(q)}$

In our case $P(q)$ is a normalizing constant that is the same for all documents, thus it does not influence our ranking and we can drop it. $P(d)$ describes how likely each document is in our corpus. If we assume that all documents are equally likely, we can also drop this part. Thus it remains that:

$P(d \mid q) \propto P(q \mid d)$

These assumptions makes our life much easier and what remains is to calculate the joint probability of the query terms occuring in each document (i.e. the term-frequency calculated in our index above):

$P(q \mid d) = \prod_{t \in q} P(t \mid d) = \textrm{tf}(t, d)$


📝 Implement the query-likelihood model by ranking each document the joint probability of all query terms occurring in a document.
- If a document does not contain a query term, assume that: $P(q \mid d) = 0$.
- Return only the top-k documents and break ties by sorting alphabetically by title.
- The result is a list of tuples containing movie title and probability (rounded to 8 places).

In [None]:
def query_likelihood(index: Dict, tokens: List[str], top_k: int) -> List[
    Tuple[str, float]]:
    """
    >>> query_likelihood(lm_index, ['the', 'crown'], 3)
    [('the crown', 0.00619253), ("the king's affection", 0.00312175), ('a christmas prince: the royal wedding', 0.00135031)]
    
    >>> query_likelihood(lm_index, ['queen', 'crown'], 2)
    [('the crown', 0.00056296), ('a christmas prince: the royal wedding', 0.0001929)]
    
    >>> query_likelihood(lm_index, ["queer", "eye"], 4)
    [("queer eye: we're in japan!", 0.00694444), ('queer eye', 0.0016), ('queer eye: brazil', 0.0016), ('queer eye germany', 0.00118906)]
    """
    titles = []

    # ADD YOUR CODE HERE

    return titles[:top_k]

In [None]:
test(query_likelihood)

## 3.3 Smoothed Query Likelihood

Lastly, the above ranking model is clearly not ideal. Missing only one query term is leading to a joint probability of 0. This problem is usually mitigated using smoothing, the idea of assigning each term a minimum probability even if they are not mentioned in the document. This could be a fixed probability for each term, or more commonly, a global term probability (e.g., our corpus frequency calculated in the index) which is linearly interpolated with the document-level probability:

$P(q \mid d) = \prod_{t \in q} \alpha P(t \mid c) + (1 - \alpha) P(t \mid d)$

Where $\alpha$ is a hyperparameter, such as $\alpha = 0.1$ and $ P(t \mid c)$ is our corpus-level probability of a token. This approach is called linear or **Jelinek-Mercer smoothing**.


📝 Copy your solution from task 3.2 above and add linear smoothing. Note that you need to retrieve all documents that contain at least one query term in this task (similar to an OR search). Sort the output as before and return only top_k items with the final scores rounded to 8 decimal places.

In [None]:
def smooth_query_likelihood(index: Dict, tokens: List[str], top_k: int,
                            alpha: float = 0.1) -> List[Tuple[str, float]]:
    """
    >>> smooth_query_likelihood(lm_index, ["queen", "crown"], top_k=5, alpha=0.1)    
    [('the crown', 0.00045678), ('a christmas prince: the royal wedding', 0.00015655), ('hero mask', 7.5e-07), ("the king's affection", 5.8e-07), ('babamın ceketi', 3.1e-07)]
    
    >>> smooth_query_likelihood(lm_index, ["queer", "eye"], top_k=4, alpha=0.5)
    [("queer eye: we're in japan!", 0.00173905), ('queer eye', 0.00040141), ('queer eye: brazil', 0.00040141), ('queer eye germany', 0.00029848)]
    
    >>> smooth_query_likelihood(lm_index, ["queer", "eye"], top_k=4, alpha=0.9)
    [("queer eye: we're in japan!", 7.05e-05), ('queer eye', 1.651e-05), ('queer eye: brazil', 1.651e-05), ('queer eye germany', 1.233e-05)]
    """

    results = []

    # ADD YOUR CODE HERE

    return results[:top_k]

In [None]:
test(smooth_query_likelihood)