Before you turn this assignment in, make sure everything runs as expected by going to the menubar and running: 

**Kernel $\rightarrow$ Restart & Run All**

Please replace all spots marked with `# ADD YOUR CODE HERE` or `ADD YOUR ANSWER HERE`.

And start by filling in your name and student_id below:

In [None]:
NAME = ""
STUDENT_ID = ""

In [None]:
assert len(NAME) > 0, "Please fill in your name"
assert len(STUDENT_ID) > 0, "Please fill in your student id"

---

In [None]:
import doctest
import itertools
import string

import pandas as pd

from collections import defaultdict, Counter
from typing import Callable, Dict, List

# Part I - Intro to IR

## 1.1 - Information Needs [1.5 pts]

📝 Explain the difference between a user's search query and their information need. Give an example of a search query that fits three different potential information needs.

<div class="alert alert-info">ADD YOUR ANSWER HERE</div>

## 1.2 - Refining information needs [1.5 pts]

📝 Without having learned specific techniques yet, reflect on your own usage of web search. Can you recall at least three things search engines do that help clarify your information need?

<div class="alert alert-info">ADD YOUR ANSWER HERE</div>

## 1.3 - Vocabulary Mismatch [1.5 pts]

📝 What is the vocabulary mismatch problem? Explain the problem and list three potential causes with examples:

<div class="alert alert-info">ADD YOUR ANSWER HERE</div>

## 1.4 Types of Relevance [1.5 pts]
📝 In information retrieval, we often talk about search results being *relevant*. A document might be *topically relevant* to a query if it covers the same subject matter. However, a document can be relevant or irrelevant in multiple ways. **List three types of relevance beyond topical relevance with example scenarios**:

<div class="alert alert-info">ADD YOUR ANSWER HERE</div>

# Part II - Building a Boolean Search Engine for Netflix

In the following, we will build an inverted index for Netflix and use it to execute boolean search queries. Afterward, we will extend our boolean search engine to not only retrieve but rank the search results.

## 💡 Doctests
Before we start coding, note that many coding tasks in your assignment come with a little helper in the form of [doctests](https://docs.python.org/3/library/doctest.html). At the top of many coding tasks, you will see a comment section like this:
```
def create_index(df: pd.DataFrame) -> Dict[str, List[int]]:    
    """
    >>> create_index(df)["witcher"]
    [3829, 4303, 4305, 4416, 4727, 5168, 5207]
    
    >>> create_index(df)["ozark"]
    [2230, 5695]
    """
    ...
    
test(create_index)
```

These are doctests, a standard Python way to define simple test cases inside code comments. You can run these tests by calling the `test` method on your function: `test(my_method)`. Note that you don't supply parameters to your method here, these are added by the doctest library.

Use these tests to verify your implementation before submitting your notebook. Note, however, that the final grading is performed against additional tests that are not included in this notebook.


## Load data
Now let's start building a search engine by downloading a dataset of Netflix movies into a Pandas DataFrame by executing the cell below:

In [None]:
def load_movies(
    url: str = "https://raw.githubusercontent.com/irlabamsterdam/uva-information-retrieval-0/main/data/netflix.csv"
) -> pd.DataFrame:
    df = pd.read_csv(url)
    df = df.fillna("")
    df["genres"] = df["genres"].str.split("|")
    df["directors"] = df["directors"].str.split("|")
    df["actors"] = df["actors"].str.split("|")
    df["production_countries"] = df["production_countries"].str.split("|")
    return df

if __name__ == "__main__":
    def test(fn: Callable):
        doctest.run_docstring_examples(fn, globals(), verbose=True, name=fn.__name__)
else:
    def test(fn: Callable): pass
    
df = load_movies()
df.head()

## 2.1 Inverted Index [2 pts]

📝  The first task is to build an inverted index for tokens in the **title and description** of each movie or show.
Since we will learn how to clean and process text in an upcoming week, use the provided method `tokenize(text: str) -> List[str]` to split text into individual words.

Create your inverted index as a Python dictionary, pointing from each token to a **posting list of movie ids sorted in ascending order**:

```Python
{
    "netflix": [1029, 1038, 1155, ...],
    "original": [218, 508, 1029, ...],
    ...
}
```

</br>
<div class="alert alert-danger">
⚠️ Ensure that each movie id is only added once for a given token. Meaning, if a movie description and title contains the same word multiple times, your posting list for the given word should contain only one reference to the movie. We will look into counting terms for ranking in week 3.
</div>
<div class="alert alert-warning">
💡 Tip: The "collections.defaultdict" class might be useful to simplify your code, but it is not necessary to resolve this task.
</div>

In [None]:
def tokenize(text: str) -> List[str]:    
    # Lowercase all text
    text = text.lower()
    # Naively remove all punctuation from text
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Naively split text into tokens on whitespace
    return text.split()

In [None]:
def create_index(df: pd.DataFrame) -> Dict[str, List[int]]:
    """
    >>> index = create_index(df)
    >>> index["bridgerton"]
    [2657, 5716]
    
    >>> index["witcher"]
    [2648, 3373, 3826, 4050, 4287, 4882, 5550, 5800]
    
    >>> len(index)
    25945
    """
    index = {}

    # Example of how to iterate over a dataframe
    for i, row in df.iterrows():
        # You can access details for each movie using the row variable:
        # E.g., row["id"], row["title"], row["description"]
        pass
    
    # ADD YOUR CODE HERE

    return index

index = create_index(df)

In [None]:
test(create_index)

In [None]:
answer_2_2_1 = 0
# ADD YOUR CODE HERE

In [None]:
assert answer_2_2_1 >= 0, "Vocabulary size should be a non-negative number"

### 2.2.2 What is the longest document in your index (in terms of number of unique tokens)? [0.5 pts]

In [None]:
answer_2_2_2 = 0
# ADD YOUR CODE HERE

In [None]:
assert 0 <= answer_2_2_2 <= 6136, "The answer should be a valid movie id"

### 2.2.3 What are the ten most common terms in the index? [0.5 pts]
Your answer should be a list of (word, number_of_documents) tuples sorted in descending order, e.g.:
```
[("netflix", 123), ("originals", 118), ...]
```

In [None]:
answer_2_2_3 = []
# ADD YOUR CODE HERE

In [None]:
assert all([isinstance(term, str) and isinstance(freq, int) for term, freq in answer_2_2_3]), "Make sure your list has the correct format"
frequencies = list(map(lambda x: x[1], answer_2_2_3))
assert frequencies == sorted(frequencies, reverse=True), "Make sure that your list is sorted in descending order"

### 📚 Stopwords
If you take a closer look at the most common terms in our index, it might be noticable that they are articles and prepositions. These terms tend to carry relatively limited information and are typically referred to as *stopwords*. Historically, many information retrieval systems used to remove (i.e., "stop") these common words during index creation. We will look into dealing with stopwords in an upcoming week.

## 2.3 Boolean AND search [2 pts]

📝  Next, we use our inverted index to answer boolean AND queries (also called conjunctive queries). Complete the function `search_and` below. The input to your function is your index and a list of keywords connected by the AND operator. The call:

`search_and(index, ["captain", "america", "avenger"])`

should corresponds to the boolean query:

`captain AND america AND avenger`

The result of your function should be a list of movie titles sorted in alphabetical order.

</br>
<div class="alert alert-warning">
💡 Tip I: Use the helper dictionary "id2title" to find a movie's title using its id.
<br/>
💡 Tip II: Set operations might help to quickly find common elements between two lists.
</div>

In [None]:
# Create a mapping of move ids to their title
id2title = df.set_index("id").title.to_dict()


def search_and(index: Dict, tokens: List[str]) -> List[str]:
    """
    >>> search_and(index, ["stranger", "things"])
    ['beyond stranger things', 'stranger things']
    
    >>> search_and(index, ["black", "mirror"])
    ['black mirror', 'black mirror: bandersnatch', 'death to 2020']
    
    >>> search_and(index, ["queer", "eye"])
    ['queer eye', 'queer eye germany', 'queer eye: brazil', "queer eye: we're in japan!"]
    """
    titles = []

    # ADD YOUR CODE HERE

    return titles

In [None]:
test(search_and)

## 2.4 Boolean OR search [2 pts]

📝  Next, complete the method `search_or` to answer disjuctive / OR queries.

`search_or(index, ["captain", "america", "avenger"])`

should corresponds to the boolean query:

`captain OR america OR avenger`

Sort the resulting titles alphabetically.

In [None]:
def search_or(index: Dict, tokens: List[str]):
    """
    >>> search_or(index, ["mindhunter", "dahmer"])
    ['conversations with a killer: the jeffrey dahmer tapes', 'dahmer - monster: the jeffrey dahmer story', 'mindhunter']
    
    >>> search_or(index, ["burnham", "brennan"])
    ['bo burnham: inside', 'bo burnham: make happy', 'bo burnham: the inside outtakes', 'bo burnham: what.', 'neal brennan: 3 mics', 'neal brennan: blocks']
    """
    titles = []

    # ADD YOUR CODE HERE

    return titles

In [None]:
test(search_or)

## 2.5 Boolean AND NOT [1 pts]
📝  Third, extend your answer to the conjunctive query from 2.3 above to handle a list of negated terms.

```
search_and_not(
    index,
    ["queens", "gambit"],
    excluded_tokens=["netflix", "afterparty"],
)
```

should corresponds to the boolean query:

`queens AND gambit AND NOT ("netflix" OR "afterparty")`

Sort the resulting titles alphabetically.

<div class="alert alert-warning">
💡 Tip: Think about how you can incorporate the "search_and" and "search_or" methods from above to make this task easier.
</div>

In [None]:
def search_and_not(index: Dict, tokens: List[str], excluded_tokens: List[str] = []):
    """    
    >>> search_and_not(index, ["stranger", "things"], [])
    ['beyond stranger things', 'stranger things']
    
    >>> search_and_not(index, ["stranger", "things"], ["beyond"])
    ['stranger things']
    
    >>> search_and_not(index, ["queens", "gambit"], ["netflix", "afterparty"])
    ["creating the queen's gambit", "the queen's gambit"]
    
    >>> search_and_not(index, ["queens", "gambit"], ["netflix", "afterparty", "creating"])
    ["the queen's gambit"]
    """
    titles = []

    # ADD YOUR CODE HERE
    
    return titles

In [None]:
test(search_and_not)

## 2.6 Ranked Boolean Search [2 pts]
📝  So far, we have sorted our results mostly alphabetically. By definition, boolean search is NOT ranked and just returns all items that match the query. However, we can introduce a simple ranking for an OR search, for example, by listing documents that match more query terms at the top.

Extend your OR search from 2.4 to return a ranked list of movies. Return not only the movie titles but the number of matching query tokens for the movie. A query `["black", "mirror"]` might result in: `[("black mirror", 2), ("black panther", 1), ...]`

Rank the resulting movies from most matching keywords to least matching keywords. When two movies have the same amount of matches, rank them alphabetically by movie title (A -> Z). Since the resulting list can get very long, return only the top k results.

<div class="alert alert-warning">
💡 Tip: The "collections.Counter" class might be useful to simplify your code.
</div>

In [None]:
def search_ranked_or(index: Dict, tokens: List[str], top_k: int):
    """
    >>> search_ranked_or(index, ["world", "planet", "david", "attenborough"], 4)
    [('david attenborough: a life on our planet', 4), ('breaking boundaries: the science of our planet', 3), ('aerials', 2), ('dark tourist', 2)]
    
    >>> search_ranked_or(index, ["teenage", "drug", "lord", "fast"], 3)
    [('shiny_flakes: the teenage drug lord', 4), ('how to sell drugs online (fast)', 3), ('earth and blood', 2)]
    
    >>> search_ranked_or(index, ["black", "mirror", "2020"], 3)
    [('death to 2020', 3), ('black mirror', 2), ('black mirror: bandersnatch', 2)]
    """
    titles = []

    # ADD YOUR CODE HERE

    return titles[:top_k]

In [None]:
test(search_ranked_or)

# Part III Bonus - Parametric Search

Search systems often allow for more detailed filtering of results using search parameters. These can come in handy to only show hotels in our price range, clothes in our size, or last-minute presents that ship with same-day delivery. This functionality is traditionally enabled using parametric indices, which index the meta-data of our items. In the case of movies, that might include fields like genre, author, or actors.

## Parametric Inverted Index [1 pts]
As an optional bonus task, extend the index created in task 2.1 to contain movie meta-data information. Add the movies' `genres`, `actors`, `directors`, and `release year` to the same index by creating compound keys such as:

```Python
{
    "genre:comedy": [1, 2, 100, ...],
    "director:quentin-tarentino": [movie_ids],
    "actor:meryl-streep": [movie_ids],
    "year:2020": [movie_ids],
    "country:nl": [movie_ids],
    ...
}
```

Afterward, you should be able to use your new index with your and, or, and_not functions created above to answer the three questions below.

In [None]:
def create_parametric_index(df: pd.DataFrame) -> Dict[str, List[int]]:
    """
    >>> parametric_index = create_parametric_index(df)
    >>> parametric_index["actor:meryl-streep"]
    [45, 239, 706, 1723, 3268, 3276, 3598, 4310, 4577]
    
    >>> parametric_index["director:martin-scorsese"]
    [224, 284, 2677, 2803]
    
    >>> parametric_index["year:1975"]
    [5, 25, 30]
    """
    index = {}

    # ADD YOUR CODE HERE

    return index


parametric_index = create_parametric_index(df)

In [None]:
test(create_parametric_index)

### 3.2.1 Query I  [0.5 pts]
Which shows was comedian John Mulaney part of in 2017?

In [None]:
answer_3_2_1 = []
# ADD YOUR CODE HERE

In [None]:
assert all([isinstance(a, str) for a in answer_3_2_1]), "Expecting a list of movie titles"

### 3.2.2 Query II [0.5 pts]

Which indexed movies starring Leonardo DiCaprio were NOT directed by Martin Scorsese?

In [None]:
answer_3_2_2 = []
# ADD YOUR CODE HERE

In [None]:
assert all([isinstance(a, str) for a in answer_3_2_2]), "Expecting a list of movie titles"