Before you turn this assignment in, make sure everything runs as expected by going to the menubar and running: 

**Kernel $\rightarrow$ Restart & Run All**

Please replace all spots marked with `# ADD YOUR CODE HERE` or `ADD YOUR ANSWER HERE`.

And start by filling in your name and student_id below:

In [None]:
NAME = ""
STUDENT_ID = ""

In [None]:
assert len(NAME) > 0, "Please fill in your name"
assert len(STUDENT_ID) > 0, "Please fill in your student id"

---

In [None]:
import doctest
import numpy as np
import pandas as pd
import xgboost as xgb

from typing import Callable, Dict, List

from scipy.spatial.distance import cosine
from sklearn.metrics import ndcg_score

In [None]:
def test(fn: Callable):
    # Turn off doctests in autograding:
    if __name__ == "__main__":
        doctest.run_docstring_examples(fn, globals(), verbose=True, name=fn.__name__, optionflags=doctest.ELLIPSIS)

# Week 4 - Learning to Rank & Vocabulary Mismatch

Welcome to week four of Zoekmachines! 👋

In part I of this week's assignment, we will learn how to automatically combine features for ranking. These methods are more computationally expensive but also more powerful than the ranking approaches we discussed in previous weeks. It is common to use a fast and simple method to retrieve a candidate set of items (e.g., retrieve a top 1000 list using bm25) and then apply a slower but more advanced learning-to-rank model to create the final ranking.

In part II, we re-visit the problem of vocabulary mismatch and how Latent Semantic Indexing might help to address the issue.

Compared to previous weeks, this week is more library-heavy since you will learn some tools that are actually used in real-world industry applications.

Good luck with the assignment!

### Resources
📚 [Latent Semantic Indexing - Manning, Raghavan, Schütze - Chapter 18.4](https://nlp.stanford.edu/IR-book/pdf/18lsi.pdf)

🌐 [Scikit-Learn TfidfVectorizer API](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

🌐 [Scikit-Learn LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

🌐 [Learning to rank with XGBoost](https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html)

# Part I - Learning to rank (LTR)

In previous weeks, we've looked at basic ranking methods such as tf-idf, bm25, or language models. However, the impact of different features (such as tf, idf, or document length) on the overall ranking is tuned manually and not learned automatically, making it difficult to introduce additional features. Modern search engines consider hundreds of features for their ranking. So this week, we will automatically learn how to weigh different features to create better rankings.

We will investigate two approaches: (1) a pointwise approach using linear regression and (2) a pairwise approach employing gradient-boosted decision trees.

Let's begin by loading a larger version of the [covid TREC dataset](https://ir.nist.gov/trec-covid/) from last week. The dataset contains pairs of real search queries and their candidate documents (100 docs per query) to be ranked. Each query-document pair is accompanied by a human relevance judgmenet on a scale of 0, 1, 2. We load three datasets: A train dataset, a test dataset for evaluation, and a combination of both, which we use during indexing.

In [None]:
def load_data(url: str) -> pd.DataFrame:
    df = pd.read_csv(url)
    df["body"] = df["body"].fillna("")
    return df

train_df = load_data("https://raw.githubusercontent.com/irlabamsterdam/uva-ir0-assignments/main/data/trec-covid-train.csv")
test_df = load_data("https://raw.githubusercontent.com/irlabamsterdam/uva-ir0-assignments/main/data/trec-covid-test.csv")
df = pd.concat([train_df, test_df])
df.head()

## 1.1 Feature engineering: TF-IDF vectors

Before learning how to weight features, we need to create our features. As you can imagine there are plenty of potential features that are useful during ranking. Here, for example, is a list of common real-world features used in [Microsoft Bing](https://www.microsoft.com/en-us/research/project/mslr/). We will compute features based on tf-idf weighting introduced last week. But instead of manually implementing an inverted index and repeating all of last week's calculation, we will use a more comfortable method and use scikit-learn's [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) for document preprocessing and tf-idf computation.

📝 The [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class can perform a.o. tokenization, normalization, stopping, and it can transform a piece of raw text directly into a vector of normalized tf-idf values. In the following, create a TfidfVectorizer that:

1. Lowercases all tokens.
2. Removes English stopwords.
3. Uses a sublinear log scaling for the term frequency like we used last week (`1 + log(tf)` instead of using the raw token count `tf`).

Return a vectorizer that is trained on all documents in the `documents` list below.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


def fit_vectorizer(df: pd.DataFrame) -> TfidfVectorizer:
    """
    # Check size of indexed vocabulary
    >>> vectorizer = fit_vectorizer(df)
    >>> len(vectorizer.vocabulary_)
    40111
    
    # Embed a single document "covid 19 pandemic" and check positions in vector
    >>> vector = vectorizer.transform(["covid 19 pandemic"])
    >>> np.array_equal(vector.nonzero()[1], np.array([838, 10456, 27452]))
    True
    
    # Embed a single document "covid 19 pandemic" and check tfidf values
    >>> vector = vectorizer.transform(["covid 19 pandemic"])
    >>> np.allclose(vector[vector.nonzero()], np.array([[0.47729191, 0.4755531, 0.73894633]]))
    True
    """
    vectorizer = None
    documents = list(df["query"]) + list(df["title"]) + list(df["body"])
    
    # ADD YOUR CODE HERE
    
    return vectorizer

In [None]:
test(fit_vectorizer)

In [None]:
if __name__ == "__main__":
    vectorizer = fit_vectorizer(df)

## 1.2 - Feature Engineering

Next, we use the vectorizer to engineer our features. We will only compute four features. Note that real search engines employ hundreds of features ([Yahoo!'s public LTR dataset contains 700 for example](http://proceedings.mlr.press/v14/chapelle11a/chapelle11a.pdf)). These features might depend only on the document (e.g., document length) and can be computed in advance, or they might depend on the current query and have to be computed on the fly.

Our train dataset contains 50 search queries each with 100 candidate documents. We already created query-document pairs so that each row of our dataset contains a document with `title` and `body` and a matching search `query`.

📝 Use the TfidfVectorizer supplied in the method below to **create vectors for the search query, document title, and body text**. Use these three vectors to compute the following features and return them in a new pandas DataFrame:

* `title_tfidf`: The cosine similarity between the query vector and title vector of the document.
* `body_tfidf`: The cosine similarity between the query vector and body vector of the document.
* `title_overlap`: The number of unique matching terms between query and title divided by the number of unique query terms.
* `body_overlap`: The number of unique matching terms between the query and body divided by the number of unique query terms.

Unique matches in this context means that you can ignore repeating tokens inside the query and count them as one. E.g., if the query is "coronavirus" and the document contains the word twice, its counted as one match. Note that you should compute the overlap features also with the tf-idf vectors and not start to tokenize text into individual words in this task.

<div class="alert alert-warning">
💡 Tip: Note that the vectorizer returns sparse vectors which save a lot of memory since they only store nonzero entries. However, some operations might not be supported on them. If necessary, call ".todense()" on the vectors to transform them into normal numpy representations.
</div>

<div class="alert alert-warning">
💡 Tip: Scikit-learn's "sklearn.metrics.pairwise.cosine_similarity" might be helpful.
</div>

<div class="alert alert-warning">
💡 Tip: Numpy's "np.logical_and" operation might come in handy when finding common entries between two vectors.
</div>

In [None]:
from sklearn.metrics.pairwise import cosine_similarity


def get_features(vectorizer: TfidfVectorizer, df: pd.DataFrame) -> pd.DataFrame:
    """
    >>> feature_df = get_features(vectorizer, train_df.head(1))
    >>> list(feature_df.columns)
    ['query_id', 'relevance', 'title_tfidf', 'body_tfidf', 'title_overlap', 'body_overlap']
    
    >>> get_features(vectorizer, train_df.head(3)).round(3)
       query_id  relevance  title_tfidf  body_tfidf  title_overlap  body_overlap
    0         1          1          0.0       0.144            0.0           1.0
    1         1          0          0.0       0.010            0.0           0.5
    2         1          0          0.0       0.013            0.0           0.5
    
    >>> get_features(vectorizer, train_df.tail(3)).round(3)
       query_id  relevance  title_tfidf  body_tfidf  title_overlap  body_overlap
    0        50          2        0.090       0.043          0.333         0.667
    1        50          1        0.109       0.102          0.667         0.667
    2        50          0        0.000       0.012          0.000         0.333
    """
    rows = []
    
    for i, row in df.iterrows():
        title_tfidf = 0
        body_tfidf = 0
        title_overlap = 0
        body_overlap = 0

        # ADD YOUR CODE HERE
        
        rows.append({
            "query_id": row["query_id"],
            "relevance": row["relevance"],
            "title_tfidf": title_tfidf,
            "body_tfidf": body_tfidf,
            "title_overlap": title_overlap,
            "body_overlap": body_overlap,
        })
        
    return pd.DataFrame(rows)

In [None]:
test(get_features)

In [None]:
if __name__ == "__main__":
    train_feature_df = get_features(vectorizer, train_df)
    test_feature_df = get_features(vectorizer, test_df)

## Evaluation

Let's generate features on the train and test datasets using your implementation above. We also add a helper method that evaluates the `nDCG@10` and `nDCG@100` when ranking by each feature on its own.

There is no task here, just execute the following cells and inspect the resulting ranking performance of each feature.

In [None]:
def evaluate(df: pd.DataFrame, score_column: str) -> Dict:
    df = df.groupby(["query_id"]).agg(
        y=("relevance", list),
        y_predict=(score_column, list)
    )
    
    return {
        "score": score_column,
        "nDCG@10": ndcg_score(list(df.y), list(df.y_predict), k=10),
        "nDCG@100": ndcg_score(list(df.y), list(df.y_predict), k=100),
    }

In [None]:
if __name__ == "__main__":
    print(evaluate(test_feature_df, "title_tfidf"))
    print(evaluate(test_feature_df, "title_overlap"))
    print(evaluate(test_feature_df, "body_tfidf"))
    print(evaluate(test_feature_df, "body_overlap"))

## 1.3 Pointwise LTR

Now, let's build our first LTR model. To recap, in our dataset we have a search query $q$. Each query-document pair is represented using a feature vector containing `['title_tfidf', 'body_tfidf', 'title_overlap', 'body_overlap']`, let's call it $x$. And for each pair, we have a relevance annotation obtained by asking human judges (in our dataset the column `relevance`).

The pointwise approach ignores which query-document pairs belong to the same query and just learns a function $f(x)$ that maps from our feature vectors as closely as it can to our relevance labels $y$. A classic approach is to use a linear regression model $f(x) = wx + b$, where the weights $w$ and the bias term $b$ are parameters that we need to learn by minimizing the mean squared error loss over all query-document pairs in our dataset:

$\sum_{i}^m (f(x_i) - y_i) ^ 2$

📝 In the following, use scikit-learn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to implement a pointwise ranking approach that predict the relevance score from our document features. Complete the class below:

1. Create and train a new regression model inside the `fit` method.
2. Use the trained model in the `predict` step to predict the relevance score for a dataframe of unseen documents.

In [None]:
from sklearn.linear_model import LinearRegression


def test_ranker():
    df = pd.DataFrame({
        "query_id": [0, 0],
        "title_tfidf": [0, 1],
        "body_tfidf": [0.5, 1],
        "title_overlap": [0, 0.5],
        "body_overlap": [1, 0],
        "relevance": [0, 5]
    })
    ranker = PointwiseRanker()
    ranker.fit(df)
    return ranker, df


class PointwiseRanker:
    def __init__(self):
        self.model = None
    
    def fit(self, train_df: pd.DataFrame):
        """
        # Check the learned feature weights and bias on a toy dataset:
        >>> ranker, _ = test_ranker()
        >>> np.allclose(ranker.model.coef_, np.array([ 2.,  1.,  1., -2.]))
        True
        >>> round(ranker.model.intercept_, 4)
        1.5
        """

        # ADD YOUR CODE HERE
        ...
    
    def predict(self, test_df: pd.DataFrame) -> np.array:
        """
        # Check predictions on a toy dataset:
        >>> ranker, df = test_ranker()
        >>> np.allclose(ranker.predict(df), df.relevance)
        True
        """
        
        predicted_relevance = np.array([])
        
        # ADD YOUR CODE HERE
        
        return predicted_relevance

In [None]:
test(PointwiseRanker.fit)
test(PointwiseRanker.predict)

In [None]:
if __name__ == "__main__":
    model = PointwiseRanker()
    model.fit(train_feature_df)
    test_feature_df["pointwise"] = model.predict(test_feature_df)
    print(evaluate(test_feature_df, "pointwise"))

## 1.4 Pairwise LTR

Instead of independently predicting a relevance score for each feature vector, pairwise approaches learn relative preference between two document candidates belonging to the same query. Meaning, we learn if one feature vector $x_i$ is more relevant than another $x_j$ for a given query, i.e. whether $y_i > y_j$.

📝 In the following, implement the pairwise ranking method RankNet that we've learned in this week's lecture using the [gradient boosted decision tree library XGBoost](https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html):

1. Configure the `XGBRanker` class to use the RankNet objective.
2. Configure how the algorithm should construct document pairs. Use the `mean` construction method and sample `1` random pair for each document in our query. Note that to construct document pairs, the algorithm needs information about which documents belong to the same query (`qid`).
3. Configure your gradient boosted decision tree, use `10` estimators with a max depth of `5` each.
4. Set the random_state to `0` to make your code reproducible.

In [None]:
from xgboost import XGBRanker


def test_ranker():
    df = pd.DataFrame({
        "query_id": [0, 0, 1, 1],
        "title_tfidf": [0, 1, 0, 0.5],
        "body_tfidf": [0.5, 1, 0.5, 1],
        "title_overlap": [0, 0.5, 0, 1],
        "body_overlap": [1, 0, 1, 0],
        "relevance": [0, 1, 0, 1]
    })
    ranker = PairwiseRanker()
    ranker.fit(df)
    return ranker, df


class PairwiseRanker:
    def __init__(self):
        self.model = None
    
    def fit(self, train_df: pd.DataFrame):
        """
        # Check the pairwise model on a toy dataset:
        >>> ranker, df = test_ranker()
        >>> ranker.model.get_params()["objective"] # Model uses correct objective
        'rank:pairwise'
        >>> ranker.model.get_params()["random_state"] # Model uses correct random state
        0
        >>> ranker.model.get_params()["lambdarank_pair_method"] # Model constructs pairs in the correct way
        'mean'
        """
        
        # ADD YOUR CODE HERE
        ...
    
    def predict(self, test_df: pd.DataFrame):
        """
        # Check the pairwise model predictions on a toy dataset:
        >>> ranker, df = test_ranker()
        >>> ranker.predict(df).round(2)
        array([-0.74,  0.74, -0.74,  0.74], dtype=float32)
        """
        predicted_relevance = np.array([])
        
        # ADD YOUR CODE HERE
        
        return predicted_relevance

In [None]:
test(PairwiseRanker.fit)
test(PairwiseRanker.predict)

In [None]:
if __name__ == "__main__":
    model = PairwiseRanker()
    model.fit(train_feature_df)
    test_feature_df["pairwise"] = model.predict(test_feature_df)
    print(evaluate(test_feature_df, "pairwise"))

## 1.5.1 Motivation listwise loss

📝 Now that we've build a simple pointwise and pairwise LTR model, what is a listwise LTR model? What is the motivation to use a listwise loss over a pointwise or a pairwise appraoch?

<div class="alert alert-info">ADD YOUR ANSWER HERE</div>

## 1.5.2 Comparing results

Lastly, compare how the pointwise, pairwise, and listwise results compare against each other on our covid dataset. You can try out a listwise LambdaMART approach by simply switching the objective in your pairwise implementation to `"rank:ndcg"`.

Describe also how the three LTR approaches compare against ranking using our initial features from task 1.2.

<div class="alert alert-danger">
⚠️ Please make sure to switch back to the pairwise objective before submitting your notebook.
</div>

<div class="alert alert-info">ADD YOUR ANSWER HERE</div>

# Part II - Latent Semantic Indexing (LSI)

The vector space model used in the previous week (and the one above) suffers from the vocabulary mismatch problem (see the assignment in week-1). Two issues are particularly common: 

- Synonyms: Multiple words refer to the same concept: car, automobile, ...
- Polysemy: One word refers to multiple concepts: light (color, not heavy, not serious)

Synonyms can lead to underscoring relevant documents if they use a synonym of our search query, and polysemy can lead to retrieving false-positive documents for our search query. Instead of searching for exact matches in our document-term matrix, the idea of LSI is to represent documents by groups of related words (topics) rather than a set of fixed terms.

In this part, we take a look at computing topics using Singular Value Decomposition (SVD) on NYTimes news article headlines. For that, first download a new dataset containing news headlines from the NYTimes published between 2015 and 2018:

In [None]:
article_df = pd.read_csv("https://raw.githubusercontent.com/irlabamsterdam/uva-ir0-assignments/main/data/nytimes.csv")
article_df.head()

## 2.1 Approximating the term-document matrix

One way to compute topics is to use SVD to factorize our term-document-matrix $A$ into three matrices of lower rank that approximate its original entries:

![](https://editor.analyticsvidhya.com/uploads/82407SVD.png)

📝 Describe the inputs and outputs of LSI:
1. Describe what $m$, $n$, and $r$ are in the figure above in the context of LSI (meaning when it comes to terms and documents).
2. What information is contained in matrix $U$, $\Sigma$, and $V^T$?

<div class="alert alert-info">ADD YOUR ANSWER HERE</div>

## 2.2 Constructing the term-document matrix

Let's implement LSI using SVD. We begin by creating a term-document matrix (of shape terms x docs) from our news headlines. Instead of just a binary encoding of terms and documents (like in this week's lecture slides), another common choice is to use tf-idf values in the term-document matrix.


📝  Complete the method below:

1.	Create a term-document matrix containing tf-idf values. Use the [TfidfVectorizer from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Configure the vectorizer to:
	- Remove English stopwords.
	- Only include terms that occur in at least 5 documents (document frequency).
2.	Return the term-document matrix (with shape terms x docs) and a list of vocabulary terms created by the vectorizer. This list should be sorted by the token ID assigned by the vectorizer, allowing you to look up words based on their ID.

In [None]:
def get_term_document_matrix(df: pd.DataFrame) -> np.array:
    """
    >>> term_doc_matrix, vocabulary = get_term_document_matrix(article_df)
    >>> term_doc_matrix.shape
    (3168, 10732)
    >>> term_doc_matrix.data[:10].round(2) # Check first ten entries in sparse term-document matrix
    array([0.53, 0.69, 0.46, 0.09, 0.09, 0.09, 0.06, 0.06, 0.06, 0.54])
    >>> len(vocabulary)
    3168
    >>> vocabulary[2928], vocabulary[571], vocabulary[1138], vocabulary[1448]
    ('trump', 'college', 'football', 'internet')
    >>> vocabulary[-10:]
    ['york', 'yorkers', 'young', 'younger', 'youth', 'youtube', 'zero', 'zika', 'zimbabwe', 'zone']
    """
    term_doc_matrix = np.array([])
    vocabulary = []

    # Treat each news headline as a document:
    text = list(df.title)

    # ADD YOUR CODE HERE

    return term_doc_matrix, vocabulary

In [None]:
test(get_term_document_matrix)

In [None]:
if __name__ == "__main__":
    term_doc_matrix, vocabulary = get_term_document_matrix(article_df)

## 2.3 Using SVD to compute term similarity

📝  Next, we will use SVD to approximate our term-document matrix using three smaller matrices. As we discussed above, each matrix contains different information and is helpful in solving different tasks. We want to use SVD to find semantically similar terms for this assignment. Thus, we only need one of the three matrices, the one of shape `terms x topics` (see the task above or check the lecture slides).


Implement the method below and use [scikit-learn's TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) to approximate the term-document matrix. Your matrix should be of the shape `terms x topics.`
1. Make the number of topics/components used by SVD a parameter of your method. Use 50 iterations and fix the random state to 42.
3. Next, compute the cosine similarity between all terms using their topic vectors. The resulting matrix should be of the shape `terms x terms.`
4. Lastly, set the cosine similarity of each term to itself (the diagonal) to `-1` and return the final similarity matrix.

<div class="alert alert-warning">
💡 Tip: Scikit-learn's "sklearn.metrics.pairwise.cosine_similarity" might be helpful.
</div>

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity


def get_term_similarity_matrix(term_doc_matrix, n_components):
    """
    >>> term_similarity = get_term_similarity_matrix(term_doc_matrix, n_components=10)
    >>> term_similarity.shape
    (3168, 3168)
    >>> (np.diag(term_similarity) == -1).all()
    True
    >>> np.allclose( # If true, n_components was probably not used.
    ...    get_term_similarity_matrix(term_doc_matrix, n_components=5),
    ...    get_term_similarity_matrix(term_doc_matrix, n_components=25))
    False
    >>> np.allclose( # If false, random_state is probably not fixed (should be 42).
    ...    get_term_similarity_matrix(term_doc_matrix, n_components=10),
    ...    get_term_similarity_matrix(term_doc_matrix, n_components=10))
    True
    """
    term_similarity = np.array([])

    # ADD YOUR CODE HERE

    return term_similarity

In [None]:
test(get_term_similarity_matrix)

In [None]:
if __name__ == "__main__":
    term_similarity_matrix = get_term_similarity_matrix(term_doc_matrix, n_components=200)

## 2.4 Finding similar terms

📝  Lastly, let's put everything together and find semantically similar tokens in our collection of news headlines. Complete the method below to find the top-n most similar terms for a given term with our similarity matrix (sorting by cosing similarity in descending order). Use the vocabulary to match between terms and their position in the similarity matrix.

In the lecture, we discussed the importance of chosing the number of latent components. If you want to see the impact yourself, play around with the `n_components` of the `get_term_similarity_matrix` method above and inspect how the list below changes. But make sure that you reset the number of components back to 200 before submitting your notebook.

In [None]:
def find_similar_terms(term_similarity_matrix, vocabulary, term, top_n: int):
    """
    >>> find_similar_terms(term_similarity_matrix, vocabulary, term="joe", top_n=5)
    ['biden', 'run', 'presidential', 'taking', '2016']
    >>> find_similar_terms(term_similarity_matrix, vocabulary, term="internet", top_n=5)
    ['devices', 'privacy', 'era', 'allow', 'rules']
    >>> find_similar_terms(term_similarity_matrix, vocabulary, term="college", top_n=3)
    ['football', 'admissions', 'savings']
    """
    similar_tokens = []
    
    # ADD YOUR CODE HERE

    return similar_tokens[:top_n]

In [None]:
test(find_similar_terms)