# Coding Interview

The goal of this exercise is to simulate a retrieval problem. You should implement all functions yourself and not use predefined ones, e.g. from `sklearn`.<br>
You have a vector `query` representing an embedding of a query, and a  `vector_db` representing a matrix of documents embeddings.

We will use `cosine_similarity` as a metric to compute the similarity between the query and the vector_db, get the top similarity scores, and top vectors (most similar documents to the query).

You have **20 minutes** to implement the following functions:

* cosine_similarity
* get_top_k_scores
* get_top_k_vectors


Once done, you need to implement the test cases for each function to make sure they are working as expected. You should propose the test for each function by:
1. Calling the function to be tested.
2. Use assertions to check your proposed conditions.

For instance, `test_cosine_similarity` function should look like this:


```python
def test_cosine_similarity(query, vector_db):
  similarity = cosine_similarity(query, vector_db) # call the target function
  assert condition_1, "message 1" # assert condition 1 is met
  assert condition_2, "message 2" # assert condition 2 is met

  print("test_cosine_similarity passed!")
```

**Use as many assertions as needed.**



> *N.B.*: *You will be evaluated on the correctness of the solution, completness of test cases, and the effciency of your code. Good coding!*




---

### Reminder

1. Cosine similarity formula for 2 vectors A and B:

$$
\frac{A ⋅ B}{||A||_2\cdot||B||_2}, where\ ||.||_2\ is\ the\ l_2\ norm
$$

2. You can get the documentation of any function using the `?` operator. For instance, to get the documentation of `np.argsort`, you can execute the command `np.argsort?`. To get the documentation of `np.sort`, you can execute `np.sort?`.

# Imports

In [None]:
import numpy as np

# Target Functions

In [None]:
def cosine_similarity(query, vector_db):
  """
  query: array of shape (d, 1), representing the embedding of a query with d dimensions.
  vector_db: array of shape (n, d), representing the embedding of n documents, each of d dimensions.
  return: array of shape (n, 1), representing similarity scores between the query and each entry in the vector_db.
  """
  raise NotImplementedError


def get_top_k_scores(similarity, k):
  """
  similarity: array of shape (n, 1) containing n similarity scores.
  k: (int) number of top scores to return.
  return: array of top k scores in descending order.
  """
  raise NotImplementedError


def get_top_k_vectors(similarity, k, vector_db):
  """
  similarity: array of shape (n, 1) containing n similarity scores.
  k: (int) number of top vectors.
  vector_db: array of shape (n, d), representing the embedding of a n documents, each of d dimensions
  return: array of top k vectors in the vector_db in descending order of their scores
  """
  raise NotImplementedError

# Test Functions

In [None]:
def test_cosine_similarity(query, vector_db):
  similarity = cosine_similarity(query, vector_db)
  # Add assertions here

  print("test_cosine_similarity passed!")


def test_get_top_k_scores(similarity, k):
  top_scores = get_top_k_scores(similarity, k)
  # Add assertions here

  print("test_get_top_k_scores passed!")


def test_get_top_k_vectors(similarity, k, vector_db):
  top_vectors = get_top_k_vectors(similarity, k, vector_db)
  # Add assertions here

  print("test_get_top_k_vectors passed!")


# Tests Execution

Here you can check if all your tests passed.

In [None]:
dimension = 100
number_of_vectors = 600
top_k = 10
query = np.random.randn(dimension, 1)
vector_db = np.random.randn(number_of_vectors, dimension)

In [None]:
similarity = cosine_similarity(query, vector_db)

In [None]:
test_cosine_similarity(query, vector_db)

In [None]:
test_get_top_k_scores(similarity, top_k)

In [None]:
test_get_top_k_vectors(similarity, top_k, vector_db)