# Coding Interview

The goal of this exercise is to simulate a retrieval problem.
You have a vector `query` representing an embedding of a query, and a  `vector_db` representing a matrix of document embeddings.

We will use `cosine_similarity` as a metric to compute the similarity between the query and the vector_db, and get the top similarity scores, and top vectors (most similar documents to the query).

You have 20 minutes to implement the following functions:

* cosine_similarity
* get_top_k_scores
* get_top_k_vectors


Once done, you need to implement test cases for each function to make sure they working as expected. You should propose the test for each function by:
1. Calling the function to be tested.
2. Use assertions to check your proposed conditions.

For instance, `test_cosine_similarity` function should look like this:


```
def test_cosine_similarity(query, vector_db):
  similarity = cosine_similarity(query, vector_db) # call the target function
  assert condition_1, "message 1" # assert condition 1 is met
  assert condition_2, "message 2" # assert condition 1 is met
```

Use as many assertions as needed.



> **N.B.**: You will be evaluated on the correctness of the solution, completness of test cases, and the effciency of your code. Good coding!




---

### Reminder

1. Cosine similarity formula for 2 vectors A and B:

$$
\frac{A ⋅ B}{||A||_2\cdot||B||_2}, where\ ||.||_2\ is\ the\ l_2\ norm
$$

2. You only need `numpy`. You can get the documentation of any function using the `?` operator. For instance, to get the documentation of `np.argsort`, you can execute the command `np.argsort?` and you will get the documentation.

# Imports

In [1]:
import numpy as np

# Target Functions

In [230]:
def cosine_similarity(query, vector_db):
  """
  query: array of shape (d, 1), representing the embedding of a query with d dimensions
  vector_db: array of shape (n, d), representing the embedding of a n documents, each of d dimensions
  return: array of shape (n, 1), representing similarity scores between the query and each entry in the vector_db.
  """
  query = query / np.sqrt((query**2).sum())
  vector_db = vector_db / np.sqrt((vector_db**2).sum(axis=1, keepdims=True))
  return vector_db@query


def get_top_k_scores(similarity, k):
  """
  similarity: array of shape (n, 1) containing n similarity scores.
  k: (int) number of top scores to return
  return: array of top k scores
  """
  return np.sort(similarity)[:-k-1:-1]


def get_top_k_vectors(similarity, k, vector_db):
  """
  similarity: array of shape (n, 1) containing n similarity scores.
  k: (int) number of top vectors.
  vector_db: array of shape (n, d), representing the embedding of a n documents, each of d dimensions
  return: array of top k vectors in the vector_db
  """
  indices = np.argsort(similarity, axis=0)[:-k-1:-1]
  return vector_db[indices.flatten()]

# Test Functions

In [231]:
def test_cosine_similarity(query, vector_db):
  similarity = cosine_similarity(query, vector_db)
  assert similarity.min() >= -1, "cosine similarity could not be less than -1"
  assert similarity.max() <= 1, "cosine similarity could not be greater than 1"
  assert len(similarity) == len(vector_db), "We should have a score for each vector"

  print("test_cosine_similarity passed!")


def test_get_top_k_scores(similarity, k):
  top_scores = get_top_k_scores(similarity, k)
  assert len(top_scores) == k, f"We should have {k} scores"
  assert np.all(np.diff(top_scores) >= 0), "scores should be sorted in descending order"

  print("test_get_top_k_scores passed!")


def test_get_top_k_vectors(similarity, k, vector_db):
  top_vectors = get_top_k_vectors(similarity, k, vector_db)
  assert len(top_vectors) == k, f"We should have {k} vectors"
  assert top_vectors.shape[1] == vector_db.shape[1], f"Vectors should have {Y.shape[1]} dimension"

  unique_vector_db = np.unique(vector_db, axis=0)
  concat_db = np.concatenate([unique_vector_db, top_vectors], axis=0)
  assert np.unique(concat_db, axis=0).shape == unique_vector_db.shape, "Top vectors should be part of all vectors"

  print("test_get_top_k_vectors passed!")


# Tests Execution

Here you can check if all your tests passed.

In [232]:
dimension = 100
number_of_vectors = 600
top_k = 10
query = np.random.randn(dimension, 1)
vector_db = np.random.randn(number_of_vectors, dimension)

In [233]:
similarity = cosine_similarity(query, vector_db)

In [234]:
test_cosine_similarity(query, vector_db)

test_cosine_similarity passed!


In [237]:
test_get_top_k_scores(similarity, top_k)

test_get_top_k_scores passed!


In [238]:
test_get_top_k_vectors(similarity, top_k, vector_db)

test_get_top_k_vectors passed!
