# Relevance-ranked search

Let's return to the indexing of toy data, as we did in the tutorial on Boolean search. This new tutorial has also been inspired by course material by Filip Ginter in Turku.

Our documents now look slightly different:

In [3]:
documents = ["This is a silly silly silly example",
             "A better example",
             "Nothing to see here nor here nor here",
             "This is a great example and a long example too"]

We can index them as we did before:

In [42]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

cv = CountVectorizer(lowercase=True, binary=True)
binary_dense_matrix = cv.fit_transform(documents).T.todense()

print("Term-document matrix:\n")
print(binary_dense_matrix)

Term-document matrix:

[[0 0 0 1]
 [0 1 0 0]
 [1 1 0 1]
 [0 0 0 1]
 [0 0 1 0]
 [1 0 0 1]
 [0 0 0 1]
 [0 0 1 0]
 [0 0 1 0]
 [0 0 1 0]
 [1 0 0 0]
 [1 0 0 1]
 [0 0 1 0]
 [0 0 0 1]]


Next, we'll remove the `binary=True` optional argument from the `CountVectorizer` constructor. The default value is `binary=False`. What change can we observe?

In [8]:
cv = CountVectorizer(lowercase=True)
dense_matrix = cv.fit_transform(documents).T.todense()

print("Term-document matrix:\n")
print(dense_matrix)

Term-document matrix:

[[0 0 0 1]
 [0 1 0 0]
 [1 1 0 2]
 [0 0 0 1]
 [0 0 3 0]
 [1 0 0 1]
 [0 0 0 1]
 [0 0 2 0]
 [0 0 1 0]
 [0 0 1 0]
 [3 0 0 0]
 [1 0 0 1]
 [0 0 1 0]
 [0 0 0 1]]


If we run a query on the term "example", we get:

In [9]:
t2i = cv.vocabulary_  # shorter notation: t2i = term-to-index
print("Query: example")
print(dense_matrix[t2i["example"]])

Query: example
[[1 1 0 2]]


Instead of seeing *whether* a term occurs in a document, we now see *how many times* the term occurs in each document:

In [53]:
hits_list = np.array(dense_matrix[t2i["example"]])[0]

for i, nhits in enumerate(hits_list):
    print("Example occurs", nhits, "time(s) in document:", documents[i])

Example occurs 1 time(s) in document: This is a silly silly silly example
Example occurs 1 time(s) in document: A better example
Example occurs 0 time(s) in document: Nothing to see here nor here nor here
Example occurs 2 time(s) in document: This is a great example and a long example too


When the number and sizes of the documents grow, we may think that the more times a search term occurs in a document, the more relevant the document is. So, if we search for "example" in our toy document collection, the fourth document is most relevant (2 hits), the first and second documents come next (1 hit each) and the third document is irrelevant (0 hits).

If we have multiple search terms, we might think that the more times the search terms occur in total in the document, the more relevant the document is.

Note that the bit-wise logical operators `AND (&)` and `OR (|)` will not work properly anymore when our matrix contains word counts. The same applies to `NOT (1 - x)`.

Let's search for the most relevant document for the query *better example*:

In [50]:
print("Query: better example")
print("Hits of better:        ", dense_matrix[t2i["better"]])
print("Hits of example:       ", dense_matrix[t2i["example"]])
print("Hits of better example:", dense_matrix[t2i["better"]] + dense_matrix[t2i["example"]])

Query: better example
Hits of better:         [[0 1 0 0]]
Hits of example:        [[1 1 0 2]]
Hits of better example: [[1 2 0 2]]


We just added the hits together. This means that we did not search for the phrase "better example", nor did we search for "better" AND "example". What we did search for was some kind of "better" OR "example", in which the sum of the number of occurrences of "better" and "example" in a document determines the relevance of the document.

This means that the second document, which contains one occurrence each of "better" and "example" is as good a hit as the fourth document, which contains two occurrences of "example" and no occurrence of "better".

Let's do another query:

In [73]:
print("Query: silly example")
print("Hits of silly:        ", dense_matrix[t2i["silly"]])
print("Hits of example:      ", dense_matrix[t2i["example"]])
print("Hits of silly example:", dense_matrix[t2i["silly"]] + dense_matrix[t2i["example"]])

Query: silly example
Hits of silly:         [[3 0 0 0]]
Hits of example:       [[1 1 0 2]]
Hits of silly example: [[4 1 0 2]]


... and also rank (sort) the results by relevance. We leave out the document without a single hit:

In [74]:
hits_list = np.array(dense_matrix[t2i["silly"]] + dense_matrix[t2i["example"]])[0]
print("Hits:", hits_list)

nhits_and_doc_ids = [ (nhits, i) for i, nhits in enumerate(hits_list) if nhits > 0 ]
print("List of tuples (nhits, doc_idx) where nhits > 0:", nhits_and_doc_ids)

ranked_nhits_and_doc_ids = sorted(nhits_and_doc_ids, reverse=True)
print("Ranked (nhits, doc_idx) tuples:", ranked_nhits_and_doc_ids)

print("\nMatched the following documents, ranked highest relevance first:")
for nhits, i in ranked_nhits_and_doc_ids:
    print("Score of 'silly example' is", nhits, "in document:", documents[i])

Hits: [4 1 0 2]
List of tuples (nhits, doc_idx) where nhits > 0: [(4, 0), (1, 1), (2, 3)]
Ranked (nhits, doc_idx) tuples: [(4, 0), (2, 3), (1, 1)]

Matched the following documents, ranked highest relevance first:
Score of 'silly example' is 4 in document: This is a silly silly silly example
Score of 'silly example' is 2 in document: This is a great example and a long example too
Score of 'silly example' is 1 in document: A better example
