## Boolean search in Python on toy data

This code has been adapted and inspired from the notebooks by Filip Ginter for the course *Information Retrieval* given in the spring of 2017 at the University of Turku.

Let's first create some toy data, that is, four sentences that we consider to be our "documents":

In [3]:
documents = ["This is a silly example",
             "A better example",
             "Nothing to see here",
             "This is a great and long example"]

### Term-document matrix

We need to import some functionality from sklearn (also called scikit-learn), which is a free software machine learning library for Python.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

We use the CountVectorizer class to create a *term-document* matrix of our data:

In [13]:
cv = CountVectorizer(lowercase=True, binary=True)
sparse_matrix = cv.fit_transform(documents)

print("Term-document matrix: (?)\n")
print(sparse_matrix)

Term-document matrix: (?)

  (0, 2)	1
  (0, 9)	1
  (0, 5)	1
  (0, 10)	1
  (1, 1)	1
  (1, 2)	1
  (2, 4)	1
  (2, 8)	1
  (2, 11)	1
  (2, 7)	1
  (3, 6)	1
  (3, 0)	1
  (3, 3)	1
  (3, 2)	1
  (3, 5)	1
  (3, 10)	1


Oops, this does not look like a matrix. It is because the matrix is stored in a _sparse_ format to save memory. How do we read this? For instance, the two first rows tell us that in the coordinate (0, 2) of the matrix there is a 1, and in the coordinate (0, 9) there is also a 1.

All positions in the matrix not explicitly mentioned contain a zero, so we save memory by not storing all zeros. The matrix is assumed to be sparse, that is, most of the elements are zeros.

Anyway, let's print a _dense_ version of this matrix:

In [14]:
dense_matrix = sparse_matrix.todense()

print("Term-document matrix: (?)\n")
print(dense_matrix)

Term-document matrix: (?)

[[0 0 1 0 0 1 0 0 0 1 1 0]
 [0 1 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 1 1 0 0 1]
 [1 0 1 1 0 1 1 0 0 0 1 0]]


This looks better, but... There are four documents, so the rows must now be the documents and the columns the terms (= words). However, we want to have a *term-document* matrix, not a *document-term* matrix.

Let's *transpose* the matrix, so that the rows and columns change places:

In [15]:
td_matrix = dense_matrix.T   # .T transposes the matrix

print("Term-document matrix:\n")
print(td_matrix)

Term-document matrix:

[[0 0 0 1]
 [0 1 0 0]
 [1 1 0 1]
 [0 0 0 1]
 [0 0 1 0]
 [1 0 0 1]
 [0 0 0 1]
 [0 0 1 0]
 [0 0 1 0]
 [1 0 0 0]
 [1 0 0 1]
 [0 0 1 0]]


From this matrix we can read, for instance, that the term represented by the first row `[0 0 0 1]` occurs only in the fourth document (_"This is a great and long example"_). It further tells us, for example, that the term on the third row `[1 1 0 1]` occurs in all but the third document.

So, how can we know which term the different rows represent?

Here goes the ordered list of terms:

In [16]:
print("\nIDX -> terms mapping:\n")
print(cv.get_feature_names())


IDX -> terms mapping:

['and', 'better', 'example', 'great', 'here', 'is', 'long', 'nothing', 'see', 'silly', 'this', 'to']


So, the first row represents the word "and" and the third row the word "example".

Let's double-check that:

In [22]:
terms = cv.get_feature_names()

print("First term (with row index 0):", terms[0])
print("Third term (with row index 2):", terms[2])

First term (with row index 0): and
Third term (with row index 2): example


It is also possible to map the other way around, from term to index:

In [19]:
print("\nterm -> IDX mapping:\n")
print(cv.vocabulary_) # note the _ at the end


term -> IDX mapping:

{'this': 10, 'is': 5, 'silly': 9, 'example': 2, 'better': 1, 'nothing': 7, 'to': 11, 'see': 8, 'here': 4, 'great': 3, 'and': 0, 'long': 6}


`vocabulary_` (with a trailing underscore) is a Python dictionary:

In [23]:
print("Row index of 'example':", cv.vocabulary_["example"])
print("Row index of 'silly':", cv.vocabulary_["silly"])

Row index of 'example': 2
Row index of 'silly': 9
