In [1]:
import lucene

from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import *
from org.apache.lucene.index import *
from org.apache.lucene.store import RAMDirectory
from org.apache.lucene.util import BytesRefIterator

In [2]:
lucene.initVM()

<jcc.JCCEnv at 0x7f8ef938b168>

----

# Information retrieval

----


*Index* is a central concept in IR. Sometimes called an inverted file, index is a datastructure optimized for search.

Let's think about files at a high level: they can be understood as mappings from integers to words (when we define separators).

Namely, files with contents "Alice has a cat", "cat chases mice."

Form mappings

| Index | File1 | File2 |
| ----- |:----:| :---: |
|  0 | Alice| cat |
|  1 | has| chases|
|  2 | a| a|
|  3 | cat| mouse |

At this level index is inversion of this (hence the name) - given a word, you can find numbers of files that contain it.

This becomes

| Word | Files |
| ----  |:---: |
| Alice | 1 |
| has | 1|
| a | 1,2|
| cat| 1, 2 |
| chases| 2 |
| mouse | 2|

This is important for Information Retrieval, because if we optimize our datastructure for fetching file indices given words, we'll also have a way of retrieving files for more complex queries (for example "Alice OR cat") by using operations on retrieved collections.

----


# Lucene

----

### Index setup

To write to index, we'll need to define analyzer and IndexWriter.

We'll explore analyzer part in next notebook. For now it suffices to tell that analyzer is responsible for preprocessing strings by converting to some canonical form (like for example splitting them into words and lowercasing) - it's also called *normalization*.

In [3]:
analyzer = StandardAnalyzer()
index_writer = IndexWriter(
    RAMDirectory(),
    IndexWriterConfig(analyzer)
)

## Documents and Field Types 

When we defined indices, we used files for an example.

But what about metadata, like title or author name?

Lucene doesn't actually store files - it stores `Documents`. Documents are key-value mappings - for example contents are stored under key (field) `content`.

`FieldTypes` are used to define exactly what is stored for a field. For example in the following code we define `FieldType` which we're going to use for our `content` field.

In [4]:
text_field_type = FieldType()

text_field_type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
text_field_type.setTokenized(True)
text_field_type.setStored(True)
text_field_type.setStoreTermVectors(True)

In [5]:
contents = [
  "Humpty Dumpty sat on a wall,",
  "Humpty Dumpty had a great fall.",
  "All the king's horses and all the king's men",
  "Couldn't put Humpty together again."
]

for content in contents:
    doc = Document()
    
    doc.add(Field('content', content, text_field_type))
    
    index_writer.addDocument(doc)

In [6]:
index_writer.commit()

7

Now that we indexed something, let's see what gets actually stored.

In [7]:
indexReader = DirectoryReader.open(
  index_writer.getDirectory()
)

In [16]:
for i in range(indexReader.maxDoc()):
    terms = indexReader.getTermVector(i, "content")
    
    term_list = 
    print(list(term.utf8ToString() for term in BytesRefIterator.cast_(terms.iterator())))
#        print(term.utf8ToString())
    
    print()

['dumpty', 'humpty', 'sat', 'wall']

['dumpty', 'fall', 'great', 'had', 'humpty']

['all', 'horses', "king's", 'men']

['again', "couldn't", 'humpty', 'put', 'together']



We see that not everything is stored. Why? It's not just any words that weren't stored - we'll look into this in the following notebook.