In [1]:
import $ivy.`org.apache.lucene:lucene-core:7.2.1`
import $ivy.`org.apache.lucene:lucene-queries:7.2.1`
import $ivy.`org.apache.lucene:lucene-queryparser:7.2.1`
import $ivy.`org.apache.lucene:lucene-analyzers-common:7.2.1`

[32mimport [39m[36m$ivy.$                                    
[39m
[32mimport [39m[36m$ivy.$                                       
[39m
[32mimport [39m[36m$ivy.$                                           
[39m
[32mimport [39m[36m$ivy.$                                                [39m

In [2]:
import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.document._
import org.apache.lucene.index._
import org.apache.lucene.store.RAMDirectory
import org.apache.lucene.util.BytesRef

[32mimport [39m[36morg.apache.lucene.analysis.standard.StandardAnalyzer
[39m
[32mimport [39m[36morg.apache.lucene.document._
[39m
[32mimport [39m[36morg.apache.lucene.index._
[39m
[32mimport [39m[36morg.apache.lucene.store.RAMDirectory
[39m
[32mimport [39m[36morg.apache.lucene.util.BytesRef[39m

We'll use this to convert some iterator-like Java objects to Scala collections.

In [3]:
def toScalaStream[T](iter: {def next(): T}): Stream[T] = {
  val value = iter.next()
  if (value == null) Stream.empty[T]
  else value #:: toScalaStream(iter)
}

defined [32mfunction[39m [36mtoScalaStream[39m

----

# Information retrieval

----


*Index* is a central concept in IR. Sometimes called an inverted file, index is a datastructure optimized for search.

Let's think about files at a high level: they can be understood as mappings from integers to words (when we define separators).

Namely, files with contents "Alice has a cat", "cat chases mice."

Form mappings

| Index | File1 | File2 |
| ----- |:----:| :---: |
|  0 | Alice| cat |
|  1 | has| chases|
|  2 | a| a|
|  3 | cat| mouse |

At this level index is inversion of this (hence the name) - given a word, you can find numbers of files that contain it.

This becomes

| Word | Files |
| ----  |:---: |
| Alice | 1 |
| has | 1|
| a | 1,2|
| cat| 1, 2 |
| chases| 2 |
| mouse | 2|

This is important for Information Retrieval, because if we optimize our datastructure for fetching file indices given words, we'll also have a way of retrieving files for more complex queries (for example "Alice OR cat") by using operations on retrieved collections.

----


# Lucene

----

### Index setup

To write to index, we'll need to define analyzer and IndexWriter.

We'll explore analyzer part in next notebook. For now it suffices to tell that analyzer is responsible for preprocessing strings by converting to some canonical form (like for example splitting them into words and lowercasing) - it's also called *normalization*.

In [11]:
val indexWriter = new IndexWriter(
  new RAMDirectory(),
  new IndexWriterConfig(analyzer))

[36manalyzer[39m: [32mStandardAnalyzer[39m = org.apache.lucene.analysis.standard.StandardAnalyzer@537e4b20
[36mindexWriter[39m: [32mIndexWriter[39m = org.apache.lucene.index.IndexWriter@7fa75d73

## Documents and Field Types 

When we defined indices, we used files for an example.

But what about metadata, like title or author name?

Lucene doesn't actually store files - it stores `Documents`. Documents are key-value mappings - for example contents are stored under key (field) `content`.

`FieldTypes` are used to define exactly what is stored for a field. For example in the following code we define `FieldType` which we're going to use for our `content` field.

In [5]:
val textFieldType = new FieldType()

textFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
textFieldType.setTokenized(true)
textFieldType.setStored(true)
textFieldType.setStoreTermVectors(true)

[36mtextFieldType[39m: [32mFieldType[39m = stored,indexed,tokenized,termVector

In [8]:
val contents = Array(
  "Humpty Dumpty sat on a wall,",
  "Humpty Dumpty had a great fall.",
  "All the king's horses and all the king's men",
  "Couldn't put Humpty together again."
)

contents.foreach { content =>
  val doc = new Document()
  
  doc.add(
    new Field(
      "content",
      content,
      textFieldType)
  )
  indexWriter.addDocument(doc)
}

indexWriter.commit()

[36mres7_1[39m: [32mLong[39m = [32m7L[39m

Now that we indexed something, let's see what gets actually stored.

In [9]:
val indexReader = DirectoryReader.open(
  indexWriter.getDirectory())

[36mindexReader[39m: [32mDirectoryReader[39m = StandardDirectoryReader(segments_1:4 _0(7.2.1):c4)

In [10]:
(0 until indexReader.maxDoc) foreach { i =>
  
  val terms = indexReader.getTermVector(i, "content")
  val termsIterator = terms.iterator()
  val termsStream = toScalaStream(terms.iterator()).map(_.utf8ToString())
  println(s"Document $i")
  termsStream.foreach { term =>
    print(term + " ")
  }
  println()
}

Document 0
dumpty humpty sat wall 
Document 1
dumpty fall great had humpty 
Document 2
all horses king's men 
Document 3
again couldn't humpty put together 


We see that not everything is stored. Why? It's not just any words that weren't stored - we'll look into this in the following notebook.