Welcome to this notebook explaining the functionality of the [SearchArray](https://github.com/softwaredoug/searcharray) libray from [Doug Turnbull](https://softwaredoug.com). The goal of *SearchArray* is to bring Lexical search to the Pythonic way of working. It is a library that extends Python Pandas with an inverted index, just like Lucene would. No installation of tools like Elasticsearch, OpenSearch or Solr. Just Pandas and other data related tools that just work for Python. 

For the demo, I'll use the content from the [Luminis blog](https://www.luminis.eu/blog/), available as a multiline json file.

## Load the data into a dataframe
The data contains a 100 blogs taken from Luminis. The result is a jsonl document with a complete json document per line. You can choose between the few_documents and all_documents files. The few_documents file is a subset of the all_documents file. The all_documents file contains 100 documents. The few_documents file contains 6 documents.

In [None]:
import pandas as pd

# Specify the path to your JSON file
file_path = 'data/all_documents.jsonl'

# Read the JSON file into a pandas DataFrame
blogs = pd.read_json(file_path, orient='records', lines=True)
blogs.head()

# Importing SearchArray   

In [None]:
from searcharray import SearchArray

## The Tokenizer
The tokenizer is a function that takes a string and returns a list of tokens. The tokens are the words that are used to build the index. The tokenizer is used to split the text into words and remove punctuation. The default tokenizer is a simple whitespace tokenizer. Test the tokenizer using a few sample texts.

```python
print(ws_punc_tokenizer('Hello, World!'))
```

In [None]:
import re
import string


def ws_punc_tokenizer(text):
    """
    Tokenizes text by splitting on whitespace and removing punctuation.
    :param text: String to tokenize.
    :return: Array of tokens.
    """
    text = re.sub(r'(\w)-(\w)', r'\1 \2', text.lower())
    split = text.split()
    return [token.translate(str.maketrans('', '', string.punctuation.replace('-', '')))
            for token in split]


In [None]:
print(ws_punc_tokenizer('Hello, World!'))

## Write the index with the tokens
Next we can use the tokenizer to add a special column to the pandas dataframe. This column will contain the tokens of the title. This step also creates the index for the searcharray.

In [None]:
title_index = SearchArray.index(blogs['title'], tokenizer=ws_punc_tokenizer)
blogs['title_index'] = title_index

In [None]:
print(f"""The type of the response: {type(title_index)}

The result is a column or a list type, shape is the row count: {title_index.shape}

The dictionary translates the available terms into numbers.
{title_index.term_dict}

Sentences translated into arrays of numbers using the dictionary.
{title_index.term_mat}""")


Using the method _index_ from the SearchArray class creates everything you need to exeute lexical search on your content. Comparing a query to a lot of documents works best with numbers. It requires less memory and is faster. The _index_ method creates a dictionary that translates the terms into numbers. The _term_mat_ is a matrix with the documents as rows and the terms as columns. The values are the term numbers taken from the dictionary.

## Search the index
The index is now ready to be used for searching. The search method returns a boolean array with the same length as the original data. The boolean array indicates if the document contains the search term. The search method uses the same tokenizer as the index.

In [None]:
title_index.match(ws_punc_tokenizer('quarkus'))

What just happened here? How did SearchArray determine if a document is a match to our query?
1. Obtain the number representing the token 'Quarkus' from the dictionary. To make lookups faster, the dictionary is a two-way dictionary. It can translate the token to a number and the number to a token.
2. Find those documents that have a higher than zero term positions for the specific term.

In the next code-block we have a closer look at those term positions. This is the part where Doug created something fast using roaring bitmaps.

In [None]:
def token_frequencies(token: str):
    """
    Prints a summary of important information about the token and its frequency of occurrence in the documents.
    :param token: The token to find positions for.
    """
    term_id = title_index.term_dict.get_term_id(token)    
    doc_ids, term_frequencies = title_index.posns.termfreqs(term_id)

    print(f"""
Type of the object storing the positions of the terms in docs '{type(title_index.posns).__name__}'
The number of a single token 'openai': {term_id}
For fun, convert the id back to the token: {title_index.term_dict.get_term(term_id)}

The document ids where the token is found: """)
    for i in range(len(doc_ids)):
        print(f"Document id: {doc_ids[i]} - Term frequency: {term_frequencies[i]}")


In [None]:
token_frequencies('quarkus')

The method above makes use of the method _termfreqs_ of the object _PosnBitArray_. This is the focus for another section. 

## Search for multiple terms
The search method can also be used to search for multiple terms. The default behavior is to search for documents that contain all the terms. The search method returns a boolean array with the same length as the original data. The boolean array indicates if the document contains all the search terms. The search method uses the same tokenizer as the index.

In [None]:
title_index.match(ws_punc_tokenizer('debug Quarkus'))

If we change the order of the terms, the result will be different. The order of the terms is used for matching. The only method to support that is to keep the positions of the terms in a document. First the proof that the reverse order of tokens does not result in a match.

In [None]:
title_index.match(ws_punc_tokenizer('Quarkus debug'))

## Scoring the search results
Next, calculate a score using BM25. With this score we can rank the documents based on the relevance of the search terms. The BM25 score is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic information retrieval model. The BM25 score is the sum of the scores of each term in the query. The score of a term is calculated using the following formula:

In [None]:
bm25 = blogs['title_index'].array.score(ws_punc_tokenizer('quarkus'))
bm25

In [None]:
blogs['bm25'] = bm25
blogs[['title', 'bm25']].sort_values('bm25', ascending=False)

The BM25 score is influenced by the length of the document (or field) compared to the average length of the document. The impact of this difference is controlled with the *b* parameter. The *k1* parameter controls the impact of the term frequency on the score. The default values are *k1=1.5* and *b=0.75*.

In [None]:
from searcharray.similarity import bm25_legacy_similarity

custom_bm25 = bm25_legacy_similarity(k1=1, b=0.1)
blogs['custom_bm25'] = blogs['title_index'].array.score(ws_punc_tokenizer('quarkus'), similarity=custom_bm25)
blogs[['title', 'bm25', 'custom_bm25']].sort_values('custom_bm25', ascending=False)

## Filter rows using the tags
Next we use the tags field to filter on those blogs that are tagged with *java*. We start showing the tags for the top matching rows from the previous query.

In [None]:
blogs[['title', 'bm25', 'tags']].sort_values('bm25', ascending=False)

In [None]:
# Filter rows where 'tags' field contains 'java'
java_blogs = blogs[blogs['tags'].apply(lambda tags: 'java' in tags)]
java_blogs[['title', 'bm25', 'tags']].sort_values('bm25', ascending=False)