<a href="https://colab.research.google.com/github/jonbaer/googlecolab/blob/master/Search_Array_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## SearchArray Guide

[SearchArray](http://github.com/softwaredoug/searcharray) is intended to be a very minmial API for lexical (ie BM25) search on top of a Pandas Dataframe.

The API is inspired by Lucene, so if you're comfortable with core search concepts from Lucene-search engines (Solr, Elasticsearch, OpenSearch, you'll be fine). Just like Lucene we have analyzers/tokenizers and similarities.

### WHY!?!?

* Help prototype ideas without standing up a search engine
* To let people without Solr / Elasticsearch expertise propose ideas
* Bring the lexical / BM25 into the normal Python data world


In [None]:
!pip install searcharray
from searcharray import SearchArray
import pandas as pd
import numpy as np

Collecting searcharray
  Downloading searcharray-0.0.72-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading searcharray-0.0.72-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: searcharray
Successfully installed searcharray-0.0.72


### Basic Indexing

We start with basic / default tokenization that doesn't do anything special.

In [None]:
chat_transcript = [
  "Hi this is Doug, I'd like to complain about the weather",
  "Doug, this is Tom, support for Earth's Climate, how can we help?",
  "Tom, can I speak to your manager?",
  "Hi, this is Sue, Tom's boss. What can I do for you?",
  "I'd like to complain about the ski conditions in West Virginia"
]

msgs = pd.DataFrame({"name": ["Doug", "Doug", "Tom", "Sue", "Doug"],
                     "msg": chat_transcript})
msgs

Unnamed: 0,name,msg
0,Doug,"Hi this is Doug, I'd like to complain about th..."
1,Doug,"Doug, this is Tom, support for Earth's Climate..."
2,Tom,"Tom, can I speak to your manager?"
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for..."
4,Doug,I'd like to complain about the ski conditions ...


In [None]:
msgs['msg_tokenized'] = SearchArray.index(msgs['msg'])
msgs

2025-06-17 19:49:25,453 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-06-17 19:49:25,456 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-06-17 19:49:25,458 - searcharray.indexing - INFO - Tokenizing 5 documents


INFO:searcharray.indexing:Tokenizing 5 documents


2025-06-17 19:49:25,465 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-06-17 19:49:25,467 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-06-17 19:49:25,468 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-06-17 19:49:25,470 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-06-17 19:49:25,473 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-06-17 19:49:25,475 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-06-17 19:49:25,477 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


Unnamed: 0,name,msg,msg_tokenized
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'Hi', ""I'd"", 'to', 'complain', 'about',..."
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'how', 'can', 'help?', 'for', 'Doug,', ..."
2,Tom,"Tom, can I speak to your manager?","Terms({'manager?', 'I', 'can', 'to', 'your', '..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'Hi,', 'I', 'can', 'boss.', 'for', 'thi..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'about', ""I'd"", 'to', 'complain', 'ski'..."


### Basic search (single term)

Searching is just a matter of calling "score"

In [None]:
msgs['score'] = msgs['msg_tokenized'].array.score("ski")
msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,msg_tokenized,score
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'about', ""I'd"", 'to', 'complain', 'ski'...",0.620554
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'Hi', ""I'd"", 'to', 'complain', 'about',...",0.0
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'how', 'can', 'help?', 'for', 'Doug,', ...",0.0
2,Tom,"Tom, can I speak to your manager?","Terms({'manager?', 'I', 'can', 'to', 'your', '...",0.0
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'Hi,', 'I', 'can', 'boss.', 'for', 'thi...",0.0


### Basic search (phrase)

Phrases are just lists of terms passed to score

In [None]:
msgs['score'] = msgs['msg_tokenized'].array.score(["ski", "conditions"])
msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,msg_tokenized,score
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'about', ""I'd"", 'to', 'complain', 'ski'...",1.241108
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'Hi', ""I'd"", 'to', 'complain', 'about',...",0.0
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'how', 'can', 'help?', 'for', 'Doug,', ...",0.0
2,Tom,"Tom, can I speak to your manager?","Terms({'manager?', 'I', 'can', 'to', 'your', '...",0.0
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'Hi,', 'I', 'can', 'boss.', 'for', 'thi...",0.0


## Custom tokenization (aka text analysis)

You almost always want some kind of custom tokenization (stemming, etc).

Luckily python comes with a rich array of stemmers, lematizers, and other functionality. SearchArray intentionally avoids creating its own library of tokenizers for this reason.

Here's an example using snowball.

In [None]:
!pip install pystemmer
import Stemmer
import string

stemmer = Stemmer.Stemmer('english')

def snowball_tokenizer(text):
  fold_to_ascii = dict( [ (ord(x), ord(y)) for x,y in zip( u"‘’´“”–-",  u"'''\"\"--") ] )

  split = text.lower().split()
  folded = [token.translate(fold_to_ascii) for token in split]
  return [stemmer.stemWord(token.translate(str.maketrans('', '', string.punctuation)))
          for token in folded]

snowball_tokenizer("Mary had a little lamb!")

Collecting pystemmer
  Downloading PyStemmer-3.0.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading PyStemmer-3.0.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (731 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/731.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m327.7/731.9 kB[0m [31m9.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.9/731.9 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pystemmer
Successfully installed pystemmer-3.0.0


['mari', 'had', 'a', 'littl', 'lamb']

### Indexing with custom tokenizer

We just pass the snowball_tokenizer function to the `index` method

In [None]:
msgs['msg_snowball'] = SearchArray.index(msgs['msg'], tokenizer=snowball_tokenizer)
msgs['score'] = msgs['msg_snowball'].array.score(snowball_tokenizer("earths climate"))
msgs.sort_values('score', ascending=False)

2025-06-17 19:51:24,545 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-06-17 19:51:24,548 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-06-17 19:51:24,554 - searcharray.indexing - INFO - Tokenizing 5 documents


INFO:searcharray.indexing:Tokenizing 5 documents


2025-06-17 19:51:24,556 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-06-17 19:51:24,558 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-06-17 19:51:24,560 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-06-17 19:51:24,562 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-06-17 19:51:24,565 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-06-17 19:51:24,567 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-06-17 19:51:24,570 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


Unnamed: 0,name,msg,msg_tokenized,score,msg_snowball
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'how', 'can', 'help?', 'for', 'Doug,', ...",1.195665,"Terms({'doug', 'climat', 'help', 'how', 'can',..."
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'Hi', ""I'd"", 'to', 'complain', 'about',...",0.0,"Terms({'doug', 'about', 'to', 'complain', 'lik..."
2,Tom,"Tom, can I speak to your manager?","Terms({'manager?', 'I', 'can', 'to', 'your', '...",0.0,"Terms({'can', 'to', 'your', 'speak', 'manag', ..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'Hi,', 'I', 'can', 'boss.', 'for', 'thi...",0.0,"Terms({'can', 'sue', 'for', 'you', 'boss', 'th..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'about', ""I'd"", 'to', 'complain', 'ski'...",0.0,"Terms({'west', 'about', 'to', 'complain', 'ski..."


### Searching with custom tokenizer

The `score` method expects pre-tokenized terms. You can use the `tokenizer` used at index time pretty easily.

In [None]:
query = "earths climate"
tokenized_phrase = msgs['msg_snowball'].array.tokenizer(query)
tokenized_phrase

['earth', 'climat']

In [None]:
msgs['score'] = msgs['msg_snowball'].array.score(tokenized_phrase)
msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,msg_tokenized,score,msg_snowball
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'how', 'can', 'help?', 'for', 'Doug,', ...",1.195665,"Terms({'doug', 'climat', 'help', 'how', 'can',..."
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'Hi', ""I'd"", 'to', 'complain', 'about',...",0.0,"Terms({'doug', 'about', 'to', 'complain', 'lik..."
2,Tom,"Tom, can I speak to your manager?","Terms({'manager?', 'I', 'can', 'to', 'your', '...",0.0,"Terms({'can', 'to', 'your', 'speak', 'manag', ..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'Hi,', 'I', 'can', 'boss.', 'for', 'thi...",0.0,"Terms({'can', 'sue', 'for', 'you', 'boss', 'th..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'about', ""I'd"", 'to', 'complain', 'ski'...",0.0,"Terms({'west', 'about', 'to', 'complain', 'ski..."


## Changing similarities

By default, we use BM25 that attempts to mirror Lucene's BM25 implementation. But this can be changed by simply passing similarity at query time.

Each "similarity" is a factory function that itself returns a function. Notice below we customize bm25's k1 and b parameters.

In [None]:
from searcharray.similarity import bm25_similarity

custom_bm25_sim = bm25_similarity(k1=10, b=0.01)
msgs['score'] = msgs['msg_snowball'].array.score(tokenized_phrase, similarity=custom_bm25_sim)
msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,msg_tokenized,score,msg_snowball
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'support', ""Earth's"", 'this', 'Doug,', ...",0.079493,"Terms({'support', 'tom', 'doug', 'this', 'eart..."
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'to', 'the', 'this', 'Doug,', 'complain...",0.0,"Terms({'doug', 'to', 'this', 'the', 'id', 'com..."
2,Tom,"Tom, can I speak to your manager?","Terms({'to', 'I', 'your', 'speak', 'can', 'man...",0.0,"Terms({'tom', 'to', 'i', 'manag', 'speak', 'ca..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'What', 'Hi,', 'Sue,', 'this', 'I', 'yo...",0.0,"Terms({'tom', 'this', 'i', 'sue', 'hi', 'you',..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'to', 'the', 'complain', 'in', 'West', ...",0.0,"Terms({'to', 'the', 'west', 'id', 'complain', ..."


### Custom similarity

You can also just make your own similarity if you create a function that returns a function that satisfies the contract.

Given an array of term_freqs for each doc, and other doc/term stats, you should return an array of similarity scores of the same length of term_freqs.

See the comments below with an example of raw TF*IDF

In [None]:
from searcharray.similarity import Similarity

def tf_idf_raw() -> Similarity:
    def raw(term_freqs: np.ndarray,        # TF array of every doc
            doc_freqs: np.ndarray,         # Doc freq array of every term (> 1 if a phrase)
            doc_lens: np.ndarray,          # Every documents length (same shape as TF)
            avg_doc_lens: int,             # avg doc length of corpus
            num_docs: int) -> np.ndarray:     # total number of docs in corpus

        phrase_doc_freq = np.sum(doc_freqs)     # In case of phrase
        return term_freqs * (1.0 / phrase_doc_freq)
    return raw

raw = tf_idf_raw()
raw(term_freqs=np.asarray([5.0, 3.0]),     # Two docs with term freqs 5 and 3
    doc_freqs=np.asarray([10.0]),          # Single term, df = 10
    doc_lens=np.asarray([10, 20]),
    avg_doc_lens=15,
    num_docs=2)

array([0.5, 0.3])

In [None]:
msgs['score'] = msgs['msg_snowball'].array.score(tokenized_phrase, similarity=raw)
msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,msg_tokenized,score,msg_snowball
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'support', ""Earth's"", 'this', 'Doug,', ...",0.5,"Terms({'support', 'tom', 'doug', 'this', 'eart..."
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'to', 'the', 'this', 'Doug,', 'complain...",0.0,"Terms({'doug', 'to', 'this', 'the', 'id', 'com..."
2,Tom,"Tom, can I speak to your manager?","Terms({'to', 'I', 'your', 'speak', 'can', 'man...",0.0,"Terms({'tom', 'to', 'i', 'manag', 'speak', 'ca..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'What', 'Hi,', 'Sue,', 'this', 'I', 'yo...",0.0,"Terms({'tom', 'this', 'i', 'sue', 'hi', 'you',..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'to', 'the', 'complain', 'in', 'West', ...",0.0,"Terms({'to', 'the', 'west', 'id', 'complain', ..."


## Advanced queries (edismax? multi-match?)

What about things like Solr's edismax? Or a big Elasticsearch multi-match query?

Well, in the end, these things are just math. And you know what Pandas good at? Math!

So, for example, an Elasticsearch multi-match query searching different fields, multiplying them by a weight (ie boost), and then summing or taking the maximum score.

First we tokenize the query according to each field's tokenizer

In [None]:
query = "doug ski vacation conditions"

query_as_snowball = msgs['msg_snowball'].array.tokenizer(query)
query_as_whitespace = msgs['msg_tokenized'].array.tokenizer(query)
query_as_snowball, query_as_whitespace

(['doug', 'ski', 'vacat', 'condit'], ['doug', 'ski', 'vacation', 'conditions'])

Then we get a score for each field, for each query term.

The resultiing arrays are shaped num_terms x num_docs

In [None]:
snowball_scores = np.asarray([msgs['msg_snowball'].array.score(query_term)
                              for query_term in query_as_snowball])

whitespace_scores = np.asarray([msgs['msg_tokenized'].array.score(query_term)
                                for query_term in query_as_snowball])

snowball_scores, whitespace_scores

(array([[0.3918906 , 0.37754142, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.6205541 ],
        [0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.6205541 ]],
       dtype=float32),
 array([[0.       , 0.       , 0.       , 0.       , 0.       ],
        [0.       , 0.       , 0.       , 0.       , 0.6205541],
        [0.       , 0.       , 0.       , 0.       , 0.       ],
        [0.       , 0.       , 0.       , 0.       , 0.       ]],
       dtype=float32))

## Take max-per-term (ie 'dismax')

In search, ["disjunction maximum"](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html) or "dismax" just means take the maximum score. That's pretty easy to do with these two arrays. It sits underneath the hood of many base queries like edismax or multi-match.

In [None]:
best_term_scores_per_doc = []
for term_idx in range(len(snowball_scores)):
    this_term_scores = np.max([snowball_scores[term_idx], whitespace_scores[term_idx]], axis=0)
    best_term_scores_per_doc.append(this_term_scores)
best_term_scores_per_doc

[array([0.3918906 , 0.37754142, 0.        , 0.        , 0.        ],
       dtype=float32),
 array([0.       , 0.       , 0.       , 0.       , 0.6205541],
       dtype=float32),
 array([0., 0., 0., 0., 0.], dtype=float32),
 array([0.       , 0.       , 0.       , 0.       , 0.6205541],
       dtype=float32)]

In [None]:
scores = np.sum(best_term_scores_per_doc, axis=0)
scores

array([0.3918906 , 0.37754142, 0.        , 0.        , 1.2411082 ],
      dtype=float32)

In [None]:
msgs['score'] = scores
msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,msg_tokenized,score,msg_snowball
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'to', 'the', 'complain', 'in', 'West', ...",1.241108,"Terms({'to', 'the', 'west', 'id', 'complain', ..."
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'to', 'the', 'this', 'Doug,', 'complain...",0.391891,"Terms({'doug', 'to', 'this', 'the', 'id', 'com..."
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'support', ""Earth's"", 'this', 'Doug,', ...",0.377541,"Terms({'support', 'tom', 'doug', 'this', 'eart..."
2,Tom,"Tom, can I speak to your manager?","Terms({'to', 'I', 'your', 'speak', 'can', 'man...",0.0,"Terms({'tom', 'to', 'i', 'manag', 'speak', 'ca..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'What', 'Hi,', 'Sue,', 'this', 'I', 'yo...",0.0,"Terms({'tom', 'this', 'i', 'sue', 'hi', 'you',..."


## Simulate edismax query parser

In [None]:
from searcharray.solr import edismax

msgs['score'], explain = edismax(msgs, q="ski",
                                 qf=["msg_tokenized", "msg_snowball"])
print(explain)
msgs.sort_values('score', ascending=False)

((msg_tokenized:ski^1 | msg_snowball:ski^1))~1


Unnamed: 0,name,msg,msg_tokenized,score,msg_snowball
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'to', 'the', 'complain', 'in', 'West', ...",0.620554,"Terms({'to', 'the', 'west', 'id', 'complain', ..."
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'to', 'the', 'this', 'Doug,', 'complain...",0.0,"Terms({'doug', 'to', 'this', 'the', 'id', 'com..."
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'support', ""Earth's"", 'this', 'Doug,', ...",0.0,"Terms({'support', 'tom', 'doug', 'this', 'eart..."
2,Tom,"Tom, can I speak to your manager?","Terms({'to', 'I', 'your', 'speak', 'can', 'man...",0.0,"Terms({'tom', 'to', 'i', 'manag', 'speak', 'ca..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'What', 'Hi,', 'Sue,', 'this', 'I', 'yo...",0.0,"Terms({'tom', 'this', 'i', 'sue', 'hi', 'you',..."
