<a href="https://colab.research.google.com/github/jonbaer/googlecolab/blob/master/Search_Array_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## SearchArray Guide

[SearchArray](http://github.com/softwaredoug/searcharray) is intended to be a very minmial API for lexical (ie BM25) search on top of a Pandas Dataframe.

The API is inspired by Lucene, so if you're comfortable with core search concepts from Lucene-search engines (Solr, Elasticsearch, OpenSearch, you'll be fine). Just like Lucene we have analyzers/tokenizers and similarities.


In [None]:
!pip install searcharray==0.0.33
from searcharray import SearchArray
import pandas as pd
import numpy as np

Collecting searcharray==0.0.33
  Downloading searcharray-0.0.33-py3-none-any.whl (35 kB)
Collecting pandas>=2.0.0 (from searcharray==0.0.33)
  Downloading pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sortednp (from searcharray==0.0.33)
  Downloading sortednp-0.4.1.tar.gz (35 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tzdata>=2022.1 (from pandas>=2.0.0->searcharray==0.0.33)
  Downloading tzdata-2023.4-py2.py3-none-any.whl (346 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m346.6/346.6 kB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sortednp
  Building wheel for 

### Basic Indexing

We start with basic / default tokenization that doesn't do anything special.

In [None]:
chat_transcript = [
  "Hi this is Doug, I'd like to complain about the weather",
  "Doug, this is Tom, support for Earth's Climate, how can we help?",
  "Tom, can I speak to your manager?",
  "Hi, this is Sue, Tom's boss. What can I do for you?",
  "I'd like to complain about the ski conditions in West Virginia"
]

msgs = pd.DataFrame({"name": ["Doug", "Doug", "Tom", "Sue", "Doug"],
                     "msg": chat_transcript})
msgs

Unnamed: 0,name,msg
0,Doug,"Hi this is Doug, I'd like to complain about th..."
1,Doug,"Doug, this is Tom, support for Earth's Climate..."
2,Tom,"Tom, can I speak to your manager?"
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for..."
4,Doug,I'd like to complain about the ski conditions ...


In [None]:
msgs['msg_tokenized'] = SearchArray.index(msgs['msg'])
msgs

Unnamed: 0,name,msg,msg_tokenized
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'like', 'weather', 'complain', 'Doug,',..."
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'we', 'can', 'Tom,', 'Doug,', ""E..."
2,Tom,"Tom, can I speak to your manager?","Terms({'can', 'Tom,', 'manager?', 'I', 'to', '..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'Hi,', 'can', 'boss.', 'What', '..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'conditions', 'like', 'ski', 'about', '..."


### Basic search (single term)

Searching is just a matter of calling "score"

In [None]:
msgs['score'] = msgs['msg_tokenized'].array.score("ski")
msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,msg_tokenized,score
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'conditions', 'like', 'ski', 'about', '...",0.620554
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'like', 'weather', 'complain', 'Doug,',...",0.0
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'we', 'can', 'Tom,', 'Doug,', ""E...",0.0
2,Tom,"Tom, can I speak to your manager?","Terms({'can', 'Tom,', 'manager?', 'I', 'to', '...",0.0
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'Hi,', 'can', 'boss.', 'What', '...",0.0


### Basic search (phrase)

Phrases are just lists of terms passed to score

In [None]:
msgs['score'] = msgs['msg_tokenized'].array.score(["ski", "conditions"])
msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,msg_tokenized,score
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'conditions', 'like', 'ski', 'about', '...",0.391891
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'like', 'weather', 'complain', 'Doug,',...",0.0
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'we', 'can', 'Tom,', 'Doug,', ""E...",0.0
2,Tom,"Tom, can I speak to your manager?","Terms({'can', 'Tom,', 'manager?', 'I', 'to', '...",0.0
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'Hi,', 'can', 'boss.', 'What', '...",0.0


## Custom tokenization (aka text analysis)

You almost always want some kind of custom tokenization (stemming, etc).

Luckily python comes with a rich array of stemmers, lematizers, and other functionality. SearchArray intentionally avoids creating its own library of tokenizers for this reason.

Here's an example using snowball.

In [None]:
!pip install pystemmer
import Stemmer
import string

stemmer = Stemmer.Stemmer('english')

def snowball_tokenizer(text):
  fold_to_ascii = dict( [ (ord(x), ord(y)) for x,y in zip( u"‘’´“”–-",  u"'''\"\"--") ] )

  split = text.lower().split()
  folded = [token.translate(fold_to_ascii) for token in split]
  return [stemmer.stemWord(token.translate(str.maketrans('', '', string.punctuation)))
          for token in folded]

snowball_tokenizer("Mary had a little lamb!")

Collecting pystemmer
  Downloading PyStemmer-2.2.0.1.tar.gz (303 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/303.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/303.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.0/303.0 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pystemmer
  Building wheel for pystemmer (setup.py) ... [?25l[?25hdone
  Created wheel for pystemmer: filename=PyStemmer-2.2.0.1-cp310-cp310-linux_x86_64.whl size=579700 sha256=01ce73aca0cf1815a7b0f7c93f94a335fcaeb56e116cbe1c7a4e6dad42595936
  Stored in directory: /root/.cache/pip/wheels/45/7d/2c/a7ebb8319e01acc5306fa1f8558bf24063d6cec2c02de330c9
Successfully built pystemmer
Installing collected packages: pystemmer
Successfully installed pystemmer-2

['mari', 'had', 'a', 'littl', 'lamb']

### Indexing with custom tokenizer

We just pass the snowball_tokenizer function to the `index` method

In [None]:
msgs['msg_snowball'] = SearchArray.index(msgs['msg'], tokenizer=snowball_tokenizer)
msgs

Unnamed: 0,name,msg,msg_tokenized,score,msg_snowball
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'like', 'weather', 'complain', 'Doug,',...",0.0,"Terms({'hi', 'like', 'weather', 'complain', 't..."
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'we', 'can', 'Tom,', 'Doug,', ""E...",0.0,"Terms({'for', 'tom', 'we', 'can', 'help', 'ear..."
2,Tom,"Tom, can I speak to your manager?","Terms({'can', 'Tom,', 'manager?', 'I', 'to', '...",0.0,"Terms({'tom', 'can', 'i', 'to', 'your', 'speak..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'Hi,', 'can', 'boss.', 'What', '...",0.0,"Terms({'for', 'hi', 'tom', 'can', 'sue', 'i', ..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'conditions', 'like', 'ski', 'about', '...",0.391891,"Terms({'like', 'ski', 'condit', 'virginia', 'a..."


### Searching with custom tokenizer

The `score` method expects pre-tokenized terms. You can use the `tokenizer` used at index time pretty easily.

In [None]:
query = "earths climate"
tokenized_phrase = msgs['msg_snowball'].array.tokenizer(query)
tokenized_phrase

['earth', 'climat']

In [None]:
msgs['score'] = msgs['msg_snowball'].array.score(tokenized_phrase)
msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,msg_tokenized,score,msg_snowball
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'we', 'can', 'Tom,', 'Doug,', ""E...",0.377541,"Terms({'for', 'tom', 'we', 'can', 'help', 'ear..."
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'like', 'weather', 'complain', 'Doug,',...",0.0,"Terms({'hi', 'like', 'weather', 'complain', 't..."
2,Tom,"Tom, can I speak to your manager?","Terms({'can', 'Tom,', 'manager?', 'I', 'to', '...",0.0,"Terms({'tom', 'can', 'i', 'to', 'your', 'speak..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'Hi,', 'can', 'boss.', 'What', '...",0.0,"Terms({'for', 'hi', 'tom', 'can', 'sue', 'i', ..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'conditions', 'like', 'ski', 'about', '...",0.0,"Terms({'like', 'ski', 'condit', 'virginia', 'a..."


## Changing similarities

By default, we use BM25 that attempts to mirror Lucene's BM25 implementation. But this can be changed by simply passing similarity at query time.

Each "similarity" is a factory function that itself returns a function. Notice below we customize bm25's k1 and b parameters.

In [None]:
from searcharray.similarity import bm25_similarity

custom_bm25_sim = bm25_similarity(k1=10, b=0.01)
msgs['score'] = msgs['msg_snowball'].array.score(tokenized_phrase, similarity=custom_bm25_sim)
msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,msg_tokenized,score,msg_snowball
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'we', 'can', 'Tom,', 'Doug,', ""E...",0.079493,"Terms({'for', 'tom', 'we', 'can', 'help', 'ear..."
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'like', 'weather', 'complain', 'Doug,',...",0.0,"Terms({'hi', 'like', 'weather', 'complain', 't..."
2,Tom,"Tom, can I speak to your manager?","Terms({'can', 'Tom,', 'manager?', 'I', 'to', '...",0.0,"Terms({'tom', 'can', 'i', 'to', 'your', 'speak..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'Hi,', 'can', 'boss.', 'What', '...",0.0,"Terms({'for', 'hi', 'tom', 'can', 'sue', 'i', ..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'conditions', 'like', 'ski', 'about', '...",0.0,"Terms({'like', 'ski', 'condit', 'virginia', 'a..."


### Custom similarity

You can also just make your own similarity if you create a function that returns a function that satisfies the contract.

Given an array of term_freqs for each doc, and other doc/term stats, you should return an array of similarity scores of the same length of term_freqs.

See the comments below with an example of raw TF*IDF

In [None]:
from searcharray.similarity import Similarity

def tf_idf_raw() -> Similarity:
    def raw(term_freqs: np.ndarray,        # TF array of every doc
            doc_freqs: np.ndarray,         # Doc freq array of every term (> 1 if a phrase)
            doc_lens: np.ndarray,          # Every documents length (same shape as TF)
            avg_doc_lens: int,             # avg doc length of corpus
            num_docs: int) -> np.ndarray:     # total number of docs in corpus

        phrase_doc_freq = np.sum(doc_freqs)     # In case of phrase
        return term_freqs * (1.0 / phrase_doc_freq)
    return raw

raw = tf_idf_raw()
raw(term_freqs=np.asarray([5.0, 3.0]),     # Two docs with term freqs 5 and 3
    doc_freqs=np.asarray([10.0]),          # Single term, df = 10
    doc_lens=np.asarray([10, 20]),
    avg_doc_lens=15,
    num_docs=2)

array([0.5, 0.3])

In [None]:
msgs['score'] = msgs['msg_snowball'].array.score(tokenized_phrase, similarity=raw)
msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,msg_tokenized,score,msg_snowball
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'we', 'can', 'Tom,', 'Doug,', ""E...",0.5,"Terms({'for', 'tom', 'we', 'can', 'help', 'ear..."
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'like', 'weather', 'complain', 'Doug,',...",0.0,"Terms({'hi', 'like', 'weather', 'complain', 't..."
2,Tom,"Tom, can I speak to your manager?","Terms({'can', 'Tom,', 'manager?', 'I', 'to', '...",0.0,"Terms({'tom', 'can', 'i', 'to', 'your', 'speak..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'Hi,', 'can', 'boss.', 'What', '...",0.0,"Terms({'for', 'hi', 'tom', 'can', 'sue', 'i', ..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'conditions', 'like', 'ski', 'about', '...",0.0,"Terms({'like', 'ski', 'condit', 'virginia', 'a..."


## Advanced queries (edismax? multi-match?)

What about things like Solr's edismax? Or a big Elasticsearch multi-match query?

Well, in the end, these things are just math. And you know what Pandas good at? Math!

So, for example, an Elasticsearch multi-match query searching different fields, multiplying them by a weight (ie boost), and then summing or taking the maximum score.

First we tokenize the query according to each field's tokenizer

In [None]:
query = "doug ski vacation conditions"

query_as_snowball = msgs['msg_snowball'].array.tokenizer(query)
query_as_whitespace = msgs['msg_tokenized'].array.tokenizer(query)
query_as_snowball, query_as_whitespace

(['doug', 'ski', 'vacat', 'condit'], ['doug', 'ski', 'vacation', 'conditions'])

Then we get a score for each field, for each query term.

The resultiing arrays are shaped num_terms x num_docs

In [None]:
snowball_scores = np.asarray([msgs['msg_snowball'].array.score(query_term)
                              for query_term in query_as_snowball])

whitespace_scores = np.asarray([msgs['msg_tokenized'].array.score(query_term)
                                for query_term in query_as_snowball])

snowball_scores, whitespace_scores

(array([[0.39189057, 0.37754144, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.62055406],
        [0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.62055406]]),
 array([[0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.62055406],
        [0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ]]))

## Take max-per-term (ie 'dismax')

In search, ["disjunction maximum"](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html) or "dismax" just means take the maximum score. That's pretty easy to do with these two arrays. It sits underneath the hood of many base queries like edismax or multi-match.

In [None]:
best_term_scores_per_doc = []
for term_idx in range(len(snowball_scores)):
    this_term_scores = np.max([snowball_scores[term_idx], whitespace_scores[term_idx]], axis=0)
    best_term_scores_per_doc.append(this_term_scores)
best_term_scores_per_doc

[array([0.39189057, 0.37754144, 0.        , 0.        , 0.        ]),
 array([0.        , 0.        , 0.        , 0.        , 0.62055406]),
 array([0., 0., 0., 0., 0.]),
 array([0.        , 0.        , 0.        , 0.        , 0.62055406])]

In [None]:
scores = np.sum(best_term_scores_per_doc, axis=0)
scores

array([0.39189057, 0.37754144, 0.        , 0.        , 1.24110813])

In [None]:
msgs['score'] = scores
msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,msg_tokenized,score,msg_snowball
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'conditions', 'like', 'ski', 'about', '...",1.241108,"Terms({'like', 'ski', 'condit', 'virginia', 'a..."
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'like', 'weather', 'complain', 'Doug,',...",0.391891,"Terms({'hi', 'like', 'weather', 'complain', 't..."
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'we', 'can', 'Tom,', 'Doug,', ""E...",0.377541,"Terms({'for', 'tom', 'we', 'can', 'help', 'ear..."
2,Tom,"Tom, can I speak to your manager?","Terms({'can', 'Tom,', 'manager?', 'I', 'to', '...",0.0,"Terms({'tom', 'can', 'i', 'to', 'your', 'speak..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'Hi,', 'can', 'boss.', 'What', '...",0.0,"Terms({'for', 'hi', 'tom', 'can', 'sue', 'i', ..."
