<a href="https://colab.research.google.com/github/kishorepv/search/blob/main/%5BBUG_FIXED%5D_0_AI_Introduction_to_Lexical_and_BM25_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Basics of lexical search

Let's walk through a basic introduction to lexical search.

### Who you are:

An ML engineer with enough comfort with Python data stack (pandas, numpy, etc) that wants to understand traditional search engines (ie Elasticsearch, etc)

### What this is

A run through of the core concepts behind lexical search.


## This notebook: tokenization

Let's walk through the importance of tokenizaiton control in a lexical search engine. Traditional search engines use [word based tokenization](https://towardsdatascience.com/word-subword-and-character-based-tokenization-know-the-difference-ea0976b64e17/) as we'll see, that can make or break a search system. Lexical search engines are all about _giving you control_ over whats a match / not a match

In [1]:
!pip install searcharray

from searcharray import SearchArray
import pandas as pd
import numpy as np

Collecting searcharray
  Downloading searcharray-0.0.72-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading searcharray-0.0.72-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: searcharray
Successfully installed searcharray-0.0.72


## Tokenization and indexing

Let's start with some dumb text. Create a basic pandas dataframe.

In [2]:
chat_transcript = [
  "Hi this is Doug, I'd like to complain about the weather",
  "Doug, this is Tom, support for Earth's Climate, how can we help?",
  "Tom, can I speak to your manager?",
  "Hi, this is Sue, Tom's boss. What can I do for you?",
  "I'd like to complain about the ski conditions in West Virginia",
  "Oh doug thats terrible, lets see what we can do."
]

msgs = pd.DataFrame({"name": ["Doug", "Tom", "Doug", "Sue", "Doug", "Sue"],
                     "msg": chat_transcript})
msgs

Unnamed: 0,name,msg
0,Doug,"Hi this is Doug, I'd like to complain about th..."
1,Tom,"Doug, this is Tom, support for Earth's Climate..."
2,Doug,"Tom, can I speak to your manager?"
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for..."
4,Doug,I'd like to complain about the ski conditions ...
5,Sue,"Oh doug thats terrible, lets see what we can do."


### Word based tokenization

Most lexical search approaches use word-based tokenization

In [3]:
def whitespace_tokenize(text):
  return text.split()

whitespace_tokenize("Mary had a little lamb")

['Mary', 'had', 'a', 'little', 'lamb']

### When we index, we tokenize

To create an index, we pass an array of strings (here `msgs['msg]` and then the resulting column is an inverted index `msg_tokenized`)

In [4]:
msgs['msg_tokenized'] = SearchArray.index(msgs['msg'],
                                          tokenizer=whitespace_tokenize)
msgs

2025-09-21 05:26:20,640 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-09-21 05:26:20,642 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-09-21 05:26:20,644 - searcharray.indexing - INFO - Tokenizing 6 documents


INFO:searcharray.indexing:Tokenizing 6 documents


2025-09-21 05:26:20,651 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-09-21 05:26:20,653 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-09-21 05:26:20,655 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-09-21 05:26:20,656 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-09-21 05:26:20,659 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-09-21 05:26:20,661 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-09-21 05:26:20,679 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


Unnamed: 0,name,msg,msg_tokenized
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'this', 'Hi', 'complain', ""I'd"", 'is', ..."
1,Tom,"Doug, this is Tom, support for Earth's Climate...","Terms({'help?', 'for', 'Climate,', 'can', 'Tom..."
2,Doug,"Tom, can I speak to your manager?","Terms({'can', 'manager?', 'Tom,', 'I', 'speak'..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'can', 'boss.', 'Sue,', 'this', ..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'conditions', 'ski', 'in', 'Virginia', ..."
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'Oh', 'terrible,', 'can', 'what', 'do.'..."


## Then lets search

Our user wants to find all messages that mention Doug, how can we solve that?

1. Compute a BM25 score for 'doug' (more on BM25 later)
2. Get the matches (anything with some BM25 score had a term match)
3. Show the matches

We expect all the messages that say "doug" to be tokenized?

In [5]:
scores = msgs['msg_tokenized'].array.score("doug")
matches = msgs[scores > 0]
matches

Unnamed: 0,name,msg,msg_tokenized
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'Oh', 'terrible,', 'can', 'what', 'do.'..."


### Why so few matches?

That didn't work?!?

* We observe so few matches for `doug` what gives?

Well lexical search is exactly that - dumb string matches - *every single character must match*


In [6]:
whitespace_tokenize("Hi this is Doug, I'd like to complain")

['Hi', 'this', 'is', 'Doug,', "I'd", 'like', 'to', 'complain']

In [7]:
whitespace_tokenize("doug is a nice guy")

['doug', 'is', 'a', 'nice', 'guy']

Observe the occurences of "Doug" above, they are different strings:

```
Doug, != doug
```

**Takeaway** lexical search is about _extremely precise control of string matching_. YOU decide what to accept as equivelant words / or not depending on your domain.

**Why this matters?** obviously it matters for text matching to match different variants of words, but also consider cases where we store tags or label an item in a taxonomy (ie this item is a hat, but it is not a cap). Compare this to embedding based retrieval: we have much less control over a "match" - usually no more than a similarity threshold. Embeddings use a semantic sledgehammer, lexical uses a scalpal.

## Loosen up string matching

1. Lowercase
2. Remove punctuation
3. Split on whitespace

In [8]:
from string import punctuation


def better_tokenize(text):
    lowercased = text.lower()
    without_punctuation = lowercased.translate(str.maketrans('', '', punctuation))
    split = without_punctuation.split()
    return split

better_tokenize("Doug, that weirdo?"), better_tokenize("Oh this is about doug?")

(['doug', 'that', 'weirdo'], ['oh', 'this', 'is', 'about', 'doug'])

### Reindex and search

Now we can index with our new tokenizer, and repeat the search

In [9]:

msgs['msg_tokenized'] = SearchArray.index(msgs['msg'],
                                          tokenizer=better_tokenize)
msgs

2025-09-21 05:26:20,766 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-09-21 05:26:20,768 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-09-21 05:26:20,770 - searcharray.indexing - INFO - Tokenizing 6 documents


INFO:searcharray.indexing:Tokenizing 6 documents


2025-09-21 05:26:20,772 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-09-21 05:26:20,774 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-09-21 05:26:20,776 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-09-21 05:26:20,778 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-09-21 05:26:20,780 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-09-21 05:26:20,782 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-09-21 05:26:20,787 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


Unnamed: 0,name,msg,msg_tokenized
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'id', 'this', 'complain', 'is', 'hi', '..."
1,Tom,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'earths', 'can', 'support', 'thi..."
2,Doug,"Tom, can I speak to your manager?","Terms({'can', 'i', 'speak', 'tom', 'your', 'ma..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'can', 'what', 'i', 'boss', 'sue..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'id', 'conditions', 'ski', 'in', 'virgi..."
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'terrible', 'can', 'what', 'lets', 'do'..."


### It worked!

In [10]:
scores = msgs['msg_tokenized'].array.score("doug")
matches = msgs[scores > 0]
matches

Unnamed: 0,name,msg,msg_tokenized
0,Doug,"Hi this is Doug, I'd like to complain about th...","Terms({'id', 'this', 'complain', 'is', 'hi', '..."
1,Tom,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'earths', 'can', 'support', 'thi..."
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'terrible', 'can', 'what', 'lets', 'do'..."


## Breadcrumbs for Elasticsearch, Vespa, etc

Most search engines have a "tokenization" concept that control index (and we'll soon see) query tokenization. It happens on the server side when you send text over, though [some would argue it should be a client-side concern](https://softwaredoug.com/blog/2025/06/03/liberating-search)

In the Lucene family of search engines, these would be [analyzers](https://www.elastic.co/docs/reference/text-analysis/analyzer-reference) which are ways of composing step-by-step tokenization (first manipulate the whole string, then split, than manipulate individual tokens). In Vespa you can see the [linguistics module](https://docs.vespa.ai/en/linguistics.html) that manages tokenization using OpenNLP's library of tokenizers