<a href="https://colab.research.google.com/github/jankovicsandras/plpgsql_bm25/blob/main/plpgsql_bm25_dev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

README.md

# plpgsql_bm25
## BM25 search implemented in PL/pgSQL

----
### News
 - Proof of concept works.
 - PL/pgSQL BM25 search functions work in Postgres without any extensions / Rust
 - BM25 index builder works in Python

### Roadmap / TODO
 - PL/pgSQL index builder (or other languages e.g. JavaScript)
 - ```bm25topk()``` should use dynamic column names, not the fixed ```id```, ```full_description```
 - implement other algorithms from rank_bm25, not just Okapi

----
### Contributions welcome!
The author is not a Postgres / PL/pgSQL expert, gladly accepts optimizations or constructive criticism.

----
### Usage in a nutshell
Python index building:
```python
# build BM25 index
mybm25_index = mybm25okapi(tokenized_corpus)
# export wsmap to CSV
mybm25_index.exportwsmap( csvfilepath )
# import wsmap to Postgres from CSV
msq('SELECT bm25importwsmap(\''+tablename_bm25wsmap+'\',\''+csvfilepath+'\');')
```
Postgres search:
```python
msq('SELECT bm25topk.id, bm25topk.score, bm25topk.doc FROM bm25topk(\''+tablename+'\', \''+tablename_bm25wsmap+'\',\''+json.dumps(tokenizedquestion).replace("'","\'\'")+'\', 10);')
```

----
### What is this?
 - https://en.wikipedia.org/wiki/Okapi_BM25
 - https://en.wikipedia.org/wiki/PL/pgSQL
 - https://github.com/dorianbrown/rank_bm25
 - TLDR:
    - BM25Okapi is a popular search algorithm.
    - Index building: Initially, there's a list of texts or documents called the corpus. Each document will be split to words (or tokens) with the tokenization function (the simplest is split on whitespace characters). The algorithm then builds a word-score-map ```wsmap```, where every word in the corpus is scored for every document based on their frequencies, ca. how special a word is in the corpus and how frequent in the current document.
    - Search: the question text (or query string) will be tokenized, then the search function looks up the words from ```wsmap``` and sums the scores for each document; the result is a list of scores, one for each document. The highest scoring document is the best match. The search function sorts the scores-documentIDs in descending order.
    - The ```wsmap``` is stored in a simple dict in Python ``` { 'word1': [doc1score, doc2score, ... ], 'word2':[doc1score, doc2score, ... ], ... }``` and a simple table in Postgres ```|word TEXT|vl JSON|``` where ```vl == [doc1score, doc2score, ... ]```.
    - Adding a new document to the corpus or changing one requires rebuilding the whole BM25 index (```wsmap```), because of how the algorithm works.

----
### Repo contents
 - ```plpgsql_bm25_dev.ipynb``` : Jupyter notebook where I develop this.
 - ```mybm25okapi.py``` : Python BM25 index builder, see also https://github.com/dorianbrown/rank_bm25
 - ```plpgsql_bm25.sql``` : PL/pgSQL functions for search

----
### Why?
Postgres has already Full Text Search and there are several extensions that implement BM25. But Full Text Search is not the same as BM25. The BM25 extensions are written in Rust, which might not be available / practical, especially in hosted environments. See Alternatives section for more info.

----
### Alternatives:

 - Postgres Full Text Search
   - https://www.postgresql.org/docs/current/textsearch.html
   - https://postgresml.org/blog/postgres-full-text-search-is-awesome


 - Rust based BM25
   - https://github.com/paradedb/paradedb/tree/dev/pg_search#overview
   - https://github.com/tensorchord/pg_bestmatch.rs


 - Postgres similarity of text using trigram matching
   - https://www.postgresql.org/docs/current/pgtrgm.html

   - NOTE: this is useful for fuzzy string matching, like spelling correction, but not query->document search solution itself.
The differing document and query text lengths will result very small relative trigram frequencies and incorrect/missing matching.

----
### Special thanks to: dorianbrown, Myon


----
### LICENSE

As https://github.com/dorianbrown/rank_bm25 has Apache-2.0 license, the derived mybm25okapi class should probably have Apache-2.0 license. The test datasets and other external code might have different licenses, please check them.

My code:

The Unlicense / PUBLIC DOMAIN

This is free and unencumbered software released into the public domain.

Anyone is free to copy, modify, publish, use, compile, sell, or distribute this software, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means.

In jurisdictions that recognize copyright laws, the author or authors of this software dedicate any and all copyright interest in the software to the public domain. We make this dedication for the benefit of the public at large and to the detriment of our heirs and successors. We intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights to this software under copyright law.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

For more information, please refer to http://unlicense.org


## installing PostgreSQL

In [1]:
! sudo apt install gnupg2 wget nano
! sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
! curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/postgresql.gpg
! sudo apt update
! sudo apt install postgresql-16 postgresql-contrib-16 postgresql-server-dev-16


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
wget is already the newest version (1.21.2-2ubuntu1.1).
gnupg2 is already the newest version (2.2.27-3ubuntu2.1).
Suggested packages:
  hunspell
The following NEW packages will be installed:
  nano
0 upgraded, 1 newly installed, 0 to remove and 49 not upgraded.
Need to get 280 kB of archives.
After this operation, 881 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 nano amd64 6.2-1 [280 kB]
Fetched 280 kB in 1s (352 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unab

In [2]:
!service postgresql start
!sudo -u postgres psql -c "CREATE USER root WITH SUPERUSER"

 * Starting PostgreSQL 16 database server
   ...done.
CREATE ROLE


## installing pgvector TODO: this is not required, maybe remove

In [3]:
#! export PG_CONFIG=/usr/lib/postgresql/16/bin/pg_config
#! cd /tmp && git clone --branch v0.7.2 https://github.com/pgvector/pgvector.git
#! cd /tmp/pgvector && sudo make clean
#! cd /tmp/pgvector && sudo make
#! cd /tmp/pgvector && sudo make install

## testing postgres, SqlMagic, msq()

In [4]:
# set connection
%load_ext sql
%config SqlMagic.feedback=False
%config SqlMagic.autopandas=True
%sql postgresql+psycopg2://@/postgres

# testing postgres
#df = %sql SELECT * FROM pg_catalog.pg_tables
#print(df)

#
def msq(s) :
  res = %sql $s
  print(res)
  return res

In [5]:
! pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


## Test dataset 1: Flipkart fashion products from Kaggle
#### This works, but you need to download the file and uncomment this.

In [6]:
# Manually download this dataset from Kaggle, then upload archive.zip, then unpack.
# https://www.kaggle.com/datasets/aaditshukla/flipkart-fasion-products-dataset

"""

import random, json, time

# Loading the dataset from https://www.kaggle.com/datasets/aaditshukla/flipkart-fasion-products-dataset
filename = 'flipkart_fashion_products_dataset.json'
dataset = []
try :
  with open(filename) as f:
    dataset = json.load(f)
    print(filename,'is loaded, number of items: ',len(dataset))
except Exception as ex:
  print(ex)

# item fields to keep
tracked_fields = ['title','description','selling_price','average_rating','product_details','brand','seller','actual_price','discount','category','sub_category','_id','product_details_string','full_description']

# number of items to sample from the dataset
item_num = 100
items = random.sample(dataset,item_num)

# Preprocess: fix 'description', create 'product_details_string', 'full_description'
for i, item in enumerate(items) :
  if len(item['title'].strip()) < 1 :
    item['title'] = 'Unknown item'
  if len(item['description'].strip()) < 1 :
    item['description'] = item['title']
    if len(item['brand'].strip()) > 0 :
      item['description'] += ' from ' + item['brand'].strip()
  item['product_details_string'] = ''
  if 'product_details' in item and len(item['product_details']) > 0 :
    for pd in item['product_details'] :
      for k in pd :
        item['product_details_string'] += k.strip()+': '+pd[k].strip()+'. '
    item['product_details_string'] = item['product_details_string'].strip()
  item['full_description'] = ''
  item['full_description'] += item['title'].strip() + ' ; '
  item['full_description'] += item['description'].strip() + ' ; '
  item['full_description'] += 'Selling price: '+item['selling_price'].strip() + ' ; '
  item['full_description'] += 'Average rating: '+item['average_rating'].strip() + ' ; '
  item['full_description'] += item['product_details_string'].strip() + ' ; '
  item['full_description'] += 'Brand: '+item['brand'].strip() + ' ; '
  item['full_description'] += 'Seller: '+item['seller'].strip() + ' ; '
  item['full_description'] += 'Actual price: '+item['actual_price'].strip()+' Discount: '+item['discount'].strip() + ' ; '
  item['full_description'] += 'Category: '+item['category'].strip()+' | Sub-category: '+item['sub_category'].strip() + ' ; '
  item['full_description'] += 'ID: '+item['_id'].strip()
  item['full_description'] = item['full_description'].replace('\n',' ')
  item['full_description'] = item['full_description'].replace(',',';')
  item['full_description'] = item['full_description'].strip()

# Print
#for i, item in enumerate(items) :
#  print(str(i),item['full_description'])

csvfilename = 'items.csv'
with open(csvfilename,'w+') as f:
  f.write('id;full_description\n')
  for i in range(0,len(items)) :
    f.write(str((i+1))+';"'+items[i]['full_description'].replace('"','\'')+'"\n')

# Creating items table
msq('DROP TABLE IF EXISTS items;')
msq('CREATE TABLE items (id SERIAL, full_description TEXT);')
msq('COPY items FROM \'/content/items.csv\' DELIMITER \';\' CSV HEADER;')
#msq('SELECT * from items;')

#
questions = [
  'I want a t-shirt, preferably grey.',
  'Do you have slim fit shirts, maybe orange?',
  'I need a gift for my boss.',
  'Do you sell wool socks?',
  'Where can I buy black jeans or pants?',
  'I want a blazer.'
]

"""

'\n\nimport random, json, time\n\n# Loading the dataset from https://www.kaggle.com/datasets/aaditshukla/flipkart-fasion-products-dataset\nfilename = \'flipkart_fashion_products_dataset.json\'\ndataset = []\ntry :\n  with open(filename) as f:\n    dataset = json.load(f)\n    print(filename,\'is loaded, number of items: \',len(dataset))\nexcept Exception as ex:\n  print(ex)\n\n# item fields to keep\ntracked_fields = [\'title\',\'description\',\'selling_price\',\'average_rating\',\'product_details\',\'brand\',\'seller\',\'actual_price\',\'discount\',\'category\',\'sub_category\',\'_id\',\'product_details_string\',\'full_description\']\n\n# number of items to sample from the dataset\nitem_num = 100\nitems = random.sample(dataset,item_num)\n\n# Preprocess: fix \'description\', create \'product_details_string\', \'full_description\'\nfor i, item in enumerate(items) :\n  if len(item[\'title\'].strip()) < 1 :\n    item[\'title\'] = \'Unknown item\'\n  if len(item[\'description\'].strip()) < 1

## Test dataset 2: generated items in several languages
#### This works, but not optimal for testing.

In [7]:
import random

# TODO: better multi language generation
i18n = {
  'en':{
    'colors': ['black','blue','green','cyan','red','magenta','brown','light grey','dark grey','bright blue'],
    'itemtypes': [ 'belt','cap','hat','jeans','jumper', 'shirt','shorts','sneakers','suit','tie' ],
    'adjs' : ['Fantastic','Cool','Superb','Awesome','Trendy'],
    'insizestr': ' in size ',
    'pricestr': '. Price: ',
    'currencystr': ' USD.',
    'questions': [
      'I want to buy a hat. What colors do you have?',
      'Can you recommend something green?',
      'Do you have shirts under 50 USD?',
      'What do you have in size 40?',
      'I would like to buy sneakers for my friend. Do you have something in size 46, preferably cyan or blue?',
      'What can you recommend in red?'
    ]
  },
  'hu':{
    'colors': ['fekete','kék','zöld','zöldeskék','piros','lila','barna','világosszürke','sötétszürke','ragyogó kék'],
    'itemtypes': [ 'öv','sapka','kalap','farmer','pullóver', 'ing','rövidnadrág','tornacipő','öltöny','nyakkendő' ],
    'adjs': ['Csodálatos','Menő','Szuper','Király','Trendi'],
    'insizestr': '. Méret: ',
    'pricestr': '. Ár: ',
    'currencystr': ' Ft.',
    'questions': [
      'Kalapot szeretnék. Milyen színek vannak?',
      'Tudsz-e ajánlani valami zöldet?',
      'Vannak ingek 50 Ft. alatt?',
      'Mik vannak 40-es méretben?',
      'Tornacipőt szeretnék a barátomnak. Van valami 46-os méretben, lehetőleg zöldeskék vagy kék?',
      'Mit tudsz ajánlani pirosban?'
    ]
  },
  'no':{
    'colors': ['svart','blå','grøn','grønblå','rød','lilla','brun','lysgrå','mørkgrå','lysblå'],
    'itemtypes': [ 'belt','lue','hatt','bukser','genser', 'skjorte','shorts','sko','dress','slips' ],
    'adjs': ['Fantastisk','Kult','Supert','Tøff','Trendy'],
    'insizestr': ' i størrelse ',
    'pricestr': '. Pris: ',
    'currencystr': ' kr.',
    'questions': [
      'Jeg vil kjøpe en hatt. Hva farger er det?',
      'Kan du anbefale noen grønt?',
      'Har de skjorter under 50 kr?',
      'Hva har de i størrelse 40?',
      'Jeg vil gjerne kjøpe sko til min venn. Har de nokre i størrelse 46, helst grønblå eller blå?',
      'Hva kan du anbefale i rødt?'
    ]
  }
}

sizes = [str(30+x*2) for x in range(0,10)]

lang = 'no'

items = []
for c in i18n[lang]['colors'] :
  for s in sizes :
    for ii,i in enumerate(i18n[lang]['itemtypes']) :
      items.append( { 'full_description': random.choice(i18n[lang]['adjs'])+' '+ c+' '+ i+ i18n[lang]['insizestr']+s+
                      i18n[lang]['pricestr']+str(int(s)+20+5*ii)+i18n[lang]['currencystr'] } )

#print(items)


# export to CSV for Postgres
csvfilename = 'items.csv'
with open(csvfilename,'w+') as f:
  f.write('id;full_description\n')
  for i in range(0,len(items)) :
    f.write(str((i+1))+';"'+items[i]['full_description'].replace('"','\'')+'"\n')


# Postgres creating items table by importing from CSV
msq('DROP TABLE IF EXISTS items;')
msq('CREATE TABLE items (id SERIAL, full_description TEXT);')
msq('COPY items FROM \'/content/items.csv\' DELIMITER \';\' CSV HEADER;')
#msq('SELECT * from items;')


# test questions
questions = i18n[lang]['questions']


 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []


## Test dataset 3: Wordpress related QA from Huggingface

In [8]:

! wget https://huggingface.co/datasets/mteb/cqadupstack-wordpress/resolve/main/corpus.jsonl
! wget https://huggingface.co/datasets/mteb/cqadupstack-wordpress/resolve/main/queries.jsonl
! ls -la


--2024-09-12 11:08:57--  https://huggingface.co/datasets/mteb/cqadupstack-wordpress/resolve/main/corpus.jsonl
Resolving huggingface.co (huggingface.co)... 18.160.143.75, 18.160.143.99, 18.160.143.76, ...
Connecting to huggingface.co (huggingface.co)|18.160.143.75|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/e3/a1/e3a12a6c68820b63bcd7ae09aa898026eba332004f38239070204b3b146060bc/089b11077372513eca8cc16653485aff1f232f8be18d8c6263be4b3b2bda0078?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27corpus.jsonl%3B+filename%3D%22corpus.jsonl%22%3B&Expires=1726398537&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyNjM5ODUzN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2UzL2ExL2UzYTEyYTZjNjg4MjBiNjNiY2Q3YWUwOWFhODk4MDI2ZWJhMzMyMDA0ZjM4MjM5MDcwMjA0YjNiMTQ2MDYwYmMvMDg5YjExMDc3MzcyNTEzZWNhOGNjMTY2NTM0ODVhZmYxZjIzMmY4YmUxOGQ4YzYyNjNiZTRiM2IyYm

In [9]:
import random, json

# load from jsonl
wcorpus = []
with open('corpus.jsonl') as f:
  wcstr = f.read()
  wcorpus = wcstr.split('\n')
print('len(wcorpus)',len(wcorpus))

# create sampled corpus, items, questions
sampledwcorpus = random.sample(wcorpus,10)
items = []
qqs = []
for i in range(0,len(sampledwcorpus)) :
  wjs = json.loads(sampledwcorpus[i])
  print(i,'---------------',wjs['_id'])
  print(len(wjs['title']),wjs['title'])
  print(len(wjs['text']),wjs['text'])
  items.append( { 'full_description': wjs['text'] } )
  qqs.append( [wjs['title'],i] )

# questions and solutions
random.shuffle(qqs)
questions = [ q[0] for q in qqs ]
questionsolutions = [ q[1]+1 for q in qqs ]

# export to CSV for Postgres
csvfilename = 'items.csv'
with open(csvfilename,'w+') as f:
  f.write('id;full_description\n')
  for i in range(0,len(items)) :
    f.write(str((i+1))+';"'+items[i]['full_description'].replace('"','\'')+'"\n')


# Postgres creating items table by importing from CSV
msq('DROP TABLE IF EXISTS items;')
msq('CREATE TABLE items (id SERIAL, full_description TEXT);')
msq('COPY items FROM \'/content/items.csv\' DELIMITER \';\' CSV HEADER;')
#msq('SELECT * from items;')


len(wcorpus) 48606
0 --------------- 146002
27 Where to replace short code
21788 Hey I am new to wordpress and web design in general. I just downloaded a plugin that helps edit navigation menus. The plugin is called Simple Responsive Menu. They are asking me to " Add shortcode in your theme(header.php) or replace your current menu.               [srMenu theme_location=primary]      OR               <?php echo do_shortcode('[srMenu theme_location=primary]');?>      The theme I am using is DonateNow, and would like to know what exactly in the header.php code I should replace. The header.php code is as follows:               <!DOCTYPE html>     <!--[if lt IE 7 ]> <html lang="en" class="ie6 oldie no-js"> <![endif]-->     <!--[if IE 7 ]>    <html lang="en" class="ie7 oldie no-js"> <![endif]-->     <!--[if IE 8 ]>    <html lang="en" class="ie8 oldie no-js"> <![endif]-->     <!--[if IE 9 ]>    <html lang="en" class="ie9 no-js"> <![endif]-->     <html <?php language_attributes(); ?>>     <head

## mybm25okapi: a refactored variant of rank_bm25 Okapi

In [10]:
"""
This is a refactored variant of rank_bm25 Okapi.

Usage:
  - corpus and query must be tokenized already, e.g. corpus = [ ['one','two','three'], ['bla','two','two'] ]  ; query = [ 'Is', 'this', 'a', 'question?' ]
  - __init__(corpus) will initialize the bm25Okapi components, where self.wsmap is the most important
  - No update is possible, so if the documents change in the corpus, then __init__(corpus) must be called again (recreating all the components).
  - search with topk() or get_scores()

Usage with Postgres:
  - corpus and query must be tokenized already, e.g. corpus = [ ['one','two','three'], ['bla','two','two'] ]  ; query = [ 'Is', 'this', 'a', 'question?' ]
  - __init__(corpus) will initialize the bm25Okapi components, where self.wsmap is the most important
  - No update is possible, so if the documents change in the corpus, then __init__(corpus) must be called again (recreating all the components).
  - call exportwsmap() after init, then import wsmap into a Postgres table: COPY tablename_bm25wsmap FROM '/path-to/tablename_bm25wsmap.csv' DELIMITER ';' CSV HEADER;
  - search in Postgres by calling the plpgsql functions: SELECT bm25topk.id, bm25topk.score, bm25topk.doc FROM bm25topk(tablename, tablename_bm25wsmap, query, 10);
"""
import math


class mybm25okapi:
  def __init__(self, corpus):
    # constants
    self.k1 = 1.5
    self.b = 0.75
    self.epsilon = 0.25

    self.corpus_len = len(corpus)
    self.avg_doc_len = 0
    self.doc_freqs = []
    self.idf = {}
    self.doc_lens = []
    word_docs_count = {}  # word -> number of documents with word
    total_word_count = 0

    for document in corpus:
      # doc lengths and total word count
      self.doc_lens.append(len(document))
      total_word_count += len(document)

      # word frequencies in this document
      frequencies = {}
      for word in document:
        if word not in frequencies:
          frequencies[word] = 0
        frequencies[word] += 1
      self.doc_freqs.append(frequencies)

      # number of documents with word count
      for word, freq in frequencies.items():
        try:
          word_docs_count[word] += 1
        except KeyError:
          word_docs_count[word] = 1

    # average document length
    self.avg_doc_len = total_word_count / self.corpus_len

    """
    Calculates frequencies of terms in documents and in corpus.
    This algorithm sets a floor on the idf values to eps * average_idf
    """
    # collect idf sum to calculate an average idf for epsilon value
    # collect words with negative idf to set them a special epsilon value.
    # idf can be negative if word is contained in more than half of documents
    idf_sum = 0
    negative_idfs = []
    for word, freq in word_docs_count.items():
      idf = math.log(self.corpus_len - freq + 0.5) - math.log(freq + 0.5)
      self.idf[word] = idf
      idf_sum += idf
      if idf < 0:
        negative_idfs.append(word)
    self.average_idf = idf_sum / len(self.idf)
    # assign epsilon
    eps = self.epsilon * self.average_idf
    for word in negative_idfs:
      self.idf[word] = eps

    # precalc "half of divisor" + self.k1 * (1 - self.b + self.b * doc_lens / self.avg_doc_len)
    self.hds = [ self.k1 * ( 1-self.b + self.b*doc_len/self.avg_doc_len) for doc_len in self.doc_lens ]

    # words * documents score map
    self.wsmap = {}
    for word in self.idf :
      self.wsmap[word] = [0] * self.corpus_len
      word_freqs = [ (doc.get(word) or 0) for doc in self.doc_freqs ]
      for i in range(0,self.corpus_len) :
        self.wsmap[word][i] += (self.idf.get(word) or 0) * ( word_freqs[i] * (self.k1 + 1) / ( word_freqs[i] + self.hds[i] ) )


  # get a list of scores for every document
  def get_scores(self, tokenizedquery):
    # zeroes list of scores
    scores = [0] * self.corpus_len
    # for each word in tokenizedquery, if word is in wsmap, lookup and add word score for every documents' scores
    for word in tokenizedquery:
      if word in self.wsmap :
        for i in range(0,self.corpus_len) :
          scores[i] += self.wsmap[word][i]
    # return scores list (not sorted)
    return scores


  def topk(self,tokenizedquery,k=None):
    docscores = self.get_scores( tokenizedquery )
    sisc = [ [i,s] for i,s in enumerate(docscores) ]
    sisc.sort(key=lambda x:x[1],reverse=True)
    if k :
      sisc = sisc[:k]
    return sisc


  # save the words*documents score map as csv for import to Postgres: COPY tablename_bm25wsmap FROM '/path-to/tablename_bm25wsmap.csv' DELIMITER ';' CSV HEADER;
  def exportwsmap(self, csvfilename) :
    with open(csvfilename,'w+') as f:
      f.write('word;vl\n')
      for word in self.wsmap :
        f.write('"'+word.replace('"','\'')+'";{'+str(self.wsmap[word]).strip()[1:-1]+'}\n')




# tokenization function
def mytokenize(s) :
  ltrimchars = ['(','[','{','<']
  rtrimchars = ['.', '?', '!', ',', ':', ';', ')', ']', '}', '>']
  if type(s) != str : return []
  wl = s.lower().split()
  for i,w in enumerate(wl) :
    if len(w) < 1 : continue
    si = 0
    ei = len(w)
    try :
      while si < ei and w[si] in ltrimchars : si += 1
      while ei > si and w[ei-1] in rtrimchars : ei -= 1
      wl[i] = wl[i][si:ei]
    except Exception as ex:
      print('|',w,'|',ex,'|',wl)
  wl = [ w for w in wl if len(w) > 0 ]
  return wl



In [11]:
"""
plpgsql functions for bm25Okapi search
"""

# bm25importwsmap(): imports wsmap from csv file (created by mybm25okapi.exportwsmap)
msq("""
DROP FUNCTION IF EXISTS bm25importwsmap;
CREATE OR REPLACE FUNCTION bm25importwsmap(tablename_bm25wsmap TEXT, csvpath TEXT) RETURNS VOID
LANGUAGE plpgsql
AS $$
DECLARE
  sql_statement TEXT := '';
BEGIN
  sql_statement := 'DROP TABLE IF EXISTS ' || tablename_bm25wsmap || ';';
  EXECUTE sql_statement;
  sql_statement := 'CREATE TABLE ' || tablename_bm25wsmap || ' (word TEXT, vl double precision[]);';
  EXECUTE sql_statement;
  sql_statement := 'COPY ' || tablename_bm25wsmap || ' FROM ' || chr(39) || csvpath || chr(39) || ' DELIMITER ' || chr(39) || ';' || chr(39) || ' CSV HEADER;';
  EXECUTE sql_statement;
END;
$$
""")


msq("""
DROP FUNCTION IF EXISTS bm25scorerows;
CREATE OR REPLACE FUNCTION bm25scorerows(tablename TEXT, tokenizedquery TEXT) RETURNS SETOF double precision[]
LANGUAGE plpgsql
AS $$
DECLARE
  w TEXT := '';
  sql_statement TEXT := '';
  tokenizedqueryjson JSON := tokenizedquery::JSON;
BEGIN
  FOR w IN SELECT * FROM json_array_elements_text(tokenizedqueryjson)
  LOOP
    sql_statement := 'SELECT vl FROM ' || tablename || ' WHERE word = $1';
    RETURN QUERY EXECUTE sql_statement USING w::TEXT;
  END LOOP;
END;
$$
""")




# bm25scoressum(): sums the score rows to one array with the document scores
msq("""
DROP FUNCTION IF EXISTS bm25scoressum;
CREATE OR REPLACE FUNCTION bm25scoressum(tablename TEXT, tokenizedquery TEXT) RETURNS SETOF double precision[]
LANGUAGE plpgsql
AS $$
BEGIN
  DROP TABLE IF EXISTS xdocs;
  CREATE TABLE xdocs AS SELECT bm25scorerows(tablename, tokenizedquery);
  RETURN QUERY SELECT ARRAY_AGG(sum ORDER BY ord) FROM (SELECT ord, SUM(int) FROM xdocs, unnest(bm25scorerows) WITH ORDINALITY u(int, ord) GROUP BY ord);
END;
$$
""")


# bm25scunnest(): unnests the score array
msq("""
DROP FUNCTION IF EXISTS bm25scunnest;
CREATE OR REPLACE FUNCTION bm25scunnest(tablename TEXT, tokenizedquery TEXT) RETURNS TABLE(score double precision)
LANGUAGE plpgsql
AS $$
BEGIN
  RETURN QUERY SELECT unnest(bm25scoressum(tablename,tokenizedquery));
END;
$$
""")


# bm25isc(): returns the index and score of the documents; index starts with 1
msq("""
DROP FUNCTION IF EXISTS bm25isc;
CREATE OR REPLACE FUNCTION bm25isc(tablename TEXT, tokenizedquery TEXT) RETURNS TABLE(id BIGINT, score double precision)
LANGUAGE plpgsql
AS $$
BEGIN
  RETURN QUERY SELECT row_number() OVER () AS id, bm25scunnest FROM bm25scunnest(tablename,tokenizedquery) ;
END;
$$
""")


# bm25topk(): returns the index, score and document sorted and limited
msq("""
DROP FUNCTION IF EXISTS bm25topk;
CREATE OR REPLACE FUNCTION bm25topk(tablename TEXT, tablename_bm25wsmap TEXT, tokenizedquery TEXT, k INT) RETURNS TABLE(id INT, score double precision, doc TEXT)
LANGUAGE plpgsql
AS $$
DECLARE
  sql_statement TEXT := '';
BEGIN
  sql_statement := 'SELECT t1.id, t2.score, t1.full_description AS doc FROM (SELECT id, full_description FROM ' || tablename || ') t1 INNER JOIN ( SELECT id, score FROM bm25isc($1,$2) ) t2 ON ( t1.id = t2.id ) ORDER BY t2.score DESC LIMIT $3;';
  RETURN QUERY EXECUTE sql_statement USING tablename_bm25wsmap, tokenizedquery, k;
END;
$$
""")



 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []


## BM25Okapi methods test

In [12]:
from rank_bm25 import BM25Okapi
import json


# table and file names
tablename = 'items'
tablename_bm25wsmap = tablename+'_bm25wsmap'
csvfilepath = '/content/'+tablename_bm25wsmap+'.csv'

# preparing tokenized corpus
tokenized_corpus = [ mytokenize(item['full_description']) for item in items ]

# rank_bm25 and mybm25okapi
rank_bm25_index = BM25Okapi(tokenized_corpus)
mybm25_index = mybm25okapi(tokenized_corpus)

# Postgres
# Export wsmap to CSV
mybm25_index.exportwsmap( csvfilepath )
# Import wsmap to Postgres from CSV
msq('SELECT bm25importwsmap(\''+tablename_bm25wsmap+'\',\''+csvfilepath+'\');')


# Running the questions
for qi,q in enumerate(questions) :

  # tokenize and print question
  tokenizedquestion = mytokenize(q)
  print('\n----Question',qi,':',q,' | Tokenized: ',tokenizedquestion)
  if questionsolutions and qi<len(questionsolutions) :
    print('Solution ID:',questionsolutions[qi])

  # rank_bm25 BM25 search
  doc_scores = rank_bm25_index.get_scores( tokenizedquestion )
  bres = [ [i,s] for i,s in enumerate(doc_scores) ]
  bres.sort(key=lambda x:x[1],reverse=True)
  bres = bres[:10]

  # mybm25okapi BM25 search
  bres2 = mybm25_index.topk( tokenizedquestion, 10 )

  # Postgres BM25 search
  msq('SELECT bm25topk.id, bm25topk.score, bm25topk.doc FROM bm25topk(\''+tablename+'\', \''+tablename_bm25wsmap+'\',\''+json.dumps(tokenizedquestion).replace("'","\'\'")+'\', 10);')

  # Print rank_bm25, mybm25okapi results
  for k in range(0,10):
    print( '|rank_bm25  |', bres[k][0]+1,  math.floor(bres[k][1]*10e5)/10e5,  items[bres[k][0] ]['full_description'] )
    print( '|mybm25okapi|', bres2[k][0]+1, math.floor(bres2[k][1]*10e5)/10e5, items[bres2[k][0]]['full_description'] )
    print(' ')


 * postgresql+psycopg2://@/postgres
  bm25importwsmap
0                

----Question 0 : get_categories hierarchical order like wp_list_categories - with name, slug & link to edit cat  | Tokenized:  ['get_categories', 'hierarchical', 'order', 'like', 'wp_list_categories', '-', 'with', 'name', 'slug', '&', 'link', 'to', 'edit', 'cat']
Solution ID: 5
 * postgresql+psycopg2://@/postgres
   id      score                                                doc
0   5  13.668083  I need to find a way to list all categories - ...
1  10   2.762337  Currently I have a header file checking to see...
2   6   2.245434  I've been searching and searching for a while ...
3   4   1.785348  I'm doing a function which requires the use of...
4   7   1.741635  I'm displaying posts from a category and want ...
5   1   1.725356  Hey I am new to wordpress and web design in ge...
6   8   1.717292  I have a plugin that has 3 pages. One of those...
7   9   0.791736  The majority of pages of one of my sites can o...
