<a href="https://colab.research.google.com/github/jankovicsandras/plpgsql_bm25/blob/main/plpgsql_bm25_dev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

README.md

# plpgsql_bm25
## BM25 search implemented in PL/pgSQL

----
### News
 - Proof of concept works.
 - PL/pgSQL BM25 search functions work in Postgres without any extensions / Rust
 - BM25 index builder works in Python

### Roadmap / TODO
 - PL/pgSQL index builder (or other languages e.g. JavaScript)
 - ```bm25topk()``` should use dynamic column names, not the fixed ```id```, ```full_description```
 - implement other algorithms from rank_bm25, not just Okapi

----
### Contributions welcome!
The author is not a Postgres / PL/pgSQL expert, gladly accepts optimizations or constructive criticism.

----
### Usage in a nutshell
Python index building:
```python
# build BM25 index
mybm25_index = mybm25okapi(tokenized_corpus)
# export wsmap to CSV
mybm25_index.exportwsmap( csvfilepath )
# import wsmap to Postgres from CSV
msq('SELECT bm25importwsmap(\''+tablename_bm25wsmap+'\',\''+csvfilepath+'\');')
```
Postgres search:
```python
msq('SELECT bm25topk.id, bm25topk.score, bm25topk.doc FROM bm25topk(\''+tablename+'\', \''+tablename_bm25wsmap+'\',\''+json.dumps(tokenizedquestion).replace("'","\'\'")+'\', 10);')
```

----
### What is this?
 - https://en.wikipedia.org/wiki/Okapi_BM25
 - https://en.wikipedia.org/wiki/PL/pgSQL
 - https://github.com/dorianbrown/rank_bm25
 - TLDR:
    - BM25Okapi is a popular search algorithm.
    - Index building: Initially, there's a list of texts or documents called the corpus. Each document will be split to words (or tokens) with the tokenization function (the simplest is split on whitespace characters). The algorithm then builds a word-score-map ```wsmap```, where every word in the corpus is scored for every document based on their frequencies, ca. how special a word is in the corpus and how frequent in the current document.
    - Search: the question text (or query string) will be tokenized, then the search function looks up the words from ```wsmap``` and sums the scores for each document; the result is a list of scores, one for each document. The highest scoring document is the best match. The search function sorts the scores-documentIDs in descending order.
    - The ```wsmap``` is stored in a simple dict in Python ``` { 'word1': [doc1score, doc2score, ... ], 'word2':[doc1score, doc2score, ... ], ... }``` and a simple table in Postgres ```|word TEXT|vl JSON|``` where ```vl == [doc1score, doc2score, ... ]```.
    - Adding a new document to the corpus or changing one requires rebuilding the whole BM25 index (```wsmap```), because of how the algorithm works.

----
### Repo contents
 - ```plpgsql_bm25_dev.ipynb``` : Jupyter notebook where I develop this.
 - ```mybm25okapi.py``` : Python BM25 index builder, see also https://github.com/dorianbrown/rank_bm25
 - ```plpgsql_bm25.sql``` : PL/pgSQL functions for search

----
### Why?
Postgres has already Full Text Search and there are several extensions that implement BM25. But Full Text Search is not the same as BM25. The BM25 extensions are written in Rust, which might not be available / practical, especially in hosted environments. See Alternatives section for more info.

----
### Alternatives:

 - Postgres Full Text Search
   - https://www.postgresql.org/docs/current/textsearch.html
   - https://postgresml.org/blog/postgres-full-text-search-is-awesome


 - Rust based BM25
   - https://github.com/paradedb/paradedb/tree/dev/pg_search#overview
   - https://github.com/tensorchord/pg_bestmatch.rs


 - Postgres similarity of text using trigram matching
   - https://www.postgresql.org/docs/current/pgtrgm.html

   - NOTE: this is useful for fuzzy string matching, like spelling correction, but not query->document search solution itself.
The differing document and query text lengths will result very small relative trigram frequencies and incorrect/missing matching.

----
### Special thanks to: dorianbrown, Myon, depesz, sobel, ilmari, xiaomiao and others from #postgresql

----
### LICENSE

As https://github.com/dorianbrown/rank_bm25 has Apache-2.0 license, the derived mybm25okapi class should probably have Apache-2.0 license. The test datasets and other external code might have different licenses, please check them.

My code:

The Unlicense / PUBLIC DOMAIN

This is free and unencumbered software released into the public domain.

Anyone is free to copy, modify, publish, use, compile, sell, or distribute this software, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means.

In jurisdictions that recognize copyright laws, the author or authors of this software dedicate any and all copyright interest in the software to the public domain. We make this dedication for the benefit of the public at large and to the detriment of our heirs and successors. We intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights to this software under copyright law.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

For more information, please refer to http://unlicense.org


## installing PostgreSQL

In [None]:
! sudo apt install gnupg2 wget nano
! sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
! curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/postgresql.gpg
! sudo apt update
! sudo apt install postgresql-16 postgresql-contrib-16 postgresql-server-dev-16


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
wget is already the newest version (1.21.2-2ubuntu1.1).
gnupg2 is already the newest version (2.2.27-3ubuntu2.1).
Suggested packages:
  hunspell
The following NEW packages will be installed:
  nano
0 upgraded, 1 newly installed, 0 to remove and 49 not upgraded.
Need to get 280 kB of archives.
After this operation, 881 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 nano amd64 6.2-1 [280 kB]
Fetched 280 kB in 1s (479 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unab

In [None]:
!service postgresql start
!sudo -u postgres psql -c "CREATE USER root WITH SUPERUSER"

 * Starting PostgreSQL 16 database server
   ...done.
CREATE ROLE


In [None]:
! pip install rank_bm25
! pip install psycopg2

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


## testing postgres, SqlMagic, msq()

In [None]:
# set connection
%load_ext sql
%config SqlMagic.feedback=False
%config SqlMagic.autopandas=True
%sql postgresql+psycopg2://@/postgres

# testing postgres
#df = %sql SELECT * FROM pg_catalog.pg_tables
#print(df)

#
def msq(s) :
  res = %sql $s
  print(res)
  return res

def msq2(t) :
  with psycopg2.connect("dbname=postgres user=root") as conn:
    with conn.cursor() as cur:
      cur.execute(t)
      res = cur.fetchall()
      for r in res:
          print(r)



In [None]:

# Note: the module name is psycopg, not psycopg3
import psycopg2


# Connect to an existing database
with psycopg2.connect("dbname=postgres user=root") as conn:

    # Open a cursor to perform database operations
    with conn.cursor() as cur:

        # Execute a command: this creates a new table
        cur.execute("""
            DROP TABLE IF EXISTS test;
            CREATE TABLE test (
                id serial PRIMARY KEY,
                num integer,
                data text)
            """)

        # Pass data to fill a query placeholders and let Psycopg perform
        # the correct conversion (no SQL injections!)
        cur.execute(
            "INSERT INTO test (num, data) VALUES (%s, %s)",
            (100, "abc'def"))

        # Query the database and obtain data as Python objects.
        cur.execute("SELECT * FROM test")
        res = cur.fetchall()
        print(res)
        # will return (1, 100, "abc'def")

        # You can use `cur.fetchmany()`, `cur.fetchall()` to return a list
        # of several records, or even iterate on the cursor
        for record in cur:
            print(record)

        # Make the changes to the database persistent
        conn.commit()



[(1, 100, "abc'def")]


## Test dataset 1: Flipkart fashion products from Kaggle
#### This works, but you need to download the file and uncomment this.

In [None]:
# Manually download this dataset from Kaggle, then upload archive.zip, then unpack.
# https://www.kaggle.com/datasets/aaditshukla/flipkart-fasion-products-dataset

"""

import random, json, time

# Loading the dataset from https://www.kaggle.com/datasets/aaditshukla/flipkart-fasion-products-dataset
filename = 'flipkart_fashion_products_dataset.json'
dataset = []
try :
  with open(filename) as f:
    dataset = json.load(f)
    print(filename,'is loaded, number of items: ',len(dataset))
except Exception as ex:
  print(ex)

# item fields to keep
tracked_fields = ['title','description','selling_price','average_rating','product_details','brand','seller','actual_price','discount','category','sub_category','_id','product_details_string','full_description']

# number of items to sample from the dataset
item_num = 100
items = random.sample(dataset,item_num)

# Preprocess: fix 'description', create 'product_details_string', 'full_description'
for i, item in enumerate(items) :
  if len(item['title'].strip()) < 1 :
    item['title'] = 'Unknown item'
  if len(item['description'].strip()) < 1 :
    item['description'] = item['title']
    if len(item['brand'].strip()) > 0 :
      item['description'] += ' from ' + item['brand'].strip()
  item['product_details_string'] = ''
  if 'product_details' in item and len(item['product_details']) > 0 :
    for pd in item['product_details'] :
      for k in pd :
        item['product_details_string'] += k.strip()+': '+pd[k].strip()+'. '
    item['product_details_string'] = item['product_details_string'].strip()
  item['full_description'] = ''
  item['full_description'] += item['title'].strip() + ' ; '
  item['full_description'] += item['description'].strip() + ' ; '
  item['full_description'] += 'Selling price: '+item['selling_price'].strip() + ' ; '
  item['full_description'] += 'Average rating: '+item['average_rating'].strip() + ' ; '
  item['full_description'] += item['product_details_string'].strip() + ' ; '
  item['full_description'] += 'Brand: '+item['brand'].strip() + ' ; '
  item['full_description'] += 'Seller: '+item['seller'].strip() + ' ; '
  item['full_description'] += 'Actual price: '+item['actual_price'].strip()+' Discount: '+item['discount'].strip() + ' ; '
  item['full_description'] += 'Category: '+item['category'].strip()+' | Sub-category: '+item['sub_category'].strip() + ' ; '
  item['full_description'] += 'ID: '+item['_id'].strip()
  item['full_description'] = item['full_description'].replace('\n',' ')
  item['full_description'] = item['full_description'].replace(',',';')
  item['full_description'] = item['full_description'].strip()

# Print
#for i, item in enumerate(items) :
#  print(str(i),item['full_description'])

csvfilename = 'items.csv'
with open(csvfilename,'w+') as f:
  f.write('id;full_description\n')
  for i in range(0,len(items)) :
    f.write(str((i+1))+';"'+items[i]['full_description'].replace('"','\'')+'"\n')

# Creating items table
msq('DROP TABLE IF EXISTS items;')
msq('CREATE TABLE items (id SERIAL, full_description TEXT);')
msq('COPY items FROM \'/content/items.csv\' DELIMITER \';\' CSV HEADER;')
#msq('SELECT * from items;')

#
questions = [
  'I want a t-shirt, preferably grey.',
  'Do you have slim fit shirts, maybe orange?',
  'I need a gift for my boss.',
  'Do you sell wool socks?',
  'Where can I buy black jeans or pants?',
  'I want a blazer.'
]

"""

'\n\nimport random, json, time\n\n# Loading the dataset from https://www.kaggle.com/datasets/aaditshukla/flipkart-fasion-products-dataset\nfilename = \'flipkart_fashion_products_dataset.json\'\ndataset = []\ntry :\n  with open(filename) as f:\n    dataset = json.load(f)\n    print(filename,\'is loaded, number of items: \',len(dataset))\nexcept Exception as ex:\n  print(ex)\n\n# item fields to keep\ntracked_fields = [\'title\',\'description\',\'selling_price\',\'average_rating\',\'product_details\',\'brand\',\'seller\',\'actual_price\',\'discount\',\'category\',\'sub_category\',\'_id\',\'product_details_string\',\'full_description\']\n\n# number of items to sample from the dataset\nitem_num = 100\nitems = random.sample(dataset,item_num)\n\n# Preprocess: fix \'description\', create \'product_details_string\', \'full_description\'\nfor i, item in enumerate(items) :\n  if len(item[\'title\'].strip()) < 1 :\n    item[\'title\'] = \'Unknown item\'\n  if len(item[\'description\'].strip()) < 1

## Test dataset 2: generated items in several languages
#### This works, but not optimal for testing.

In [None]:
import random

# TODO: better multi language generation
i18n = {
  'en':{
    'colors': ['black','blue','green','cyan','red','magenta','brown','light grey','dark grey','bright blue'],
    'itemtypes': [ 'belt','cap','hat','jeans','jumper', 'shirt','shorts','sneakers','suit','tie' ],
    'adjs' : ['Fantastic','Cool','Superb','Awesome','Trendy'],
    'insizestr': ' in size ',
    'pricestr': '. Price: ',
    'currencystr': ' USD.',
    'questions': [
      'I want to buy a hat. What colors do you have?',
      'Can you recommend something green?',
      'Do you have shirts under 50 USD?',
      'What do you have in size 40?',
      'I would like to buy sneakers for my friend. Do you have something in size 46, preferably cyan or blue?',
      'What can you recommend in red?'
    ]
  },
  'hu':{
    'colors': ['fekete','kék','zöld','zöldeskék','piros','lila','barna','világosszürke','sötétszürke','ragyogó kék'],
    'itemtypes': [ 'öv','sapka','kalap','farmer','pullóver', 'ing','rövidnadrág','tornacipő','öltöny','nyakkendő' ],
    'adjs': ['Csodálatos','Menő','Szuper','Király','Trendi'],
    'insizestr': '. Méret: ',
    'pricestr': '. Ár: ',
    'currencystr': ' Ft.',
    'questions': [
      'Kalapot szeretnék. Milyen színek vannak?',
      'Tudsz-e ajánlani valami zöldet?',
      'Vannak ingek 50 Ft. alatt?',
      'Mik vannak 40-es méretben?',
      'Tornacipőt szeretnék a barátomnak. Van valami 46-os méretben, lehetőleg zöldeskék vagy kék?',
      'Mit tudsz ajánlani pirosban?'
    ]
  },
  'no':{
    'colors': ['svart','blå','grøn','grønblå','rød','lilla','brun','lysgrå','mørkgrå','lysblå'],
    'itemtypes': [ 'belt','lue','hatt','bukser','genser', 'skjorte','shorts','sko','dress','slips' ],
    'adjs': ['Fantastisk','Kult','Supert','Tøff','Trendy'],
    'insizestr': ' i størrelse ',
    'pricestr': '. Pris: ',
    'currencystr': ' kr.',
    'questions': [
      'Jeg vil kjøpe en hatt. Hva farger er det?',
      'Kan du anbefale noen grønt?',
      'Har de skjorter under 50 kr?',
      'Hva har de i størrelse 40?',
      'Jeg vil gjerne kjøpe sko til min venn. Har de nokre i størrelse 46, helst grønblå eller blå?',
      'Hva kan du anbefale i rødt?'
    ]
  }
}

sizes = [str(30+x*2) for x in range(0,10)]

lang = 'no'

items = []
for c in i18n[lang]['colors'] :
  for s in sizes :
    for ii,i in enumerate(i18n[lang]['itemtypes']) :
      items.append( { 'full_description': random.choice(i18n[lang]['adjs'])+' '+ c+' '+ i+ i18n[lang]['insizestr']+s+
                      i18n[lang]['pricestr']+str(int(s)+20+5*ii)+i18n[lang]['currencystr'] } )

#print(items)


# export to CSV for Postgres
csvfilename = 'items.csv'
with open(csvfilename,'w+') as f:
  f.write('id;full_description\n')
  for i in range(0,len(items)) :
    f.write(str((i+1))+';"'+items[i]['full_description'].replace('"','\'')+'"\n')


# Postgres creating items table by importing from CSV
msq('DROP TABLE IF EXISTS items;')
msq('CREATE TABLE items (id SERIAL, full_description TEXT);')
msq('COPY items FROM \'/content/items.csv\' DELIMITER \';\' CSV HEADER;')
#msq('SELECT * from items;')


# test questions
questions = i18n[lang]['questions']


 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []


## Test dataset 3: Wordpress related QA from Huggingface

In [None]:

! wget https://huggingface.co/datasets/mteb/cqadupstack-wordpress/resolve/main/corpus.jsonl
! wget https://huggingface.co/datasets/mteb/cqadupstack-wordpress/resolve/main/queries.jsonl
! ls -la

import random, json

# load from jsonl
wcorpus = []
with open('corpus.jsonl') as f:
  wcstr = f.read()
  wcorpus = wcstr.split('\n')
print('len(wcorpus)',len(wcorpus))

# create sampled corpus, items, questions
sampledwcorpus = random.sample(wcorpus,20)
items = []
qqs = []
for i in range(0,len(sampledwcorpus)) :
  wjs = json.loads(sampledwcorpus[i])
  print(i,'---------------',wjs['_id'])
  print(len(wjs['title']),wjs['title'])
  print(len(wjs['text']),wjs['text'])
  items.append( { 'doctext': wjs['text'] } )
  qqs.append( [wjs['title'],i] )

# questions and solutions
random.shuffle(qqs)
questions = [ q[0] for q in qqs ]
questionsolutions = [ q[1]+1 for q in qqs ]

# export to CSV for Postgres
csvfilename = 'items.csv'
with open(csvfilename,'w+') as f:
  f.write('id;doctext\n')
  for i in range(0,len(items)) :
    f.write(str((i+1))+';"'+items[i]['doctext'].replace('"','\'')+'"\n')


# Postgres creating items table by importing from CSV
msq('DROP TABLE IF EXISTS items;')
msq('CREATE TABLE items (id SERIAL, doctext TEXT);')
msq('COPY items FROM \'/content/items.csv\' DELIMITER \';\' CSV HEADER;')
#msq('SELECT * from items;')


--2024-09-21 20:25:50--  https://huggingface.co/datasets/mteb/cqadupstack-wordpress/resolve/main/corpus.jsonl
Resolving huggingface.co (huggingface.co)... 18.238.49.112, 18.238.49.117, 18.238.49.10, ...
Connecting to huggingface.co (huggingface.co)|18.238.49.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/e3/a1/e3a12a6c68820b63bcd7ae09aa898026eba332004f38239070204b3b146060bc/089b11077372513eca8cc16653485aff1f232f8be18d8c6263be4b3b2bda0078?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27corpus.jsonl%3B+filename%3D%22corpus.jsonl%22%3B&Expires=1727209551&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyNzIwOTU1MX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2UzL2ExL2UzYTEyYTZjNjg4MjBiNjNiY2Q3YWUwOWFhODk4MDI2ZWJhMzMyMDA0ZjM4MjM5MDcwMjA0YjNiMTQ2MDYwYmMvMDg5YjExMDc3MzcyNTEzZWNhOGNjMTY2NTM0ODVhZmYxZjIzMmY4YmUxOGQ4YzYyNjNiZTRiM2IyYmR

## Test dataset 4: Oversimplified

In [None]:
# This works
"""
corpus = ['one two five', 'two three. Two TWO ','three? (THREE) three thReE, FOUr one;','one [one three] fivE?']
questions = ['Three two?','One.','<two ONE four>']
items = [ { 'full_description': ct } for ct in corpus ]

# export to CSV for Postgres
csvfilename = 'items.csv'
with open(csvfilename,'w+') as f:
  f.write('id;full_description\n')
  for i in range(0,len(items)) :
    f.write(str((i+1))+';"'+items[i]['full_description'].replace('"','\'')+'"\n')


# Postgres creating items table by importing from CSV
msq('DROP TABLE IF EXISTS items;')
msq('CREATE TABLE items (id SERIAL, full_description TEXT);')
msq('COPY items FROM \'/content/items.csv\' DELIMITER \';\' CSV HEADER;')
#msq('SELECT * from items;')
"""

'\ncorpus = [\'one two five\', \'two three. Two TWO \',\'three? (THREE) three thReE, FOUr one;\',\'one [one three] fivE?\']\nquestions = [\'Three two?\',\'One.\',\'<two ONE four>\']\nitems = [ { \'full_description\': ct } for ct in corpus ]\n\n# export to CSV for Postgres\ncsvfilename = \'items.csv\'\nwith open(csvfilename,\'w+\') as f:\n  f.write(\'id;full_description\n\')\n  for i in range(0,len(items)) :\n    f.write(str((i+1))+\';"\'+items[i][\'full_description\'].replace(\'"\',\'\'\')+\'"\n\')\n\n\n# Postgres creating items table by importing from CSV\nmsq(\'DROP TABLE IF EXISTS items;\')\nmsq(\'CREATE TABLE items (id SERIAL, full_description TEXT);\')\nmsq(\'COPY items FROM \'/content/items.csv\' DELIMITER \';\' CSV HEADER;\')\n#msq(\'SELECT * from items;\')\n'

## mybm25okapi: a refactored variant of rank_bm25 Okapi

In [None]:
"""
This is a refactored variant of rank_bm25 Okapi.

Usage:
  - corpus and query must be tokenized already, e.g. corpus = [ ['one','two','three'], ['bla','two','two'] ]  ; query = [ 'Is', 'this', 'a', 'question?' ]
  - __init__(corpus) will initialize the bm25Okapi components, where self.wsmap is the most important
  - No update is possible, so if the documents change in the corpus, then __init__(corpus) must be called again (recreating all the components).
  - search with topk() or get_scores()

Usage with Postgres:
  - corpus and query must be tokenized already, e.g. corpus = [ ['one','two','three'], ['bla','two','two'] ]  ; query = [ 'Is', 'this', 'a', 'question?' ]
  - __init__(corpus) will initialize the bm25Okapi components, where self.wsmap is the most important
  - No update is possible, so if the documents change in the corpus, then __init__(corpus) must be called again (recreating all the components).
  - call exportwsmap() after init, then import wsmap into a Postgres table: COPY tablename_bm25wsmap FROM '/path-to/tablename_bm25wsmap.csv' DELIMITER ';' CSV HEADER;
  - search in Postgres by calling the plpgsql functions: SELECT bm25topk.id, bm25topk.score, bm25topk.doc FROM bm25topk(tablename, tablename_bm25wsmap, query, 10);
"""
import math


class mybm25okapi:
  def __init__(self, corpus):
    # constants
    self.debugmode = False
    self.k1 = 1.5
    self.b = 0.75
    self.epsilon = 0.25

    self.corpus_len = len(corpus)
    self.avg_doc_len = 0
    self.word_freqs = []
    self.idf = {}
    self.doc_lens = []
    word_docs_count = {}  # word -> number of documents with word
    total_word_count = 0

    for document in corpus:
      # doc lengths and total word count
      self.doc_lens.append(len(document))
      total_word_count += len(document)

      # word frequencies in this document
      frequencies = {}
      for word in document:
        if word not in frequencies:
          frequencies[word] = 0
        frequencies[word] += 1
      self.word_freqs.append(frequencies)

      # number of documents with word count
      for word, freq in frequencies.items():
        try:
          word_docs_count[word] += 1
        except KeyError:
          word_docs_count[word] = 1

    # average document length
    self.avg_doc_len = total_word_count / self.corpus_len

    if self.debugmode : print('self.corpus_len',self.corpus_len,'\nself.doc_lens',self.doc_lens,'\ntotal_word_count',total_word_count,'\nself.word_freqs',self.word_freqs,'\nself.avg_doc_len',self.avg_doc_len,'\nword_docs_count',word_docs_count)

    # precalc "half of divisor" + self.k1 * (1 - self.b + self.b * doc_lens / self.avg_doc_len)
    self.hds = [ self.k1 * ( 1-self.b + self.b*doc_len/self.avg_doc_len) for doc_len in self.doc_lens ]
    if self.debugmode : print('self.hds',self.hds)

    """
    Calculates frequencies of terms in documents and in corpus.
    This algorithm sets a floor on the idf values to eps * average_idf
    """
    # collect idf sum to calculate an average idf for epsilon value
    # collect words with negative idf to set them a special epsilon value.
    # idf can be negative if word is contained in more than half of documents
    idf_sum = 0
    negative_idfs = []
    for word, freq in word_docs_count.items():
      #print('word',word,'freq',freq,'corpus_len',self.corpus_len,'self.corpus_len - freq + 0.5',self.corpus_len - freq + 0.5,'freq + 0.5',freq + 0.5,'ml1',math.log(self.corpus_len - freq + 0.5),'ml2',math.log(freq + 0.5))
      idf = math.log(self.corpus_len - freq + 0.5) - math.log(freq + 0.5)
      self.idf[word] = idf
      idf_sum += idf
      if idf < 0:
        negative_idfs.append(word)
      if self.debugmode : print('word',word,'self.idf[word]',self.idf[word])
    self.average_idf = idf_sum / len(self.idf)
    if self.debugmode : print('idf_sum',idf_sum,'len(self.idf)',len(self.idf),'self.average_idf',self.average_idf)
    # assign epsilon
    eps = self.epsilon * self.average_idf
    if self.debugmode : print('eps',eps)
    for word in negative_idfs:
      self.idf[word] = eps
      if self.debugmode : print('word',word,'got eps',eps)
    if self.debugmode : print('self.idf',self.idf)


    # words * documents score map
    self.wsmap = {}
    for word in self.idf :
      self.wsmap[word] = [0] * self.corpus_len
      word_freqs = [ (word_freq.get(word) or 0) for word_freq in self.word_freqs ]
      thiswordidf = (self.idf.get(word) or 0)
      if self.debugmode : print('word in self.idf',word,'thiswordidf',thiswordidf,'word_freqs',word_freqs)
      for i in range(0,self.corpus_len) :
        self.wsmap[word][i] = thiswordidf * ( word_freqs[i] * (self.k1 + 1) / ( word_freqs[i] + self.hds[i] ) ) # += replaced with =
    if self.debugmode : print('self.wsmap',self.wsmap)


  # get a list of scores for every document
  def get_scores(self, tokenizedquery):
    # zeroes list of scores
    scores = [0] * self.corpus_len
    # for each word in tokenizedquery, if word is in wsmap, lookup and add word score for every documents' scores
    for word in tokenizedquery:
      if word in self.wsmap :
        for i in range(0,self.corpus_len) :
          scores[i] += self.wsmap[word][i]
    # return scores list (not sorted)
    return scores


  def topk(self,tokenizedquery,k=None):
    docscores = self.get_scores( tokenizedquery )
    sisc = [ [i,s] for i,s in enumerate(docscores) ]
    sisc.sort(key=lambda x:x[1],reverse=True)
    if k :
      sisc = sisc[:k]
    return sisc


  # save the words*documents score map as csv for import to Postgres: COPY tablename_bm25wsmap FROM '/path-to/tablename_bm25wsmap.csv' DELIMITER ';' CSV HEADER;
  def exportwsmap(self, csvfilename) :
    with open(csvfilename,'w+') as f:
      f.write('word;vl\n')
      for word in self.wsmap :
        f.write('"'+word.replace('"','\'')+'";{'+str(self.wsmap[word]).strip()[1:-1]+'}\n')




# tokenization function
def mytokenize(s) :
  ltrimchars = ['(','[','{','<','\'','"']
  rtrimchars = ['.', '?', '!', ',', ':', ';', ')', ']', '}', '>','\'','"']
  if type(s) != str : return []
  wl = s.lower().split()
  for i,w in enumerate(wl) :
    if len(w) < 1 : continue
    si = 0
    ei = len(w)
    try :
      while si < ei and w[si] in ltrimchars : si += 1
      while ei > si and w[ei-1] in rtrimchars : ei -= 1
      wl[i] = wl[i][si:ei]
    except Exception as ex:
      print('|',w,'|',ex,'|',wl)
  wl = [ w for w in wl if len(w) > 0 ]
  return wl



In [None]:
"""
plpgsql functions for bm25Okapi search
"""
# TODO: JSONB instead of JSON


################################################################################
#
#  "Old" search for imported index (without tokenizer)
#
################################################################################


# bm25importwsmap(): imports wsmap from csv file (created by mybm25okapi.exportwsmap)
msq("""
DROP FUNCTION IF EXISTS bm25importwsmap;
CREATE OR REPLACE FUNCTION bm25importwsmap(tablename_bm25wsmap TEXT, csvpath TEXT) RETURNS VOID
LANGUAGE plpgsql
AS $$
DECLARE
  sql_statement TEXT := '';
BEGIN
  sql_statement := 'DROP TABLE IF EXISTS ' || tablename_bm25wsmap || ';';
  EXECUTE sql_statement;
  sql_statement := 'CREATE TABLE ' || tablename_bm25wsmap || ' (word TEXT, vl double precision[]);';
  EXECUTE sql_statement;
  sql_statement := 'COPY ' || tablename_bm25wsmap || ' FROM ' || chr(39) || csvpath || chr(39) || ' DELIMITER ' || chr(39) || ';' || chr(39) || ' CSV HEADER;';
  EXECUTE sql_statement;
END;
$$
""")


# bm25scorerows() get the documentscores row for each word
msq("""
DROP FUNCTION IF EXISTS bm25scorerows;
CREATE OR REPLACE FUNCTION bm25scorerows(tablename TEXT, tokenizedquery TEXT) RETURNS SETOF double precision[]
LANGUAGE plpgsql
AS $$
DECLARE
  w TEXT := '';
  sql_statement TEXT := '';
  tokenizedqueryjson JSON := tokenizedquery::JSON;
BEGIN
  FOR w IN SELECT * FROM json_array_elements_text(tokenizedqueryjson)
  LOOP
    sql_statement := 'SELECT vl FROM ' || tablename || ' WHERE word = $1';
    RETURN QUERY EXECUTE sql_statement USING w::TEXT;
  END LOOP;
END;
$$
""")




# bm25scoressum(): sums the score rows to one array with the document scores
msq("""
DROP FUNCTION IF EXISTS bm25scoressum;
CREATE OR REPLACE FUNCTION bm25scoressum(tablename TEXT, tokenizedquery TEXT) RETURNS SETOF double precision[]
LANGUAGE plpgsql
AS $$
BEGIN
  DROP TABLE IF EXISTS xdocs;
  CREATE TABLE xdocs AS SELECT bm25scorerows(tablename, tokenizedquery);
  RETURN QUERY SELECT ARRAY_AGG(sum ORDER BY ord) FROM (SELECT ord, SUM(int) FROM xdocs, unnest(bm25scorerows) WITH ORDINALITY u(int, ord) GROUP BY ord);
END;
$$
""")


# bm25scunnest(): unnests the score array
msq("""
DROP FUNCTION IF EXISTS bm25scunnest;
CREATE OR REPLACE FUNCTION bm25scunnest(tablename TEXT, tokenizedquery TEXT) RETURNS TABLE(score double precision)
LANGUAGE plpgsql
AS $$
BEGIN
  RETURN QUERY SELECT unnest(bm25scoressum(tablename,tokenizedquery));
END;
$$
""")


# bm25isc(): returns the index and score of the documents; index starts with 1
msq("""
DROP FUNCTION IF EXISTS bm25isc;
CREATE OR REPLACE FUNCTION bm25isc(tablename TEXT, tokenizedquery TEXT) RETURNS TABLE(id BIGINT, score double precision)
LANGUAGE plpgsql
AS $$
BEGIN
  RETURN QUERY SELECT row_number() OVER () AS id, bm25scunnest FROM bm25scunnest(tablename,tokenizedquery) ;
END;
$$
""")


# bm25topk(): returns the index, score and document sorted and limited
msq("""
DROP FUNCTION IF EXISTS bm25topk;
CREATE OR REPLACE FUNCTION bm25topk(tablename TEXT, tablename_bm25wsmap TEXT, tokenizedquery TEXT, k INT) RETURNS TABLE(id INT, score double precision, doc TEXT)
LANGUAGE plpgsql
AS $$
DECLARE
  sql_statement TEXT := '';
BEGIN
  sql_statement := 'SELECT t1.id, t2.score, t1.full_description AS doc FROM (SELECT id, full_description FROM ' || tablename || ') t1 INNER JOIN ( SELECT id, score FROM bm25isc($1,$2) ) t2 ON ( t1.id = t2.id ) ORDER BY t2.score DESC LIMIT $3;';
  RETURN QUERY EXECUTE sql_statement USING tablename_bm25wsmap, tokenizedquery, k;
END;
$$
""")


################################################################################
#
#  plpgsql bm25 index building
#
################################################################################


# bm25simpletokenize(): split text to words on whitespace, lowercase, remove some punctiation, similar to mytokenize()
funstr = """
DROP FUNCTION IF EXISTS bm25simpletokenize;
CREATE OR REPLACE FUNCTION bm25simpletokenize(txt TEXT) RETURNS TEXT[]
LANGUAGE plpgsql
AS $$
DECLARE
  w TEXT;
  w2 TEXT;
  words TEXT[];
BEGIN
  FOREACH w IN ARRAY regexp_split_to_array(LOWER(txt), '\s+') LOOP
    w2 = RTRIM( LTRIM( w, '([{<"''' ), '.?!,:;)]}>"''' );
    IF LENGTH(w2) > 0 THEN
      words = array_append( words, w2 );
    END IF;
  END LOOP;
  RETURN words;
END;
$$
"""

def pqaddfun(t) :
  with psycopg2.connect("dbname=postgres user=root") as conn:
    with conn.cursor() as cur:
      cur.execute(t)
      conn.commit()

pqaddfun(funstr)

########################

# count_words_in_array() creates doc->words counts
msq("""
DROP FUNCTION IF EXISTS count_words_in_array;
CREATE OR REPLACE FUNCTION count_words_in_array(input_array text[]) RETURNS jsonb
LANGUAGE plpgsql
AS $$
DECLARE
    word_count jsonb := '{}';
    current_word text;
BEGIN
    FOREACH current_word IN ARRAY input_array LOOP
        IF word_count->>current_word IS NULL THEN
            word_count := jsonb_set( word_count, ARRAY[current_word], '1'::jsonb, true );
        ELSE
            word_count := jsonb_set( word_count, ARRAY[current_word], ((word_count->>current_word)::int + 1)::text::jsonb );
        END IF;
    END LOOP;
    RETURN word_count;
END;
$$;
""")


# get_word_docs_count()
msq("""
DROP FUNCTION IF EXISTS get_word_docs_count;
CREATE OR REPLACE FUNCTION get_word_docs_count( wordstname TEXT, wf JSONB ) RETURNS VOID
LANGUAGE plpgsql
AS $$
DECLARE
  /*sql_statement TEXT := '';*/
  mkey TEXT;
BEGIN
  FOR mkey IN SELECT key FROM jsonb_each_text(wf) LOOP

    /*sql_statement := 'INSERT INTO ' || wordstname || '(word, word_docs_count) VALUES (' || chr(39) || mkey || chr(39) || ', COALESCE((SELECT word_docs_count FROM ' || wordstname || ' WHERE word = ' || chr(39) || mkey || chr(39) || ') ,1::INTEGER ) ) ON CONFLICT (word) DO UPDATE SET word_docs_count = (' || wordstname || '.word_docs_count + 1)::INTEGER;';
    EXECUTE sql_statement;*/

    EXECUTE FORMAT( 'INSERT INTO %s(word, word_docs_count) VALUES (%s, COALESCE((SELECT word_docs_count FROM %s WHERE word = %s) ,1::INTEGER ) ) ON CONFLICT (word) DO UPDATE SET word_docs_count = (%s.word_docs_count + 1)::INTEGER;', wordstname, quote_literal(mkey), wordstname, quote_literal(mkey), wordstname );

  END LOOP;
END;
$$;
""")


# get_wsmapobj()
msq("""
DROP FUNCTION IF EXISTS get_wsmapobj;
CREATE OR REPLACE FUNCTION get_wsmapobj( docstname TEXT, word TEXT, thisidf DOUBLE PRECISION, thisk1 DOUBLE PRECISION ) RETURNS DOUBLE PRECISION[]
LANGUAGE plpgsql
AS $$
DECLARE
  /*sql_statement TEXT := '';*/
  res DOUBLE PRECISION[];
BEGIN
  /* self.wsmap[word][i] = thiswordidf * ( word_freqs[i] * (self.k1 + 1) / ( word_freqs[i] + self.hds[i] ) ) # += replaced with = */

  /*
  sql_statement := 'SELECT ARRAY_AGG( ' || thisidf || ' * COALESCE(word_freqs->>' || chr(39) || word || chr(39) || ',' || chr(39) || chr(48) || chr(39) || ')::INTEGER * ' || (thisk1+1)::DOUBLE PRECISION || ' / ( COALESCE(word_freqs->>' || chr(39) || word || chr(39) || ',' || chr(39) || chr(48) || chr(39) || ')::INTEGER + hds ) ) FROM ' || docstname || ';';
  EXECUTE sql_statement INTO res;*/

  EXECUTE FORMAT( 'SELECT ARRAY_AGG( %s * COALESCE(word_freqs->>%s,%s)::INTEGER * %s / ( COALESCE(word_freqs->>%s,%s)::INTEGER + hds ) ORDER BY id) FROM %s;', thisidf, quote_literal(word), quote_literal(0), (thisk1+1), quote_literal(word), quote_literal(0), docstname ) INTO res;

  RETURN res;
END;
$$;
""")


# bm25createindex()
msq("""
DROP FUNCTION IF EXISTS bm25createindex;
CREATE OR REPLACE FUNCTION bm25createindex(tablename TEXT, columnname TEXT) RETURNS VOID
LANGUAGE plpgsql
AS $$
DECLARE
  /*sql_statement TEXT := '';*/
  docstname TEXT := tablename || '_' ||  columnname || '_bm25i_docs';
  wordstname TEXT := tablename || '_' ||  columnname || '_bm25i_words';
  param_k1 DOUBLE PRECISION := 1.5;
  param_b DOUBLE PRECISION := 0.75;
  param_epsilon DOUBLE PRECISION := 0.25;
  corpus_len INTEGER;
  vocab_len INTEGER;
  total_word_count INTEGER;
  avg_doc_len DOUBLE PRECISION;
  idf_sum DOUBLE PRECISION;
  average_idf DOUBLE PRECISION;
  param_eps DOUBLE PRECISION;
BEGIN

  /* create bm25_params_debug table, this is only required for debugging. */
  /*
  DROP TABLE IF EXISTS bm25_params_debug;
  CREATE TABLE bm25_params_debug ( paramname TEXT PRIMARY KEY, value DOUBLE PRECISION );
  INSERT INTO bm25_params_debug(paramname,value) VALUES('param_k1',param_k1);
  INSERT INTO bm25_params_debug(paramname,value) VALUES('param_b',param_b);
  INSERT INTO bm25_params_debug(paramname,value) VALUES('param_epsilon',param_epsilon);
  */

  /* create docs table */

  /*sql_statement := 'DROP TABLE IF EXISTS ' || docstname || ';';*/

  EXECUTE FORMAT( 'DROP TABLE IF EXISTS %s;', docstname );

  /*sql_statement := 'CREATE TABLE ' || docstname || ' AS SELECT ' || columnname || ' AS doc, bm25simpletokenize(' || columnname || ') AS tokenized_doc FROM ' || tablename || ' ;';*/
  /*
  EXECUTE FORMAT( 'CREATE TABLE %s AS SELECT id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY, %s AS doc, bm25simpletokenize(%s) AS tokenized_doc FROM %s ;', docstname, columnname, columnname, tablename );
  */
  EXECUTE FORMAT( 'CREATE TABLE %s (id SERIAL PRIMARY KEY, doc TEXT, tokenized_doc TEXT[]);', docstname );
  EXECUTE FORMAT( 'INSERT INTO %s (doc, tokenized_doc) SELECT %s AS doc, bm25simpletokenize(%s) AS tokenized_doc FROM %s ;', docstname, columnname, columnname, tablename );

  /*EXECUTE FORMAT( 'ALTER TABLE %s ADD COLUMN id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY;', docstname );*/

  /* add doc_lens */

  /*sql_statement := 'ALTER TABLE ' || docstname || ' ADD COLUMN doc_lens INTEGER;';*/

  EXECUTE FORMAT( 'ALTER TABLE %s ADD COLUMN doc_lens INTEGER;', docstname );

  /*sql_statement := 'UPDATE ' || docstname || ' SET doc_lens=subquery.doc_lens FROM (SELECT tokenized_doc AS td, CARDINALITY(tokenized_doc) AS doc_lens FROM  ' || docstname || ') AS subquery WHERE tokenized_doc = subquery.td;';*/

  EXECUTE FORMAT( 'UPDATE %s SET doc_lens=subquery.doc_lens FROM (SELECT tokenized_doc AS td, CARDINALITY(tokenized_doc) AS doc_lens FROM %s) AS subquery WHERE tokenized_doc = subquery.td;', docstname, docstname );

  /* add word_freqs (JSONB word:count object) */

  /*sql_statement := 'ALTER TABLE ' || docstname || ' ADD COLUMN word_freqs JSONB;';*/

  EXECUTE FORMAT( 'ALTER TABLE %s ADD COLUMN word_freqs JSONB;', docstname );

  /*sql_statement := 'UPDATE ' || docstname || ' SET word_freqs=count_words_in_array(tokenized_doc);';*/

  EXECUTE FORMAT( 'UPDATE %s SET word_freqs=count_words_in_array(tokenized_doc);', docstname );

  /* total word count */

  /*sql_statement := 'SELECT SUM(doc_lens) FROM ' || docstname || ';';*/

  EXECUTE FORMAT( 'SELECT SUM(doc_lens) FROM %s;', docstname ) INTO total_word_count;

  /* this debug statement is not required */
  /*INSERT INTO bm25_params_debug(paramname,value) VALUES('total_word_count',total_word_count);*/

  /* create words table */

  /*sql_statement := 'DROP TABLE IF EXISTS ' || wordstname || ';';*/

  EXECUTE FORMAT( 'DROP TABLE IF EXISTS %s;', wordstname );

  /*sql_statement := 'CREATE TABLE ' || wordstname || ' ( word TEXT PRIMARY KEY, word_docs_count INTEGER, idf DOUBLE PRECISION );';*/

  EXECUTE FORMAT( 'CREATE TABLE %s ( word TEXT PRIMARY KEY, word_docs_count INTEGER, idf DOUBLE PRECISION );', wordstname );

  /* count docs with each word */

  /*sql_statement := 'SELECT get_word_docs_count( ' || chr(39) || wordstname || chr(39) || ', word_freqs ) FROM ' || docstname || ';';*/

  EXECUTE FORMAT('SELECT get_word_docs_count( %s, word_freqs ) FROM %s;', quote_literal(wordstname), docstname );

  /* self.avg_doc_len = total_word_count / self.corpus_len */

  /*sql_statement := 'SELECT COUNT(doc_lens) FROM ' || docstname || ' WHERE doc_lens > 0;';*/

  EXECUTE FORMAT( 'SELECT COUNT(doc_lens) FROM %s WHERE doc_lens > 0;', docstname ) INTO corpus_len;
  avg_doc_len := total_word_count::DOUBLE PRECISION / corpus_len::DOUBLE PRECISION;

  /* these debug statements are not required */
  /*INSERT INTO bm25_params_debug(paramname,value) VALUES('corpus_len',corpus_len);
  INSERT INTO bm25_params_debug(paramname,value) VALUES('avg_doc_len',avg_doc_len);*/

  /*  # precalc "half of divisor" + self.k1 * (1 - self.b + self.b * doc_lens / self.avg_doc_len)  */

  /*sql_statement := 'ALTER TABLE ' || docstname || ' ADD COLUMN hds DOUBLE PRECISION;';*/

  EXECUTE FORMAT( 'ALTER TABLE %s ADD COLUMN hds DOUBLE PRECISION;', docstname );

  /*sql_statement := 'UPDATE ' || docstname || ' SET hds = ' || param_k1 || ' * ( 1.0::DOUBLE PRECISION - ' || param_b || ' + ' || param_b || ' * doc_lens / ' || avg_doc_len || ') ;';*/

  EXECUTE FORMAT( 'UPDATE %s SET hds = %s * ( 1.0::DOUBLE PRECISION - %s + %s * doc_lens / %s ) ;', docstname, param_k1, param_b, param_b, avg_doc_len );


  /* idf = math.log(self.corpus_len - freq + 0.5) - math.log(freq + 0.5) ; self.idf[word] = idf ; idf_sum += idf */

  /*sql_statement := 'UPDATE ' || wordstname || ' SET idf = LN( ' || corpus_len::DOUBLE PRECISION || ' - word_docs_count::DOUBLE PRECISION + 0.5::DOUBLE PRECISION) - LN( word_docs_count::DOUBLE PRECISION + 0.5::DOUBLE PRECISION)';*/

  EXECUTE FORMAT( 'UPDATE %s SET idf = LN( %s - word_docs_count::DOUBLE PRECISION + 0.5::DOUBLE PRECISION) - LN( word_docs_count::DOUBLE PRECISION + 0.5::DOUBLE PRECISION);', wordstname, corpus_len::DOUBLE PRECISION );

  /*sql_statement := 'SELECT SUM(idf) FROM ' || wordstname || ';';*/

  EXECUTE FORMAT( 'SELECT SUM(idf) FROM %s;', wordstname ) INTO idf_sum;

  /*sql_statement := 'SELECT COUNT(word) FROM ' || wordstname || ';';*/

  EXECUTE FORMAT( 'SELECT COUNT(word) FROM %s;', wordstname ) INTO vocab_len;

  average_idf = idf_sum / vocab_len::DOUBLE PRECISION;
  param_eps = param_epsilon * average_idf;

  /*sql_statement := 'UPDATE ' || wordstname || ' SET idf = ' || param_eps || ' WHERE idf < 0;';*/
  EXECUTE FORMAT( 'UPDATE %s SET idf = %s WHERE idf < 0;', wordstname, param_eps );

  /* these debug statements are not required */
  /*INSERT INTO bm25_params_debug(paramname,value) VALUES('idf_sum',idf_sum);
  INSERT INTO bm25_params_debug(paramname,value) VALUES('vocab_len',vocab_len);
  INSERT INTO bm25_params_debug(paramname,value) VALUES('average_idf',average_idf);
  INSERT INTO bm25_params_debug(paramname,value) VALUES('param_eps',param_eps);*/

  /*  words * documents score map  */

  /*sql_statement := 'ALTER TABLE ' || wordstname || ' ADD COLUMN wsmap DOUBLE PRECISION[];';*/

  EXECUTE FORMAT( 'ALTER TABLE %s ADD COLUMN wsmap DOUBLE PRECISION[];', wordstname );

  /*sql_statement := 'UPDATE ' || wordstname || ' SET wsmap = get_wsmapobj( ' || chr(39) || docstname || chr(39) || ', word, idf, ' || param_k1 || ');';*/

  EXECUTE FORMAT( 'UPDATE %s SET wsmap = get_wsmapobj( %s, word, idf, %s );', wordstname, quote_literal(docstname), param_k1 );

END;
$$
""")


################################################################################
#
#  "New" search for plpgsql-built-index (with built-in tokenizer)
#
################################################################################


# bm25scorerows2() get the documentscores row for each word
msq("""
DROP FUNCTION IF EXISTS bm25scorerows2;
CREATE OR REPLACE FUNCTION bm25scorerows2(tablename TEXT, mquery TEXT) RETURNS SETOF double precision[]
LANGUAGE plpgsql
AS $$
DECLARE
  w TEXT := '';
  /*sql_statement TEXT := '';*/
BEGIN
  FOR w IN SELECT unnest(bm25simpletokenize(mquery))
  LOOP
    RETURN QUERY EXECUTE FORMAT( 'SELECT wsmap FROM %s WHERE word = %s;', tablename, quote_literal(w) );
  END LOOP;
END;
$$
""")


# bm25scoressum2(): sums the score rows to one array with the document scores ; TODO: instead of xdocstname maybe with temp table, race condition here?
msq("""
DROP FUNCTION IF EXISTS bm25scoressum2;
CREATE OR REPLACE FUNCTION bm25scoressum2(tablename TEXT, tokenizedquery TEXT) RETURNS SETOF double precision[]
LANGUAGE plpgsql
AS $$
DECLARE
  xdocstname TEXT := tablename || '_bm25i_temp';
BEGIN
  EXECUTE FORMAT( 'DROP TABLE IF EXISTS %s;', xdocstname );
  EXECUTE FORMAT( 'CREATE TABLE %s AS SELECT bm25scorerows2(%s, %s);', xdocstname, quote_literal(tablename), quote_literal(tokenizedquery) );
  RETURN QUERY EXECUTE FORMAT( 'SELECT ARRAY_AGG(sum ORDER BY ord) FROM (SELECT ord, SUM(int) FROM %s, unnest(bm25scorerows2) WITH ORDINALITY u(int, ord) GROUP BY ord);', xdocstname );
END;
$$
""")


# bm25scunnest2(): unnests the score array
msq("""
DROP FUNCTION IF EXISTS bm25scunnest2;
CREATE OR REPLACE FUNCTION bm25scunnest2(tablename TEXT, tokenizedquery TEXT) RETURNS TABLE(score double precision)
LANGUAGE plpgsql
AS $$
BEGIN
  RETURN QUERY SELECT unnest(bm25scoressum2(tablename,tokenizedquery));
END;
$$
""")


# bm25isc(): returns the index and score of the documents; index starts with 1
msq("""
DROP FUNCTION IF EXISTS bm25isc2;
CREATE OR REPLACE FUNCTION bm25isc2(tablename TEXT, tokenizedquery TEXT) RETURNS TABLE(id BIGINT, score double precision)
LANGUAGE plpgsql
AS $$
BEGIN
  RETURN QUERY SELECT row_number() OVER () AS id, bm25scunnest2 FROM bm25scunnest2(tablename,tokenizedquery) ;
END;
$$
""")


# bm25topk(): returns the index, score and document sorted and limited |  TABLE(id INT, id2 BIGINT, score double precision, doc TEXT)
msq("""
DROP FUNCTION IF EXISTS bm25topk2;
CREATE OR REPLACE FUNCTION bm25topk2(tablename TEXT, columnname TEXT, tokenizedquery TEXT, k INT) RETURNS TABLE(id INTEGER, score double precision, doc TEXT)
LANGUAGE plpgsql
AS $$
DECLARE
  docstname TEXT := tablename || '_' ||  columnname || '_bm25i_docs';
  wordstname TEXT := tablename || '_' ||  columnname || '_bm25i_words';
BEGIN
  RETURN QUERY EXECUTE FORMAT( 'SELECT t1.id, t2.score, t1.%s AS doc FROM (SELECT id, doc AS %s FROM %s) t1 INNER JOIN ( SELECT id, score FROM bm25isc2(%s,%s) ) t2 ON ( t1.id = t2.id ) ORDER BY t2.score DESC LIMIT %s;', columnname, columnname, docstname, quote_literal(wordstname), quote_literal(tokenizedquery), k );
  /*RETURN QUERY EXECUTE FORMAT( 'SELECT id, score FROM bm25isc2(%s,%s)', quote_literal(wordstname), quote_literal(tokenizedquery) );*/
END;
$$
""")



 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Empty DataFrame
Columns: []
Index: []
 * postgresql+psycopg2://@/postgres
Em

In [None]:
from rank_bm25 import BM25Okapi
import json


# table and file names
tablename = 'items'
columnname = 'doctext' #'full_description' #
tablename_bm25wsmap = tablename+'_bm25wsmap'
csvfilepath = '/content/'+tablename_bm25wsmap+'.csv'
k = 10

# preparing tokenized corpus
tokenized_corpus = [ mytokenize(item[columnname]) for item in items ]

# rank_bm25 and mybm25okapi
rank_bm25_index = BM25Okapi(tokenized_corpus)
mybm25_index = mybm25okapi(tokenized_corpus)

# Postgres
# Export wsmap to CSV
#mybm25_index.exportwsmap( csvfilepath )
# Import wsmap to Postgres from CSV
#msq('SELECT bm25importwsmap(\''+tablename_bm25wsmap+'\',\''+csvfilepath+'\');')

##########
msq('SELECT bm25createindex(\''+tablename+'\',\''+columnname+'\');')

docstname = tablename + '_' + columnname + '_bm25i_docs'
wordstname = tablename + '_' + columnname + '_bm25i_words'
msq('SELECT * FROM '+docstname+';')
msq('SELECT * FROM '+wordstname+';')
#msq('SELECT * FROM bm25_params_debug;')
##########

# Running the questions
runquestions = True # TODO
if runquestions :
  for qi,q in enumerate(questions) :

    # tokenize and print question
    tokenizedquestion = mytokenize(q)
    print('\n----Question',qi,':',q,' | Tokenized: ',tokenizedquestion)
    if questionsolutions and qi<len(questionsolutions) :
      print('Solution ID:',questionsolutions[qi])

    # rank_bm25 BM25 search
    doc_scores = rank_bm25_index.get_scores( tokenizedquestion )
    bres = [ [i,s] for i,s in enumerate(doc_scores) ]
    bres.sort(key=lambda x:x[1],reverse=True)
    bres = bres[:10]

    # mybm25okapi BM25 search
    bres2 = mybm25_index.topk( tokenizedquestion, k )

    # Postgres BM25 search
    #sqlst = 'SELECT bm25topk.id, bm25topk.score, bm25topk.doc FROM bm25topk(\''+tablename+'\', \''+tablename_bm25wsmap+'\',\''+json.dumps(tokenizedquestion).replace("'","\'\'")+'\', 10);'
    #sqlst = 'SELECT * FROM bm25topk2(\''+tablename+'\', \''+columnname+'\',\''+q.replace("'","\'\'")+'\', 10);'
    #print('|',sqlst,'|')
    #msq(sqlst)
    msq2( 'SELECT * FROM bm25topk2( \''+tablename+'\', \''+columnname+'\', \''+q.replace("'","\'\'")+'\', '+str(k)+' );' )
    #msq('SELECT bm25topk.id, bm25topk.score, bm25topk.doc FROM bm25topk(\''+tablename+'\', \''+tablename_bm25wsmap+'\',\''+json.dumps(tokenizedquestion).replace("'","\'\'")+'\', 10);')

    # Print rank_bm25, mybm25okapi results
    for k in range(0,10):
      if k < len(bres) :
        print( '|rank_bm25  |', bres[k][0]+1,  math.floor(bres[k][1]*10e5)/10e5,  items[bres[k][0] ][columnname] )
      if k < len(bres2) :
        print( '|mybm25okapi|', bres2[k][0]+1, math.floor(bres2[k][1]*10e5)/10e5, items[bres2[k][0]][columnname] )
      print(' ')


 * postgresql+psycopg2://@/postgres
  bm25createindex
0                
 * postgresql+psycopg2://@/postgres
    id                                                doc  \
0    5  How can I get all users with atleast one post?...   
1   12  I have Custom Type `cpt1` and Custom Fields `c...   
2    1  In my plugin, I have a list of titles and perm...   
3    2  I'm just trying to add a simple bit of Show/Hi...   
4    3  How would I go about adding a column to the Pa...   
5    4  I insert author box below my post and use get_...   
6    6  Title says it all. We have a front end form on...   
7    7  Under Reading Settings, I can specify how many...   
8    8  I am having 2 WordPress install.One for websit...   
9    9  I have the following code which adds post navi...   
10  10  I made a child theme defined as so:           ...   
11  11  I have a WooCommerce site and a specific distr...   
12  13  I'm looking to see if it's possible to find an...   
13  14  Is it possible to rewrite the 

In [None]:
# this works
# ! psql postgresql://@/postgres

## bm25simpletokenize() test

In [None]:
import json
import psycopg2

def pqbm25simpletokenize(t) :
  with psycopg2.connect("dbname=postgres user=root") as conn:
    with conn.cursor() as cur:
      cur.execute("SELECT bm25simpletokenize(%s)",(t,))
      res = cur.fetchall()
      return str(res)

txts = [
  'hello, HELLO!',
  ' I found ',
  'I found this snippet online …               function search_url_rewrite_rule() {         if ( is_search() && !empty($_GET[\'s\'])) {  ',
  'Is there a way to add\' Reattach and Unattach links in image gallery?',
  ' been queried\'s in the global $wp_query?',
  'Currently, I makes a plugin, and I '
]

for t in txts :
  print( mytokenize(t) )
  s = pqbm25simpletokenize(t)
  print(s[2:])
