<a href="https://colab.research.google.com/github/jankovicsandras/ml/blob/main/emscripten_wasm_vectordb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Emscripten Wasm vectorDB test

### PROs:
 - it works :)
 - seems to be fast (but not yet timed, TODO)

### CONs:
 - only top_1 for now, not top_k
 - more complex than JavaScript

### Table of contents:
1. Installing deps
2. Creating test database and embedding
3. Building the C vectordb
4. Wasm compiling
5. Testing Wasm vectordb

### 1. Installing deps

In [None]:
# for python and embedding
!pip -q install sentence-transformers

# for Wasm
! git clone https://github.com/emscripten-core/emsdk.git
! cd emsdk && ./emsdk install latest && ./emsdk activate latest && source ./emsdk_env.sh
! ./emsdk/upstream/emscripten/emcc --version
! curl https://wasmtime.dev/install.sh -sSf | bash

### 2. Creating test database and embedding

In [54]:
import random
from sentence_transformers import SentenceTransformer

# TODO: better multi language generation
i18n = {
  'en':{
    'colors': ['black','blue','green','cyan','red','magenta','brown','light grey','dark grey','bright blue'],
    'itemtypes': [ 'belt','cap','hat','jeans','jumper', 'shirt','shorts','sneakers','suit','tie' ],
    'adjs' : ['Fantastic','Cool','Superb','Awesome','Trendy'],
    'insizestr': ' in size ',
    'pricestr': '. Price: ',
    'currencystr': ' USD.',
    'questions': [
      'I want to buy a hat. What colors do you have?',
      'Can you recommend something green?',
      'Do you have shirts under 50 USD?',
      'What do you have in size 40?',
      'I would like to buy sneakers for my friend. Do you have something in size 46, preferably cyan or blue?',
      'What can you recommend in red?'
    ]
  },
  'hu':{
    'colors': ['fekete','kék','zöld','zöldeskék','piros','lila','barna','világosszürke','sötétszürke','ragyogó kék'],
    'itemtypes': [ 'öv','sapka','kalap','farmer','pullóver', 'ing','rövidnadrág','tornacipő','öltöny','nyakkendő' ],
    'adjs': ['Csodálatos','Menő','Szuper','Király','Trendi'],
    'insizestr': '. Méret: ',
    'pricestr': '. Ár: ',
    'currencystr': ' Ft.',
    'questions': [
      'Kalapot szeretnék. Milyen színek vannak?',
      'Tudsz-e ajánlani valami zöldet?',
      'Vannak ingek 50 Ft. alatt?',
      'Mik vannak 40-es méretben?',
      'Tornacipőt szeretnék a barátomnak. Van valami 46-os méretben, lehetőleg zöldeskék vagy kék?',
      'Mit tudsz ajánlani pirosban?'
    ]
  },
  'no':{
    'colors': ['svart','blå','grøn','grønblå','rød','lilla','brun','lysgrå','mørkgrå','lysblå'],
    'itemtypes': [ 'belt','lue','hatt','bukser','genser', 'skjorte','shorts','sko','dress','slips' ],
    'adjs': ['Fantastisk','Kult','Supert','Tøff','Trendy'],
    'insizestr': ' i størrelse ',
    'pricestr': '. Pris: ',
    'currencystr': ' kr.',
    'questions': [
      'Jeg vil kjøpe en hatt. Hva farger er det?',
      'Kan du anbefale noen grønt?',
      'Har de skjorter under 50 kr?',
      'Hva har de i størrelse 40?',
      'Jeg vil gjerne kjøpe sko til min venn. Har de nokre i størrelse 46, helst grønblå eller blå?',
      'Hva kan du anbefale i rødt?'
    ]
  }
}

sizes = [str(30+x*2) for x in range(0,10)]

lang = 'hu'

allitems = []
for c in i18n[lang]['colors'] :
  for s in sizes :
    for ii,i in enumerate(i18n[lang]['itemtypes']) :
      allitems.append( random.choice(i18n[lang]['adjs'])+' '+ c+' '+ i+ i18n[lang]['insizestr']+s+
                      i18n[lang]['pricestr']+str(int(s)+20+5*ii)+i18n[lang]['currencystr'] )

random.shuffle(allitems)
print('len(allitems)',len(allitems))#,allitems)

# test questions
questions = i18n[lang]['questions']
random.shuffle(questions)
#print(questions)

# embedding
embeddermodel = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
zv = [0]*384
ivs = [] # item vectors
ivs.append(zv) # the first is a zeros vector, so the returned result index from the vectordb will be 1 indexed!
for i in allitems :
  ivs.append( embeddermodel.encode(i) )

qvs = [] # question vectors
for q in questions :
  qvs.append( embeddermodel.encode(q) )


len(allitems) 1000


### 3. Building the C vectordb

In [55]:
# Building vdb
vector_size = 384 # all-MiniLM-L6-v2 is 384, adjust if needed
vectors_count = len( ivs )

def flatten(xss):
    return [x for xs in xss for x in xs]

vdb = flatten(ivs)

# TODO: this is unoptimized and possibly redundant, could be done in flatten
vdbstr = ''
for i in range(0,vectors_count) :
  vdbstr += '\n'
  for j in range(0,vector_size) :
    vdbstr += str( vdb[ i*vector_size+j ] ) + ', '
vdbstr = vdbstr[:-2] + '\n'

# the C vectordb program
ccode = """
#include <emscripten.h>
#include <stdio.h>
#include <stdlib.h>

#define VECTOR_SIZE """ +str(vector_size)+ """
#define VECTORS_COUNT """ +str(vectors_count)+ """

// vdb and this c file should be generated with external tool
// shape is  [ [0,0,...], [x0,x1...], ... items ], where the first vector should be zeros
// top_k = 1 for now, similarity metric is cosine (dot product)
float vdb [] = { """+vdbstr+""" };


EMSCRIPTEN_KEEPALIVE
int gettop1(float v[]){
  int i=0, j=0, vi = 0;
  float vival = 0, ival = 0;
  for(i=0; i<VECTORS_COUNT; i++){
    ival = 0;
    for(j=0; j<VECTOR_SIZE; j++){
      ival = ival + vdb[i*VECTOR_SIZE+j] * v[j];
    }
    if( ival > vival ){
      vival = ival;
      vi = i;
    }
  }
  return vi;
}// End of gettop1()


EMSCRIPTEN_KEEPALIVE
int main(int argc, char **argv) {

  // argv to input vector parsing
  float v[VECTOR_SIZE] = {0};
  for(int argi=1; argi<argc; argi++){
    v[argi-1] = atof( argv[argi] );
  }

  // get best match's index from vectordb
  int vi = gettop1( v );

  // print and return
  printf("%d", vi);
  return vi;

}// End of main()

"""

# print(ccode)

# write the C program to file
with open('vdb.c','w+') as f:
  f.write(ccode)


### 4. Wasm compiling

In [56]:
! ./emsdk/upstream/emscripten/emcc -O3 -s WASM=1 -s STANDALONE_WASM -o a.js vdb.c
! ls -la

total 6528
drwxr-xr-x  1 root root    4096 Feb 26 13:12 .
drwxr-xr-x  1 root root    4096 Feb 26 08:00 ..
-rw-r--r--  1 root root   12043 Feb 26 13:12 a.js
-rwxr-xr-x  1 root root 1558233 Feb 26 13:12 a.wasm
drwxr-xr-x  4 root root    4096 Feb 22 14:24 .config
drwxr-xr-x 12 root root    4096 Feb 26 13:11 emsdk
drwxr-xr-x  1 root root    4096 Feb 22 14:24 sample_data
-rw-r--r--  1 root root 5088718 Feb 26 13:11 vdb.c


### 5. Testing Wasm vectordb

In [57]:
import os
basecommand = '/root/.wasmtime/bin/wasmtime a.wasm'

# loop questions
for i in range(0,len(questions)) :
  print('\n--------\n',questions[i])
  # question vector
  inputvector = qvs[i]
  # creating CLI command, this could be optimized
  fullcommand = basecommand + ''
  for n in inputvector :
    fullcommand += ' '+str(n)
  # vdb search
  rawresultidx = os.popen(fullcommand).read()
  resultidx = int(rawresultidx)-1  # -1 is important because vdb[1] is allitems[0], vdb[0] is a zeros placeholder
  # print
  print( resultidx, allitems[ resultidx ] )



--------
 Mit tudsz ajánlani pirosban?
389 Trendi piros öltöny. Méret: 34. Ár: 94 Ft.

--------
 Tudsz-e ajánlani valami zöldet?
64 Trendi zöldeskék rövidnadrág. Méret: 42. Ár: 92 Ft.

--------
 Mik vannak 40-es méretben?
9 Menő világosszürke nyakkendő. Méret: 40. Ár: 105 Ft.

--------
 Vannak ingek 50 Ft. alatt?
779 Trendi kék öltöny. Méret: 40. Ár: 100 Ft.

--------
 Kalapot szeretnék. Milyen színek vannak?
684 Szuper sötétszürke kalap. Méret: 40. Ár: 70 Ft.

--------
 Tornacipőt szeretnék a barátomnak. Van valami 46-os méretben, lehetőleg zöldeskék vagy kék?
461 Szuper világosszürke tornacipő. Méret: 36. Ár: 91 Ft.
