In [1]:
# import os
# os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Dense Retrieval using Milvus

- Understand the Python Elastic Search Client
- Map BM25 to Elastic Search 
- Compute Evaluation metrics 
- Other users of Elastic Search

## Goals

In [2]:
!ls

00_data_fetch_bq.ipynb		 Untitled.ipynb
00_data_fetch_spark.ipynb	 __pycache__
01_b_setup.ipynb		 ann_benchmark_recall.ipynb
01_data_cleanup.ipynb		 metrics_utils.py
01_data_subset.ipynb		 old
02_retrieval_dense_milvus.ipynb  test_setup.ipynb
02_retrieval_sparse.ipynb	 workshop_setup.ipynb


## Imports

In [3]:
import datetime
import pickle
import uuid
import datetime
import numpy as np
import time
import pandas as pd
import tqdm
import torch
import metrics_utils
import rich


In [4]:
from pymilvus import (
    connections,
    utility,
    FieldSchema, CollectionSchema, DataType,
    Collection,
)
import pymilvus

from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer


In [5]:
pd.options.display.max_colwidth = 500 # increase column width

## Data

In [6]:
path_posts = "gs://np-public-training-temp/stackoverflow/final/posts.parquet"
path_posts_related = "gs://np-public-training-temp/stackoverflow/final/related_posts.parquet"

path_posts = "gs://np-public-training-temp/stackoverflow/final_subset/posts.parquet"
path_posts_related = "gs://np-public-training-temp/stackoverflow/final_subset/related_posts.parquet"


In [7]:
collection_name = "stackoverflow"

## Model

Luckily there is an open source model trained on stackoverflow and uploaded to hugging face

[Hugging Face Model Card](https://huggingface.co/flax-sentence-embeddings/stackoverflow_mpnet-base)

SentenceTransformer is a nice library that makes training and using models much easier especially those geared for similarity.

In [8]:
model_name  = 'flax-sentence-embeddings/stackoverflow_mpnet-base'

### Sentence Transformer Api

In [9]:
model = SentenceTransformer(model_name)


In [10]:
rich.print ( list(model.children()) )

we are pooling `mean-pooling` as specified by the parameter`pooling_mode_mean_tokens`

In [11]:
text = "What does 'super' do in Python? - difference between super().__init__() and explicit superclass __init__()"
resp = model.encode(text, output_value=None)

rich.print(resp)

In [12]:
resp['token_embeddings'].shape

torch.Size([38, 768])

### Tokenizer

In [13]:
text = "The quick brown FOX was running "
text = "Python pandas memory isssue"

In [14]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [15]:
tokenizer(text)

{'input_ids': [0, 18754, 25466, 2019, 3642, 26358, 6346, 2067, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [16]:
tokens = tokenizer.tokenize(text) 
rich.print ( tokens)

In [17]:
input_ids= tokenizer.convert_tokens_to_ids(tokens)
input_ids

[18754, 25466, 2019, 3642, 26358, 6346, 2067]

In [18]:
decoded_string = tokenizer.decode(input_ids)
decoded_string

'python pandas memory isssue'

#### special tokens

In [19]:
tokenizer.special_tokens_map

{'bos_token': '<s>',
 'eos_token': '</s>',
 'unk_token': '[UNK]',
 'sep_token': '</s>',
 'pad_token': '<pad>',
 'cls_token': '<s>',
 'mask_token': '<mask>'}

0 => beginning of sentence    
2 => end of sentence

In [20]:
tokenizer.convert_ids_to_tokens([0, 2])

['<s>', '</s>']

In [21]:
tokenizer.convert_ids_to_tokens(tokenizer(text)['input_ids'])

['<s>', 'python', 'panda', '##s', 'memory', 'iss', '##su', '##e', '</s>']

In [22]:
dim = model.get_sentence_embedding_dimension()
dim

768

every sentence regardless of length would be represented as a vector of 768 dimension

## Milvus Collection Setup 

In [23]:
df = pd.read_parquet(path_posts)
df['Tags']  = df['Tags'].apply(lambda x: " ".join( x.tolist()))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [24]:
len(df)

219841

In [25]:
connections.connect("default", host="localhost", port="19530")


Milvus `collections` is the same as Elastic Search concept of `indexes` / table.

Each collection is meant for a seperate use case. 

In [26]:
utility.list_collections()

['hello_milvus', 'stackoverflow']

In [27]:
if collection_name in utility.list_collections():
    utility.drop_collection(collection_name)

In [28]:
?Collection

[0;31mInit signature:[0m [0mCollection[0m[0;34m([0m[0mname[0m[0;34m,[0m [0mschema[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0musing[0m[0;34m=[0m[0;34m'default'[0m[0;34m,[0m [0mshards_num[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      This is a class corresponding to collection in milvus. 
[0;31mInit docstring:[0m
Constructs a collection by name, schema and other parameters.
Connection information is contained in kwargs.

:param name: the name of collection
:type name: str

:param schema: the schema of collection
:type schema: class `schema.CollectionSchema`

:param using: Milvus link of create collection
:type using: str

:param shards_num: How wide to scale collection. Corresponds to how many active datanodes
                can be used on insert.
:type shards_num: int

:param kwargs:
    * *consistency_level* (``str/int``) --
    Which consistency level to use when searching 

unlike Elastic Search , Milvus requires us to specify the document schema beforehand.   

Currently Milvus stores the metadata for a document in MySql, hence some of the data type names

In [29]:
fields = [
    FieldSchema(name="Id", dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name="AcceptedAnswerId", dtype=DataType.INT64),
    FieldSchema(name="Title", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="QuestionBody", dtype=DataType.VARCHAR, max_length=50_000),
    FieldSchema(name="Tags", dtype=DataType.VARCHAR, max_length=5000),
    FieldSchema(name="ViewCount", dtype=DataType.INT64),
    FieldSchema(name="AnswerCount", dtype=DataType.INT64),
    FieldSchema(name="CommentCount", dtype=DataType.INT64),
    FieldSchema(name="Score", dtype=DataType.INT64),
    FieldSchema(name="AnswerId", dtype=DataType.INT64),
    FieldSchema(name="AcceptedAnswerBody", dtype=DataType.VARCHAR, max_length=50_000),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim) ,
    
    #FieldSchema(name="CreationDate", dtype=DataType.VARCHAR),

]

schema = CollectionSchema(fields, "collection containing stackoverflow")

stackoverflow_milvus = Collection(collection_name, schema, consistency_level="Strong")

In [30]:
schema

{
  auto_id: False
  description: collection containing stackoverflow
  fields: [{
    name: Id
    description: 
    type: 5
    is_primary: True
    auto_id: False
  }, {
    name: AcceptedAnswerId
    description: 
    type: 5
  }, {
    name: Title
    description: 
    type: 21
    params: {'max_length': 500}
  }, {
    name: QuestionBody
    description: 
    type: 21
    params: {'max_length': 50000}
  }, {
    name: Tags
    description: 
    type: 21
    params: {'max_length': 5000}
  }, {
    name: ViewCount
    description: 
    type: 5
  }, {
    name: AnswerCount
    description: 
    type: 5
  }, {
    name: CommentCount
    description: 
    type: 5
  }, {
    name: Score
    description: 
    type: 5
  }, {
    name: AnswerId
    description: 
    type: 5
  }, {
    name: AcceptedAnswerBody
    description: 
    type: 21
    params: {'max_length': 50000}
  }, {
    name: embedding
    description: 
    type: 101
    params: {'dim': 768}
  }]
}

In [31]:
fields = [f.name for f in schema.fields]

In [32]:
fields

['Id',
 'AcceptedAnswerId',
 'Title',
 'QuestionBody',
 'Tags',
 'ViewCount',
 'AnswerCount',
 'CommentCount',
 'Score',
 'AnswerId',
 'AcceptedAnswerBody',
 'embedding']

replace Nan or NA columns with a default value

In [33]:

df[['AcceptedAnswerId','AnswerId']] = df[['AcceptedAnswerId','AnswerId']].fillna(-1).astype(int)

cols = ['ViewCount','AnswerCount','CommentCount' ,'Score' ]
df[cols] = df[cols ].fillna(0).astype(int)


df[['AcceptedAnswerBody']] = df[['AcceptedAnswerBody']].fillna("")



In [34]:
df.head()

Unnamed: 0,Id,AcceptedAnswerId,Title,QuestionBody,Tags,ViewCount,AnswerCount,CommentCount,Score,CreationDate,AnswerId,AcceptedAnswerBody
1,15020895,-1,Python int-byte efficient data structure,"i am currently storing key-values of type int-int. For fast access, I am currently using an BTrees.IIBTree structure stored in memory. It is not stored on disk at all since we need the most recent data.\n\nHowever, the current solution barely fits into memory, so I am looking for a more efficient database or data structure in terms of access time. In case it would be stored in memory it also needs to be efficient in terms of memory space. \n\nOne idea would be to replace the BTrees.IIBTree s...",python data-structures,155,0,3,1,2013-02-22 09:33:26.360,-1,
9,68487902,-1,Why does the Variance of Laplace very different for OpenCV and scikit-image?,"TL;DR: How can I use skimage.filters.laplace(image).var() in a way to get the same value as cv2.Laplacian(image, CV_64F).var() and skimage.filters.sobel(image) to get same value as cv2.Sobel(image) ?\nI have the following code to find the Laplace Variance for blur detection\n[CODE]\nSo when I try to find the Laplace variance from OpenCV and scikit-image, it gives me two different values:\n[CODE]\nWhich one should I use or how can I get same number from both the functions?\nAlso, How can I us...",python opencv image-processing computer-vision scikit-image,391,0,5,1,2021-07-22 15:50:34.220,-1,
15,61391327,-1,Why input never ends,"I have python 3.7 installed and I have this code:\n\n[CODE]\n\nI was writing the name and press enter but the input is not over, it is still running and waiting for more inputs\n\nEdit: the problem is that input is never ending, doesn't matter how many enters I press\n",python python-3.x input,104,1,6,3,2020-04-23 15:43:03.497,-1,
27,28852710,-1,Crashes with piecewise linear objective for gurobi 6.0.2 / setPWLObj,"We have a complex optimization problem which includes several quadratic terms with integer and continous variables (using Anaconda Python / Pycharm with Gurobi 6.0.2). We applied the setPWLObj function to apprixmate the quadratic objective components. The code for this is as follows:\n\n[CODE]\n\nWith l1 and l2 being continous variables.\n\nThe problem behaves inconsistently. Running it on a Mac mostly delivers the exit codes 138 and 139 (correspondent to Bus Error 10), sometimes the same pr...",python crash gurobi piecewise,403,1,1,3,2015-03-04 10:58:16.370,-1,
29,24043029,-1,Python TypeError: plotdatehist() got an unexpected keyword argument,"apologies beforehand if this is a stupid question...\n\nI've been using some Manchester University code to record, analyse, and graphically display bird box activity using IR emitters/receivers using a Raspberry Pi.\nAnyway, I've run into a problem in the graphical display part. \n\nThe part of the code causing the error is: \n\n[CODE]\n\nand the error which keeps coming up reads\n\n[CODE]\n\nI've heard that similar problems can be fixed by updating software, but as far as I can tell everyth...",python typeerror,419,0,7,0,2014-06-04 16:42:32.257,-1,


In [35]:
len(df)

219841

In [36]:
df_subset = df.head(5_000_000).copy()

In [37]:
df_subset['Title'].tolist();

In [38]:
?model.encode

[0;31mSignature:[0m
[0mmodel[0m[0;34m.[0m[0mencode[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0msentences[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbatch_size[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m32[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshow_progress_bar[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_value[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'sentence_embedding'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconvert_to_numpy[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconvert_to_tensor[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdevice[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnormalize_embeddings[0

### 

## Embedding Generation

In [39]:
embeddings = model.encode(df_subset['Title'].head(1000).tolist() , show_progress_bar=True)

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

generating embeddings on a cpu can be very slow

embed on gpu    
otherwise fetch precomputed embeddings

In [41]:


if torch.cuda.is_available():
    embeddings = model.encode(df_subset['Title'].tolist() , show_progress_bar=True)
    df_subset['embedding'] = embeddings.tolist()
    df_subset.to_parquet(path_posts.replace(".parquet", "_with_embedding.parquet") , index=False)
    


Batches:   0%|          | 0/6871 [00:00<?, ?it/s]

In [42]:
embeddings.shape

(219841, 768)

In [43]:
df_subset = pd.read_parquet( path_posts.replace(".parquet", "_with_embedding.parquet") )

In [44]:
df_subset.head()

Unnamed: 0,Id,AcceptedAnswerId,Title,QuestionBody,Tags,ViewCount,AnswerCount,CommentCount,Score,CreationDate,AnswerId,AcceptedAnswerBody,embedding
0,15020895,-1,Python int-byte efficient data structure,"i am currently storing key-values of type int-int. For fast access, I am currently using an BTrees.IIBTree structure stored in memory. It is not stored on disk at all since we need the most recent data.\n\nHowever, the current solution barely fits into memory, so I am looking for a more efficient database or data structure in terms of access time. In case it would be stored in memory it also needs to be efficient in terms of memory space. \n\nOne idea would be to replace the BTrees.IIBTree s...",python data-structures,155,0,3,1,2013-02-22 09:33:26.360,-1,,"[-0.012659057043492794, -0.00613340875133872, -0.0010714539093896747, -0.05187602341175079, 0.016947930678725243, 0.018191726878285408, -0.0036008567549288273, -0.0025095134042203426, 0.048929549753665924, -0.056598395109176636, -0.03610250726342201, -0.07298106700181961, -0.06097523495554924, 0.030608760192990303, -0.008724302984774113, -0.005664358846843243, -0.003994342405349016, 0.007222541607916355, -0.003428444731980562, -0.0011291344417259097, -0.06662901490926743, -0.0584153793752193..."
1,68487902,-1,Why does the Variance of Laplace very different for OpenCV and scikit-image?,"TL;DR: How can I use skimage.filters.laplace(image).var() in a way to get the same value as cv2.Laplacian(image, CV_64F).var() and skimage.filters.sobel(image) to get same value as cv2.Sobel(image) ?\nI have the following code to find the Laplace Variance for blur detection\n[CODE]\nSo when I try to find the Laplace variance from OpenCV and scikit-image, it gives me two different values:\n[CODE]\nWhich one should I use or how can I get same number from both the functions?\nAlso, How can I us...",python opencv image-processing computer-vision scikit-image,391,0,5,1,2021-07-22 15:50:34.220,-1,,"[0.06565584242343903, -0.031634315848350525, 0.006259895395487547, 0.03216732665896416, -0.0666438490152359, 0.000994484405964613, 0.014307020232081413, 0.02116939052939415, 0.04959907755255699, 0.0003712112666107714, 0.009125420823693275, 0.05929381772875786, 0.012205596081912518, -0.02416669763624668, -0.03371993079781532, -0.02006007730960846, 0.02398102730512619, -0.01687006652355194, 0.04898605868220329, -0.07479659467935562, -0.0025706428568810225, -0.06757359206676483, 0.0030513843521..."
2,61391327,-1,Why input never ends,"I have python 3.7 installed and I have this code:\n\n[CODE]\n\nI was writing the name and press enter but the input is not over, it is still running and waiting for more inputs\n\nEdit: the problem is that input is never ending, doesn't matter how many enters I press\n",python python-3.x input,104,1,6,3,2020-04-23 15:43:03.497,-1,,"[0.01758185401558876, -0.07531370967626572, 0.0016971321310847998, 0.04993859678506851, 0.031192289665341377, -0.011618325486779213, 0.0057819378562271595, 0.031561750918626785, -0.03394051268696785, 0.0017980141565203667, 0.0721777155995369, -0.01667066290974617, 0.01838006265461445, 0.007994672283530235, -0.06731881201267242, -0.01252695545554161, 0.02153550088405609, 0.005166086368262768, -0.00957895815372467, -0.03164404258131981, 0.006995020899921656, 0.018143707886338234, -0.0579609908..."
3,28852710,-1,Crashes with piecewise linear objective for gurobi 6.0.2 / setPWLObj,"We have a complex optimization problem which includes several quadratic terms with integer and continous variables (using Anaconda Python / Pycharm with Gurobi 6.0.2). We applied the setPWLObj function to apprixmate the quadratic objective components. The code for this is as follows:\n\n[CODE]\n\nWith l1 and l2 being continous variables.\n\nThe problem behaves inconsistently. Running it on a Mac mostly delivers the exit codes 138 and 139 (correspondent to Bus Error 10), sometimes the same pr...",python crash gurobi piecewise,403,1,1,3,2015-03-04 10:58:16.370,-1,,"[0.010070848278701305, 0.06428138166666031, -0.0021203288342803717, -0.013878368772566319, 0.006991712376475334, -0.011957813985645771, 0.01164327748119831, 0.010017339140176773, -0.018830085173249245, 0.07985765486955643, 0.009358108043670654, 0.045219000428915024, 0.02817448601126671, 0.0421956330537796, 0.040011510252952576, -0.034388672560453415, 0.019705161452293396, -0.001457356265746057, 0.036921314895153046, 0.0048420061357319355, 0.0034806530456990004, 0.02671915665268898, 0.0008028..."
4,24043029,-1,Python TypeError: plotdatehist() got an unexpected keyword argument,"apologies beforehand if this is a stupid question...\n\nI've been using some Manchester University code to record, analyse, and graphically display bird box activity using IR emitters/receivers using a Raspberry Pi.\nAnyway, I've run into a problem in the graphical display part. \n\nThe part of the code causing the error is: \n\n[CODE]\n\nand the error which keeps coming up reads\n\n[CODE]\n\nI've heard that similar problems can be fixed by updating software, but as far as I can tell everyth...",python typeerror,419,0,7,0,2014-06-04 16:42:32.257,-1,,"[0.034622225910425186, 0.0662810429930687, 0.03625532612204552, -0.033592548221349716, 0.01620665192604065, 0.018059229478240013, 0.02985021099448204, 0.05047150328755379, 0.022751474753022194, 0.03492686152458191, 0.05066610127687454, -0.0318586565554142, -0.027931421995162964, 0.0064952014945447445, -0.03989172726869583, -0.052240967750549316, -0.005318492650985718, 0.017922651022672653, 0.001835482893511653, 0.017695773392915726, -0.01469376403838396, -0.0010026713134720922, 0.00470648007..."


In [45]:
df_subset = df_subset [fields]

In [46]:
df_subset.iloc[0].to_dict();

In [47]:
df_subset.dtypes

Id                     int64
AcceptedAnswerId       int64
Title                 object
QuestionBody          object
Tags                  object
ViewCount              int64
AnswerCount            int64
CommentCount           int64
Score                  int64
AnswerId               int64
AcceptedAnswerBody    object
embedding             object
dtype: object

In [48]:
embeddings = None

## Embedding Insertion

In [49]:
insert_result = stackoverflow_milvus.insert( df_subset  )



In [50]:
insert_result

(insert count: 219841, delete count: 0, upsert count: 0, timestamp: 437096132418994178, success count: 219841, err count: 0)

In [51]:
stackoverflow_milvus.num_entities

219841

In [52]:
stackoverflow_milvus.indexes

[]

the embeddings are inserted but no index is created 

Milvus supports several indexes / ANN

https://milvus.io/docs/index.md

In [53]:
index = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 100},
}

# n_list = number of clusters to create

# index = {
#     "index_type": "FLAT",
#     "metric_type": "L2",
#     "params": {}
# }



In [54]:
?stackoverflow_milvus.create_index

[0;31mSignature:[0m
[0mstackoverflow_milvus[0m[0;34m.[0m[0mcreate_index[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfield_name[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex_params[0m[0;34m=[0m[0;34m{[0m[0;34m}[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtimeout[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkwargs[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0mpymilvus[0m[0;34m.[0m[0morm[0m[0;34m.[0m[0mindex[0m[0;34m.[0m[0mIndex[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Creates index for a specified field. Return Index Object.

:param field_name: The name of the field to create an index for.
:type  field_name: str

:param index_params: The indexing parameters.
:type  index_params: dict

:param timeout: An optional duration of time in seconds to allow for the RPC. When timeout
                is set to None, client waits until server response or error occur
:type  timeout: float

:p

create the index

In [55]:
stackoverflow_milvus.create_index("embedding", index)

Status(code=0, message='')

In [56]:
stackoverflow_milvus.indexes

[<pymilvus.orm.index.Index at 0x7f2af0142e50>]

load the index into memory

In [57]:
stackoverflow_milvus.load()


In [58]:
!ls

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
00_data_fetch_bq.ipynb		 Untitled.ipynb
00_data_fetch_spark.ipynb	 __pycache__
01_b_setup.ipynb		 ann_benchmark_recall.ipynb
01_data_cleanup.ipynb		 metrics_utils.py
01_data_subset.ipynb		 old
02_retrieval_dense_milvus.ipynb  test_setup.ipynb
02_retrieval_sparse.ipynb	 workshop_setup.ipynb


## Embedding Retrieval

In [59]:
search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 3}
    
}
# n_probe = number of clusters to search [1 , n_list]

In [60]:
vectors_to_search = list(df_subset.iloc[0:1]['embedding'])







In [61]:
len(vectors_to_search) , len(vectors_to_search[0])

(1, 768)

In [62]:
?stackoverflow_milvus.search;

Object `stackoverflow_milvus.search;` not found.


In [63]:
?time.time

[0;31mDocstring:[0m
time() -> floating point number

Return the current time in seconds since the Epoch.
Fractions of a second may be present if the system clock provides them.
[0;31mType:[0m      builtin_function_or_method


In [64]:
start_time = time.time()
result = stackoverflow_milvus.search(data=vectors_to_search, anns_field="embedding", param=search_params, limit=3
                                     , output_fields=["Id"]
                                    
                                    )
end_time = time.time()


print((end_time - start_time))

0.1374046802520752


In [65]:
for hits in result:
    for hit in hits:
        print(f"hit: {hit}, score:{hit.score} id: {hit.entity.get('Id')} , data:{hit.entity._row_data} ")

hit: (distance: 0.0, id: 15020895), score:0.0 id: 15020895 , data:{'Id': 15020895} 
hit: (distance: 0.5060367584228516, id: 25471026), score:0.5060367584228516 id: 25471026 , data:{'Id': 25471026} 
hit: (distance: 0.5167953968048096, id: 46005777), score:0.5167953968048096 id: 46005777 , data:{'Id': 46005777} 


just like ES, we get the id , score , and the metadata when inderted

In [66]:
hit.score

0.5167953968048096

## Evaluate on golden data

### helper code

In [67]:
def format_resp(hits, row):
    payload = []
    query = row['PostTitle']
    
    for hit in hits:
        doc_id = int(hit.entity.get('Id'))
        
        r = {
             'query': query
             , 'query_id' : row['PostId']
             ,'doc_id' : doc_id
             , 'is_relevant' : doc_id in row['RelatedPostIds']
             ,'score' : hit.score
             ,'doc_title' : hit.entity.get('Title')


        }
        payload.append(r)    
    return payload
    
def evaluate_relevancy_hits(df,search_params,num_hits=20, batch_size=10):
    
    payload_all = []
    print(f"Encoding {len(df)} vectors")
    
    for pos in tqdm.trange(0, len(df), batch_size):
        
        df_subset = df.iloc[pos:pos + batch_size] 
    
        vectors_to_search = model.encode( list( df_subset['PostTitle']) )

        result = stackoverflow_milvus.search(data=vectors_to_search, anns_field="embedding", param=search_params, limit=num_hits
                                         , output_fields=["Id","Title"]
                                        )

        for hits , row in zip( result, df_subset.to_dict(orient='records') ):
            payload = format_resp(hits, row)
            payload_all.extend(payload)


    print(f"formatted response")

    df_res = pd.DataFrame(payload_all)
    return df_res

In [68]:
pdf_related = pd.read_parquet(path_posts_related)

In [69]:
pdf_related

Unnamed: 0,PostId,PostTitle,RelatedPostIds,RelatedPostTitles,num_candidates
1,3494593,Shading a kernel density plot between two points.,"[3494593, 14863744, 14094644, 16504452, 48853178, 36948624, 47308146, 34029811, 31215748, 29499914, 41484896, 7787114, 27189453, 23680729, 36224394, 18742693]","[Shading a kernel density plot between two points., adding percentile lines to a density plot, draw the following shaded area in R, color a portion of the normal distribution, How can I shade the area under a curve?, Shade area under a curve, Shading a region under a PDF, Fill different colors for each quantile in geom_density() of ggplot, How to shade part of a density curve in ggplot (with no y axis data), r density plot - fill area under curve, Fill negative value area below geom_line, po...",16
2,37949409,Dictionary in a numpy array?,"[37949409, 47689224, 61517741]","[Dictionary in a numpy array?, How to access the elements in numpy array of sets?, opening npy array. can view but not index?]",3
8,19876079,Cannot find module cv2 when using OpenCV,"[19876079, 62443365, 64580641, 45606137, 60294113, 65227902, 63039959]","[Cannot find module cv2 when using OpenCV, How to use opencv module in python(I'm using pycharm), build opencv from source: ModuleNotFoundError: No module named 'cv2', ImportError: No module named cv2 when executing Python script, 'opencv-python' installed but still shows 'ModuleNotFoundError: No module named cv2 ', Installed OpenCV successfully, but cannot import it within modules, On raspberry pi terminal cv2 works but on my project didnt work how can i fix this]",7
12,35082143,Error: package or namespace load failed for ‘car’,"[35082143, 65941744, 68515009, 56409535]","[Error: package or namespace load failed for ‘car’, Error: package or namespace load failed for ‘tidyverse’ there is no package called ‘reprex’, Truble loading 'Hmisc', > library(ez) Error: package or namespace load failed for ‘ez’ in loadNamespace]",4
14,2673651,inheritance from str or int,"[2673651, 48465797, 3120562, 15085917, 3238350, 4827303, 29751474, 50051365, 5693942, 59567148, 30045106, 37764447, 65568299, 24736813, 38873373]","[inheritance from str or int, Inherited class of int doesn't take additional arguments, Python, subclassing immutable types, Inheriting from immutable types, Subclassing int in Python, problem subclassing builtin type, Customizing immutable types in Python, Class inheritance not working while creating a Dimension custom class with int parent class in Python 3.6, Subclassing int and overriding the __init__ method - Python, How to inherit class complex in python?, Python how to extend `str` an...",15
...,...,...,...,...,...
33231,28419763,Expand Text widget to fill the entire parent Frame in Tkinter,"[28419763, 48171462]","[Expand Text widget to fill the entire parent Frame in Tkinter, Resize Entry with window in Tkinter]",2
33234,40332743,Source code for str.split?,"[40332743, 51355719]","[Source code for str.split?, where can I find implementation of str method in python?]",2
33241,27443414,Cannot perform a backup or restore operation within a transaction,"[27443414, 53216877]","[Cannot perform a backup or restore operation within a transaction, Can't perform a backup or restore operation within a transaction]",2
33243,48536681,What is the exact meaning of stride's list in tf.nn.conv2d?,"[48536681, 47305022]","[What is the exact meaning of stride's list in tf.nn.conv2d?, What is the meaning of 2D stride in convolution?]",2


#### Params 1

In [70]:
search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 3}
}

In [71]:
vectors_to_search = model.encode( list( pdf_related.iloc[0:5]['PostTitle']) )

    

In [72]:
result = stackoverflow_milvus.search(data=vectors_to_search, anns_field="embedding", param=search_params, limit=20
                                     , output_fields=["Id","Title"]
                                    )

In [73]:
result

<pymilvus.orm.search.SearchResult at 0x7f2c7f464a90>

In [74]:
payload_all = []

for hits , row in zip( result, pdf_related.iloc[0:5].to_dict(orient='records') ):
    payload = format_resp(hits, row)
    payload_all.extend(payload)

df_res = pd.DataFrame(payload_all)


In [75]:
df_res

Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
0,Shading a kernel density plot between two points.,3494593,3494593,True,4.924334e-13,Shading a kernel density plot between two points.
1,Shading a kernel density plot between two points.,3494593,27294822,False,2.057712e-01,Shading a kernel density estimate between two points - with transparency
2,Shading a kernel density plot between two points.,3494593,60270301,False,5.561024e-01,Kernel Density Plots and Histogram overlay
3,Shading a kernel density plot between two points.,3494593,50526344,False,6.594170e-01,Points with density gradient
4,Shading a kernel density plot between two points.,3494593,8808751,False,6.652752e-01,Difference between two density plots
...,...,...,...,...,...,...
95,inheritance from str or int,2673651,42359156,False,7.071114e-01,Class inheritance
96,inheritance from str or int,2673651,20604142,False,7.118337e-01,Python Inheritance With No Arguments
97,inheritance from str or int,2673651,65127155,False,7.207956e-01,inheritance with the extention in python
98,inheritance from str or int,2673651,48465797,True,7.299086e-01,Inherited class of int doesn't take additional arguments


In [76]:
len(pdf_related)


6114

In [77]:
df_res = evaluate_relevancy_hits(pdf_related.iloc[0:50] , search_params=search_params)

Encoding 50 vectors


100%|██████████| 5/5 [00:00<00:00,  5.16it/s]


formatted response


In [78]:
df_res = evaluate_relevancy_hits(pdf_related , search_params=search_params)

Encoding 6114 vectors


100%|██████████| 612/612 [02:02<00:00,  4.99it/s]


formatted response


In [79]:

df_agg_res  = df_res.groupby(['query_id'], as_index=False).apply (lambda x: pd.Series(metrics_utils.all_metrics(x['is_relevant'])))


In [80]:
df_agg_res

Unnamed: 0,query_id,p@1,p@5,p@10,mrr,map
0,972,1.0,0.6,0.4,1.0,0.775000
1,8948,1.0,0.6,0.3,1.0,0.682967
2,20794,1.0,0.4,0.2,1.0,1.000000
3,32404,1.0,0.4,0.2,1.0,0.550000
4,32899,1.0,0.6,0.4,1.0,0.892857
...,...,...,...,...,...,...
6109,71792480,1.0,0.2,0.1,1.0,1.000000
6110,71992622,1.0,0.4,0.2,1.0,1.000000
6111,72050038,1.0,0.2,0.1,1.0,1.000000
6112,72369460,1.0,0.2,0.1,1.0,1.000000


In [81]:
df_agg_res.drop(columns='query_id').agg(np.mean)

p@1     0.999673
p@5     0.289696
p@10    0.164213
mrr     0.999836
map     0.882925
dtype: float64

#### Params 2

In [82]:
search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 20}
}

In [83]:
df_res = evaluate_relevancy_hits(pdf_related , search_params=search_params)

Encoding 6114 vectors


100%|██████████| 612/612 [02:02<00:00,  5.00it/s]


formatted response


In [84]:
df_agg_res  = df_res.groupby(['query_id'], as_index=False).apply (lambda x: pd.Series(metrics_utils.all_metrics(x['is_relevant'])))


In [85]:
df_agg_res

Unnamed: 0,query_id,p@1,p@5,p@10,mrr,map
0,972,1.0,0.6,0.4,1.0,0.775000
1,8948,1.0,0.6,0.3,1.0,0.643810
2,20794,1.0,0.4,0.2,1.0,1.000000
3,32404,1.0,0.4,0.2,1.0,0.550000
4,32899,1.0,0.6,0.4,1.0,0.892857
...,...,...,...,...,...,...
6109,71792480,1.0,0.2,0.1,1.0,1.000000
6110,71992622,1.0,0.4,0.2,1.0,1.000000
6111,72050038,1.0,0.2,0.1,1.0,1.000000
6112,72369460,1.0,0.2,0.1,1.0,1.000000


In [86]:
df_agg_res.drop(columns='query_id').agg(np.mean)

p@1     0.999836
p@5     0.294504
p@10    0.168678
mrr     0.999918
map     0.872481
dtype: float64

In [99]:
df_agg_res [  (df_agg_res['p@1'] < 1) & (df_agg_res['p@5'] >= 0.4) ]

Unnamed: 0,query_id,p@1,p@5,p@10,mrr,map
3144,26767591,0.0,0.8,0.6,0.5,0.693142


In [118]:
query_id = 6668963

In [119]:
df_res [ df_res.query_id==query_id ].iloc[0]['query']

'How to prevent ifelse() from turning Date objects into numeric objects'

In [120]:
df_res [ df_res.query_id==query_id ][['doc_title','is_relevant'] ]

Unnamed: 0,doc_title,is_relevant
20200,How to prevent ifelse() from turning Date objects into numeric objects,True
20201,Is ifelse() coercing datetimes to numeric?,True
20202,mutating a new variable with ifelse() loses date format,True
20203,How do I stop implicit date conversion when using ifelse with date time data?,True
20204,R ifelse avoiding change in date format,True
20205,using ifelse with Dates in R,True
20206,ifelse Statement Returning Number Instead Of Date,True
20207,R- date time variable loses format after ifelse,True
20208,Replace nested ifelse while working with dates in R,False
20209,Ifelse function and date handling when using lubridate,True


## Cleanup

In [87]:
connections.disconnect('default')

In [88]:
connections.list_connections()

[('default', None)]

In [106]:
df_agg_res.to_parquet("../tmp/df_agg_res__faiss.parquet", index=False)
df_agg_res.head()

Unnamed: 0,query_id,p@1,p@5,p@10,mrr,map
0,972,1.0,0.6,0.4,1.0,0.775
1,8948,1.0,0.6,0.3,1.0,0.64381
2,20794,1.0,0.4,0.2,1.0,1.0
3,32404,1.0,0.4,0.2,1.0,0.55
4,32899,1.0,0.6,0.4,1.0,0.892857


In [107]:
df_res.to_parquet("../tmp/df_res__faiss.parquet", index=False)
df_res.head()

Unnamed: 0,query,query_id,doc_id,is_relevant,score,doc_title
0,Shading a kernel density plot between two points.,3494593,3494593,True,3.698816e-13,Shading a kernel density plot between two points.
1,Shading a kernel density plot between two points.,3494593,27294822,False,0.2057712,Shading a kernel density estimate between two points - with transparency
2,Shading a kernel density plot between two points.,3494593,60270301,False,0.5561023,Kernel Density Plots and Histogram overlay
3,Shading a kernel density plot between two points.,3494593,7787114,True,0.6285322,polygon in density plot?
4,Shading a kernel density plot between two points.,3494593,50526344,False,0.659417,Points with density gradient


## Comparision

In [126]:
df_agg_faiss = pd.read_parquet("../tmp/df_agg_res__faiss.parquet")
df_agg_es = pd.read_parquet("../tmp/df_agg_res__elasticsearch.parquet")

df_both = pd.merge(df_agg_faiss, df_agg_es, on="query_id",suffixes=('_faiss', '_es') )

df_res_es = pd.read_parquet("../tmp/df_res__elasticsearch.parquet")

In [127]:
df_both

Unnamed: 0,query_id,p@1_faiss,p@5_faiss,p@10_faiss,mrr_faiss,map_faiss,p@1_es,p@5_es,p@10_es,mrr_es,map_es
0,972,1.0,0.6,0.4,1.0,0.775000,1.0,0.2,0.3,1.0,0.449060
1,8948,1.0,0.6,0.3,1.0,0.643810,1.0,0.4,0.3,1.0,0.666667
2,20794,1.0,0.4,0.2,1.0,1.000000,1.0,0.2,0.1,1.0,1.000000
3,32404,1.0,0.4,0.2,1.0,0.550000,1.0,0.4,0.3,1.0,0.591667
4,32899,1.0,0.6,0.4,1.0,0.892857,1.0,0.8,0.4,1.0,1.000000
...,...,...,...,...,...,...,...,...,...,...,...
6109,71792480,1.0,0.2,0.1,1.0,1.000000,1.0,0.2,0.1,1.0,1.000000
6110,71992622,1.0,0.4,0.2,1.0,1.000000,1.0,0.4,0.2,1.0,1.000000
6111,72050038,1.0,0.2,0.1,1.0,1.000000,1.0,0.2,0.1,1.0,1.000000
6112,72369460,1.0,0.2,0.1,1.0,1.000000,1.0,0.2,0.1,1.0,0.558824


In [128]:
df_both [ df_both['p@5_faiss'] > 2 *df_both['p@5_es']].sort_values(['p@5_faiss'])

Unnamed: 0,query_id,p@1_faiss,p@5_faiss,p@10_faiss,mrr_faiss,map_faiss,p@1_es,p@5_es,p@10_es,mrr_es,map_es
1761,13990465,1.0,0.2,0.1,1.0,1.000000,0.0,0.0,0.1,0.166667,0.166667
5673,58483210,1.0,0.2,0.1,1.0,1.000000,0.0,0.0,0.1,0.100000,0.100000
4499,41970582,1.0,0.2,0.2,1.0,0.666667,0.0,0.0,0.1,0.166667,0.166667
5567,56104377,1.0,0.2,0.1,1.0,1.000000,0.0,0.0,0.1,0.142857,0.142857
5952,65404788,1.0,0.2,0.1,1.0,1.000000,0.0,0.0,0.0,0.083333,0.083333
...,...,...,...,...,...,...,...,...,...,...,...
928,6668963,1.0,1.0,0.9,1.0,0.956277,1.0,0.4,0.3,1.000000,0.722222
455,3099219,1.0,1.0,0.9,1.0,0.921950,1.0,0.4,0.3,1.000000,0.547503
3706,32988099,1.0,1.0,0.8,1.0,0.939577,1.0,0.2,0.3,1.000000,0.443056
5435,53902507,1.0,1.0,0.5,1.0,0.852381,1.0,0.4,0.5,1.000000,0.645833


In [129]:
df_res [ df_res.query_id==6668963 ][['doc_title','is_relevant'] ].head()

Unnamed: 0,doc_title,is_relevant
20200,How to prevent ifelse() from turning Date objects into numeric objects,True
20201,Is ifelse() coercing datetimes to numeric?,True
20202,mutating a new variable with ifelse() loses date format,True
20203,How do I stop implicit date conversion when using ifelse with date time data?,True
20204,R ifelse avoiding change in date format,True


In [130]:
df_res_es[ df_res_es.query_id==6668963 ][['doc_title','is_relevant'] ].head()

Unnamed: 0,doc_title,is_relevant
20200,How to prevent ifelse() from turning Date objects into numeric objects,True
20201,How to iterate over list of Dates without coercion to numeric?,False
20202,R Why ifelse() changes datatype,True
20203,"In R, why does subtracting numerics from NA return NA but subtracting dates from NA return an error?",False
20204,to_python() and from_db_value() methods overlapping in function?,False


cant add new fields     
order of fields matter    
field size matters    

### Limitations

**Can vectors with duplicate primary keys be inserted into Milvus?**    
Yes. Milvus does not check if vector primary keys are duplicates.


**When vectors with duplicate primary keys are inserted, does Milvus treat it as an update operation?**
No. Milvus does not currently support update operations and does not check if entity primary keys are duplicates. You are responsible for ensuring entity primary keys are unique, and if they aren't Milvus may contain multiple entities with duplicate primary keys.

If this occurs, which data copy will return when queried remains an unknown behavior. This limitation will be fixed in future releases.

https://milvus.io/docs/product_faq.md#Can-vectors-with-duplicate-primary-keys-be-inserted-into-Milvus

## reference

https://github.com/milvus-io/pymilvus/blob/master/examples/hello_milvus.ipynb


https://milvus.io/tools/sizing/