# Using PyMilvus's Model To Generate Text Embeddings
Quickly enhance your search capabilities with text embeddings using PyMilvus(higher than 2.4.0). This guide shows how to utilize PyMilvus models for extracting rich text embeddings, setting the foundation for powerful search functionalities.  
In this doc, we will go through **dense embedding** models, **sparse embedding** models, and **hybrid** models to show how to use them in action.  
Let's begin by installing the dependencies(you could also use `virtualenv` to create a new enviroment):

In [None]:
! pip install pymilvus[model]

For most use cases, you can generate embeddings for storage or retrieval by simply using `ef(texts)`. However, when you need different processing for queries and documents, you could use two specific functions. Documents are processed with `encode_documents` to generate their embeddings, which are then stored in the vector database. For retrieval, the query is processed using `encode_queries` to create its embedding, which is then used to search the database.

**Dense embedding** is a technique used in natural language processing to represent words or phrases as continuous, dense vectors in a high-dimensional space, capturing semantic relationships.

## OpenAI Embedding Function
OpenAI offers dense [embedding services](https://platform.openai.com/docs/guides/embeddings), but to access them, users must sign up and obtain an API key. With the API key properly set in your environment variables, you can start using tools like OpenAIEmbeddingFunction to generate dense embeddings.


In [1]:
docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]

from pymilvus import model

# initialize using 'text-embedding-3-large'
openai_ef = model.dense.OpenAIEmbeddingFunction(
    model_name="text-embedding-3-large", # Specify the model name
    dimensions=512 # Set the embedding dimensionality according to MRL feature.
)

# get the embeddings in general way.
queries = docs
queries_embeddings = openai_ef(queries)
docs_embeddings = openai_ef(docs)

# get the embeddings in specified way.
queries_embeddings = openai_ef.encode_queries(queries)
docs_embeddings = openai_ef.encode_documents(docs)

# now we can check the dimension of embedding from results and the embedding function.
print("dense dim:", openai_ef.dim, queries_embeddings[0].shape)
print("dense dim:", openai_ef.dim, docs_embeddings[0].shape)

dense dim: 512 (512,)
dense dim: 512 (512,)


when using OpenAIEmbeddingFunction, `encoding_queries` and `encoding_documents` are exactly same procedure, one can use `openai_ef(texts)` instead.  
- `openai_ef(texts)`: same with other two functions.
- `openai_ef.encode_queries(queries)`: same with other two functions.
- `openai_ef.encode_documents(documents)`: same with other two functions.


Additionally, you may initialize the OpenAIEmbeddingFunction by directly providing OpenAI's official parameters, such as api_key and base_url, as part of the function's configuration.

In [4]:
# initialize using api_key directly.
openai_ef = model.dense.OpenAIEmbeddingFunction(model_name="text-embedding-3-small", api_key="sk-api-key")
# get the embeddings
queries_embeddings = openai_ef.encode_queries(queries)
docs_embeddings = openai_ef.encode_documents(docs)
print("dense dim:", openai_ef.dim, queries_embeddings[0].shape)
print("dense dim:", openai_ef.dim, docs_embeddings[0].shape)

dense dim: 1536 (1536,)
dense dim: 1536 (1536,)


## Sentence Transformer Embedding Function

In addition to hosted services like OpenAI, there exists a variety of powerful open-source dense embedding models. For these, the SentenceTransformerEmbeddingFunction based on [Sentence-Transformer](https://www.sbert.net/) can be utilized to extract text embeddings.

In [2]:
docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
    "The Turing Test, proposed by Alan Turing, is a measure of a machine's ability to exhibit intelligent behavior.",
    "Deep learning is a subset of machine learning in artificial intelligence that has networks capable of learning unsupervised from data that is unstructured or unlabeled.",
    "The concept of neural networks, which are vital to deep learning algorithms, was inspired by the understanding of the human brain's structure and function.",
    "Artificial intelligence applications range from natural language processing to expert systems, and from automated reasoning to machine learning.",
    "The development of quantum computing holds the potential to drastically increase the processing power available for artificial intelligence systems.",
    "In the field of robotics, artificial intelligence is used to enable autonomous decision-making by robots in complex environments.",
    "Ethical considerations in AI research and application are becoming increasingly important as the technology advances and becomes more integrated into daily life.",
    "Reinforcement learning, a type of machine learning algorithm, enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences.",
    "AI has the potential to revolutionize industries by optimizing processes, enhancing decision-making, and creating new opportunities for innovation."
]

from pymilvus import model

# initialize the SentenceTransformerEmbeddingFunction
sentence_transformer_ef = model.dense.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2", # Specify the model name
    device="cpu" # Specify the device to use, e.g., 'cpu' or 'cuda:0'
)

# get the embeddings in general way.
queries = docs
queries_embeddings = sentence_transformer_ef(queries)
docs_embeddings = sentence_transformer_ef(docs)

# get the embeddings in specified way.
queries_embeddings = sentence_transformer_ef.encode_queries(queries)
docs_embeddings = sentence_transformer_ef.encode_documents(docs)

print("dense dim:", sentence_transformer_ef.dim, queries_embeddings[0].shape)
print("dense dim:", sentence_transformer_ef.dim, docs_embeddings[0].shape)

dense dim: 384 (384,)
dense dim: 384 (384,)


when using SentenceTransformerEmbeddingFunction, `encoding_queries` and `encoding_documents` would prepend **query_instruction** and **doc_instruction** respectively, others are same.  
- `sentence_transformer_ef(texts)`: prepend nothing, just process the raw text.
- `sentence_transformer_ef.encode_queries(queries)`: prepend the **query_instruction** to each query. 
- `sentence_transformer_ef.encode_documents(documents)`: prepend the **doc_instruction** to each document.

Additionally, the initialization of SentenceTransformerEmbeddingFunction may incorporate features from Sentence Transformer, such as specifying parameters like the `batch_size`. Some models require adding an instruction before the actual text input. 

In [6]:
#BAAI/bge-small-en-v1.5 suggest prepend a instruction when generate embedding.
sentence_transformer_ef = model.dense.SentenceTransformerEmbeddingFunction(
    model_name="BAAI/bge-small-en-v1.5",
    device="cpu",
    batch_size=8,
    query_instruction="Represent this sentence for searching relevant passages:",
    doc_instruction="Represent this sentence for searching relevant passages:",
)
queries_embeddings = sentence_transformer_ef.encode_queries(docs)
docs_embeddings = sentence_transformer_ef.encode_documents(docs)

print("dense dim:", sentence_transformer_ef.dim, queries_embeddings[0].shape)
print("dense dim:", sentence_transformer_ef.dim, docs_embeddings[0].shape)

dense dim: 384 (384,)
dense dim: 384 (384,)


**Sparse embedding** represents words or phrases using vectors where most elements are zero, with only one non-zero element indicating the presence of a specific word in a vocabulary. Sparse embeddings models are efficient and interpretable, making them suitable for tasks where exact term matches are crucial

## Splade Embedding Function

[SPLADE](https://arxiv.org/abs/2109.10086) embedding is a model that offers highly sparse representations for documents and queries, inheriting desirable properties from bag-of-words (BOW) models such as exact term matching and efficiency. We can use SPLADE model easily with SpladeEmbeddingFunction.


In [3]:
docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
    "The Turing Test, proposed by Alan Turing, is a measure of a machine's ability to exhibit intelligent behavior.",
]
  
from pymilvus.model.sparse import SpladeEmbeddingFunction

# default using model_name naver/splade-cocondenser-ensembledistil. 
# other valid options: 
# - naver/splade_v2_max
# - naver/splade_v2_distil
# - naver/splade-cocondenser-selfdistil.
splade_ef = SpladeEmbeddingFunction()

queries = docs

# get the embeddings in general way.
queries_embeddings = splade_ef(queries)
docs_embeddings = splade_ef(docs)

# get the embeddings in specified way.
queries_embeddings = splade_ef.encode_queries(queries)
docs_embeddings = splade_ef.encode_documents(docs)

# since the output embeddings are in a 2D csr_array format, we convert them to a list for easier manipulation.
print("sparse dim:", splade_ef.dim, list(queries_embeddings)[0].shape)
print("sparse dim:", splade_ef.dim, list(docs_embeddings)[0].shape)


sparse dim: 30522 (1, 30522)
sparse dim: 30522 (1, 30522)


when using SpladeEmbeddingFunbction, `encoding_queries` and `encoding_documents` would prepend **query_instruction** and **doc_instruction** respectively, and **k_tokens_query** is used to prune the query results. **k_tokens_document** is used to prune the document results.  
- `splade_ef(texts)`: prepend nothing, it does not prune the results.
- `splade_ef.encode_queries(queries)`: prepend the **query_instruction** to each query, **k_tokens_query** is used to prune the query results.
- `splade_ef.encode_documents(documents)`: prepend the **doc_instruction** to each document, **k_tokens_document** is used to prune the document results. 


By default, the model outputs the result directly. However, there are situations where users may only want to retain the top k largest values for a desired valid value. In such cases, users can specify the parameters 'k_tokens_query' and 'k_tokens_document' for queries and documents, respectively.

In [12]:
#Initialize the SpladeEmbeddingFunction retaining the top 64 tokens for queries and the top 128 tokens for documents.
splade_ef = SpladeEmbeddingFunction(device="cpu", k_tokens_query=64, k_tokens_document=128)
    
queries_embeddings = splade_ef.encode_queries(queries)
docs_embeddings = splade_ef.encode_documents(docs)

print("sparse dim:", splade_ef.dim, list(queries_embeddings)[0].shape)
print("sparse dim:", splade_ef.dim, list(docs_embeddings)[0].shape)

print("query embedding non zero elemments:", list(queries_embeddings)[0].nnz)
print("document embedding non zero elemments:", list(docs_embeddings)[0].nnz)


sparse dim: 30522 (1, 30522)
sparse dim: 30522 (1, 30522)
query embedding non zero elemments: 64
document embedding non zero elemments: 128


## BM25 Embedding Function
[BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is a ranking function used in information retrieval to estimate the relevance of documents to a given search query. It enhances the basic term frequency approach by incorporating document length normalization and term frequency saturation. BM25 can generate sparse embeddings by representing documents as vectors of term importance scores, allowing for efficient retrieval and ranking in sparse vector spaces.


In [4]:
from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction

# there are some built-in analyzers for several languages, now we use 'en' for English.
analyzer = build_default_analyzer(language="en")

corpus = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]

# analyzer can tokenize the text into tokens
tokens = analyzer(corpus[0])
print("tokens:", tokens)



tokens: ['artifici', 'intellig', 'found', 'academ', 'disciplin', '1956']


The BM25 algorithm processes text by first breaking it into tokens using a built-in analyzer, as shown with English language tokens like 'artifici', 'intellig', and 'academ'. It then gathers statistics on these tokens, evaluating their frequency and distribution across documents. The core of BM25 calculates the relevance score of each token based on its importance, with rarer tokens receiving higher scores. This concise process enables effective ranking of documents by relevance to a query. 

 So we need to fit a dataset(or corpus) to get the statistics.



In [5]:
bm25_ef = BM25EmbeddingFunction(analyzer)

# Fit the model on the corpus to get the statstics of the corpus.
bm25_ef.fit(corpus)
docs = [
    "The field of artificial intelligence was established as an academic subject in 1956.",
    "Alan Turing was the pioneer in conducting significant research in artificial intelligence.",
    "Originating in Maida Vale, London, Turing grew up in the southern regions of England.",
    "In 1956, artificial intelligence emerged as a scholarly field.",
    "Turing, originally from Maida Vale, London, was brought up in the south of England."
]
queries = docs

# get the embeddings in specified way.
queries_embeddings = bm25_ef.encode_queries(queries)
docs_embeddings = bm25_ef.encode_documents(docs)

# Since the output embeddings are in a 2D csr_array format, we convert them to a list for easier manipulation.
print("sparse dim:", bm25_ef.dim, list(queries_embeddings)[0].shape)
print("sparse dim:", bm25_ef.dim, list(docs_embeddings)[0].shape)


sparse dim: 21 (1, 21)
sparse dim: 21 (1, 21)


when using BM25EmbeddingFunction, `encoding_queries` and `encoding_documents` are not exchangable mathematically. There is no avaliable `bm25_ef(texts)` implmented.  
- `bm25_ef(texts)`: is not avaliable.
- `bm25_ef.encode_queries(queries)`: has its distinct implmentation.
- `bm25_ef.encode_documents(documents)`: has its distinct implmentation.

Fitting data each time is time-consuming; we offer save and load features for efficiency.

In [6]:
bm25_ef.save("example.json")
new_bm25_ef = BM25EmbeddingFunction(analyzer)
new_bm25_ef.load("example.json")

queries_embeddings = new_bm25_ef.encode_queries(queries)
docs_embeddings = new_bm25_ef.encode_documents(docs)

# Since the output embeddings are in a 2D csr_array format, we convert them to a list for easier manipulation.
print("sparse dim:", new_bm25_ef.dim, list(queries_embeddings)[0].shape)
print("sparse dim:", new_bm25_ef.dim, list(docs_embeddings)[0].shape)


sparse dim: 21 (1, 21)
sparse dim: 21 (1, 21)



Calculating statistics within similar distributions is crucial for achieving accurate results. But when there lacks corpus, we provide pre-built, fitted data on the MS MARCO dataset specifically for English. This will download a pre-built json file into your local path.

In [7]:
prebuilt_bm25_ef = BM25EmbeddingFunction(analyzer)
# load the pre-built json file without fitting the corpus.
prebuilt_bm25_ef.load()
queries_embeddings = prebuilt_bm25_ef.encode_queries(queries)
docs_embeddings = prebuilt_bm25_ef.encode_documents(docs)
# Since the output embeddings are in a 2D csr_array format, we convert them to a list for easier manipulation.
print("sparse dim:", prebuilt_bm25_ef.dim, list(queries_embeddings)[0].shape)
print("sparse dim:", prebuilt_bm25_ef.dim, list(docs_embeddings)[0].shape)

path is None, using default bm25_msmarco_v1.json.


sparse dim: 3889385 (1, 3889385)
sparse dim: 3889385 (1, 3889385)


Finally, let's show analyzers for other languages.

In [8]:
de_text = "Alan Turing war die erste Person, die umfangreiche Forschungen im Bereich der KI durchführte."
fr_text = "Alan Turing était la première personne à mener des recherches approfondies en IA."
zh_text = "艾伦·图灵是第一个进行人工智能领域深入研究的人。"

de_analyzer = build_default_analyzer(language="de")
fr_analyzer = build_default_analyzer(language="fr")
zh_analyzer = build_default_analyzer(language="zh")

de_tokens = de_analyzer(de_text)
fr_tokens = fr_analyzer(fr_text)
zh_tokens = zh_analyzer(zh_text)

print("de_tokens", de_tokens)
print("fr_tokens", fr_tokens)
print("zh_tokens", zh_tokens)


Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache


Loading model cost 0.627 seconds.
Prefix dict has been built successfully.


de_tokens ['alan', 'turing', 'erst', 'person', 'umfangreich', 'forschung', 'bereich', 'ki', 'durchfuhrt']
fr_tokens ['alan', 'turing', 'premi', 'person', 'men', 'recherch', 'approfond', 'ia']
zh_tokens ['艾伦', '图灵', '第一个', '人工智能', '领域', '深入研究', '人']



In the realm of embedding models, hybrid architectures exist that generate both dense and sparse embeddings. We refer to these models as hybrid models and introduce BGE-M3 as an example of such a model.
## BGE-M3 Embedding Function



The [BGE-M3](https://arxiv.org/abs/2402.03216) is named for its capabilities in Multi-Linguality, Multi-Functionality, and Multi-Granularity. Capable of supporting over 100 languages, BGE-M3 sets new benchmarks in multi-lingual and cross-lingual retrieval tasks. Its unique ability to perform dense retrieval, multi-vector retrieval, and sparse retrieval within a single framework makes it an ideal choice for a wide range of information retrieval (IR) applications.  

(caution: when running this section, please restart the jupyter python kernel firstly.)

In [1]:
docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
    "The Turing Test, proposed by Alan Turing, is a measure of a machine's ability to exhibit intelligent behavior.",
]
  
from pymilvus.model.hybrid import BGEM3EmbeddingFunction
import numpy as np

# please set the use_fp16 to False when you are using cpu.
# by default the return options is:
#  return_dense True
#  return_sparse True
#  return_colbert_vecs False 
bge_m3_ef = BGEM3EmbeddingFunction(
    model_name='BAAI/bge-m3', # Specify the model name
    device='cpu', # Specify the device to use, e.g., 'cpu' or 'cuda:0'
    use_fp16=False # Specify whether to use fp16. Set to `False` if `device` is `cpu`.
)
queries = docs

# get the embeddings in general way.
queries_embeddings = bge_m3_ef(queries)
docs_embeddings = bge_m3_ef(docs)

# get the embeddings in specified way.
queries_embeddings = bge_m3_ef.encode_queries(queries)
docs_embeddings = bge_m3_ef.encode_documents(docs)

print("dense query dim:", bge_m3_ef.dim["dense"], queries_embeddings["dense"][0].shape)
print("dense document dim:", bge_m3_ef.dim["dense"], docs_embeddings["dense"][0].shape)

# Since the sparse embeddings are in a 2D csr_array format, we convert them to a list for easier manipulation.
print("sparse query dim:", bge_m3_ef.dim["sparse"], list(queries_embeddings["sparse"])[0].shape)
print("sparse document dim:", bge_m3_ef.dim["sparse"], list(docs_embeddings["sparse"])[0].shape)

Fetching 23 files:   0%|          | 0/23 [00:00<?, ?it/s]

loading existing colbert_linear and sparse_linear---------
dense query dim: 1024 (1024,)
dense document dim: 1024 (1024,)
sparse query dim: 250002 (1, 250002)
sparse document dim: 250002 (1, 250002)


when using BGEM3EmbeddingFunction, `encoding_queries` and `encoding_documents` are exactly same procedure, one can use `bge_m3_ef(texts)` instead.  
- `bge_m3_ef(texts)`: same with other two functions.
- `bge_m3_ef.encode_queries(queries)`: same with other two functions.
- `bge_m3_ef.encode_documents(documents)`: same with other two functions.


Although BGE-M3 is capable of generating both dense and sparse embeddings simultaneously, it can also be configured to function as a standard dense or sparse embedding generator by adjusting its return options.

In [2]:
# use bge-m3 as a dense embedding.
bge_m3_ef = BGEM3EmbeddingFunction(
    model_name='BAAI/bge-m3', # Specify the model name
    device='cpu', # Specify the device to use, e.g., 'cpu' or 'cuda:0'
    use_fp16=False, # Specify whether to use fp16. Set to `False` if `device` is `cpu`.
    return_sparse=False # only allow the dense embedding output
)
queries = docs
docs_embeddings = bge_m3_ef.encode_documents(docs)
print("dense query dim:", bge_m3_ef.dim["dense"], queries_embeddings["dense"][0].shape)

Fetching 23 files:   0%|          | 0/23 [00:00<?, ?it/s]

loading existing colbert_linear and sparse_linear---------
dense query dim: 1024 (1024,)



Similarly, we can utilize the BGEM3EmbeddingFunction as a sparse embedding function.






In [3]:
# use bge-m3 as a sparse embedding.
bge_m3_ef = BGEM3EmbeddingFunction(
    model_name='BAAI/bge-m3', # Specify the model name
    device='cpu', # Specify the device to use, e.g., 'cpu' or 'cuda:0'
    use_fp16=False, # Specify whether to use fp16. Set to `False` if `device` is `cpu`.
    return_dense=False # only allow the sparse embedding output
)
queries = docs
docs_embeddings = bge_m3_ef.encode_documents(docs)
# Since the sparse embeddings are in a 2D csr_array format, we convert them to a list for easier manipulation.
print("sparse query dim:", bge_m3_ef.dim["sparse"], list(queries_embeddings["sparse"])[0].shape)

Fetching 23 files:   0%|          | 0/23 [00:00<?, ?it/s]

loading existing colbert_linear and sparse_linear---------
sparse query dim: 250002 (1, 250002)
