# Custom embeddings

This notebook will show to use custom embeddings in Lilac.

When making a custom embedding, you have to register an embedding function with Lilac, but you do not have to compute embeddings for the entire dataset in Lilac. Embeddings from an existing vector store can be loaded with [`Dataset.load_embeddings`](https://docs.lilacml.com/api_reference/data.html#lilac.data.Dataset.load_embeddings)

For more information on embeddings, see our [Embeddings](https://docs.lilacml.com/datasets/dataset_embeddings.html) guide.


In [2]:
%load_ext autoreload
%autoreload 2
import lilac as ll

ll.set_project_dir('./data')

items = [
  {'id': '0_', 'text': 'This is some fake data'},
  {'id': '1_', 'text': 'This is some more fake data'},
  {'id': '2_', 'text': 'This is even more fake data'},
  {'id': '3_', 'text': 'I love plants'},
]
# Load a fake dataset from dictionaries.
try:
  ds = ll.get_dataset('local', 'load_embedding')
except Exception as e:
  ds = ll.from_dicts('local', 'load_embedding', items)

  from .autonotebook import tqdm as notebook_tqdm


# Register an embedding function

For embeddings to be useful in Lilac, we must be able to compute new embeddings

This means we have to register an embedding function under a name so that we can call it from semantic searches (embedding the query) or from concept search (embedding concept data).


In [9]:
import numpy as np

try:
  from sentence_transformers import SentenceTransformer
except ImportError:
  raise ImportError(
    'Could not import the "sentence_transformers" python package. '
    'Please install it with `pip install "sentence_transformers".'
  )

embedding_model = SentenceTransformer('thenlper/gte-small')


def _embed(text):
  # Call the gte-small embedding model.
  return np.array(embedding_model.encode(text))


# Make an embedding class.
class MyEmbedding(ll.TextEmbeddingSignal):
  name = 'my_embedding'

  def compute(self, data):
    for text in data:
      embedding = _embed(text)
      # Yield a full chunk embedding. If you want to chunk your text, yield an array here.
      yield [ll.chunk_embedding(0, len(text), embedding)]


print('Testing the embedding on a single item...')
print(next(MyEmbedding().compute(['This is some text'])))

# Register the embedding under 'my_embedding' so it can be used by Lilac.
ll.register_embedding(MyEmbedding, exists_ok=True)

Testing the embedding on a single item...
[{'__span__': {'start': 0, 'end': 17}, 'embedding': array([-4.39735241e-02, -9.28446930e-03,  4.57611308e-02, -3.19548771e-02,
        8.43660533e-03,  9.48431529e-03,  5.90084903e-02,  5.59187271e-02,
       -1.78449824e-02, -5.01370244e-02,  8.36663414e-03, -5.73770069e-02,
        9.36811697e-03,  1.31201018e-02, -1.38591407e-02, -2.79680942e-03,
        1.27376560e-02, -2.13926788e-02, -5.75558171e-02,  3.24781537e-02,
        7.07704574e-02,  2.67298613e-02, -2.15108655e-02, -1.52723445e-02,
        5.37770353e-02,  2.32838802e-02, -1.45876221e-02, -3.92815508e-02,
       -8.67534149e-03, -1.66421592e-01, -7.23247370e-03,  2.60695047e-03,
        4.72562760e-02, -5.58551177e-02, -1.39868818e-03, -8.96183029e-03,
       -5.59775271e-02,  6.02203384e-02, -2.89687067e-02,  3.60556208e-02,
        3.68024521e-02, -3.64479795e-02, -5.02835214e-03, -4.53025363e-02,
       -2.61498820e-02, -7.03798011e-02, -5.70576228e-02, -3.78287770e-02,
      

## Load full-document embeddings from our vector store

First, let's compute full-document embeddings manually with gte-small, from the sentence_transformers library.

Our vector store is just a dictionary in this example.


In [10]:
vector_store = {}
for item in items:
  vector_store[item['id']] = _embed(item['text'])


# Load the embeddings into Lilac.
def _load_embedding(item):
  return vector_store[item['id']]


# Load the embeddings into Lilac.
ds.load_embedding(
  load_fn=_load_embedding, index_path='text', embedding='my_embedding', overwrite=True
)

Load embedding my_embedding on load_embedding:('text',): 100%|██████████| 4/4 [00:00<00:00, 2032.62it/s]


hnswlib index creation took 0.001s.
hnswlib add items took 0.000s.
Wrote embedding index to ./data/datasets/local/load_embedding/text/my_embedding


# Semantic search with our custom embedding

Now we can rank documents with our custom embedding.


In [15]:
# Select rows using a semantic search.
rows = ds.select_rows(
  ['text'],
  searches=[
    ll.SemanticSearch(path='text', query='This is some data', embedding='my_embedding'),
  ],
)

for row in rows:
  print(
    row['text'],
    row['text.semantic_similarity(embedding=my_embedding,query=This is some data)'][0]['score'],
  )

Computing signal "semantic_similarity" on local/load_embedding:('text',) took 0.016s.
This is some fake data 0.9254916310310364
This is some more fake data 0.9084776043891907
This is even more fake data 0.8841889500617981
I love plants 0.7808101177215576
