# Semantic Search Engine Built With Embeddings

This is the first demonstration of an embeddings based search engine.

The benefits of this search engine is the ability to search for _meaning_ rather than just keyword matching. LLMs 

# GENERAL WARNING
Do not run this notebook without `uv` and the venv required.

## Opening these notebooks for the first time.
To install all the dependencies required to execute this notebook, type `uv sync` in the root of the git repo.

To run this notebook and ensure that `uv` is taking care of your packages, run this command

### For Jupyter Lab
```sh
uv run --with jupyter jupyter lab
```

## VS Code
Install the required Jupyter/Python extensions

When asked for a kernel, select `.venv/bin/python`.

Here we're loading the dataset.

In [5]:
from datasets import load_dataset

dictionary_dataset = load_dataset("MAKILINGDING/english_dictionary", split="train")

In [6]:
dictionary_dataset.set_format("pandas")
df = dictionary_dataset[:]

In [9]:
df.head()

Unnamed: 0,word,definition
0,a,the first letter of the english and of many ot...
1,a 1,a registry mark given by underwriters (as at l...
2,a b c,"the first three letters of the alphabet, used ..."
3,a cappella,in church or chapel style; -- said of composit...
4,a fortiori,with stronger reason.


Joining word and definition into a column called `combined` to feed to the transformer model.

In [4]:
cols = ['word', 'definition']
df['combined'] = df[cols].apply(lambda row: '\n'.join(row.values.astype(str)), axis=1)

# Converting into a Dataset object
This allows us to use a hugging face model and its library for transformers directly

In [6]:
from datasets import Dataset

dictionary_dataset = Dataset.from_pandas(df)
dictionary_dataset

Dataset({
    features: ['word', 'definition', 'combined'],
    num_rows: 124088
})

---

Tread very carefully from here.
This involves training the model, we have the embeddings dataset already so there is no need to run this code again.
I am only keeping the code here for understanding purposes.

Run this code if you know what you are doing.

In [7]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

# Importing pytorch
Given that my laptop only had a CPU, `torch.device()` was passed the `"cpu"` parameter. If you have an NVIDIA GPU, replace `"cpu"` with `"cuda"`.

In [8]:
import torch

device = torch.device("cpu")
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

# About The Transformer Being Used
I chose `multi-qa-mpnet-base-dot-v1` since it's a decently sized transformer, which is *built* for semantic search.

It has been trained on 215M (question, answer) pairs from diverse sources. So it's good at creating sentence embeddings, since we're working with definitions... this is exactly what we need.

This model is based off the [mpnet-base](https://huggingface.co/microsoft/mpnet-base) model from Microsoft.

In [10]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

In [10]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [11]:
embedding = get_embeddings(dictionary_dataset["combined"][0])
embedding.shape

torch.Size([1, 768])

This is just to show the embeddings size, our transformer outputs vectors of a size $(1 \times 768)$

In [None]:
embeddings_dataset = dictionary_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["combined"]).detach().cpu().numpy()[0]}
)

This code was run and generated an embeddings dataset.

The following code saved it to a `.csv` file.

In [None]:
embeddings_dataset.to_csv(dictionary_embeddings.csv, index = False)

To see the actual search part of the model, check the `semantic-search-with-dataset.ipynb` notebook in this repo.