# Semantic Search with FAISS 

In this notebook, we will build a search engine that can help us find answers to questions related to a datset

## Using Embeddings for Semantic Search 

As we know, Transformer based language models represent each token in a span of text as an embedding vector. It turns out that one can pool the individual embeddings to create a vector representation of whole sentences, paras or even documents. These embeddings can then be used to find similar documents in the corpus by computing the dot-product similarity (or some other similarity metric) between each embedding and returning the documents with the greatest overlap.

In this notebook we’ll use embeddings to develop a semantic search engine. These search engines offer several advantages over conventional approaches that are based on matching keywords in a query with the documents.

## Loading and Preparing the dataset 

In [1]:
from datasets import load_dataset

In [2]:
issues_dataset = load_dataset('lewtun/github-issues', split='train')
issues_dataset

Repo card metadata block was not found. Setting CardData to empty.


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

Here we’ve specified the default train split in load_dataset(), so it returns a Dataset instead of a DatasetDict. The first order of business is to filter out the pull requests, as these tend to be rarely used for answering user queries and will introduce noise in our search engine. As should be familiar by now, we can use the Dataset.filter() function to exclude these rows in our dataset. While we’re at it, let’s also filter out rows with no comments, since these provide no answers to user queries:

In [3]:
issues_dataset = issues_dataset.filter(lambda x: x['is_pull_request'] == False and len(x['comments']) > 0)

In [4]:
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 808
})

Well, We're done to 808 rows from 3019.In short, Don`t let the numbers scare you .We can see that there are a lot of columns in our dataset, most of which we don’t need to build our search engine. From a search perspective, the most informative columns are title, body, and comments, while html_url provides us with a link back to the source issue. Let’s use the Dataset.remove_columns() function to drop the rest:

In [5]:
columns = issues_dataset.column_names
columns_to_include = ['title','body','comments','html_url']
colums_to_exclue = set(columns_to_include).symmetric_difference(columns)

issues_dataset = issues_dataset.remove_columns(colums_to_exclue)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

In [6]:
issues_dataset[0]['comments']

['Cool, I think we can do both :)',
 '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']

To create our embeddings we’ll augment each comment with the issue’s title and body, since these fields often include useful contextual information. Because our comments column is currently a list of comments for each issue, we need to “explode” the column so that each row consists of an (html_url, title, body, comment) tuple. In Pandas we can do this with the DataFrame.explode() function, which creates a new row for each element in a list-like column, while replicating all the other column values. To see this in action, let’s first switch to the Pandas DataFrame format:

In [7]:
issues_dataset.set_format('pandas')
dataframe = issues_dataset[:]

In [8]:
dataframe.head(10)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"[Cool, I think we can do both :), @lhoestq now...",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,[Hi ! I guess the caching mechanism should hav...,## Describe the bug\r\nAfter upgrading to data...
2,https://github.com/huggingface/datasets/issues...,OSCAR unshuffled_original_ko: NonMatchingSplit...,[I tried `unshuffled_original_da` and it is al...,## Describe the bug\r\n\r\nCannot download OSC...
3,https://github.com/huggingface/datasets/issues...,load_dataset using default cache on Windows ca...,"[Hi @daqieq, thanks for reporting.\r\n\r\nUnfo...",## Describe the bug\r\nStandard process to dow...
4,https://github.com/huggingface/datasets/issues...,to_tf_dataset keeps a reference to the open da...,"[I did some investigation and, as it seems, th...",To reproduce:\r\n```python\r\nimport datasets ...
5,https://github.com/huggingface/datasets/issues...,Conda build fails,[Why 1.9 ?\r\n\r\nhttps://anaconda.org/Hugging...,## Describe the bug\r\nCurrent `datasets` vers...
6,https://github.com/huggingface/datasets/issues...,Mutable columns argument breaks set_format,[Pushed a fix to my branch #2731 ],## Describe the bug\r\nIf you pass a mutable l...
7,https://github.com/huggingface/datasets/issues...,Datasets 1.12 dataset.filter TypeError: get_in...,"[Thanks for reporting, I'm looking into it :),...",## Describe the bug\r\nUpgrading to 1.12 cause...
8,https://github.com/huggingface/datasets/issues...,"""File name too long"" error for file locks","[Hi, the filename here is less than 255\r\n```...",## Describe the bug\r\n\r\nGetting the followi...
9,https://github.com/huggingface/datasets/issues...,Unwanted progress bars when accessing examples,[doing a patch release now :)],When accessing examples from a dataset formatt...


When we Explode the dataframe, We Expect to get one row for each of the comments

In [9]:
comments_dataframe = dataframe.explode('comments', ignore_index=True)
comments_dataframe.head(10)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...
2,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Hi ! I guess the caching mechanism should have...,## Describe the bug\r\nAfter upgrading to data...
3,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"If it's easy enough to implement, then yes ple...",## Describe the bug\r\nAfter upgrading to data...
4,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Well it can cause issue with anyone that updat...,## Describe the bug\r\nAfter upgrading to data...
5,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"I just merged a fix, let me know if you're sti...",## Describe the bug\r\nAfter upgrading to data...
6,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Definitely works on several manual cases with ...,## Describe the bug\r\nAfter upgrading to data...
7,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Fixed by #2947.,## Describe the bug\r\nAfter upgrading to data...
8,https://github.com/huggingface/datasets/issues...,OSCAR unshuffled_original_ko: NonMatchingSplit...,I tried `unshuffled_original_da` and it is als...,## Describe the bug\r\n\r\nCannot download OSC...
9,https://github.com/huggingface/datasets/issues...,load_dataset using default cache on Windows ca...,"Hi @daqieq, thanks for reporting.\r\n\r\nUnfor...",## Describe the bug\r\nStandard process to dow...


Good job, we can see that the rows have been replicated with comments column containing individual comments, Now Switch back to to dataset from pandas

In [10]:
from datasets import Dataset

In [11]:
comments_dataset = Dataset.from_pandas(comments_dataframe)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

well, turns out the number of rows has substantially increased from 808 to 2964

Now, that we have one comment per row, let`s create a new comments_length column that contains the number of words per comment

In [12]:
comments_dataset = comments_dataset.map(lambda x: {"comment_length": len(x['comments'].split())})

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

In [13]:
comments_dataset[0]

{'html_url': 'https://github.com/huggingface/datasets/issues/2945',
 'title': 'Protect master branch',
 'comments': 'Cool, I think we can do both :)',
 'body': 'After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch\r\n  - Currently, simple merge commits are already disabled\r\n  - I propose to disable rebase merging as well\r\n- ~~Protect the master branch from direct pushes (to avoid accidentally pushing of merge commits)~~\r\n  - ~~This protection would rejec

We can use this new column to filter out short comments, which typically include things like “cc @lewtun” or “Thanks!” that are not relevant for our search engine. There’s no precise number to select for the filter, but around 15 words seems like a good start

In [14]:
comments_dataset = comments_dataset.filter(lambda x: x['comment_length'] > 15)
comments_dataset

Filter:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2175
})

well, turns out the number of rows has descreased from 2964 to 2175

let’s concatenate the issue title, body, and comments together in a new text column

In [15]:
def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }

In [16]:
comments_dataset = comments_dataset.map(concatenate_text)


Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

In [17]:
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text'],
    num_rows: 2175
})

## Creating Text Embeddings 

we know that we can obtain token embeddings by using the AutoModel class. All we need to do us pick a suitable checkpoint to load the model from. Fortunately, there’s a library called `sentence-transformers` that is dedicated to creating embeddings.

In [18]:
from transformers import AutoTokenizer, TFAutoModel

In [19]:
model_checkpoint = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModel.from_pretrained(model_checkpoint,from_pt=True)

2024-12-21 12:30:28.794615: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1
2024-12-21 12:30:28.794636: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 8.00 GB
2024-12-21 12:30:28.794642: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 2.67 GB
2024-12-21 12:30:28.794671: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-12-21 12:30:28.794982: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFMPNetModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFMPNetModel from a

We`d like to represent each entry in our github issues corpus as a single vector, so we need to pool or average our token embeddings in some way. One of the popular approach is to perform *CLS Pooling* on our model's outputs, where we simply collect the last hidden state for the special [CLS] token

In [20]:
def cls_pooling(model_output): 
    return model_output.last_hidden_state[:, 0]

Next, we create a helper function that will tokenize a list of documents, place the tensors on the GPU, feed them to the model, and finally apply CLS pooling to the outputs

In [21]:
def get_embeddings(text_list): 
    encoded_input = tokenizer(text_list, padding=True, truncation=True, return_tensors='tf') # tokenize the text
    encoded_input = {key:value for key, value in encoded_input.items()} 
    model_output = model(**encoded_input) # feed tokenized inputs to the model 

    return cls_pooling(model_output)

In [22]:
comments_dataset['text'][0]

'Protect master branch \n After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch\r\n  - Currently, simple merge commits are already disabled\r\n  - I propose to disable rebase merging as well\r\n- ~~Protect the master branch from direct pushes (to avoid accidentally pushing of merge commits)~~\r\n  - ~~This protection would reject direct pushes to master branch~~\r\n  - ~~If so, for each release (when we need to commit directly to the master branch), we should p

In [23]:
# Check the above function by feeding it the first text entry in our corpus and inspecting the output shape 
embedding = get_embeddings(comments_dataset['text'][0])
embedding.shape

TensorShape([1, 768])

Great, we’ve converted the first entry in our corpus into a 768-dimensional vector! We can use Dataset.map() to apply our get_embeddings() function to each row in our corpus, so let’s create a new embeddings column as follows:

In [None]:
embeddings_dataset = comments_dataset.map( lambda x: {"embeddings" : get_embeddings(x["text"]).numpy()[0] } )



Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

Notice that we’ve converted the embeddings to NumPy arrays — that’s because 🤗 Datasets requires this format when we try to index them with FAISS, which we`ll do next