Hello!

In this notebook we will be using embeddings to create a semantic search engine over Github issues.

But first, we need access to a GPU. To do that, follow these steps:
1. Click "Runtime" on the menu on the top left corner of your screen.
2. Click "Change runtime type"
3. Under "Hardware accelerator" click "T4 GPU" and then click "Save"

Bam -- now you have access to a [T4 GPU](https://www.nvidia.com/en-us/data-center/tesla-t4/)! Pretty easy.

Now, install these packages to run this notebook.

In [None]:
!pip install sentence-transformers datasets pandas tqdm transformers faiss-gpu

In [None]:
import torch
import pandas as pd

from transformers import AutoTokenizer, AutoModel
from datasets import Dataset, load_dataset

# Background

We previously learned how logits are the key to an LLM's "intelligence". Recall that logits are an unnormalized probability distribution that show how likely each token is to appear next. The logits of a "good" LLM should model natural language very well.

_Great, we now have high quality logits. But what can we do with them?_

Logits don't simply enable an LLM to be good at natural language, i.e. speaking or writing. The logits encode an understanding of what each word means, and to a larger extent, enable the model to understand concepts, ideas, feelings, and more. Through logits, LLMs attempt to understand how our world works.

In this section we will talk about one of the downstream use cases for logits: semantic search.


_Semantic search_ is a new kind of search engine where the actual _meaning_ of the search query is used for search. This is in contrast to other search engines that use keyword matching or URLs for search. Embeddings from a LLM power semantic search!

### Goals

In this section we'll use embeddings to develop a semantic search engine for Github issues and comments from the [Datasets repository](https://github.com/huggingface/datasets), a popoular repositroy developed by HuggingFace to manage datasets for AI models.

What does this mean?


In the [Datasets repository](https://github.com/huggingface/datasets), there are tons of issues posted about the code. The issues can be found here: https://github.com/huggingface/datasets/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc.

However, the search bar on Github is pretty limited. Click on the URL and take a look at all those issues -- there are so many of them! It's hard to search through them with Gitbhub's search bar.

So we want to design a better way to search through these issues using semantic search + logits + embeddings.

Let's dive in!

# Downloading and Processing Data

Let's download a dataset of all the github issues in the Datasets repository. We can do that with the `load_dataset` function.

In [None]:
issues_dataset = load_dataset("lewtun/github-issues", split="train")
issues_dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

Here is what one entry of this dataset looks like.

In [None]:
issues_dataset[0]

{'url': 'https://api.github.com/repos/huggingface/datasets/issues/2955',
 'repository_url': 'https://api.github.com/repos/huggingface/datasets',
 'labels_url': 'https://api.github.com/repos/huggingface/datasets/issues/2955/labels{/name}',
 'comments_url': 'https://api.github.com/repos/huggingface/datasets/issues/2955/comments',
 'events_url': 'https://api.github.com/repos/huggingface/datasets/issues/2955/events',
 'html_url': 'https://github.com/huggingface/datasets/pull/2955',
 'id': 1003999469,
 'node_id': 'PR_kwDODunzps4sHuRu',
 'number': 2955,
 'title': 'Update legacy Python image for CI tests in Linux',
 'user': {'login': 'albertvillanova',
  'id': 8515462,
  'node_id': 'MDQ6VXNlcjg1MTU0NjI=',
  'avatar_url': 'https://avatars.githubusercontent.com/u/8515462?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/albertvillanova',
  'html_url': 'https://github.com/albertvillanova',
  'followers_url': 'https://api.github.com/users/albertvillanova/followers',
  'following_u

Q0: How many entries are in this dataset? Hint: a Dataset object has a `len` method.

In [None]:
# write code for answer here

A0: ANSWER-HERE

Now we've got our dataset but there is a lot of data in there that we don't need.

Let's filter out the pull requests, as these tend to be rarely used for answering user queries and will introduce noise in our search engine.

In [None]:
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False)
)
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 984
})

While we're at it, let's also filter out rows with no comments, since these provide no answers to user queries. Use the [Dataset.filter()](https://huggingface.co/docs/datasets/v2.18.0/process#select-and-filter) function:

In [None]:
# write code here

In [None]:
assert len(issues_dataset) == 808

We can see that there are a lot of columns in our dataset, most of which we don't need to build our search engine. From a search perspective, the most informative columns are title, body, and comments, while html_url provides us with a link back to the source issue. Let's use the `Dataset.remove_columns()` function to drop the rest.

If you need, [here](https://huggingface.co/docs/datasets/v2.18.0/process#remove) is an example showing how you use `Dataset.remove_columns()` function.

In [None]:
# write code here

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

In [None]:
assert list(issues_dataset.features.keys()) == ['html_url', 'title', 'comments', 'body']

To create our embeddings we'll augment each comment with the issue’s title and body, since these fields often include useful contextual information. Because our comments column is currently a list of comments for each issue, we need to “explode” the column so that each row consists of an (html_url, title, body, comment) tuple.

We'll use Pandas's [DataFrame.explode() function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) to do this. So we will temproarily turn our dataset into a Pandas DataFrame and then we'll convert it back into a HuggingFace dataset.

In [None]:
issues_dataset.set_format("pandas")
df = issues_dataset[:]
comments_df = df.explode("comments", ignore_index=True)
comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

Now that we have one comment per row, let's create a new comments_length column that contains the number of words per comment:

In [None]:
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)

comments_dataset

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2964
})

We can use `comment_length` to filter out short comments, which typically include things like “cc @lewtun” or “Thanks!” that are not relevant for our search engine. There's no precise number to select for the filter, but around 15 words seems like a good start.

Again, use the [Dataset.filter()](https://huggingface.co/docs/datasets/v2.18.0/process#select-and-filter) function:

In [None]:
# write code here

Filter:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2175
})

In [None]:
assert len(comments_dataset) == 2175

Having cleaned up our dataset a bit, let’s concatenate the issue title, description, and comments together in a new text column. As usual, we’ll write a simple function that we can pass to Dataset.map():

In [None]:
def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }


comments_dataset = comments_dataset.map(concatenate_text)

comments_dataset

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text'],
    num_rows: 2175
})

We've finished processing our dataset!

Q2: Please inspect three Github comments from our dataset. Look at the `text` column and summerize what each comment is about.

A2: ANSWER-HERE

# Embed Data

We’re finally ready to create some embeddings! Let’s dive in.

#### Loading a Pretrained Encoder model.

We previously downloaded the GPT2 model from HuggingFace. Now we will download a different model, one that specializes in creating logits that are good for semantic search.

We will generate embeddings by using [this Sentence Transformers model](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1). It is one of hundreds of encoder models available. Downloads happen automatically with SentenceTransformer, and may take up to a minute the first time.

In [None]:
model_id = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

To speed up the embedding process, it helps to place the model and inputs on a GPU device. If there is no GPU avaliable, it will take too long to create these embeddings. In that case, you'll simply need to download the embeddings (which I already computed) by running

```python
embeddings_dataset = load_dataset("eitanturok/github-issues-embeddings", split="train")
```

However, you hopefully won't need to do this. Now, let's go back to placing the model on the GPU.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

We’d like to represent each entry in our GitHub issues corpus as a single vector. How can we do that?

Here is where the logits come into play!

If we take a single Github issue, input it into our model, and take the outputted logits -- this is a single vector that should capture all the meaning of our Github issue.

Let's do this. Here is a function that extracts the model logits form the model output.

In [None]:
def get_logits(model_output):
    return model_output.last_hidden_state[:, 0]

Now let's look at the text of a single Github comment in our dataset. We will want to embed this text and got the logits for it.

In [None]:
text = comments_dataset["text"][0]
print(text)

Protect master branch 
 After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:
- 00cc036fea7c7745cfe722360036ed306796a3f2
- 13ae8c98602bbad8197de3b9b425f4c78f582af1
- ...

I propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:
- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch
  - Currently, simple merge commits are already disabled
  - I propose to disable rebase merging as well
- ~~Protect the master branch from direct pushes (to avoid accidentally pushing of merge commits)~~
  - ~~This protection would reject direct pushes to master branch~~
  - ~~If so, for each release (when we need to commit directly to the master branch), we should previously disable the pr

Now, using `text` and the `get_logits` function, fill in `get_embeddings`. `get_embeddings` should tokenize `text`, place the tokenized text on the GPU, feeds it into the model, and then calls `get_logits()` on the model output and then returns the logits.

In [None]:
# Write answer here.

def get_embeddings(text: str):
  pass

We can use Dataset.map() to apply our get_embeddings() function to each piece of text in `comments_dataset`, so let’s create a new embeddings column as follows:

In [None]:
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

Notice that we’ve converted the embeddings to NumPy arrays — that’s because 🤗 Datasets requires this format when we try to index them with FAISS, which we’ll do next. Now let's inspect the embeddings a bit:

In [None]:
embeddings_dataset

In [None]:
embeddings_dataset[0]

Q3: How many dimensions is the embedding of each Github comment?

A3:

# Semantic Search

Now that we have a dataset of embeddings, we need some way to search over them. To do this, we’ll use a special data structure in 🤗 Datasets called a FAISS index. FAISS is a vector database create by Facebook that provides efficient algorithms to quickly search and cluster embedding vectors.

The basic idea behind FAISS is to create a special data structure called an index -- this is the heart of the vector database -- that allows one to find which embeddings are similar to an input embedding.

Creating a FAISS index in 🤗 Datasets is simple — we use the Dataset.add_faiss_index() function and specify which column of our dataset we’d like to index:


In [None]:
embeddings_dataset.add_faiss_index(column="embeddings")

We can now perform queries on this index by doing a nearest neighbor lookup with the Dataset.get_nearest_examples() function. Let’s test this out by first embedding a question as follows:


In [None]:
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

Just like with the documents, we now have a 768-dimensional vector representing the query, which we can compare against the whole corpus to find the most similar embeddings:

In [None]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

The Dataset.get_nearest_examples() function returns a tuple of scores that rank the overlap between the query and the document, and a corresponding set of samples (here, the 5 best matches). Let’s collect these in a pandas.DataFrame so we can easily sort them:

In [None]:
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

Now we can iterate over the first few rows to see how well our query matched the available comments:

In [None]:
for _, row in samples_df.iterrows():
    print(f"SIMILARITY SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print(f"COMMENT: {row.comments}")
    print("=" * 50)
    print()

This is great!!!

By using semantic search, we got the five Github issues that are most similar to our query "How can I load a dataset offline?"

To recap, we got these results by:
1. embedding all of the Github issues
2. putting these embeddings in a vector database
3. embedding the query "How can I load a dataset offline?"
4. finding which embeddings in the vector database are most similar to the query's embedding

Congrats -- these are the steps for how to implement a semantic search engine! You just built a semantic search engine!

### Questions

Q4: Manually inspect the first two Github issues we returned -- click on their "URL"s. Look through the Github issue there -- do these issues actually have to do with loading a dataset offline?

A4: ANSWER-HERE

Q5: We want to compare our semantic search engine to Github's search engine which works like a traditional search based on keywords. Does Github's search engine give the same results we got for our query "How can I load a dataset offline?".

To compare these, please go to the datasets repository on Github here: https://github.com/huggingface/datasets.

Navigate to the issues tab and go to the search bar in the middle of the screen (it should say filters next to it). Delete "is:open" and "sort:updated-desc" but keep "is:issue" -- these are tags that Github uses to filter our search. Then enter our query into the search bar "How can I load a dataset offline?".

Take a look at the results from Github's search engine. How many of the search results given by Github's search engine did we also get with our semantic search engine?

A5: ANSWER-HERE

Q6: Are there any Github issues that our search engine returned but Github's search engine did not return? What about vice versa?

A6: ANSWER-HERE

Q7: Which search engine do you think returned more relevant results to the query "How can I load a dataset offline?"? In other words, which search engine do you think did a better job for this one example?

A7: ANSWER-HERE