In [37]:
from datasets import load_dataset
from utils import check_file_or_folder_existence

### Semantic search with FAISS

In section 5, we created a dataset of GitHub issues and comments from the ü§ó Datasets repository. In this section we‚Äôll use this information to build a search engine that can help us find answers to our most pressing questions about the library!

### TOC
1. [Using embeddings for semantic search](#using-embeddings-for-semantic-search)
2. [Loading and preparing the dataset](#loading-and-preparing-the-dataset)

As we saw in Chapter 1, Transformer-based language models represent each token in a span of text as an embedding vector. It turns out that one can ‚Äúpool‚Äù the individual embeddings to create a vector representation for whole sentences, paragraphs, or (in some cases) documents. These embeddings can then be used to find similar documents in the corpus by computing the dot-product similarity (or some other similarity metric) between each embedding and returning the documents with the greatest overlap.

In this section we‚Äôll use embeddings to develop a semantic search engine. These search engines offer several advantages over conventional approaches that are based on matching keywords in a query with the documents.

<img src="images/semantic_search/semantic-search.svg" alt="Alternative text" />

#### Loading and preparing the dataset

The first thing we need to do is download our dataset of GitHub issues, so let‚Äôs use load_dataset() function as usual:

In [38]:
# issues_dataset.cleanup_cache_files()

In [39]:
from datasets import load_dataset

issues_dataset = load_dataset("hjerpe/github-kubeflow-issues", download_mode="force_redownload")
issues_dataset

ConnectionError: Couldn't reach 'hjerpe/github-kubeflow-issues' on the Hub (ConnectionError)

Here we‚Äôve specified the default train split in load_dataset(), so it returns a Dataset instead of a DatasetDict. The first order of business is to filter out the pull requests, as these tend to be rarely used for answering user queries and will introduce noise in our search engine. As should be familiar by now, we can use the Dataset.filter() function to exclude these rows in our dataset. While we‚Äôre at it, let‚Äôs also filter out rows with no comments, since these provide no answers to user queries:

In [None]:
def remove_empty_entries(example):
    return {"comments": [x for x in example["comments"] if x]}

    
issues_dataset = issues_dataset.map(remove_empty_entries)

In [None]:
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'is_pull_request'],
    num_rows: 453
})

We can see that there are a lot of columns in our dataset, most of which we don‚Äôt need to build our search engine. From a search perspective, the most informative columns are title, body, and comments, while html_url provides us with a link back to the source issue. Let‚Äôs use the Dataset.remove_columns() function to drop the rest:

In [None]:
columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 453
})

To create our embeddings we‚Äôll augment each comment with the issue‚Äôs title and body, since these fields often include useful contextual information. Because our comments column is currently a list of comments for each issue, we need to ‚Äúexplode‚Äù the column so that each row consists of an (html_url, title, body, comment) tuple. In Pandas we can do this with the DataFrame.explode() function, which creates a new row for each element in a list-like column, while replicating all the other column values. To see this in action, let‚Äôs first switch to the Pandas DataFrame format:

In [None]:
issues_dataset.set_format("pandas")
df = issues_dataset[:]

If we inspect the first row in this DataFrame we can see there are four comments associated with this issue:

In [None]:
df["comments"][0].tolist()

["Hugging Face's datasets library may prioritize remote configurations. Make sure there are no conflicting configurations causing the library to prefer downloading data\r\nMay  be try debugging\r\nraw_datasets = load_dataset('json', data_files=data_files)\r\nprint(raw_datasets)\r\n",
 "It doesn't download them but writes them to the local HF cache. The logging could indeed be better. Does loading the dataset succeed? If it doesn't, can you share the error stack trace?"]

In [None]:
df["comments"][2].tolist()

['Thanks for reporting. We are investigating it.',
 'This issue is caused by latest `pandas` release 2.1.0 (released yesterday Aug 30).\r\n\r\nSee: https://github.com/huggingface/datasets/actions/runs/6035484010/job/16375932085?pr=6198\r\n',
 "People using previous releases of `datasets` should pin `pandas` in their local environment:\r\n```\r\npython -m pip install 'pandas<2.1.0'\r\n```"]

When we explode df, we expect to get one row for each of these comments. Let‚Äôs check if that‚Äôs the case:

In [None]:
comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/kubeflow/pipelines/issues/6199,v2 backend API,Hugging Face's datasets library may prioritize...,- [x] #6169 \r\n- [x] #6170 \r\n- [x] #6171\r\...
1,https://github.com/kubeflow/pipelines/issues/6199,v2 backend API,It doesn't download them but writes them to th...,- [x] #6169 \r\n- [x] #6170 \r\n- [x] #6171\r\...
2,https://github.com/kubeflow/pipelines/issues/6198,v2 UI tracker,_The documentation is not available anymore as...,* POC and Design\r\n * [x] https://github.co...
3,https://github.com/kubeflow/pipelines/issues/6198,v2 UI tracker,<details>\n<summary>Show benchmarks</summary>\...,* POC and Design\r\n * [x] https://github.co...


Great, we can see the rows have been replicated, with the comments column containing the individual comments! Now that we‚Äôre finished with Pandas, we can quickly switch back to a Dataset by loading the DataFrame in memory:

In [None]:
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 1557
})

Okay, this has given us 1.5 thousands comments to work with!

‚úèÔ∏è Try it out! See if you can use Dataset.map() to explode the comments column of issues_dataset without resorting to the use of Pandas. This is a little tricky; you might find the ‚ÄúBatch mapping‚Äù section of the ü§ó Datasets documentation useful for this task.

In [None]:
explore = comments_dataset.map(lambda batch: {"b": [x.split(".") for x in batch["comments"]]}, 
                     remove_columns=comments_dataset.column_names,
                     batched=True)
explore["b"]

Map:   0%|          | 0/1557 [00:00<?, ? examples/s]

[["Hugging Face's datasets library may prioritize remote configurations",
  " Make sure there are no conflicting configurations causing the library to prefer downloading data\r\nMay  be try debugging\r\nraw_datasets = load_dataset('json', data_files=data_files)\r\nprint(raw_datasets)\r\n"],
 ["It doesn't download them but writes them to the local HF cache",
  ' The logging could indeed be better',
  " Does loading the dataset succeed? If it doesn't, can you share the error stack trace?"],
 ['_The documentation is not available anymore as the PR was closed or merged',
  '_'],
 ['<details>\n<summary>Show benchmarks</summary>\n\nPyArrow==8',
  '0',
  '0\n\n<details>\n<summary>Show updated benchmarks!</summary>\n\n### Benchmark: benchmark_array_xd',
  'json\n\n| metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | rea