# NB_080124T0722_create_my_own_dataset_semantic_search

# 1.Goal

- create my own dataset [link](https://huggingface.co/learn/nlp-course/chapter5/5#creating-your-own-dataset)
- semantic search [link]https://huggingface.co/learn/nlp-course/chapter5/6#semantic-search-with-faiss)

# 2.Introduction

Sometimes the dataset that you need to build an NLP application doesn’t exist, so you’ll need to create it yourself. In this section we’ll show you how to create a corpus of GitHub issues, which are commonly used to track bugs or features in GitHub repositories. This corpus could be used for various purposes, including:
- Exploring how long it takes to close open issues or pull requests
- Training a multilabel classifier that can tag issues with metadata based on the issue’s description (e.g., “bug,” “enhancement,” or “question”)
- Creating a semantic search engine to find which issues match a user’s query

# 3.Steps

    - worworking with dataset
      - fetch issues
      - load locally as HF dataset
      - cleaning up the data
      - augmenting the dataset
      - uploading the dataset to the HF hub
      - creating a dataset card 
    - semantic search
      - loading and prepating dataset
        - load
        - filter only for issues not pull request
        - remain only the requires columns
        - convert to pandas and explode
        - convert back to dataset HF
        - filter comments longer 15 sympbols length
        - concatenate  issue title, description, and comments together
      - creating text embeddings
        - download model for tokenization
        - Create embedding  using CLSpooling

# 4.Tools

In [11]:
!pip install ipywidgets

Collecting ipywidgets
  Downloading ipywidgets-8.1.1-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.9 (from ipywidgets)
  Downloading widgetsnbextension-4.0.9-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.9 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.9-py3-none-any.whl.metadata (4.1 kB)
Downloading ipywidgets-8.1.1-py3-none-any.whl (139 kB)
   ---------------------------------------- 0.0/139.4 kB ? eta -:--:--
   -------- ------------------------------- 30.7/139.4 kB 1.4 MB/s eta 0:00:01
   -------------------------------- ------- 112.6/139.4 kB 1.3 MB/s eta 0:00:01
   ---------------------------------------- 139.4/139.4 kB 1.4 MB/s eta 0:00:00
Downloading jupyterlab_widgets-3.0.9-py3-none-any.whl (214 kB)
   ---------------------------------------- 0.0/214.9 kB ? eta -:--:--
   -------------------------------- ------- 174.1/214.9 kB 5.3 MB/s eta 0:00:01
   ---------------------------------------- 214.9/214.9 kB 2.6 MB/s eta 0:0

In [13]:
!pip install --upgrade jupyter ipywidgets


Collecting jupyter
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting notebook (from jupyter)
  Downloading notebook-7.0.6-py3-none-any.whl.metadata (10 kB)
Collecting qtconsole (from jupyter)
  Downloading qtconsole-5.5.1-py3-none-any.whl.metadata (5.1 kB)
Collecting jupyter-console (from jupyter)
  Downloading jupyter_console-6.6.3-py3-none-any.whl (24 kB)
Collecting nbconvert (from jupyter)
  Downloading nbconvert-7.14.0-py3-none-any.whl.metadata (7.7 kB)
Collecting beautifulsoup4 (from nbconvert->jupyter)
  Using cached beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
Collecting bleach!=5.0.0 (from nbconvert->jupyter)
  Downloading bleach-6.1.0-py3-none-any.whl.metadata (30 kB)
Collecting defusedxml (from nbconvert->jupyter)
  Downloading defusedxml-0.7.1-py2.py3-none-any.whl (25 kB)
Collecting jupyterlab-pygments (from nbconvert->jupyter)
  Downloading jupyterlab_pygments-0.3.0-py3-none-any.whl.metadata (4.4 kB)
Collecting mistune<4,>=2.0.3 (from nbconvert->jupyte

In [16]:
!pip install --upgrade tqdm




In [58]:
!pip install faiss

ERROR: Could not find a version that satisfies the requirement faiss (from versions: none)
ERROR: No matching distribution found for faiss


# 5.Create dataset

## 5.0.Inititalizing

In [3]:
# Copy your GitHub token here
GITHUB_TOKEN = "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"
HEADERS = {"Authorization": f"token {GITHUB_TOKEN}"}

## 5.1.Fetch issues

### EDA

In [7]:
import requests

url = "https://api.github.com/repos/huggingface/datasets/issues?page=1&per_page=1"
response = requests.get(url)
response.status_code, type(response), response.json()

(200,
 requests.models.Response,
 [{'url': 'https://api.github.com/repos/huggingface/datasets/issues/6566',
   'repository_url': 'https://api.github.com/repos/huggingface/datasets',
   'labels_url': 'https://api.github.com/repos/huggingface/datasets/issues/6566/labels{/name}',
   'comments_url': 'https://api.github.com/repos/huggingface/datasets/issues/6566/comments',
   'events_url': 'https://api.github.com/repos/huggingface/datasets/issues/6566/events',
   'html_url': 'https://github.com/huggingface/datasets/issues/6566',
   'id': 2069495429,
   'node_id': 'I_kwDODunzps57Wf6F',
   'number': 6566,
   'title': 'I train controlnet_sdxl in bf16 datatype, got unsupported ERROR in datasets',
   'user': {'login': 'HelloWorldBeginner',
    'id': 25008090,
    'node_id': 'MDQ6VXNlcjI1MDA4MDkw',
    'avatar_url': 'https://avatars.githubusercontent.com/u/25008090?v=4',
    'gravatar_id': '',
    'url': 'https://api.github.com/users/HelloWorldBeginner',
    'html_url': 'https://github.com/HelloW

### fetch issues implementation

In [1]:
import time
import math
from pathlib import Path
import pandas as pd
from tqdm.notebook import tqdm
import requests


def fetch_issues(
    owner="huggingface",
    repo="datasets",
    # num_issues=10_000,
    num_issues=1_000,
    rate_limit=5_000,
    issues_path=Path("."),
):
    if not issues_path.is_dir():
        issues_path.mkdir(exist_ok=True)

    batch = []
    all_issues = []
    per_page = 100  # Number of issues to return per page
    num_pages = math.ceil(num_issues / per_page)
    base_url = "https://api.github.com/repos"

    for page in tqdm(range(num_pages)):
        # Query with state=all to get both open and closed issues
        query = f"issues?page={page}&per_page={per_page}&state=all"
        issues = requests.get(
            f"{base_url}/{owner}/{repo}/{query}", headers=HEADERS
        )
        batch.extend(issues.json())

        if len(batch) > rate_limit and len(all_issues) < num_issues:
            all_issues.extend(batch)
            batch = []  # Flush batch for next time period
            print(f"Reached GitHub rate limit. Sleeping for one hour ...")
            time.sleep(60 * 60 + 1)

    all_issues.extend(batch)
    df = pd.DataFrame.from_records(all_issues)
    df.to_json(
        f"{issues_path}/{repo}-issues.jsonl", orient="records", lines=True
    )
    print(
        f"Downloaded all the issues for {repo}! Dataset stored at {issues_path}/{repo}-issues.jsonl"
    )

In [4]:
# Depending on your internet connection, this can take several minutes to run...
fetch_issues()

  0%|          | 0/10 [00:00<?, ?it/s]

Downloaded all the issues for datasets! Dataset stored at ./datasets-issues.jsonl


## 5.2.Load locally as HF dataset

In [5]:
from datasets import load_dataset

In [6]:
issues_dataset = load_dataset(
    "json", data_files="datasets-issues.jsonl", split="train"
)
issues_dataset

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request'],
    num_rows: 1000
})

## 5.3.Cleaning up the data

### EDA

In [7]:
sample = issues_dataset.shuffle(seed=666).select(range(3))

# Print out the URL and pull request entries
for url, pr in zip(sample["html_url"], sample["pull_request"]):
    print(f">> URL: {url}")
    print(f">> Pull request: {pr}\n")

>> URL: https://github.com/huggingface/datasets/pull/6509
>> Pull request: {'url': 'https://api.github.com/repos/huggingface/datasets/pulls/6509', 'html_url': 'https://github.com/huggingface/datasets/pull/6509', 'diff_url': 'https://github.com/huggingface/datasets/pull/6509.diff', 'patch_url': 'https://github.com/huggingface/datasets/pull/6509.patch', 'merged_at': datetime.datetime(2023, 12, 19, 9, 31, 3)}

>> URL: https://github.com/huggingface/datasets/issues/6540
>> Pull request: None

>> URL: https://github.com/huggingface/datasets/issues/5768
>> Pull request: None



ps:  
Here we can see that each pull request is associated with various URLs, while ordinary issues have a None entry. We can use this distinction to create a new is_pull_request column that checks whether the pull_request field is None or not:

### Implementation

In [8]:
issues_dataset = issues_dataset.map(
    lambda x: {"is_pull_request": False if x["pull_request"] is None else True}
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [9]:
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 1000
})

## 5.4.Augmenting the dataset

### EDA

In [10]:
issue_number = 2792
url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
response = requests.get(url, headers=HEADERS)
response.json()

[{'url': 'https://api.github.com/repos/huggingface/datasets/issues/comments/897594128',
  'html_url': 'https://github.com/huggingface/datasets/pull/2792#issuecomment-897594128',
  'issue_url': 'https://api.github.com/repos/huggingface/datasets/issues/2792',
  'id': 897594128,
  'node_id': 'IC_kwDODunzps41gDMQ',
  'user': {'login': 'bhavitvyamalik',
   'id': 19718818,
   'node_id': 'MDQ6VXNlcjE5NzE4ODE4',
   'avatar_url': 'https://avatars.githubusercontent.com/u/19718818?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/bhavitvyamalik',
   'html_url': 'https://github.com/bhavitvyamalik',
   'followers_url': 'https://api.github.com/users/bhavitvyamalik/followers',
   'following_url': 'https://api.github.com/users/bhavitvyamalik/following{/other_user}',
   'gists_url': 'https://api.github.com/users/bhavitvyamalik/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/bhavitvyamalik/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/

### Implementation 

In [12]:
def get_comments(issue_number):
    url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
    response = requests.get(url, headers=HEADERS)
    return [r["body"] for r in response.json()]


# Test our function works as expected
get_comments(2792)

["@albertvillanova my tests are failing here:\r\n```\r\ndataset_name = 'gooaq'\r\n\r\n    def test_load_dataset(self, dataset_name):\r\n        configs = self.dataset_tester.load_all_configs(dataset_name, is_local=True)[:1]\r\n>       self.dataset_tester.check_load_dataset(dataset_name, configs, is_local=True, use_local_dummy_data=True)\r\n\r\ntests/test_dataset_common.py:234: \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\ntests/test_dataset_common.py:187: in check_load_dataset\r\n    self.parent.assertTrue(len(dataset[split]) > 0)\r\nE   AssertionError: False is not true\r\n```\r\nWhen I try loading dataset on local machine it works fine. Any suggestions on how can I avoid this error?",
 'Thanks for the help, @albertvillanova! All tests are passing now.']

In [13]:
# Depending on your internet connection, this can take a few minutes...
issues_with_comments_dataset = issues_dataset.map(
    lambda x: {"comments": get_comments(x["number"])}
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [15]:
issues_with_comments_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 1000
})

## 5.5.Uploading the dataset to the HF

In [18]:
from huggingface_hub import notebook_login

notebook_login()

# huggingface-cli login  # for terminal through cli

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [23]:
issues_with_comments_dataset.push_to_hub("ilbaks/github-issues")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


CommitInfo(commit_url='https://huggingface.co/datasets/ilbaks/github-issues/commit/4c21e09eab5c5480d71567689323c6b557261c1b', commit_message='Upload dataset', commit_description='', oid='4c21e09eab5c5480d71567689323c6b557261c1b', pr_url=None, pr_revision=None, pr_num=None)

# 6.Semantic search with FAISS

## 6.1.Loading and preparing dataset

### 6.1.1.load dataset

In [25]:
from datasets import load_dataset

issues_dataset = load_dataset("ilbaks/github-issues", split="train")
issues_dataset

Downloading readme:   0%|          | 0.00/8.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.47M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 1000
})

### 6.1.2.filter only for issues not pull request

In [26]:
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)
issues_dataset

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 449
})

### 6.1.3.remain only the requires columns

In [27]:
columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 449
})

### 6.1.4.convert to pandas and explode

In [28]:
# Set the format of the dataset to 'pandas' for easier manipulation
issues_dataset.set_format("pandas")

# Convert the entire dataset to a Pandas DataFrame
df = issues_dataset[:]

In [30]:
issues_dataset, df

(Dataset({
     features: ['html_url', 'title', 'comments', 'body'],
     num_rows: 449
 }),
                                               html_url  \
 0    https://github.com/huggingface/datasets/issues...   
 1    https://github.com/huggingface/datasets/issues...   
 2    https://github.com/huggingface/datasets/issues...   
 3    https://github.com/huggingface/datasets/issues...   
 4    https://github.com/huggingface/datasets/issues...   
 ..                                                 ...   
 444  https://github.com/huggingface/datasets/issues...   
 445  https://github.com/huggingface/datasets/issues...   
 446  https://github.com/huggingface/datasets/issues...   
 447  https://github.com/huggingface/datasets/issues...   
 448  https://github.com/huggingface/datasets/issues...   
 
                                                  title  \
 0     `drop_last_batch=True` for IterableDataset ma...   
 1    `ImportError`: cannot import name 'insecure_ha...   
 2          Document

In [31]:
df["comments"][0].tolist()

["My current workaround this issue is to return `None` in the second element and then filter out samples which have `None` in  them.\r\n\r\n```python\r\ndef merge_samples(batch):\r\n    if len(batch['a']) == 1:\r\n        batch['c'] = [batch['a'][0]]\r\n        batch['d'] = [None]\r\n    else:\r\n        batch['c'] = [batch['a'][0]]\r\n        batch['d'] = [batch['a'][1]]\r\n    return batch\r\n    \r\ndef filter_fn(x):\r\n    return x['d'] is not None\r\n\r\n# other code...\r\nmapped = mapped.filter(filter_fn)\r\n```"]

In [32]:
# transform each element of those lists into a separate row
comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,`drop_last_batch=True` for IterableDataset ma...,My current workaround this issue is to return ...,### Describe the bug\r\n\r\nScenario:\r\n- Int...
1,https://github.com/huggingface/datasets/issues...,`ImportError`: cannot import name 'insecure_ha...,@Wauplin Do you happen to know what's up?,### Describe the bug\n\nYep its not [there](ht...
2,https://github.com/huggingface/datasets/issues...,`ImportError`: cannot import name 'insecure_ha...,<del>Installing `datasets` from `main` did the...,### Describe the bug\n\nYep its not [there](ht...
3,https://github.com/huggingface/datasets/issues...,`ImportError`: cannot import name 'insecure_ha...,@wasertech upgrading `huggingface_hub` to a ne...,### Describe the bug\n\nYep its not [there](ht...


### 6.1.5.convert back to dataset HF

In [33]:
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 1392
})

### 6.1.6.filter comments longer 15 sympbols length

In [34]:
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)

Map:   0%|          | 0/1392 [00:00<?, ? examples/s]

In [35]:
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
comments_dataset

Filter:   0%|          | 0/1392 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 1019
})

### 6.1.7.concatenate  issue title, description, and comments together

In [36]:
def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }


comments_dataset = comments_dataset.map(concatenate_text)

Map:   0%|          | 0/1019 [00:00<?, ? examples/s]

In [37]:
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text'],
    num_rows: 1019
})

## 6.2.Creating text embeddings

### 6.2.1.Download model for tokenization

In [38]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

### 6.2.2.Create embeddings  using CLSpooling

 represent each entry in our GitHub issues corpus as a single vector, so we need to “pool” or average our token embeddings in some way. One popular approach is to perform CLS pooling on our model’s outputs, where we simply collect the last hidden state for the special [CLS] token.

In [40]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

helper function that will tokenize a list of documents, place the tensors on the GPU, feed them to the model, and finally apply CLS pooling to the outputs [explanation with gpt](https://chat.openai.com/share/6c58d2f2-4798-459f-bde9-e689491337be)

In [45]:
import torch

device = torch.device("cuda")
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

In [43]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [41]:
comments_dataset["text"][0]

' `drop_last_batch=True` for IterableDataset map function is ignored with multiprocessing DataLoader  \n ### Describe the bug\r\n\r\nScenario:\r\n- Interleaving two iterable datasets of unequal lengths (`all_exhausted`), followed by a batch mapping with batch size 2 to effectively merge the two datasets and get a sample from each dataset in a single batch, with `drop_last_batch=True` to skip the last batch in case it doesn\'t have two samples.\r\n\r\nWhat works:\r\n- Using DataLoader with `num_workers=0`\r\n\r\nWhat does not work:\r\n- Using DataLoader with `num_workers=1`, errors in the last batch.\r\n\r\nBasically, `drop_last_batch=True` is ignored when using multiple dataloading workers.\r\n\r\nPlease take a look at the minimal repro script below.\r\n\r\n### Steps to reproduce the bug\r\n\r\n```python\r\nfrom datasets import Dataset, interleave_datasets\r\nfrom torch.utils.data import DataLoader\r\n\r\n\r\ndef merge_samples(batch):\r\n    assert len(batch[\'a\']) == 2, "Batch size m

In [46]:
embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

In [47]:
embeddings_dataset = comments_dataset.map(
    lambda x: {
        "embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]
    }
)

Map:   0%|          | 0/1019 [00:00<?, ? examples/s]

In [48]:
embeddings_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 1019
})

In [54]:
len(embeddings_dataset["embeddings"][0])

768

In [51]:
embeddings_dataset["embeddings"][0][:10]

[-0.2956484258174896,
 -0.06441400945186615,
 -0.03195429593324661,
 0.1042032539844513,
 -0.08406780660152435,
 -0.0683007761836052,
 0.6586068868637085,
 0.20464758574962616,
 0.23949621617794037,
 0.4104593098163605]

## 6.3.Using FAISS for efficient similarity search

Now that we have a dataset of embeddings, we need some way to search over them. To do this, we’ll use a special data structure in 🤗 Datasets called a [FAISS index](https://faiss.ai/). FAISS (short for Facebook AI Similarity Search) is a library that provides efficient algorithms to quickly search and cluster embedding vectors.

The basic idea behind FAISS is to create a special data structure called an index that allows one to find which embeddings are similar to an input embedding.

In [55]:
embeddings_dataset.add_faiss_index(column="embeddings")

ImportError: You must install Faiss to use FaissIndex. To do so you can run `conda install -c pytorch faiss-cpu` or `conda install -c pytorch faiss-gpu`. A community supported package is also available on pypi: `pip install faiss-cpu` or `pip install faiss-gpu`. Note that pip may not have the latest version of FAISS, and thus, some of the latest features and bug fixes may not be available.

We can now perform queries on this index by doing a nearest neighbor lookup with the Dataset.get_nearest_examples() function. Let’s test this out by first embedding a question as follows:

In [None]:
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

In [None]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

The Dataset.get_nearest_examples() function returns a tuple of scores that rank the overlap between the query and the document, and a corresponding set of samples (here, the 5 best matches). Let’s collect these in a pandas.DataFrame so we can easily sort them:

In [None]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

Now we can iterate over the first few rows to see how well our query matched the available comments:

In [None]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()


"""
COMMENT: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.

@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?
SCORE: 25.505046844482422
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================

COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)
You can now use them offline
\`\`\`python
datasets = load_dataset("text", data_files=data_files)
\`\`\`

We'll do a new release soon
SCORE: 24.555509567260742
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================

COMMENT: I opened a PR that allows to reload modules that have already been loaded once even if there's no internet.

Let me know if you know other ways that can make the offline mode experience better. I'd be happy to add them :)

I already note the "freeze" modules option, to prevent local modules updates. It would be a cool feature.

----------

> @mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?

Indeed `load_dataset` allows to load remote dataset script (squad, glue, etc.) but also you own local ones.
For example if you have a dataset script at `./my_dataset/my_dataset.py` then you can do
\`\`\`python
load_dataset("./my_dataset")
\`\`\`
and the dataset script will generate your dataset once and for all.

----------

About I'm looking into having `csv`, `json`, `text`, `pandas` dataset builders already included in the `datasets` package, so that they are available offline by default, as opposed to the other datasets that require the script to be downloaded.
cf #1724
SCORE: 24.14896583557129
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================

COMMENT: > here is my way to load a dataset offline, but it **requires** an online machine
>
> 1. (online machine)
>
> ```
>
> import datasets
>
> data = datasets.load_dataset(...)
>
> data.save_to_disk(/YOUR/DATASET/DIR)
>
> ```
>
> 2. copy the dir from online to the offline machine
>
> 3. (offline machine)
>
> ```
>
> import datasets
>
> data = datasets.load_from_disk(/SAVED/DATA/DIR)
>
> ```
>
>
>
> HTH.


SCORE: 22.893993377685547
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================

COMMENT: here is my way to load a dataset offline, but it **requires** an online machine
1. (online machine)
\`\`\`
import datasets
data = datasets.load_dataset(...)
data.save_to_disk(/YOUR/DATASET/DIR)
\`\`\`
2. copy the dir from online to the offline machine
3. (offline machine)
\`\`\`
import datasets
data = datasets.load_from_disk(/SAVED/DATA/DIR)
\`\`\`

HTH.
SCORE: 22.406635284423828
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================
"""