# Semantic search with FAISS (TensorFlow)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [55]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install faiss-gpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [78]:
import pandas as pd

In [79]:
issues_df  = pd.read_csv('github_issues.csv')

In [80]:
pd.set_option('max_colwidth', 1000)

In [81]:
issues_df.head()

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues/2945,Protect master branch,"['Cool, I think we can do both :)'\n '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']","After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch\r\n - Currently, simple merge commits are already disabled\r\n - I propose to disable rebase merging as well\r\n- ~~Protect the master branch from direct pushes (to avoid accidentally pushing of merge commits)~~\r\n - ~~This protection would reject direct pushes to master branch~~\r\n - ~~If so, for each release (when we need to commit directly to the master branch), we should previously disable the ..."
1,https://github.com/huggingface/datasets/issues/2943,Backwards compatibility broken for cached datasets that use `.filter()`,"[""Hi ! I guess the caching mechanism should have considered the new `filter` to be different from the old one, and don't use cached results from the old `filter`.\r\nTo avoid other users from having this issue we could make the caching differentiate the two, what do you think ?""\n ""If it's easy enough to implement, then yes please ðŸ˜„ But this issue can be low-priority, since I've only encountered it in a couple of `transformers` CI tests.""\n ""Well it can cause issue with anyone that updates `datasets` and re-run some code that uses filter, so I'm creating a PR""\n ""I just merged a fix, let me know if you're still having this kind of issues :)\r\n\r\nWe'll do a release soon to make this fix available""\n 'Definitely works on several manual cases with our dummy datasets, thank you @lhoestq !'\n 'Fixed by #2947.']","## Describe the bug\r\nAfter upgrading to datasets `1.12.0`, some cached `.filter()` steps from `1.11.0` started failing with \r\n`ValueError: Keys mismatch: between {'indices': Value(dtype='uint64', id=None)} and {'file': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'speaker_id': Value(dtype='int64', id=None), 'chapter_id': Value(dtype='int64', id=None), 'id': Value(dtype='string', id=None)}`\r\n\r\nRelated feature: https://github.com/huggingface/datasets/pull/2836\r\n\r\n:question: This is probably a `wontfix` bug, since it can be solved by simply cleaning the related cache dirs, but the workaround could be useful for someone googling the error :) \r\n\r\n## Workaround\r\nRemove the cache for the given dataset, e.g. `rm -rf ~/.cache/huggingface/datasets/librispeech_asr`.\r\n\r\n## Steps to reproduce the bug\r\n1. Delete `~/.cache/huggingface/datasets/librispeech_asr` if it exists.\r\n\r\n2. `pip install datasets==1.11.0` and run the following snippet:\..."
2,https://github.com/huggingface/datasets/issues/2941,OSCAR unshuffled_original_ko: NonMatchingSplitsSizesError,['I tried `unshuffled_original_da` and it is also not working'],"## Describe the bug\r\n\r\nCannot download OSCAR `unshuffled_original_ko` due to `NonMatchingSplitsSizesError`.\r\n\r\n## Steps to reproduce the bug\r\n\r\n```python\r\n>>> dataset = datasets.load_dataset('oscar', 'unshuffled_original_ko')\r\nNonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=25292102197, num_examples=7345075, dataset_name='oscar'), 'recorded': SplitInfo(name='train', num_bytes=25284578514, num_examples=7344907, dataset_name='oscar')}]\r\n```\r\n\r\n## Expected results\r\n\r\nLoading is successful.\r\n\r\n## Actual results\r\n\r\nLoading throws above error.\r\n\r\n## Environment info\r\n\r\n- `datasets` version: 1.12.1\r\n- Platform: Linux-5.4.0-81-generic-x86_64-with-glibc2.29\r\n- Python version: 3.8.10\r\n- PyArrow version: 5.0.0\r\n"
3,https://github.com/huggingface/datasets/issues/2937,load_dataset using default cache on Windows causes PermissionError: [WinError 5] Access is denied,"[""Hi @daqieq, thanks for reporting.\r\n\r\nUnfortunately, I was not able to reproduce this bug:\r\n```ipython\r\nIn [1]: from datasets import load_dataset\r\n ...: ds = load_dataset('wiki_bio')\r\nDownloading: 7.58kB [00:00, 26.3kB/s]\r\nDownloading: 2.71kB [00:00, ?B/s]\r\nUsing custom data configuration default\r\nDownloading and preparing dataset wiki_bio/default (download: 318.53 MiB, generated: 736.94 MiB, post-processed: Unknown size, total: 1.03 GiB) to C:\\Users\\username\\.cache\\huggingface\\datasets\\wiki_bio\\default\\\r\n1.1.0\\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9...\r\nDownloading: 334MB [01:17, 4.32MB/s]\r\nDataset wiki_bio downloaded and prepared to C:\\Users\\username\\.cache\\huggingface\\datasets\\wiki_bio\\default\\1.1.0\\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9. Subsequent calls will reuse thi\r\ns data.\r\n```\r\n\r\nThis kind of error messages usually happen because:\r\n- Your running Python script hasn't ...","## Describe the bug\r\nStandard process to download and load the wiki_bio dataset causes PermissionError in Windows 10 and 11.\r\n\r\n## Steps to reproduce the bug\r\n```python\r\nfrom datasets import load_dataset\r\nds = load_dataset('wiki_bio')\r\n```\r\n\r\n## Expected results\r\nIt is expected that the dataset downloads without any errors.\r\n\r\n## Actual results\r\nPermissionError see trace below:\r\n```\r\nUsing custom data configuration default\r\nDownloading and preparing dataset wiki_bio/default (download: 318.53 MiB, generated: 736.94 MiB, post-processed: Unknown size, total: 1.03 GiB) to C:\Users\username\.cache\huggingface\datasets\wiki_bio\default\1.1.0\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9...\r\nTraceback (most recent call last):\r\n File ""<stdin>"", line 1, in <module>\r\n File ""C:\Users\username\.conda\envs\hf\lib\site-packages\datasets\load.py"", line 1112, in load_dataset\r\n builder_instance.download_and_prepare(\r\n File ""C:\Users..."
4,https://github.com/huggingface/datasets/issues/2934,"to_tf_dataset keeps a reference to the open data somewhere, causing issues on windows","[""I did some investigation and, as it seems, the bug stems from [this line](https://github.com/huggingface/datasets/blob/8004d7c3e1d74b29c3e5b0d1660331cd26758363/src/datasets/arrow_dataset.py#L325). The lifecycle of the dataset from the linked line is bound to one of the returned `tf.data.Dataset`. So my (hacky) solution involves wrapping the linked dataset with `weakref.proxy` and adding a custom `__del__` to `tf.python.data.ops.dataset_ops.TensorSliceDataset` (this is the type of a dataset that is returned by `tf.data.Dataset.from_tensor_slices`; this works for TF 2.x, but I'm not sure `tf.python.data.ops.dataset_ops` is a valid path for TF 1.x) that deletes the linked dataset, which is assigned to the dataset object as a property. Will open a draft PR soon!""\n 'Thanks a lot for investigating !']","To reproduce:\r\n```python\r\nimport datasets as ds\r\nimport weakref\r\nimport gc\r\n\r\nd = ds.load_dataset(""mnist"", split=""train"")\r\nref = weakref.ref(d._data.table)\r\ntfd = d.to_tf_dataset(""image"", batch_size=1, shuffle=False, label_cols=""label"")\r\ndel tfd, d\r\ngc.collect()\r\nassert ref() is None, ""Error: there is at least one reference left""\r\n```\r\n\r\nThis causes issues because the table holds a reference to an open arrow file that should be closed. So on windows it's not possible to delete or move the arrow file afterwards.\r\n\r\nMoreover the CI test of the `to_tf_dataset` method isn't able to clean up the temporary arrow files because of this.\r\n\r\ncc @Rocketknight1"


In [82]:
issues_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 808 entries, 0 to 807
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   html_url  808 non-null    object
 1   title     808 non-null    object
 2   comments  808 non-null    object
 3   body      805 non-null    object
dtypes: object(4)
memory usage: 25.4+ KB


In [83]:
issues_df.body.fillna("No text", inplace = True)

In [84]:
issues_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 808 entries, 0 to 807
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   html_url  808 non-null    object
 1   title     808 non-null    object
 2   comments  808 non-null    object
 3   body      808 non-null    object
dtypes: object(4)
memory usage: 25.4+ KB


In [85]:
issues_df['comment_length'] = issues_df['comments'].map(lambda x: len(x.split()))

In [86]:
issues_df.head(2)

Unnamed: 0,html_url,title,comments,body,comment_length
0,https://github.com/huggingface/datasets/issues/2945,Protect master branch,"['Cool, I think we can do both :)'\n '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']","After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch\r\n - Currently, simple merge commits are already disabled\r\n - I propose to disable rebase merging as well\r\n- ~~Protect the master branch from direct pushes (to avoid accidentally pushing of merge commits)~~\r\n - ~~This protection would reject direct pushes to master branch~~\r\n - ~~If so, for each release (when we need to commit directly to the master branch), we should previously disable the ...",71
1,https://github.com/huggingface/datasets/issues/2943,Backwards compatibility broken for cached datasets that use `.filter()`,"[""Hi ! I guess the caching mechanism should have considered the new `filter` to be different from the old one, and don't use cached results from the old `filter`.\r\nTo avoid other users from having this issue we could make the caching differentiate the two, what do you think ?""\n ""If it's easy enough to implement, then yes please ðŸ˜„ But this issue can be low-priority, since I've only encountered it in a couple of `transformers` CI tests.""\n ""Well it can cause issue with anyone that updates `datasets` and re-run some code that uses filter, so I'm creating a PR""\n ""I just merged a fix, let me know if you're still having this kind of issues :)\r\n\r\nWe'll do a release soon to make this fix available""\n 'Definitely works on several manual cases with our dummy datasets, thank you @lhoestq !'\n 'Fixed by #2947.']","## Describe the bug\r\nAfter upgrading to datasets `1.12.0`, some cached `.filter()` steps from `1.11.0` started failing with \r\n`ValueError: Keys mismatch: between {'indices': Value(dtype='uint64', id=None)} and {'file': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'speaker_id': Value(dtype='int64', id=None), 'chapter_id': Value(dtype='int64', id=None), 'id': Value(dtype='string', id=None)}`\r\n\r\nRelated feature: https://github.com/huggingface/datasets/pull/2836\r\n\r\n:question: This is probably a `wontfix` bug, since it can be solved by simply cleaning the related cache dirs, but the workaround could be useful for someone googling the error :) \r\n\r\n## Workaround\r\nRemove the cache for the given dataset, e.g. `rm -rf ~/.cache/huggingface/datasets/librispeech_asr`.\r\n\r\n## Steps to reproduce the bug\r\n1. Delete `~/.cache/huggingface/datasets/librispeech_asr` if it exists.\r\n\r\n2. `pip install datasets==1.11.0` and run the following snippet:\...",142


In [87]:
issues_df = issues_df[issues_df.comment_length > 15]

In [88]:
issues_df['alltext'] = issues_df['title'] + issues_df['comments'] + issues_df['body']

In [89]:
issues_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 713 entries, 0 to 807
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   html_url        713 non-null    object
 1   title           713 non-null    object
 2   comments        713 non-null    object
 3   body            713 non-null    object
 4   comment_length  713 non-null    int64 
 5   alltext         713 non-null    object
dtypes: int64(1), object(5)
memory usage: 39.0+ KB


In [93]:
from datasets import Dataset

issues_dataset = Dataset.from_pandas(issues_df)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'alltext', '__index_level_0__'],
    num_rows: 713
})

In [94]:
from transformers import AutoTokenizer, TFAutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = TFAutoModel.from_pretrained(model_ckpt, from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFMPNetModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFMPNetModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMPNetModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFMPNetModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMPNetModel for predictions without further training.


In [95]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

In [96]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="tf"
    )
    encoded_input = {k: v for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [97]:
embedding = get_embeddings(issues_dataset["alltext"][0])
embedding.shape

TensorShape([1, 768])

In [98]:
embedding[0:1]

<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[-1.73256457e-01, -7.45784268e-02, -1.29896492e-01,
        -1.91489965e-01, -2.53821760e-01, -2.37534806e-01,
         1.72121376e-01,  2.30887443e-01, -3.39684263e-02,
        -1.02802664e-02,  2.30401605e-01, -3.47656347e-02,
        -1.21524930e-01,  2.46941462e-01, -4.85174544e-02,
         1.59288332e-01,  1.60679877e-01,  3.09095345e-02,
        -1.09926477e-01, -3.74729186e-03, -5.27047366e-03,
        -8.98313373e-02,  1.90739408e-01,  5.71844056e-02,
        -5.15568368e-02, -4.92822677e-02,  7.48017877e-02,
         1.71193346e-01, -4.42930341e-01, -4.17683303e-01,
         9.23647732e-02,  2.22596422e-01, -3.05245630e-02,
         5.22523046e-01, -9.90139379e-05,  4.36451882e-02,
         2.02781215e-01,  2.80355811e-02, -1.27011716e-01,
        -2.62358546e-01, -4.68172342e-01, -4.17620450e-01,
        -1.07307851e-01, -9.51803550e-02,  1.65016726e-01,
        -4.29863967e-02, -2.01586802e-02,  7.90846497e-02,
      

In [99]:
embeddings_dataset = issues_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["alltext"]).numpy()[0]}
)



Map:   0%|          | 0/713 [00:00<?, ? examples/s]

In [100]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'alltext', '__index_level_0__', 'embeddings'],
    num_rows: 713
})

In [101]:
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).numpy()
question_embedding.shape

(1, 768)

In [102]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

In [103]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [104]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: ['Transferred to `datasets` based on the stack trace.'
 "Hi @lkcao !\r\nYour issue is indeed related to `datasets`. In addition to installing the package manually, you will need to download the `text.py` script on your server. You'll find it (under `datasets/datasets/text`: https://github.com/huggingface/datasets/blob/master/datasets/text/text.py.\r\nThen you can change the line 221 of `run_mlm_new.py` into:\r\n```python\r\n  datasets = load_dataset('/path/to/text.py', data_files=data_files)\r\n```\r\nWhere `/path/to/text.py` is the path on the server where you saved the `text.py` script."
 "We're working on including the local dataset builders (csv, text, json etc.) directly in the `datasets` package so that they can be used offline"
 "The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)\r\nYou can now use them offline\r\n```python\r\ndatasets = load_dataset('text', data_files=data_files)\r\n```\r\n\r\nWe'll do a new r