<a href="https://colab.research.google.com/github/kisakiwata/CV_huggingface/blob/main/Semantic_search_with_Bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Semantic search (Bert, PyTorch)



Exploring the comments for bug issues reported to hugging face and store the commetns as embeddings

Install the Transformers, Datasets, and Evaluate libraries

In [42]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install faiss-gpu



In [43]:
from datasets import load_dataset

issues_dataset = load_dataset("lewtun/github-issues", split="train")
issues_dataset

Repo card metadata block was not found. Setting CardData to empty.


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

In [44]:
# you can see dataset looks like dictionary/json-based data
issues_dataset[0]

{'url': 'https://api.github.com/repos/huggingface/datasets/issues/2955',
 'repository_url': 'https://api.github.com/repos/huggingface/datasets',
 'labels_url': 'https://api.github.com/repos/huggingface/datasets/issues/2955/labels{/name}',
 'comments_url': 'https://api.github.com/repos/huggingface/datasets/issues/2955/comments',
 'events_url': 'https://api.github.com/repos/huggingface/datasets/issues/2955/events',
 'html_url': 'https://github.com/huggingface/datasets/pull/2955',
 'id': 1003999469,
 'node_id': 'PR_kwDODunzps4sHuRu',
 'number': 2955,
 'title': 'Update legacy Python image for CI tests in Linux',
 'user': {'login': 'albertvillanova',
  'id': 8515462,
  'node_id': 'MDQ6VXNlcjg1MTU0NjI=',
  'avatar_url': 'https://avatars.githubusercontent.com/u/8515462?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/albertvillanova',
  'html_url': 'https://github.com/albertvillanova',
  'followers_url': 'https://api.github.com/users/albertvillanova/followers',
  'following_u

In [45]:
# filtering out issues that are pull requests and have no comments

issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)

In [46]:
# column names
issues_dataset.column_names

['url',
 'repository_url',
 'labels_url',
 'comments_url',
 'events_url',
 'html_url',
 'id',
 'node_id',
 'number',
 'title',
 'user',
 'labels',
 'state',
 'locked',
 'assignee',
 'assignees',
 'milestone',
 'comments',
 'created_at',
 'updated_at',
 'closed_at',
 'author_association',
 'active_lock_reason',
 'pull_request',
 'body',
 'timeline_url',
 'performed_via_github_app',
 'is_pull_request']

In [47]:
# removing unnecessary columns

columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

In [48]:
# converting to pandas dataframe

issues_dataset.set_format("pandas")
df = issues_dataset[:]

In [49]:
df.head(5)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"[Cool, I think we can do both :), @lhoestq now...",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,[Hi ! I guess the caching mechanism should hav...,## Describe the bug\r\nAfter upgrading to data...
2,https://github.com/huggingface/datasets/issues...,OSCAR unshuffled_original_ko: NonMatchingSplit...,[I tried `unshuffled_original_da` and it is al...,## Describe the bug\r\n\r\nCannot download OSC...
3,https://github.com/huggingface/datasets/issues...,load_dataset using default cache on Windows ca...,"[Hi @daqieq, thanks for reporting.\r\n\r\nUnfo...",## Describe the bug\r\nStandard process to dow...
4,https://github.com/huggingface/datasets/issues...,to_tf_dataset keeps a reference to the open da...,"[I did some investigation and, as it seems, th...",To reproduce:\r\n```python\r\nimport datasets ...


In [50]:
# four comments are concatenated in the comment column

df["comments"][0].tolist()

['Cool, I think we can do both :)',
 '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']

In [51]:
# using explode, set to expand df by comments column

comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...
2,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Hi ! I guess the caching mechanism should have...,## Describe the bug\r\nAfter upgrading to data...
3,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"If it's easy enough to implement, then yes ple...",## Describe the bug\r\nAfter upgrading to data...


In [52]:
# now converting back to Dataset format
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

In [53]:
comments_dataset['comments'][0]

'Cool, I think we can do both :)'

In [54]:
# creating a new column that has the length of each comment
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

In [55]:
len(comments_dataset["comments"][0].split())

8

In [56]:
# splitting each comment by word - checking the length, if it's larger than 15, keep it as short comments deemed to be not significant
comments_dataset["comments"][0].split()

['Cool,', 'I', 'think', 'we', 'can', 'do', 'both', ':)']

In [57]:
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)

Filter:   0%|          | 0/2964 [00:00<?, ? examples/s]

In [58]:
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2175
})

In [59]:
# concatenate the issue title, description, and comments together in a new text column

def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }


comments_dataset = comments_dataset.map(concatenate_text)

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

In [60]:
comments_dataset['text'][0]

'Protect master branch \n After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch\r\n  - Currently, simple merge commits are already disabled\r\n  - I propose to disable rebase merging as well\r\n- ~~Protect the master branch from direct pushes (to avoid accidentally pushing of merge commits)~~\r\n  - ~~This protection would reject direct pushes to master branch~~\r\n  - ~~If so, for each release (when we need to commit directly to the master branch), we should p

### Create Text Embeddings

- obtain token embeddings
- pick a suitable checkpoint to choose a model from
- pick a tokenzier suitable for semantic search (asymentric search)

- chose "sentence-transformers/multi-qa-mpnet-base-dot-v1" as supposed to be the best for semantic search
- sentence transformers offer token embeddings
- (from_pt=True will convert Torch weights to tensorflow)

In [61]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

In [62]:
model

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

In [63]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#!nvidia-smi
#device = torch.device("cuda")
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

In [64]:
# creating pooling to create an embedding per sentence as opposed to per vocabulary
# here, we are selecting the first CLS token as a reporsentation of a sentence
# there are multiple ways to select sentence vector
# https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca

# quote from the tutorial
#we’d like to represent each entry in our GitHub issues corpus as a single vector,
# so we need to “pool” or average our token embeddings in some way.
# One popular approach is to perform CLS pooling on our model’s outputs, where we simply collect the last hidden state for the special [CLS] token.



def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

In [65]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    # putting inputs into GPU device
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [71]:
sample = comments_dataset["text"][0]

## class transformers.modeling_outputs.BaseModelOutputWithPooling

( last_hidden_state: FloatTensor = Nonepooler_output: FloatTensor = Nonehidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

* last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length,
hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
* pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
* hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

* attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.



In [76]:
# encoded_input = tokenizer(
#     sample, padding=True, truncation=True, return_tensors="pt"
# )
# encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
model_output = model(**encoded_input)

In [68]:
embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

In [78]:
# (batch_size, sequence_length, hidden_size)
# length of strings are trimmed to 514, limits of Bert input length

model_output.last_hidden_state.shape

torch.Size([1, 469, 768])

In [79]:
# batch_size, hidden_size) of first token
model_output.pooler_output.shape

torch.Size([1, 768])

In [81]:
len(comments_dataset["text"][0])

1897

In [86]:
model_output.last_hidden_state[: , 0].shape

torch.Size([1, 768])

In [87]:
# now implement this to each row and convert it to numpy as that is required format for FAISS
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

## Using FAISS for efficient similarity search

* FAISS (short for Facebook AI Similarity Search) is a library that provides efficient algorithms to quickly search and cluster embedding vectors.

* The basic idea behind FAISS is to create a special data structure called an index that allows one to find which embeddings are similar to an input embedding

In [88]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 2175
})

In [91]:
embeddings_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 2175
})

In [92]:
# We can now perform queries on this index by doing a nearest neighbor lookup

question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

In [93]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

In [96]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [97]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.

@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?
SCORE: 25.505023956298828
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)
You can now use them offline
```python
datasets = load_dataset('text', data_files=data_files)
```

We'll do a new release soon
SCORE: 24.555545806884766
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: I opened a PR that allows to reload modules that have already been loaded once even if there's n