# Hugging Face semantic search

Using `sentence-transformers/msmarco-bert-base-dot-v5`

References:

- Hugging Face [semantic search][semantic_search_guide]
- SBERT [pre-trained models][sbert_pretrained_models]
- Installation: [`sentence-transformers`][sentence_transformers]
- Code reference [`semantic_search.py`][semantic_search_py]
- Model info: [`msmarco-bert-base-dot-v5`][msmarco-bert-base-dot-v5]



[semantic_search_guide]: https://www.sbert.net/examples/applications/semantic-search/README.html
[sbert_pretrained_models]: https://www.sbert.net/docs/pretrained_models.html#msmarco-passage-models
[sentence_transformers]: https://www.sbert.net/docs/installation.html
[semantic_search_py]: https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search.py
[msmarco-bert-base-dot-v5]: https://huggingface.co/sentence-transformers/msmarco-bert-base-dot-v5

## Setup

### Working directory

In [1]:
# Change the working directory to project root
import pathlib
import os
ROOT_DIR = pathlib.Path.cwd().parent
os.chdir(ROOT_DIR)

### Dependencies

In [2]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd
import torch
from src.torch_utils import get_torch_device, min_max_scaler, standard_scaler
from src.data import load_as_dataframe, slice_by_claim, \
    get_claim_texts, get_evidence_texts, get_paired_texts

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Get the best torch device
torch_device = get_torch_device()

Torch device is 'mps'


In [4]:
# Directories for saving/loading
EMB_PATH = ROOT_DIR.joinpath("result", "embeddings", "*")
EMB_PATH

PosixPath('/Users/johnsonzhou/git/comp90042-project/result/embeddings/*')

## Load Data

### Load from json

In [5]:
data_names = ["train-claims", "dev-claims", "evidence"]
train_claims, dev_claims = load_as_dataframe(data_names)

Loaded train-claims
Loaded dev-claims
Loaded evidence


### Sample inspection

In [6]:
sample_claim = slice_by_claim(train_claims, None, 2, ["SUPPORTS", "DISPUTED"])
sample_claim

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,evidence_text
claim,claim_text,claim_label,evidences,Unnamed: 4_level_1
claim-1937,"Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.",DISPUTED,evidence-442946,At very high concentrations (100 times atmosph...
claim-1937,"Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.",DISPUTED,evidence-1194317,Plants can grow as much as 50 percent faster i...
claim-1937,"Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.",DISPUTED,evidence-12171,Higher carbon dioxide concentrations will favo...
claim-2510,"In 1946, PDO switched to a cool phase.",SUPPORTS,evidence-530063,There is evidence of reversals in the prevaili...
claim-2510,"In 1946, PDO switched to a cool phase.",SUPPORTS,evidence-984887,"1945/1946: The PDO changed to a ""cool"" phase, ..."


In [7]:
sample_claim_texts = get_claim_texts(sample_claim)
sample_claim_texts

['Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.',
 'In 1946, PDO switched to a cool phase.']

In [8]:
sample_evidence_texts = get_evidence_texts(sample_claim)
sample_evidence_texts

['At very high concentrations (100 times atmospheric concentration, or greater), carbon dioxide can be toxic to animal life, so raising the concentration to 10,000 ppm (1%) or higher for several hours will eliminate pests such as whiteflies and spider mites in a greenhouse.',
 'Plants can grow as much as 50 percent faster in concentrations of 1,000 ppm CO 2 when compared with ambient conditions, though this assumes no change in climate and no limitation on other nutrients.',
 'Higher carbon dioxide concentrations will favourably affect plant growth and demand for water.',
 'There is evidence of reversals in the prevailing polarity (meaning changes in cool surface waters versus warm surface waters within the region) of the oscillation occurring around 1925, 1947, and 1977; the last two reversals corresponded with dramatic shifts in salmon production regimes in the North Pacific Ocean.',
 '1945/1946: The PDO changed to a "cool" phase, the pattern of this regime shift is similar to the 19

## Select HF embedding model

For our task, we generally have a **short** query of claims and **long** evidence passages. Therefore, we are conducting **asymmetric semantic search**. Based on the guide, MSMARCO models are most suited for this task.

In [9]:
embedder = SentenceTransformer(
    "sentence-transformers/msmarco-bert-base-dot-v5",
    device=torch_device
)
embedder

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

## Explore embeddings

### Claim embeddings

In [10]:
sample_claim_embeddings = embedder.encode(
    sample_claim_texts, convert_to_tensor=True, device=torch_device
)

print(sample_claim_embeddings.shape)
print(sample_claim_embeddings)

torch.Size([2, 768])
tensor([[ 0.0797, -0.1350, -0.1297,  ...,  0.3374,  0.2907, -0.1008],
        [-0.4509, -0.2364,  0.0293,  ...,  0.0458,  0.0530,  0.0498]],
       device='mps:0')


### Evidence embeddings

In [11]:
sample_evidence_embeddings = embedder.encode(
    sample_evidence_texts, convert_to_tensor=True, device=torch_device
)

print(sample_evidence_embeddings.shape)
print(sample_evidence_embeddings)

torch.Size([5, 768])
tensor([[ 0.0680,  0.4074, -0.1831,  ...,  0.3127,  0.5444,  0.3580],
        [-0.1647,  0.1412,  0.2400,  ...,  0.3577,  0.3601,  0.0757],
        [ 0.0445, -0.0545,  0.1051,  ...,  0.4495,  0.0975,  0.0455],
        [-0.4079,  0.1890,  0.1829,  ..., -0.0434,  0.2071,  0.2800],
        [-0.4422, -0.3580,  0.2699,  ...,  0.1765,  0.0556,  0.0731]],
       device='mps:0')


## Semantic Search

### Similarity scoring

The `msmarco-bert-base-dot-v5` model is designed to be used with dot product scoring.

In [12]:
scorer = util.dot_score

sample_scores = scorer(
    a=sample_claim_embeddings,
    b=sample_evidence_embeddings
)
sample_scores
# Rows are claims
# Columns are evidences

tensor([[176.9947, 173.2845, 171.6822, 163.7634, 162.2655],
        [157.8367, 159.5848, 157.4214, 169.8187, 183.0319]], device='mps:0')

### Predict related evidence using arbitrary cut-off

In [13]:
score_cutoff = 170

#### Cutoff vs sample evidences

This shows that using a cutoff score can be useful on small evidence sets.

In [14]:
for claim_text, evidence_scores in zip(sample_claim_texts, sample_scores):
    related_evidences = evidence_scores > score_cutoff
    print("Claim text:\n", claim_text, "\n")
    print((
        sample_claim.reset_index()[["evidences", "evidence_text"]][
            related_evidences.cpu().numpy()
        ]
    ), "\n")

Claim text:
 Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life. 

          evidences                                      evidence_text
0   evidence-442946  At very high concentrations (100 times atmosph...
1  evidence-1194317  Plants can grow as much as 50 percent faster i...
2    evidence-12171  Higher carbon dioxide concentrations will favo... 

Claim text:
 In 1946, PDO switched to a cool phase. 

         evidences                                      evidence_text
4  evidence-984887  1945/1946: The PDO changed to a "cool" phase, ... 



#### Cutoff vs training evidences

Using a cutoff score can be difficult when corpus size is larger.

In [15]:
# Get embeddings for all evidence in train_claims
train_evidence_embeddings_fp = \
    EMB_PATH.with_name("train_evidence_msmarco_bert_base_dot_v5.pt")
# if os.path.exists(train_evidence_embeddings_fp):
if os.path.exists(train_evidence_embeddings_fp):
    # Load it from file if we have already created it
    with open(train_evidence_embeddings_fp, mode="rb") as f:
        train_evidence_embeddings = torch.load(f=f, map_location=torch_device)
else:
    # Create the embeddings and save it to file
    train_evidence_embeddings = embedder.encode(
        sentences=get_evidence_texts(train_claims),
        device=torch_device,
        batch_size=64,
        show_progress_bar=True,
        convert_to_tensor=True,
    )
    with open(train_evidence_embeddings_fp, mode="wb") as f:
        torch.save(obj=train_evidence_embeddings, f=f)

In [16]:
sample_scores = scorer(
    a=sample_claim_embeddings,
    b=train_evidence_embeddings
)
print(sample_scores.shape)
sample_scores
# Rows are claims
# Columns are evidences

torch.Size([2, 4122])


tensor([[176.9947, 173.2845, 171.6822,  ..., 162.7350, 164.4235, 163.3744],
        [157.8367, 159.5848, 157.4214,  ..., 159.3624, 162.1377, 161.1283]],
       device='mps:0')

In [17]:
for claim_text, evidence_scores in zip(sample_claim_texts, sample_scores):
    # related_evidences = evidence_scores > score_cutoff
    related_evidences = (
        train_claims.reset_index()
        .assign(score=evidence_scores.cpu().numpy())
        .get(["evidences", "score", "evidence_text"])
    )
    print("Claim text:\n", claim_text, "\n")
    print(related_evidences[related_evidences["score"] > score_cutoff], "\n")

Claim text:
 Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life. 

             evidences       score   
0      evidence-442946  176.994675  \
1     evidence-1194317  173.284454   
2       evidence-12171  171.682175   
3      evidence-338219  170.441483   
7     evidence-1177431  173.101151   
...                ...         ...   
4044    evidence-91099  171.580597   
4045   evidence-208053  173.186478   
4093  evidence-1009038  171.619537   
4094   evidence-864864  170.431671   
4096   evidence-296879  172.676620   

                                          evidence_text  
0     At very high concentrations (100 times atmosph...  
1     Plants can grow as much as 50 percent faster i...  
2     Higher carbon dioxide concentrations will favo...  
3     While ‘climate change’ can be due to natural f...  
7     There is no convincing scientific evidence tha...  
...                        

### Cosine Similarity Scoring

Discovery work based for using cosine similarity loss Siamese BERT Networks ([ref](https://arxiv.org/abs/1908.10084)).

In [18]:
scorer = util.cos_sim

sample_scores = scorer(
    a=sample_claim_embeddings,
    b=sample_evidence_embeddings
)
sample_scores
# Rows are claims
# Columns are evidences

tensor([[0.9191, 0.9136, 0.9141, 0.8516, 0.8493],
        [0.8205, 0.8422, 0.8390, 0.8840, 0.9589]], device='mps:0')

In [19]:
min_max_scaler(X=sample_scores, dim=1)

tensor([[1.0000, 0.9209, 0.9277, 0.0338, 0.0000],
        [0.0000, 0.1571, 0.1338, 0.4589, 1.0000]], device='mps:0')

In [20]:
standard_scaler(X=sample_scores, dim=1)

tensor([[ 0.8273,  0.6729,  0.6861, -1.0601, -1.1261],
        [-0.8744, -0.4818, -0.5400,  0.2722,  1.6241]], device='mps:0')

The above uses cosine similarity to get similarity scores then explores two different scaling methods to separate the results. Standard scaler is much more useful.

### Predict related evidence using scaled cut-off

In [29]:
scaled_cutoff = 0.9

#### Cutoff vs training evidences

In [48]:
sample_scores = scorer(
    a=sample_claim_embeddings,
    b=train_evidence_embeddings
)
print(sample_scores.shape)
sample_scores
# Rows are claims
# Columns are evidences

torch.Size([2, 4122])


tensor([[0.9191, 0.9136, 0.9141,  ..., 0.8648, 0.8742, 0.8550],
        [0.8205, 0.8422, 0.8390,  ..., 0.8477, 0.8630, 0.8441]],
       device='mps:0')

In [49]:
sample_scores = standard_scaler(sample_scores, dim=1)
sample_scores

tensor([[ 2.4697,  2.1836,  2.2082,  ..., -0.3455,  0.1460, -0.8508],
        [-1.7638, -0.3436, -0.5544,  ...,  0.0135,  1.0106, -0.2211]],
       device='mps:0')

In [50]:
for claim_text, evidence_scores in zip(sample_claim_texts, sample_scores):
    # related_evidences = evidence_scores > score_cutoff
    related_evidences = (
        train_claims.reset_index()
        .assign(score=evidence_scores.cpu().numpy())
        .get(["evidences", "score", "evidence_text"])
    )
    print("Claim text:\n", claim_text, "\n")
    print(related_evidences[related_evidences["score"] > scaled_cutoff], "\n")

Claim text:
 Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life. 

             evidences     score   
0      evidence-442946  2.469691  \
1     evidence-1194317  2.183646   
2       evidence-12171  2.208215   
3      evidence-338219  1.475389   
7     evidence-1177431  1.935595   
...                ...       ...   
4094   evidence-864864  1.086352   
4096   evidence-296879  1.146516   
4098   evidence-374377  0.981687   
4099   evidence-984234  1.003293   
4100   evidence-537406  1.328993   

                                          evidence_text  
0     At very high concentrations (100 times atmosph...  
1     Plants can grow as much as 50 percent faster i...  
2     Higher carbon dioxide concentrations will favo...  
3     While ‘climate change’ can be due to natural f...  
7     There is no convincing scientific evidence tha...  
...                                                

In [51]:
for claim_text, evidence_scores in zip(sample_claim_texts, sample_scores):
    # related_evidences = evidence_scores > score_cutoff
    related_evidences = (
        train_claims.reset_index()
        .assign(score=evidence_scores.cpu().numpy())
        .get(["evidences", "score", "evidence_text"])
    )
    print("Claim text:\n", claim_text, "\n")
    print(related_evidences.sort_values(by="score", ascending=False).head(10), "\n")

Claim text:
 Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life. 

             evidences     score   
1607   evidence-808385  3.194276  \
2408   evidence-808385  3.194276   
1153   evidence-808385  3.194276   
2788   evidence-316364  3.147586   
2285   evidence-499734  3.057801   
2384  evidence-1167485  2.924245   
1964   evidence-862615  2.893133   
962    evidence-149105  2.877542   
2245   evidence-453635  2.819163   
790    evidence-364767  2.794708   

                                          evidence_text  
1607  Pollutants emitted into the atmosphere by huma...  
2408  Pollutants emitted into the atmosphere by huma...  
1153  Pollutants emitted into the atmosphere by huma...  
2788  There are few studies of the health effects of...  
2285  While the full implications of elevated CO2 on...  
2384  Less direct geological evidence indicates that...  
1964  Elevated CO2 increases 