# Hugging Face sentence embedding

Using `sentence-transformers/msmarco-bert-base-dot-v5`

References:

- Hugging Face [semantic search][semantic_search_guide]
- SBERT [pre-trained models][sbert_pretrained_models]
- Installation: [`sentence-transformers`][sentence_transformers]
- Code reference [`semantic_search.py`][semantic_search_py]
- Model info: [`msmarco-bert-base-dot-v5`][msmarco-bert-base-dot-v5]



[semantic_search_guide]: https://www.sbert.net/examples/applications/semantic-search/README.html
[sbert_pretrained_models]: https://www.sbert.net/docs/pretrained_models.html#msmarco-passage-models
[sentence_transformers]: https://www.sbert.net/docs/installation.html
[semantic_search_py]: https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search.py
[msmarco-bert-base-dot-v5]: https://huggingface.co/sentence-transformers/msmarco-bert-base-dot-v5

## Setup

### Working directory

In [1]:
# Change the working directory to project root
import pathlib
import os
ROOT_DIR = pathlib.Path.cwd().parent
os.chdir(ROOT_DIR)

### Dependencies

In [2]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd
import torch
from src.sys_platform import is_apple_silicon
from src.torch_mps import get_mps_device
from src.data import load_as_dataframe, slice_by_claim, \
    get_claim_texts, get_evidence_texts, get_paired_texts

  from .autonotebook import tqdm as notebook_tqdm


## Load Data

### Load from json

In [3]:
data_names = ["train-claims", "dev-claims", "evidence"]
train_claims, dev_claims = load_as_dataframe(data_names)

Loaded train-claims
Loaded dev-claims
Loaded evidence


### Sample inspection

In [4]:
sample_claim = slice_by_claim(train_claims, None, 2, ["SUPPORTS", "DISPUTED"])
sample_claim

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,evidence_text
claim,claim_text,claim_label,evidences,Unnamed: 4_level_1
claim-1937,"Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.",DISPUTED,evidence-442946,At very high concentrations (100 times atmosph...
claim-1937,"Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.",DISPUTED,evidence-1194317,Plants can grow as much as 50 percent faster i...
claim-1937,"Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.",DISPUTED,evidence-12171,Higher carbon dioxide concentrations will favo...
claim-2510,"In 1946, PDO switched to a cool phase.",SUPPORTS,evidence-530063,There is evidence of reversals in the prevaili...
claim-2510,"In 1946, PDO switched to a cool phase.",SUPPORTS,evidence-984887,"1945/1946: The PDO changed to a ""cool"" phase, ..."


In [5]:
get_claim_texts(sample_claim)

['Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.',
 'In 1946, PDO switched to a cool phase.']

In [6]:
get_evidence_texts(sample_claim)

['At very high concentrations (100 times atmospheric concentration, or greater), carbon dioxide can be toxic to animal life, so raising the concentration to 10,000 ppm (1%) or higher for several hours will eliminate pests such as whiteflies and spider mites in a greenhouse.',
 'Plants can grow as much as 50 percent faster in concentrations of 1,000 ppm CO 2 when compared with ambient conditions, though this assumes no change in climate and no limitation on other nutrients.',
 'Higher carbon dioxide concentrations will favourably affect plant growth and demand for water.',
 'There is evidence of reversals in the prevailing polarity (meaning changes in cool surface waters versus warm surface waters within the region) of the oscillation occurring around 1925, 1947, and 1977; the last two reversals corresponded with dramatic shifts in salmon production regimes in the North Pacific Ocean.',
 '1945/1946: The PDO changed to a "cool" phase, the pattern of this regime shift is similar to the 19

In [7]:
sample_pairs = get_paired_texts(sample_claim)
sample_pairs

[['Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.',
  'Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.',
  'Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.',
  'In 1946, PDO switched to a cool phase.',
  'In 1946, PDO switched to a cool phase.'],
 ['At very high concentrations (100 times atmospheric concentration, or greater), carbon dioxide can be toxic to animal life, so raising the concentration to 10,000 ppm (1%) or higher for several hours will eliminate pests such as whiteflies and spider mites in a greenhouse.',
  'Plants can grow as much as 50 percent faster in concentrations of 1,000 ppm CO 2 when compared with ambient conditions, though this assumes no change in cli

## Select HF embedding model

For our task, we generally have a **short** query of claims and **long** evidence passages. Therefore, we are conducting **asymmetric semantic search**. Based on the guide, MSMARCO models are most suited for this task.

In [8]:
embedder = SentenceTransformer("sentence-transformers/msmarco-bert-base-dot-v5")

In [9]:
# Use MPS device if using Apple Silicon
if (is_apple_silicon()):
    mps_device = get_mps_device()
    embedder.to(device=mps_device)

MPS is available


## Explore embeddings

### Claim embeddings

In [10]:
claim_embeddings = embedder.encode(sample_pairs[0], convert_to_tensor=True)

print(claim_embeddings.shape)
print(claim_embeddings)

torch.Size([5, 768])
tensor([[ 0.0797, -0.1350, -0.1297,  ...,  0.3374,  0.2907, -0.1008],
        [ 0.0797, -0.1350, -0.1297,  ...,  0.3374,  0.2907, -0.1008],
        [ 0.0797, -0.1350, -0.1297,  ...,  0.3374,  0.2907, -0.1008],
        [-0.4509, -0.2364,  0.0293,  ...,  0.0458,  0.0530,  0.0498],
        [-0.4509, -0.2364,  0.0293,  ...,  0.0458,  0.0530,  0.0498]])


### Evidence embeddings

In [11]:
evidence_embeddings = embedder.encode(sample_pairs[1], convert_to_tensor=True)

print(evidence_embeddings.shape)
print(evidence_embeddings)

torch.Size([5, 768])
tensor([[ 0.0680,  0.4074, -0.1831,  ...,  0.3127,  0.5444,  0.3580],
        [-0.1647,  0.1412,  0.2400,  ...,  0.3577,  0.3601,  0.0757],
        [ 0.0445, -0.0545,  0.1051,  ...,  0.4495,  0.0975,  0.0455],
        [-0.4079,  0.1890,  0.1829,  ..., -0.0434,  0.2071,  0.2800],
        [-0.4422, -0.3580,  0.2699,  ...,  0.1765,  0.0556,  0.0731]])
