# Sentence Similarity with Pretrained BERT
In this notebook, we use pretrained [BERT](https://arxiv.org/abs/1810.04805) as a sentence encoder to measure sentence similarity. We use a [feature extractor](../../utils_nlp/bert/extract_features.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert). 

### 00 Global Settings

In [1]:
import sys
import os
import torch
import itertools
import numpy as np
import pandas as pd
from collections import OrderedDict

sys.path.append("../../")
from utils_nlp.models.bert.common import Language, Tokenizer
from utils_nlp.models.bert.extract_features import BERTSentenceEncoder, PoolingStrategy

In [2]:
# path config
CACHE_DIR = "./temp"
if not os.path.exists(CACHE_DIR):
    os.makedirs(CACHE_DIR, exist_ok=True)

# model config
LANGUAGE = Language.ENGLISH
TO_LOWER = True
MAX_SEQ_LENGTH = 128
NUM_GPUS = 0

### 01 Define the Sentence Encoder with Pretrained BERT

The `BERTSentenceEncoder` defaults to Pretrained BERT.

In [3]:
se_pretrained = BERTSentenceEncoder(
    language=LANGUAGE,
    num_gpus=NUM_GPUS,
    cache_dir=CACHE_DIR,
    to_lower=TO_LOWER,
    max_len=MAX_SEQ_LENGTH,
)

### 02 Define the Sentence Encoder with Finetuned BERT

Or, we can finetune `BERTSequenceClassifier` with pretrained weights on the SNLI dataset, and then pass it into the BERT sentence encoder as `bert_model`.

In [4]:
train_df = snli.load_pandas_df(
    os.path.join(CACHE_DIR, "data"), file_split=Split.TRAIN
)
train_df = snli.clean_df(train_df)

In [5]:
tokenizer = Tokenizer(LANGUAGE, to_lower=TO_LOWER, cache_dir=CACHE_DIR)

In [6]:
train_tokens_sentence1 = tokenizer.tokenize(train_df.sentence1)
train_tokens_sentence2 = tokenizer.tokenize(train_df.sentence2)

train_token_pairs = np.transpose(
    np.array([train_tokens_sentence1, train_tokens_sentence2])
).tolist()

100%|██████████| 549361/549361 [01:50<00:00, 4961.30it/s]
100%|██████████| 549361/549361 [01:14<00:00, 7360.49it/s]


In [7]:
se = BERTSentenceEncoder(
    language=LANGUAGE,
    num_gpus=NUM_GPUS,
    cache_dir=CACHE_DIR,
    to_lower=TO_LOWER,
    max_len=MAX_SEQ_LENGTH,
)

### 02 Compute the Sentence Encodings

The `encode` method of the sentence encoder accepts a list of text to encode, as well as the layers we want to extract the embeddings from and the pooling strategy we want to use. The embedding size is 768. We can also return just the values column as a list of numpy arrays by setting the `as_numpy` parameter to True.

In [4]:
se.encode(
    ["Coffee is good", "The moose is across the street"],
    layer_indices=[-2],
    pooling_strategy=PoolingStrategy.MEAN,
    as_numpy=False
)

100%|██████████| 2/2 [00:00<00:00, 2003.97it/s]


Unnamed: 0,text_index,layer_index,values
0,0,-2,"[0.03808079, 0.0926698, 0.0366186, -0.12183700..."
1,1,-2,"[0.08424112, 0.099506006, -0.38437766, 0.21644..."
