# [모듈] Semantic Textual Similarity & Semantic Search

### 참고


# 1. 배경

# 2. 사용 예시

## 2.1. Training Data
- https://www.sbert.net/docs/training/overview.html

In [1]:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

#Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

#Define your train examples. You need more than just two examples...
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]

#Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/550 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

In [2]:
from sentence_transformers import evaluation
sentences1 = ['This list contains the first column', 'With your sentences', 'You want your model to evaluate on']
sentences2 = ['Sentences contains the other column', 'The evaluator matches sentences1[i] with sentences2[i]', 'Compute the cosine similarity and compares it to scores[i]']
scores = [0.3, 0.6, 0.2]

evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)

# ... Your other code to load training data

model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100, evaluator=evaluator, evaluation_steps=500)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

## 2.2. Continue Training on Other Data
- https://www.sbert.net/docs/training/overview.html

In [3]:
"""
This example loads the pre-trained SentenceTransformer model 'nli-distilroberta-base-v2' from the server.
It then fine-tunes this model for some epochs on the STS benchmark dataset.
Note: In this example, you must specify a SentenceTransformer model.
If you want to fine-tune a huggingface/transformers model like bert-base-uncased, see training_nli.py and training_stsbenchmark.py
"""
from torch.utils.data import DataLoader
import math
from sentence_transformers import SentenceTransformer, LoggingHandler, losses, util, InputExample
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
import logging
from datetime import datetime
import os
import gzip
import csv

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

#Check if dataset exsist. If not, download and extract  it
sts_dataset_path = 'datasets/stsbenchmark.tsv.gz'

if not os.path.exists(sts_dataset_path):
    util.http_get('https://sbert.net/datasets/stsbenchmark.tsv.gz', sts_dataset_path)




# Read the dataset
model_name = 'nli-distilroberta-base-v2'
train_batch_size = 16
num_epochs = 1
model_save_path = 'output/training_stsbenchmark_continue_training-'+model_name+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")



# Load a pre-trained sentence transformer model
model = SentenceTransformer(model_name)

# Convert the dataset to a DataLoader ready for training
logging.info("Read STSbenchmark train dataset")

train_samples = []
dev_samples = []
test_samples = []
with gzip.open(sts_dataset_path, 'rt', encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
    for row in reader:
        score = float(row['score']) / 5.0  # Normalize score to range 0 ... 1
        inp_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=score)

        if row['split'] == 'dev':
            dev_samples.append(inp_example)
        elif row['split'] == 'test':
            test_samples.append(inp_example)
        else:
            train_samples.append(inp_example)



train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)


# Development set: Measure correlation between cosine score and gold labels
logging.info("Read STSbenchmark dev dataset")
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')


# Configure the training. We skip evaluation in this example
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1) #10% of train data for warm-up
logging.info("Warmup-steps: {}".format(warmup_steps))


# Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
          evaluator=evaluator,
          epochs=num_epochs,
          evaluation_steps=1000,
          warmup_steps=warmup_steps,
          output_path=model_save_path)


##############################################################################
#
# Load the stored model and evaluate its performance on STS benchmark dataset
#
##############################################################################

model = SentenceTransformer(model_save_path)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path=model_save_path)

  0%|          | 0.00/392k [00:00<?, ?B/s]

2022-08-11 14:05:29 - Load pretrained SentenceTransformer: nli-distilroberta-base-v2


Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/679 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

2022-08-11 14:05:39 - Use pytorch device: cuda
2022-08-11 14:05:39 - Read STSbenchmark train dataset
2022-08-11 14:05:39 - Read STSbenchmark dev dataset
2022-08-11 14:05:39 - Warmup-steps: 144


Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration:   0%|          | 0/360 [00:00<?, ?it/s]

2022-08-11 14:06:01 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 0:
2022-08-11 14:06:02 - Cosine-Similarity :	Pearson: 0.8814	Spearman: 0.8820
2022-08-11 14:06:02 - Manhattan-Distance:	Pearson: 0.8482	Spearman: 0.8508
2022-08-11 14:06:02 - Euclidean-Distance:	Pearson: 0.8499	Spearman: 0.8530
2022-08-11 14:06:02 - Dot-Product-Similarity:	Pearson: 0.8242	Spearman: 0.8260
2022-08-11 14:06:02 - Save model to output/training_stsbenchmark_continue_training-nli-distilroberta-base-v2-2022-08-11_14-05-29


Iteration:   0%|          | 0/360 [00:00<?, ?it/s]

2022-08-11 14:06:24 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 1:
2022-08-11 14:06:26 - Cosine-Similarity :	Pearson: 0.8884	Spearman: 0.8888
2022-08-11 14:06:26 - Manhattan-Distance:	Pearson: 0.8621	Spearman: 0.8647
2022-08-11 14:06:26 - Euclidean-Distance:	Pearson: 0.8641	Spearman: 0.8673
2022-08-11 14:06:26 - Dot-Product-Similarity:	Pearson: 0.8408	Spearman: 0.8418
2022-08-11 14:06:26 - Save model to output/training_stsbenchmark_continue_training-nli-distilroberta-base-v2-2022-08-11_14-05-29


Iteration:   0%|          | 0/360 [00:00<?, ?it/s]

2022-08-11 14:06:48 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 2:
2022-08-11 14:06:50 - Cosine-Similarity :	Pearson: 0.8894	Spearman: 0.8895
2022-08-11 14:06:50 - Manhattan-Distance:	Pearson: 0.8684	Spearman: 0.8725
2022-08-11 14:06:50 - Euclidean-Distance:	Pearson: 0.8698	Spearman: 0.8745
2022-08-11 14:06:50 - Dot-Product-Similarity:	Pearson: 0.8510	Spearman: 0.8517
2022-08-11 14:06:50 - Save model to output/training_stsbenchmark_continue_training-nli-distilroberta-base-v2-2022-08-11_14-05-29


Iteration:   0%|          | 0/360 [00:00<?, ?it/s]

2022-08-11 14:07:12 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 3:
2022-08-11 14:07:13 - Cosine-Similarity :	Pearson: 0.8889	Spearman: 0.8888
2022-08-11 14:07:13 - Manhattan-Distance:	Pearson: 0.8689	Spearman: 0.8728
2022-08-11 14:07:13 - Euclidean-Distance:	Pearson: 0.8703	Spearman: 0.8747
2022-08-11 14:07:13 - Dot-Product-Similarity:	Pearson: 0.8510	Spearman: 0.8510
2022-08-11 14:07:13 - Load pretrained SentenceTransformer: output/training_stsbenchmark_continue_training-nli-distilroberta-base-v2-2022-08-11_14-05-29
2022-08-11 14:07:14 - Use pytorch device: cuda
2022-08-11 14:07:14 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-08-11 14:07:16 - Cosine-Similarity :	Pearson: 0.8631	Spearman: 0.8629
2022-08-11 14:07:16 - Manhattan-Distance:	Pearson: 0.8369	Spearman: 0.8383
2022-08-11 14:07:16 - Euclidean-Distance:	Pearson: 0.8400	Spearman: 0.8415
2022-08-11 14:07:16 - Dot-Product-Similarity:	Pearson: 0.8228	Spearman:

0.862920923634329

## 2.3. Multitask Training
- https://www.sbert.net/docs/training/overview.html
- https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_multi-task.py

In [None]:
"""
This is an example how to train SentenceTransformers in a multi-task setup.
The system trains BERT on the AllNLI and on the STSbenchmark dataset.
"""
from torch.utils.data import DataLoader
import math
from sentence_transformers import models, losses
from sentence_transformers import LoggingHandler, SentenceTransformer, util
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import *
import logging
from datetime import datetime
import gzip
import csv
import os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

# Read the dataset
model_name = 'bert-base-uncased'
batch_size = 16
model_save_path = 'output/training_multi-task_'+model_name+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")


#Check if dataset exsist. If not, download and extract  it
nli_dataset_path = 'datasets/AllNLI.tsv.gz'
sts_dataset_path = 'datasets/stsbenchmark.tsv.gz'

if not os.path.exists(nli_dataset_path):
    util.http_get('https://sbert.net/datasets/AllNLI.tsv.gz', nli_dataset_path)

if not os.path.exists(sts_dataset_path):
    util.http_get('https://sbert.net/datasets/stsbenchmark.tsv.gz', sts_dataset_path)



# Use BERT for mapping tokens to embeddings
word_embedding_model = models.Transformer(model_name)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])


# Convert the dataset to a DataLoader ready for training
logging.info("Read AllNLI train dataset")
label2int = {"contradiction": 0, "entailment": 1, "neutral": 2}
train_nli_samples = []
with gzip.open(nli_dataset_path, 'rt', encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
    for row in reader:
        if row['split'] == 'train':
            label_id = label2int[row['label']]
            train_nli_samples.append(InputExample(texts=[row['sentence1'], row['sentence2']], label=label_id))


train_dataloader_nli = DataLoader(train_nli_samples, shuffle=True, batch_size=batch_size)
train_loss_nli = losses.SoftmaxLoss(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=len(label2int))

logging.info("Read STSbenchmark train dataset")
train_sts_samples = []
dev_sts_samples = []
test_sts_samples = []
with gzip.open(sts_dataset_path, 'rt', encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
    for row in reader:
        score = float(row['score']) / 5.0  # Normalize score to range 0 ... 1
        inp_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=score)

        if row['split'] == 'dev':
            dev_sts_samples.append(inp_example)
        elif row['split'] == 'test':
            test_sts_samples.append(inp_example)
        else:
            train_sts_samples.append(inp_example)


train_dataloader_sts = DataLoader(train_sts_samples, shuffle=True, batch_size=batch_size)
train_loss_sts = losses.CosineSimilarityLoss(model=model)


logging.info("Read STSbenchmark dev dataset")
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_sts_samples, name='sts-dev')

# Configure the training
num_epochs = 1

warmup_steps = math.ceil(len(train_dataloader_sts) * num_epochs * 0.1) #10% of train data for warm-up
logging.info("Warmup-steps: {}".format(warmup_steps))


# Here we define the two train objectives: train_dataloader_nli with train_loss_nli (i.e., SoftmaxLoss for NLI data)
# and train_dataloader_sts with train_loss_sts (i.e., CosineSimilarityLoss for STSbenchmark data)
# You can pass as many (dataloader, loss) tuples as you like. They are iterated in a round-robin way.
train_objectives = [(train_dataloader_nli, train_loss_nli), (train_dataloader_sts, train_loss_sts)]

# Train the model
model.fit(train_objectives=train_objectives,
          evaluator=evaluator,
          epochs=num_epochs,
          evaluation_steps=1000,
          warmup_steps=warmup_steps,
          output_path=model_save_path
          )



##############################################################################
#
# Load the stored model and evaluate its performance on STS benchmark dataset
#
##############################################################################

model = SentenceTransformer(model_save_path)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_sts_samples, name='sts-test')
test_evaluator(model, output_path=model_save_path)

  0%|          | 0.00/40.8M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

2022-08-11 14:07:28 - Use pytorch device: cuda
2022-08-11 14:07:28 - Read AllNLI train dataset
2022-08-11 14:07:37 - Softmax loss: #Vectors concatenated: 3
2022-08-11 14:07:37 - Read STSbenchmark train dataset
2022-08-11 14:07:37 - Read STSbenchmark dev dataset
2022-08-11 14:07:37 - Warmup-steps: 144


Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration:   0%|          | 0/360 [00:00<?, ?it/s]

2022-08-11 14:08:56 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 0:
2022-08-11 14:08:59 - Cosine-Similarity :	Pearson: 0.8356	Spearman: 0.8376
2022-08-11 14:08:59 - Manhattan-Distance:	Pearson: 0.7746	Spearman: 0.7817
2022-08-11 14:08:59 - Euclidean-Distance:	Pearson: 0.7749	Spearman: 0.7823
2022-08-11 14:08:59 - Dot-Product-Similarity:	Pearson: 0.7151	Spearman: 0.7318
2022-08-11 14:08:59 - Save model to output/training_multi-task_bert-base-uncased-2022-08-11_14-07-16


Iteration:   0%|          | 0/360 [00:00<?, ?it/s]

# 3. 커널 리스타트


- 커널 리스타트에 대한 내용이 있습니다. 클릭 후 가장 하단의 "3.커널 리스타팅" 을 참조 하세요.
    - [리스타트 상세](https://github.com/gonsoomoon-ml/NLP-HuggingFace-On-SageMaker/blob/main/1_NSMC-Classification/2_WarmingUp/0.1.warming_up_yelp_review.ipynb)

In [None]:
import IPython

IPython.Application.instance().kernel.do_shutdown(True)