# Text Match

In this neotebook it will:

    I. explain the similarity problem.
    II. Model
    III. Realization
    IV. Improvement

## I. Presentation

### 1. definition

The text Match handle the problems involving the texts relationships such as similarity, matching.
It can be applied to conversation, text search, question-answering, compute similarity, etc.
The multi-choice can also be considered as text match.

In this context, we concentrate on the match of 2 texts.
When we talk about the similarity here, we refer to semantic similarity.

Example:

| text1  |  text2 | similar |
|----------|:-------|:-------:|
| the cat ate the mouse | the mouse ate the cat food | 0 |
| I like the product | It is not a bad product | 1 |
| The film came out yesterday| The ticket is expensive | 0 |
| He will come tomorrow no matter what | Nothing will stop him to come tomorrow | 1 |

### 2. data processing

    ___________________________________________________________
    |CLS|       sentence1       |SEP|      sentence2      |SEP|
    -----------------------------------------------------------

## II. Model

The model is the same as the classification model used in basics with the output of dimension 1.

Since the output dim is 1, the prediction is score based, we consider similar when output score > 0.54, else not similar.

in the code:

```python
if self.num_labels == 1:
    self.config.problem_type = "regression"
```

So since our num_labels is 1, the type of the model becomes regression. This can be checked using: model.config.problem_type after doawloading the model.

```python
if self.config.problem_type == "regression":
    loss_fct = MSELoss()
    if self.num_labels == 1:
        loss = loss_fct(logits.squeeze(), labels.squeeze())
```

The loss function is MSELoss.

## III. Realization

In [1]:
# to set the gpu to use
# Since I have 2 GPUs and I only want to use one, I need to run this.
# Should be run the first
# skip this if you don't need.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
## defin repos for data and model

# data

ckp_data = "PiC/phrase_similarity"

# model

ckp = "google-bert/bert-base-uncased"

### 1. import

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

2024-06-21 11:04:36.874860: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-21 11:04:36.874912: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-21 11:04:36.877838: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-21 11:04:36.893172: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 2. load dataset

In [4]:
data = load_dataset(ckp_data)
data

DatasetDict({
    train: Dataset({
        features: ['phrase1', 'phrase2', 'sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 7004
    })
    validation: Dataset({
        features: ['phrase1', 'phrase2', 'sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['phrase1', 'phrase2', 'sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 2000
    })
})

In [6]:
# show one data

data["train"][0]

{'phrase1': 'newly formed camp',
 'phrase2': 'recently made encampment',
 'sentence1': 'newly formed camp is released from the membrane and diffuses across the intracellular space where it serves to activate pka.',
 'sentence2': 'recently made encampment is released from the membrane and diffuses across the intracellular space where it serves to activate pka.',
 'label': 0,
 'idx': 0}

### 3. Split data

skip 

### 4. tokenization

In [5]:
# load tokenizer

tokenizer = AutoTokenizer.from_pretrained(ckp)
tokenizer

BertTokenizerFast(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [8]:
# data processing function

def process(samples):

    toks = tokenizer(samples["sentence1"], samples["sentence2"], max_length=128, truncation=True)
    toks["labels"] = [float(l) for l in samples["label"]]

    return toks

In [9]:
# tokenize

tokenized_data = data.map(process, batched=True, remove_columns=data["train"].column_names)
tokenized_data

Map:   0%|          | 0/7004 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 7004
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [10]:
# show some data

tokenized_data["train"][2]

{'input_ids': [101,
  3602,
  2008,
  2755,
  1015,
  2515,
  2025,
  7868,
  2151,
  3327,
  3252,
  2006,
  1996,
  2275,
  5675,
  1035,
  3515,
  1012,
  102,
  3602,
  2008,
  2755,
  1015,
  2515,
  2025,
  7868,
  2151,
  3563,
  3968,
  23664,
  2006,
  1996,
  2275,
  5675,
  1035,
  3515,
  1012,
  102],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': 0.0}

### 5. load model

In [11]:
# load model

model = AutoModelForSequenceClassification.from_pretrained(ckp, num_labels=1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
model.config.problem_type

In [13]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

### 6. define metric

In [14]:
import evaluate

acc_func = evaluate.load("accuracy")
f1_func = evaluate.load("f1")

In [15]:
def metric(pred):

    preds, refs = pred
    preds = [int(p > 0.5) for p in preds]
    refs = [int(l > 0.5) for l in refs]
    
    acc = acc_func.compute(predictions=preds, references=refs)
    f1 = f1_func.compute(predictions=preds, references=refs)

    acc.update(f1)

    return acc

### 7. train args

In [16]:
args = TrainingArguments(
        output_dir="../tmp/checkpoints",
        per_device_train_batch_size=64,
        per_device_eval_batch_size=128,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_steps=10,
        learning_rate=2e-5,
        weight_decay=0.01,
        metric_for_best_model="f1",
        load_best_model_at_end=True,
)



### 8. trainer

In [17]:
from transformers import DataCollatorWithPadding

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=metric
)

### 9. train + eval

In [18]:
trainer.evaluate(eval_dataset=tokenized_data["test"])

  preds = [int(p > 0.5) for p in preds]


{'eval_loss': 0.3240305185317993,
 'eval_accuracy': 0.5,
 'eval_f1': 0.0,
 'eval_runtime': 3.3586,
 'eval_samples_per_second': 595.485,
 'eval_steps_per_second': 4.764}

In [27]:

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.1593,0.200395,0.7165,0.729097
2,0.1001,0.227135,0.705,0.7158
3,0.0537,0.240205,0.7025,0.718143


  preds = [int(p > 0.5) for p in preds]
  preds = [int(p > 0.5) for p in preds]
  preds = [int(p > 0.5) for p in preds]


TrainOutput(global_step=330, training_loss=0.10457922545346347, metrics={'train_runtime': 100.8683, 'train_samples_per_second': 208.311, 'train_steps_per_second': 3.272, 'total_flos': 919141441454256.0, 'train_loss': 0.10457922545346347, 'epoch': 3.0})

In [20]:
trainer.evaluate(eval_dataset=tokenized_data["test"])

  preds = [int(p > 0.5) for p in preds]


{'eval_loss': 0.20629166066646576,
 'eval_accuracy': 0.7085,
 'eval_f1': 0.7329363261566652,
 'eval_runtime': 2.621,
 'eval_samples_per_second': 763.073,
 'eval_steps_per_second': 6.105,
 'epoch': 3.0}

### 10. inference

In [38]:
from transformers import pipeline

pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

In [59]:
%%time
# the label returned by pipe is meaningless in this case, since it always return 0
# So we use the score which is the output scalar of the model

out = pipe({"text": "This place is not what I want.", "text_pair": "The weather is hot."}, function_to_apply="none")
print(out)
pred = "similar" if out["score"] > 0.5 else "not similar"
pred

{'label': 'LABEL_0', 'score': 0.1934397965669632}
CPU times: user 36.6 ms, sys: 22.3 ms, total: 58.9 ms
Wall time: 53.4 ms


'not similar'

In [56]:
model.config.id2label

{0: 'LABEL_0', 1: 'LABEL_1'}

In [40]:
subset = "test"
for i in range(0,20):
    print("ref: ", data[subset][i]["label"])
    out = pipe({"text": data[subset][i]["sentence1"], "text_pair": data[subset][i]["sentence2"]}, function_to_apply="none")
    print("pred: ", out["score"])

ref:  0
pred:  0.008944655768573284
ref:  0
pred:  0.11162159591913223
ref:  1
pred:  0.2571737766265869
ref:  0
pred:  0.3957631289958954
ref:  0
pred:  0.14410561323165894
ref:  0
pred:  0.08695762604475021
ref:  0
pred:  0.023133572190999985
ref:  1
pred:  0.8889474272727966
ref:  1
pred:  0.0772826075553894
ref:  0
pred:  0.32186418771743774
ref:  0
pred:  0.06406279653310776
ref:  0
pred:  0.035871099680662155
ref:  0
pred:  0.8888132572174072
ref:  0
pred:  0.14307324588298798
ref:  0
pred:  0.1918620616197586
ref:  1
pred:  0.8433682918548584
ref:  1
pred:  0.5614240765571594
ref:  0
pred:  0.06535700708627701
ref:  0
pred:  0.28326693177223206
ref:  0
pred:  0.9595742225646973


## IV. Improvement

### A. Problem

If we have a lot of texts to match, it will be inefficient if use the same technique as above: take a pair of texts and test their similarity.

eg. if we have a text to be matched against 1M texts database. We see as using pipeline of HF, it takes some 50ms to infer. This will take 1M * 50ms ~ 50000s ~ 14h.

B. Solution

So we can use another atrategy for this kind of problem: embed the candidate texts and save them. When there is a text to match, we can just find the closest candidate.

C. HF solution

1 - 3 steps unchanged

### 4. tokenization

In [28]:
def process(samples):

    sentences = []
    labels = []

    for ind in range(len(samples["label"])):
        sentences.append(samples["sentence1"][ind])
        sentences.append(samples["sentence2"][ind])
        labels.append(1.0 if samples["label"][ind] == 1 else -1.0) # this is for cos loss for similarity

    toks = tokenizer(sentences, max_length=128, truncation=True, padding="max_length")
    toks = {k: [v[i:i+2] for i in range(0, len(v), 2)] for k, v in toks.items()}
    toks["labels"] = labels

    return toks

tokenized_data = data.map(process, batched=True, remove_columns=data["train"].column_names)
tokenized_data

Map:   0%|          | 0/7004 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 7004
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [29]:
# show data

print(tokenized_data["train"][0])

{'input_ids': [[101, 4397, 2719, 3409, 2003, 2207, 2013, 1996, 10804, 1998, 28105, 2015, 2408, 1996, 26721, 16882, 2686, 2073, 2009, 4240, 2000, 20544, 1052, 2912, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 3728, 2081, 4372, 26468, 3672, 2003, 2207, 2013, 1996, 10804, 1998, 28105, 2015, 2408, 1996, 26721, 16882, 2686, 2073, 2009, 4240, 2000, 20544, 1052, 2912, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0,

### 5. model

In this step, instead of load only a model, we have to define a customized model.

In [13]:
from transformers import BertForSequenceClassification, BertPreTrainedModel, BertModel
from typing import Optional
import torch
from torch.nn import CosineSimilarity, CosineEmbeddingLoss

# we redefine the model since there is no ready-to-use model for this problem.
# precisely, no model can handle our new data dim is [batchn 2, seq_len].
# The reference is the BertForSequenceClassification model to produce the below code

class EmbedModel(BertPreTrainedModel):

    def __init__(self, config):

        super().__init__(config)
        self.bert = BertModel(config)
        self.post_init()
        
    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ):
        
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # get sentence1 and sentence2
        s1_input_ids, s2_input_ids = input_ids[:, 0], input_ids[:, 1]
        s1_attention_mask, s2_attention_mask = attention_mask[:, 0], attention_mask[:, 1]
        s1_token_type_ids, s2_token_type_ids = token_type_ids[:, 0], token_type_ids[:, 1]


        s1_outputs = self.bert(
            s1_input_ids,
            attention_mask=s1_attention_mask,
            token_type_ids=s1_token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        s1_pooled_output = s1_outputs[1] # [batch, hidden]

        s2_outputs = self.bert(
            s2_input_ids,
            attention_mask=s2_attention_mask,
            token_type_ids=s2_token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        s2_pooled_output = s2_outputs[1] # [batch, hidden]

        # compute similarity of the hiddens
        simi = CosineSimilarity()(s1_pooled_output, s2_pooled_output)

        # loss
        loss = None
        if labels is not None:
            loss_fct = CosineEmbeddingLoss(0.3)
            loss = loss_fct(s1_pooled_output, s2_pooled_output, labels)

        output = (simi,)
        return ((loss,) + output) if loss is not None else output # should return first loss


In [14]:
# construct model

model = EmbedModel.from_pretrained(ckp)
model

EmbedModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_af

### 6. metric

In [18]:
import evaluate

acc_func = evaluate.load("accuracy")
f1_func = evaluate.load("f1")

def metric(pred):

    preds, refs = pred
    preds = [int(p > 0.7) for p in preds]
    refs = [int(l > 0) for l in refs] # should convert back to [0,1] instead of [1,-1]

    acc = acc_func.compute(predictions=preds, references=refs)
    f1 = f1_func.compute(predictions=preds, references=refs)

    acc.update(f1)

    return acc

### 7. train args

In [30]:
# train

args = TrainingArguments(
        output_dir="./checkpoints",
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_steps=10,
        learning_rate=2e-5,
        weight_decay=0.01,
        metric_for_best_model="f1",
        load_best_model_at_end=True,
)



### 8.trainer

In [31]:

trainer = Trainer(
    model = model,
    args=args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["validation"],
    compute_metrics=metric
)

### 9. train + eval

In [32]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.0948,0.276205,0.623,0.694242
2,0.098,0.282983,0.637,0.692633
3,0.1519,0.280566,0.642,0.698145


TrainOutput(global_step=657, training_loss=0.11748833834127023, metrics={'train_runtime': 241.5772, 'train_samples_per_second': 86.978, 'train_steps_per_second': 2.72, 'total_flos': 2764195109535744.0, 'train_loss': 0.11748833834127023, 'epoch': 3.0})

### 10. inference

In [22]:
class SimilarityPipeline:

    def __init__(self, model, tokenizer) -> None:
        self.model = model.bert
        self.tokenizer = tokenizer
        self.device = model.device

    def preprocess(self, senA, senB):
        return self.tokenizer([senA, senB], max_length=128, truncation=True, return_tensors="pt", padding=True)

    def predict(self, inputs):
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        return self.model(**inputs)[1]  # [2, 768]

    def postprocess(self, logits):
        cos = CosineSimilarity()(logits[None, 0, :], logits[None, 1, :]).squeeze().cpu().item()
        return cos

    def __call__(self, senA, senB, return_vector=False):
        inputs = self.preprocess(senA, senB)
        logits = self.predict(inputs)
        result = self.postprocess(logits)
        if return_vector:
            return result, logits
        else:
            return result

In [33]:
pipe = SimilarityPipeline(model, tokenizer)

In [36]:
pipe(data["test"][0]["sentence1"], data["test"][0]["sentence2"], return_vector=True)

(-0.5793790817260742,
 tensor([[ 0.9772,  0.9492,  1.0000,  ...,  1.0000,  0.9650, -0.9397],
         [-0.9744, -0.5174,  0.7518,  ...,  0.1605, -0.6033,  0.9713]],
        device='cuda:0', grad_fn=<TanhBackward0>))

## Some Doc:
 - https://www.newscatcherapi.com/blog/ultimate-guide-to-text-similarity-with-python