# Sentence Pair Classification 

Predict if two sentences are paraphrase, duplicate, or similar.



## Huggingface

### Sentence Pair Tokenization (creating input to model)
This will be model-dependent. 

* [Huggingface Youtube - Preprocessing sentence pairs (PyTorch)](https://www.youtube.com/watch?v=0u3ioSwev3s)
* [Colab for Huggingface Youtube - Preprocessing sentence pairs (PyTorch)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/videos/sentence_pairs_pt.ipynb)
> This notebook regroups the code sample of the video below, which is a part of the [Hugging Face course](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fhuggingface.co%2Fcourse).


### Huggingface Forum Topic for Sentence Pair Classification

* [Use two sentences as inputs for sentence classification](https://discuss.huggingface.co/t/use-two-sentences-as-inputs-for-sentence-classification/5444)

> In BERT, two sentences are provided as follows to the model: ```[CLS] sentence1 [SEP] sentence2 [SEP] [PAD] [PAD] [PAD]```.
> You can prepare them using BertTokenizer, simply by providing two sentences:
> ```
> from transformers import BertTokenizer
> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
> 
> sentence_a = "this is a sentence"
> sentence_b = "this is another sentence"
> 
> encoding = tokenizer(sentence_a, sentence_b, padding="max_length", truncation=True)
> ```

* [Train a Bert Classifier with more than 2 Input Text Columns](https://discuss.huggingface.co/t/train-a-bert-classifier-with-more-than-2-input-text-columns/59895)

> ```
> def tokenize_function(examples):
>     return tokenizer(examples["text1"], examples["text2"])
> ```

## SageMaker

* [Sentence Pair Classification - HuggingFace](https://sagemaker.readthedocs.io/en/v2.143.0/algorithms/text/sentence_pair_classification_hugging_face.html)

## Sentence Transformer for Sentence Pair Classification 

* [Cross-Encoders](https://www.sbert.net/examples/applications/cross-encoder/README.html)

> SentenceTransformers also supports to load Cross-Encoders for sentence pair scoring and sentence pair classification tasks. Cross-Encoders can be used whenever you have a pre-defined set of sentence pairs you want to score. For example, you have 100 sentence pairs and you want to get similarity scores for these 100 pairs. For a Cross-Encoder, we pass both sentences simultaneously to the Transformer network. It produces then an output value between 0 and 1 indicating the similarity of the input sentence pair:
>   
> <img src="image/Bi_vs_Cross-Encoder.png" align="left" width=500/>
>   
> As detailed in our [paper](https://arxiv.org/abs/1908.10084), Cross-Encoder achieve better performances than Bi-Encoders. However, for many application they are not practical as they do not produce embeddings we could e.g. index or efficiently compare using cosine similarity.
> 
> ### Cross-Encoders Usage
> 
> ```
> from sentence_transformers.cross_encoder import CrossEncoder
> 
> model = CrossEncoder("model_name_or_path")
> scores = model.predict([["My first", "sentence pair"], ["Second text", "pair"]])
> ```
> You pass to ```model.predict``` a list of sentence pairs. Note, Cross-Encoder do not work on individual sentence, you have to pass sentence pairs. As model name, you can pass any model or path that is compatible with Huggingface [AutoModel](https://huggingface.co/transformers/model_doc/auto.html) class For a full example, to score a query with all possible sentences in a corpus see [cross-encoder_usage.py](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/cross-encoder/cross-encoder_usage.py).
>   
> ### Combining Bi- and Cross-Encoders
> 
> Cross-Encoder achieve higher performance than Bi-Encoders, however, they do not scale well for large datasets. Here, it can make sense to combine Cross- and Bi-Encoders, for example in Information Retrieval / Semantic Search scenarios: First, you use an efficient Bi-Encoder to retrieve e.g. the top-100 most similar sentences for a query. Then, you use a Cross-Encoder to re-rank these 100 hits by computing the score for every (query, hit) combination.
>   
> For more details on combing Bi- and Cross-Encoders, see [Application - Information Retrieval](https://www.sbert.net/examples/applications/retrieve_rerank/README.html).
>
> ### Training Cross-Encoders
> 
> See [Cross-Encoder Training](https://www.sbert.net/examples/training/cross-encoder/README.html) how to train your own Cross-Encoder models.

### Sentence Transformer - Cross Encoder Models

* [Huggingface - Sentence Transformers - Cross-Encoders](https://huggingface.co/cross-encoder)

## Fine Tuning Sentence Pair Classification Models

* [Huggingface - Fine-Tuning BERT for Sentence-Pair Classification](https://github.com/sukhijapiyush/Fine-Tune-Bert-for-Sentence-Pair-Classification)
* [Google Research - Fine_tune_ALBERT_sentence_pair_classification.ipynb](https://colab.research.google.com/github/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb)
* [Kaggle - Quora Question Pairs Competition](https://www.kaggle.com/competitions/quora-question-pairs)
> Can you identify question pairs that have the same intent?

* [Kaggle - Fine tune BERT for Queation-pair classification](https://www.kaggle.com/code/sharanharsoor/fine-tune-bert-for-queation-pair-classification)


# HuggingFace BERT Sentence Classification 

In [1]:
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification
)
import torch

In [2]:
MODEL_NAME: str = "bert-base-cased-finetuned-mrpc"

In [3]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

phrase_0 = "Machine Learning (ML) makes predictions from data"
phrase_1 = "ML uses data to compute a prediction."

print(f"\nFirst phrase: {phrase_0}")
print(f"\nSecond phrase: {phrase_1}")
print("\nSecond phrase: ")
phrase_tokenized = tokenizer(phrase_0, phrase_1, return_tensors="pt")

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]


First phrase: Machine Learning (ML) makes predictions from data

Second phrase: ML uses data to compute a prediction.

Second phrase: 


In [4]:
with torch.no_grad():
    logits = model(**phrase_tokenized).logits
    probabilities = torch.softmax(logits, dim=1).numpy()

print(f"\nPseudo-probabilities of not-a-para, is-a-para: {probabilities}")


Pseudo-probabilities of not-a-para, is-a-para: [[0.05828979 0.94171023]]
