In Sentence Transformers, a "CrossEncoder" is a type of model architecture that is designed to encode and compare pairs of sentences simultaneously. Unlike the "Siamese" architecture, where the two sentences are encoded separately and then their representations are compared, the CrossEncoder processes both sentences together in a single forward pass.

The CrossEncoder architecture is commonly used for tasks that involve comparing pairs of sentences, such as sentence similarity, semantic textual similarity, paraphrase identification, and natural language inference. It's especially useful when you want to calculate a single similarity score or label for the entire sentence pair.

Here's a high-level overview of how a CrossEncoder works:

1. **Input Encoding:** The two sentences in the pair are tokenized and encoded together as a single input. They are processed through a pre-trained transformer-based model, such as BERT, RoBERTa, or other variants, which produces a fixed-length vector representation for the entire sentence pair.

2. **Similarity Calculation:** The encoded sentence pair representation is then passed through one or more fully connected layers or other transformation functions to calculate a similarity score. This score indicates how similar the two sentences are in terms of their semantic meaning, with higher scores indicating higher similarity.

3. **Training:** During training, CrossEncoder models are often fine-tuned using supervised learning on tasks that require sentence pair comparisons. The model is trained with labeled data, where each sentence pair is associated with a similarity score or a binary label (e.g., paraphrase/non-paraphrase).

The CrossEncoder architecture is particularly useful when you need a single model that can handle multiple types of similarity-related tasks, as it can provide efficient and accurate sentence pair comparisons.

**What is the NLI task in NLP?**

Natural Language Inference (NLI) is the task of determining whether the given “hypothesis” logically follows from the “premise”. In layman's terms, you need to understand whether the hypothesis is true, while the premise is your only knowledge about the subject.

#### MultiNLI
 The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen. 

**Examples**

Premise 	                                            Label 	    Hypothesis

The Old One always comforted Ca'daan, except today. 	neutral 	Ca'daan knew the Old One very well.

Your gift is appreciated by each and every student 
who will benefit from your generosity.                  neutral 	Hundreds of students will benefit from your generosity.

yes now you know if if everybody like in August when 
everybody's on vacation or something we can dress 
a little more casual or 	                            contradiction 	August is a black out month for vacations in the company.


At the other end of Pennsylvania Avenue, people began 
to line up for a White House tour. 	                    entailment 	People formed a line at the end of Pennsylvania Avenue.

#### Fine Tuning SBERT

In [None]:
%pip install sentence-transformers   
%pip install torch torchvision torchaudio

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
import torch
torch.__version__

#### Goo Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

#### My personal traning dataset

In [None]:
!cp /content/my_new_ds_train_eval_test.zip /content/drive/My\ Drive/Colab\ Notebooks/ML\ Project\ 2/my_new_ds_train_eval_test.zip

#### Download NLI Traning Dataset

In [None]:
from torch.utils.data import DataLoader
import math

from sentence_transformers import LoggingHandler, util
from sentence_transformers.cross_encoder import CrossEncoder
from sentence_transformers.cross_encoder.evaluation import CESoftmaxAccuracyEvaluator
from sentence_transformers.readers import InputExample
import logging
from datetime import datetime
import os
import gzip
import csv
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('data-ques.csv')

In [11]:
df.head(2)

Unnamed: 0,question,text,hypothesis,label
0,What is NLP?,Natural Language Processing (NLP) is a field o...,NLP is a subfield of AI that deals with the pr...,entailment
1,What are the common NLP tasks?,Common NLP tasks include part-of-speech taggin...,NLP tasks mainly focus on analyzing the syntax...,entailment


In [7]:
len(df)

447

In [6]:
df.drop_duplicates(keep=False, inplace=True)

In [None]:
# Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
logger = logging.getLogger(__name__)

#As dataset, we use SNLI + MultiNLI
#Check if dataset exsist. If not, download and extract  it
nli_dataset_path = 'datasets/AllNLI.tsv.gz'

if not os.path.exists(nli_dataset_path):
    util.http_get('https://sbert.net/datasets/AllNLI.tsv.gz', nli_dataset_path)


# Read the AllNLI.tsv.gz file and create the training dataset
logger.info("Read AllNLI train dataset")

In [None]:
label2int = {"contradiction": 0, "entailment": 1, "neutral": 2}

train_samples = []
dev_samples = []

In [None]:
with gzip.open(nli_dataset_path, 'rt', encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
    for row in reader:
        label_id = label2int[row['label']]
        if row['split'] == 'train':
            train_samples.append(InputExample(texts=[row['sentence1'], row['sentence2']], label=label_id))
        else:
            dev_samples.append(InputExample(texts=[row['sentence1'], row['sentence2']], label=label_id))

In [None]:
# with gzip.open(nli_dataset_path, 'rt', encoding='utf8') as fIn:
#     df_x = pd.read_csv(fIn, sep='\t', names=['sentence1', 'sentence2','label_id'], quoting=csv.QUOTE_NONE)

with gzip.open(nli_dataset_path, 'rt', encoding='utf8') as fIn:
    df_x = pd.read_csv(fIn, sep='\t', quoting=csv.QUOTE_NONE)

In [None]:
df_x.head(1)

In [None]:
df_x.head(1)['sentence1']

In [None]:
df_x.head(1)['sentence2']

In [None]:
train_batch_size = 16
num_epochs = 4
model_save_path = 'output/training_allnli-' + \
    datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

In [None]:
# Define our CrossEncoder model. We use distilroberta-base as basis and setup it up to predict 3 labels
model = CrossEncoder('distilroberta-base', num_labels=len(label2int))

In [None]:
# We wrap train_samples, which is a list ot InputExample, in a pytorch DataLoader
train_dataloader = DataLoader(
    train_samples, shuffle=True, batch_size=train_batch_size)

# During training, we use CESoftmaxAccuracyEvaluator to measure the accuracy on the dev set.
evaluator = CESoftmaxAccuracyEvaluator.from_input_examples(
    dev_samples, name='AllNLI-dev')


# 10% of train data for warm-up
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)
logger.info("Warmup-steps: {}".format(warmup_steps))

**But it takes too time like 20 hours for traning**

In [None]:
# Train the model
model.fit(train_dataloader=train_dataloader,
          evaluator=evaluator,
          epochs=num_epochs,
          evaluation_steps=10000,
          warmup_steps=warmup_steps,
          output_path=model_save_path)

### Semantic Similarity Between two sentences

In [None]:
from sentence_transformers import CrossEncoder

# this is the best model for STSb dataset
model = CrossEncoder('cross-encoder/stsb-roberta-base')

In [None]:
text1 = """
Gradient descent is an optimization algorithm which is commonly-used to train machine learning models and neural networks. Training data helps these models learn over time, and the cost function within gradient descent specifically acts as a barometer, gauging its accuracy with each iteration of parameter updates.
"""
text2 = """
Gradient descent (GD) is not an iterative first-order optimisation algorithm used to find a local minimum/maximum of a given function. This method is commonly used in machine learning (ML) and deep learning(DL) to minimise a cost/loss function (e.g. in a linear regression). Due to its importance and ease of implementation, this algorithm is usually taught at the beginning of almost all machine learning courses.
"""

In [None]:
scores = model.predict([(text1,text2)])

In [None]:
scores

#### Use already fine-tuned Cross-Encoder model

1. NLI trained Cross Encoder

In [None]:
# pretrained model
model = CrossEncoder('cross-encoder/nli-distilroberta-base', max_length=512)

In [None]:
texts = [
    ("good public policy should make society healthier happier  safer and more productive", "scoial sciencists should be able to help us understand the world better"),
]

In [None]:
texts[0]

In [None]:
scores_n = model.predict(texts)

2. STSb trained Cross Encoder

In [None]:
# pretrained model
model = CrossEncoder('cross-encoder/stsb-distilroberta-base',)

In [None]:
scores_n = model.predict(texts)

In [None]:
# convert computed liast of list equal scores_hugg to simple labels
label_mapping = ["contradiction", "entailment", "neutral"]

# original scores
print(scores_n)

In [None]:
# for easy reading: convert list to singular label
labels = [label_mapping[score_max] for score_max in scores_n.argmax(axis=1)]
print(labels)

### Train Cross Encoder

In [None]:
train_btach_size = 16
num_epochs = 4
model_save_path = 'output/training_stsbenchmark_continue_training-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

### Train Cross Encoder Model on Custom DataSet

In [None]:
train_batch_size = 16
num_epochs = 3
model_save_path = 'output/my_training_ds-' + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

In [None]:
# cross encoder model
my_model = CrossEncoder("distilroberta-base", num_labels=len(label2int), max_length=512)

# distilroberta-base is a smaller version of RoBERTa-base. It has 6 layers, 768 hidden size, 12 attention heads and about 82M parameters.

# distilroberta-base is a smaller version of RoBERTa-base. It has 6 layers, 768 hidden size, 12 attention heads and about 82M parameters.

#### Drop Column

In [None]:
# inplace = True means we are modifying the original dataframe, not creating a new one and assigning it to df again 
df.drop('question', axis='columns', inplace=True)

#### String to Number

In [None]:
# converting label to numeric
df['label'].replace(label2int, inplace=True)

#### Splitting the dataset into train and test

In [None]:
# data split into train_df, val_df, test_df
train_df, val_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

#### Adding data as input example

In [None]:
train_samples_my = []
test_samples = []

# for i in range(len(train_df)):
#     train_samples_my.append(InputExample(texts=[train_df.iloc[i]['question'], train_df.iloc[i]['answer']], label=train_df.iloc[i]['label']))

# use anither way without for loop
train_samples_my = [InputExample(texts=[train_df.iloc[i]['text'], train_df.iloc[i]['hypothesis']], label=train_df.iloc[i]['label']) for i in range(len(train_df))]

test_samples = [InputExample(texts=[test_df.iloc[i]['text'], test_df.iloc[i]['hypothesis']], label=test_df.iloc[i]['label']) for i in range(len(test_df))]


In [None]:
train_dataloader_my = DataLoader(train_samples_my, shuffle=True,batch_size=train_batch_size)

In [None]:
evaluator_my = CESoftmaxAccuracyEvaluator.from_input_examples(test_samples, name='my_training_ds_evaluator')

In [None]:
warmup_steps = math.ceil(len(train_dataloader_my) * num_epochs * 0.1) #10% of train data for warm-up

# warmup steps is a parameter that controls the learning rate during the first few epochs of training.
# During the warm-up steps, the learning rate is linearly increased from 0 to the specified learning rate.
# After the warm-up phase, the learning rate is decreasing again during training.
# This is a common technique in transfer learning, especially for fine-tuning BERT models.

logger.info("Warmup-steps: {}".format(warmup_steps))

In [None]:
# Train the model
my_model.fit(train_dataloader=train_dataloader_my,
             evaluator=evaluator_my,
             epochs=1,
             evaluation_steps=2000,
             warmup_steps=warmup_steps,
             save_best_model=True,
             output_path=model_save_path)

# evaluation_steps means how often the model should be evaluated on the dev set.
# save_best_model=True means that only the best model is saved on disk (correctly classified most dev samples)

#### Load my pre-trained cross-encoder model

In [None]:
model_reloaded = CrossEncoder(model_save_path)

In [None]:
#### eval on test set for single value outcome
from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator

evaluator = CECorrelationEvaluator.from_input_examples(test_samples, name='eval_test')
evaluator(my_model, output_path=model_save_path)