# Loading Dataset

****The Multi-Genre Natural Language Inference (MNLI) corpus,
which is a collection of 392,702 sentence pairs annotated with entailment (contradic‐
tion, neutral, entailment). We will be using a subset of the data, 50,000 annotated
sentence pairs, to create a minimal example that does not need to be trained for hours
on end.****

In [2]:
from datasets import load_dataset
# Loading MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset(
 "glue", "mnli", split="train"
).select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/52.2M [00:00<?, ?B/s]

(…)alidation_matched-00000-of-00001.parquet:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

(…)dation_mismatched-00000-of-00001.parquet:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

test_matched-00000-of-00001.parquet:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

test_mismatched-00000-of-00001.parquet:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

Generating test_matched split:   0%|          | 0/9796 [00:00<?, ? examples/s]

Generating test_mismatched split:   0%|          | 0/9847 [00:00<?, ? examples/s]

In [3]:
print(train_dataset[3])
print(train_dataset[6])

{'premise': 'How do you know? All this is their information again.', 'hypothesis': 'This information belongs to them.', 'label': 0}
{'premise': 'But a few Christian mosaics survive above the apse is the Virgin with the infant Jesus, with the Archangel Gabriel to the right (his companion Michael, to the left, has vanished save for a few feathers from his wings).', 'hypothesis': 'Most of the Christian mosaics were destroyed by Muslims.  ', 'label': 1}


# Training the model 

**We typically choose an existing sentence-transformers model
and fine-tune that model, but in this example, we are going to train an embedding
from scratch.
This means that we will have to define two things. First, a pretrained Transformer
model that serves as embedding individual words. We will use the BERT base model
(uncased) as it is a great introduction model.**

In [4]:
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer
# Using a base model
embedding_model = SentenceTransformer('bert-base-uncased')

Collecting sentence_transformers
  Downloading sentence_transformers-3.2.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.2.1-py3-none-any.whl (255 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.8/255.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: sentence_transformers
Successfully installed sentence_transformers-3.2.1


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



# Loss Function to optimize the model

In [5]:
from sentence_transformers import losses
#softmax loss
train_loss = losses.SoftmaxLoss(
 model=embedding_model,
 sentence_embedding_dimension=embedding_model.get_sentence_embedding_dimension(), num_labels=3)


# Model Evaluation

**We can perform evaluation of the performance of our model using the Semantic
Textual Similarity Benchmark (STSB). It is a collection of human-labeled sentence
pairs, with similarity scores between 1 and 5.**

In [6]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
# Creating an embedding similarity evaluator for STSB
val_sts = load_dataset("glue", "stsb", split="validation")
evaluator = EmbeddingSimilarityEvaluator(
 sentences1=val_sts["sentence1"],
 sentences2=val_sts["sentence2"],
 scores=[score/5 for score in val_sts["label"]],
 main_similarity="cosine",
)

train-00000-of-00001.parquet:   0%|          | 0.00/502k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/151k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/114k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5749 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

# Defining the training arguements

**we have our evaluator, we create SentenceTransformerTrainingArgu
ments, similar to training with Hugging Face Transformers**

In [7]:
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

In [8]:
args = SentenceTransformerTrainingArguments(
 output_dir="base_embedding_model",
 num_train_epochs=1,
 per_device_train_batch_size=32,
 per_device_eval_batch_size=32,
 warmup_steps=100,
 fp16=True,
 eval_steps=100,
 logging_steps=100,
)

# Training the embedding model

In [9]:
from sentence_transformers.trainer import SentenceTransformerTrainer
# Training
trainer = SentenceTransformerTrainer(
 model=embedding_model,
 args=args,
 train_dataset=train_dataset,
 loss=train_loss,
 evaluator=evaluator
)
trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011113403911111971, max=1.0…

Step,Training Loss
100,1.0798
200,0.934
300,0.8773
400,0.841
500,0.8279
600,0.8236
700,0.8002
800,0.789
900,0.7748
1000,0.7718


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=1563, training_loss=0.8113608772146038, metrics={'train_runtime': 539.8233, 'train_samples_per_second': 92.623, 'train_steps_per_second': 2.895, 'total_flos': 0.0, 'train_loss': 0.8113608772146038, 'epoch': 1.0})

# Evaluating Results

In [14]:
# Evaluating our trained model
result=evaluator(embedding_model)
result

{'pearson_cosine': 0.5345397045887939,
 'spearman_cosine': 0.6082469138789448,
 'pearson_manhattan': 0.5801962729306847,
 'spearman_manhattan': 0.6068780695191375,
 'pearson_euclidean': 0.571048137696305,
 'spearman_euclidean': 0.6029259030941507,
 'pearson_dot': 0.5044776644612751,
 'spearman_dot': 0.5443529748240566,
 'pearson_max': 0.5801962729306847,
 'spearman_max': 0.6082469138789448}

In [16]:
print(result['pearson_cosine'])

0.5345397045887939
