#### Evaluating teacher and student models before training distillation

In [1]:
!uv pip install -qU sentence-transformers
!uv pip install -q transformers
!uv pip install -q datasets 
!uv pip install -q ipywidgets
!uv pip install -q pandas 
!uv pip install -q 'accelerate>=0.26.0'

In [2]:
from datasets import load_dataset

eval_dataset = load_dataset('mteb/sts17-crosslingual-sts', 'en-en', split='test')
eval_dataset

Dataset({
    features: ['sentence1', 'sentence2', 'score', 'lang'],
    num_rows: 250
})

In [3]:
eval_dataset.set_format(type='pandas')
eval_df = eval_dataset[:]
eval_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   sentence1  250 non-null    object 
 1   sentence2  250 non-null    object 
 2   score      250 non-null    float64
 3   lang       250 non-null    object 
dtypes: float64(1), object(3)
memory usage: 7.9+ KB


In [4]:
eval_df.head() 

Unnamed: 0,sentence1,sentence2,score,lang
0,A person is on a baseball team.,A person is playing basketball on a team.,2.4,en-en
1,Our current vehicles will be in museums when e...,The car needs to some work,0.2,en-en
2,A woman supervisor is instructing the male wor...,A woman is working as a nurse.,1.0,en-en
3,A bike is next to a couple women.,A child next to a bike.,2.0,en-en
4,The group is eating while taking in a breathta...,A group of people take a look at an unusual tree.,2.2,en-en


#### Student model 

In [5]:
import torch 
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Transformer, Pooling 

student_model_id = 'FacebookAI/xlm-roberta-base'
transformer_module = Transformer(student_model_id, model_args=dict(torch_dtype=torch.float16))
pooling_module = Pooling(
    word_embedding_dimension=transformer_module.get_word_embedding_dimension(),
    pooling_mode_cls_token=False,
    pooling_mode_mean_tokens=True
)
student_model = SentenceTransformer(modules=[transformer_module, pooling_module])
student_model.to('cuda')
student_model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [6]:
student_model[0].auto_model.dtype 

torch.float16

#### Teacher model 

In [7]:
from sentence_transformers import SentenceTransformer

teacher_model_id = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'
transformer_module = Transformer(teacher_model_id, model_args=dict(torch_dtype=torch.float16))
pooling_module = Pooling(
    word_embedding_dimension=transformer_module.get_word_embedding_dimension(),
    pooling_mode_cls_token=True,
    pooling_mode_mean_tokens=False
)
teacher_model = SentenceTransformer(modules=[transformer_module, pooling_module])
teacher_model.to('cuda')
teacher_model 

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [8]:
teacher_model[0].auto_model.dtype 

torch.float16

#### Defining evaluator 

In [9]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

embed_sim_eval = EmbeddingSimilarityEvaluator(
    sentences1=eval_dataset['sentence1'],
    sentences2=eval_dataset['sentence2'],
    scores=[score / 5.0 for score in eval_dataset['score']],
    name=f'en-en-test',
    show_progress_bar=False
)

In [10]:
model = teacher_model # student/teacher 

In [11]:
from sentence_transformers import SentenceTransformerTrainer 

trainer = SentenceTransformerTrainer(
    model=model,
    evaluator=embed_sim_eval
) 
trainer.evaluate()

{'eval_model_preparation_time': 0.0019,
 'eval_en-en-test_pearson_cosine': 0.7372142557329914,
 'eval_en-en-test_spearman_cosine': 0.7581291730279394,
 'eval_runtime': 0.5112,
 'eval_samples_per_second': 0.0,
 'eval_steps_per_second': 0.0}

#### Results 

The performance is measured using Spearman correlation between the predicted similarity score and the gold score for different model configurations.

| Model                                               | En - En | En - Ua | Ua - Ua | 
| --------------------------------------------------- | ------- | ------- | ------- |
| XLM-RoBERTa (mean pooling, float 32)                |  52.2   | - | - |
| XLM-RoBERTa (mean pooling, float 16)                |  52.2   |- | - |
| XLM-RoBERTa (cls token, float 32)                   |  5.8    |- | - |
| multi-qa-mpnet-base-dot-v1 (cls token, float 32)    |  76.8   |- | - |
| multi-qa-mpnet-base-dot-v1 (cls token, float 16)    |  76.7   |- | - |
| multi-qa-mpnet-base-dot-v1 (mean pooling, float 32) |  76.0   |- | - |
 

#### Conclusions

Loading models with lower `fp16` precision doesn't change the results.

`multi-qa-mpnet-base-dot-v1` (a monolingual model) achieves better performance on the STS Benchmark (Semantic Textual Similarity Benchmark) for `en-en` pairs.

Changing the pooling strategy for XLM-RoBERTa (initially loaded with mean pooling) to CLS token results in a significant performance decrease.

In contrast, changing the pooling strategy for `multi-qa-mpnet-base-dot-v1` to mean (initially loaded with CLS) results in only a minor decrease.
