# Evaluate Embedding with MTEB package and SageMaker processing

The objective of this notebook is to evaluate embeddings with MTEB on a Sentence Similarity task. Cloud-wise, we'll use SageMaker processing for spinning up and down computing resources without the hassle of managing them.

## Setup and general S3 bucket configuration

In [None]:
!pip install sagemaker

In [None]:
from sagemaker import session
from sagemaker import get_execution_role

sagemaker_session=session.Session()

BUCKET=sagemaker_session.default_bucket()
S3_OUTPUT_PATH="mteb/eval"

## Using a SageMaker processing script for Sentence Transformers

Let's create a repository for handling sentence transformers evaluation, dedicated to STS Benchmark. First we need to create a directory.

In [None]:
!mkdir -p sbertscripts/

Now we just need to create an evaluation script. We'll focus on [STS Benchmark task](https://paperswithcode.com/dataset/sts-benchmark) and English only language.

In [None]:
%%writefile sbertscripts/embeval.py

import argparse
import os
from mteb import MTEB
from sentence_transformers import SentenceTransformer
from mteb.tasks import STSBenchmarkSTS

def stsb_mteb_evaluate_model(model, output_folder)->None:
    evaluation = MTEB(tasks=[STSBenchmarkSTS(langs=["en"])], task_langs=['en'])
    results = evaluation.run(model, output_folder=output_folder, eval_splits=['test'])
    return results

if __name__=='__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-name")
    os.path.join("/opt/ml/processing/evaluation")
    args, _ = parser.parse_known_args()
    print("Received arguments {}".format(args))
    output_path_folder = "/opt/ml/processing/eval/"
    model = SentenceTransformer(args.model_name, output_path_folder)
    res = stsb_mteb_evaluate_model(model, output_path_folder)

Although we're going to use a pre-built container, we will customize it in order to leverage use of MTEB package. In order to to so, we need to add a `requirements.txt`file in the scripts directory.

In [None]:
%%writefile sbertscripts/requirements.txt
transformers
mteb
datasets
accelerate==0.20.3

### Rationale for the use of PyTorch

At the time of writing, HuggingFace SageMaker processing doesn't have GPU image. Since, instead of using GPU-based instances, this time we'll think in cost-effective manner and use CPUS: since it needs to be further analyzed, an evaluation output is not immediatelty needed in a low latency manner.

Hence, we'll use PyTorch processor, with CPU support. For each sentence transformer model chosen, we're going to launch a processing job.

### Launching SageMaker processing job

In [None]:
from sagemaker.pytorch import PyTorchProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

def run_sm_processing_job(model_name, script_dir = "sbertscripts"):
    #Initialize the PyTorch Processor
    model_suffix = model_name.split('/')[-1]
    hfp = PyTorchProcessor(
        role=get_execution_role(), 
        instance_count=1,
        instance_type='ml.m5.2xlarge',
        framework_version='1.13.1',
        base_job_name=f"mteb-eval-{model_suffix}",
        py_version="py39",
        max_runtime_in_seconds=600
    )

    #Run the processing job
    s3_destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}/{model_name}'
    runnah=hfp.run(
        code='embeval.py',
        source_dir=script_dir,
        outputs=[
            ProcessingOutput(output_name='eval', source='/opt/ml/processing/eval/', destination=s3_destination)
        ],
        arguments = ["--model-name", model_name], 
        wait=False
    )
    return {"s3eval":s3_destination, "model_name":model_name, "processor":hfp}

Let's submit these for processing job for every SBERT model we want to evaluate.

In [None]:
l=[]
sberts = ["sentence-transformers/all-mpnet-base-v2", "sentence-transformers/all-MiniLM-L6-v2", "intfloat/e5-large-v2"]
for model_name in sberts:
    l.append(run_sm_processing_job(model_name))

Note that we put `wait=False` parameter so we might need to wait until processing jobs are all complete.

__TO DO__: add time handler until job completion based on job completion.

In [None]:
## TO DO: add time handler until job completion based on job completion.
m=l[0]['processor']
mm=m.latest_job.describe()['ProcessingJobStatus']


Let's collect the results from their respective buckets inside a local `sbertresults` directory

In [None]:
!rm -rf sbertresults/
!aws s3 cp --recursive s3://{BUCKET}/{S3_OUTPUT_PATH}/sentence-transformers/ ./sbertresults/

!aws s3 cp s3://{BUCKET}/{S3_OUTPUT_PATH}/intfloat/e5-large-v2/STSBenchmark.json E5largeV2results.json

Now it's time to see the results

In [None]:
!pygmentize sbertresults/all-mpnet-base-v2/STSBenchmark.json

In [None]:
!pygmentize sbertresults/all-MiniLM-L6-v2/STSBenchmark.json

In [None]:
!pygmentize E5largeV2results.json

Both MPNET and miniLM highlight excellent results with regard to STS B.

## Now let's evaluate fastText with SageMaker processing.

Let's evaluate FastText with regard to MTEB, thanks to SageMaker processing. [FastText](https://fasttext.cc/) is a static pre-trained embedding containing support for 157 languages, as long as a tokenization that is enabled on subword level

What's great with MTEB is that we can create custom model evaluation classes. The only requirement for these classes is to possess an `encode` method whose inputs are list of sentences and outputs are list of vectors. You can do whatever you want inside that class, even by calling external APIS !

### Creating an evaluation script and requirements

As usual, let's keep our work tidy, create a dedicated folder, put our evaluation script as well as requirements, and run the processing job.

In [None]:
!mkdir -p fasttextscripts/

In [None]:
%%writefile fasttextscripts/embeval.py

from huggingface_hub import hf_hub_download
import fasttext
from mteb import MTEB
import string
import torch
import numpy as np


class NaiveAvgFastTextModel():
    def encode(self, sentences, batch_size=32, **kwargs):
        """ Returns a list of embeddings for the given sentences.
        Args:
            sentences (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        model_path = hf_hub_download(repo_id="facebook/fasttext-en-vectors", filename="model.bin")
        self.ftmodel = fasttext.load_model(model_path)
        res= []
        for sentence in sentences:
            unpunkt_sentence = sentence.translate(str.maketrans('', '', string.punctuation)).lower()
            res.append(self.ftmodel.get_sentence_vector(unpunkt_sentence))
        return res     

if __name__=='__main__':
    output_path_folder = "/opt/ml/processing/eval/"
    model = NaiveAvgFastTextModel()
    evaluation = MTEB(tasks=["STSBenchmark"])
    evaluation.run(model, eval_splits=["test"], output_folder=output_path_folder)

In [None]:
%%writefile fasttextscripts/requirements.txt
transformers
mteb
datasets
accelerate==0.20.3
huggingface
fasttext

### Launching SageMaker processing job

Apart from directory, SageMaker processing job is not different from above function. Let's reuse the utility above.

In [None]:
ftres=run_sm_processing_job("fasttext", script_dir = "fasttextscripts")


In [None]:
ftres

In [None]:
!aws s3 cp s3://sagemaker-eu-west-2-175256325518/mteb/eval/fasttext/STSBenchmark.json ftbench.json

In [None]:
!pygmentize ftbench.json

We notice that although lower than transformer based embeddings, FastText scores are honorable when handling a similarity task. Old but gold!