# Sentence Embedding Evaluation

This study is aimed at evaluating various sentence embedding methods, including Word2Vec, GloVe, BERT, T5, and SentenceTransformer.

## 1. Setup

Here, we set up the necessary libraries and modules for our study.


In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import random
import os

import gensim.downloader as api
from torchtext.vocab import GloVe
from transformers import BertTokenizer, BertModel, T5Tokenizer, T5Model
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import torch

import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

from utils import clone_senteval_repo, download_transfer_data, init_senteval_params, preprocess_text, setup_senteval

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

## 2. Data Loading and Preprocessing

We'll load and preprocess the STS Benchmark for further use, using utils.py.

In [3]:
clone_senteval_repo()
download_transfer_data()
params, tasks = init_senteval_params()
setup_senteval()

import senteval

## 3. Model Loading and Embedding Functions

In this section, we load the models and define functions to obtain sentence embeddings.


In [None]:
w2v_model = api.load("word2vec-google-news-300")

def w2v_emb(params, batch):
    embeddings = []
    for sent in batch:
        sent = sent or ['.']
        sentvec = [w2v_model[word] for word in sent if word in w2v_model]
        sentvec = np.mean(sentvec, axis=0) if sentvec else np.zeros(300)
        embeddings.append(sentvec)
    return np.vstack(embeddings)

In [None]:
glove_model = api.load("glove-wiki-gigaword-300")

def glove_emb(params, batch):
    embeddings = []
    for sent in batch:
        sent = sent or ['.']
        sentvec = [glove_model[word] for word in sent if word in glove_model]
        sentvec = np.mean(sentvec, axis=0) if sentvec else np.zeros(300)
        embeddings.append(sentvec)
    return np.vstack(embeddings)

In [None]:
bert_tokenizer = BertTokenizer.from_pretrained('prajjwal1/bert-tiny')
bert_model = BertModel.from_pretrained('prajjwal1/bert-tiny')

def bert_emb(params, batch):
    embeddings = []
    for sent in batch:
        sent = ' '.join(sent) or '.'
        inputs = bert_tokenizer(sent, return_tensors="pt", padding='max_length', truncation=True, max_length=128)
        with torch.no_grad():
            outputs = bert_model(**inputs)
        sentvec = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
        embeddings.append(sentvec)
    return np.vstack(embeddings)

In [None]:
model_st = SentenceTransformer('distilbert-base-nli-mean-tokens')

def st_emb(params, batch):
    batch = [' '.join(sent) or '.' for sent in batch]
    embeddings = model_st.encode(batch)
    return embeddings

## 4. Evaluation

This section will be dedicated to evaluating the quality of the sentence embeddings using various metrics.


In [8]:
se_w2v = senteval.engine.SE(params, w2v_emb)
results_w2v = se_w2v.eval(tasks)

In [9]:
se_glove = senteval.engine.SE(params, glove_emb)
results_glove = se_glove.eval(tasks)

In [10]:
se_st = senteval.engine.SE(params, st_emb)
results_st = se_st.eval(tasks)

In [13]:
se_bert = senteval.engine.SE(params, bert_emb)
results_bert = se_bert.eval(tasks)

In [22]:
results = {
    'w2v': results_w2v, 'glove': results_glove,
    'st': results_st, 'bert': results_bert
}

df_list = []
for model_name, tasks in results.items():
    for task_name, metrics in tasks.items():
        data_row = {
            'Model': model_name,
            'Task': task_name,
            'Pearson': metrics.get('pearson', None),
            'Spearman': metrics.get('spearman', None),
            'MSE': metrics.get('mse', None)
        }
        df_list.append(data_row)

df = pd.DataFrame(df_list)
df

Unnamed: 0,Model,Task,Pearson,Spearman,MSE
0,w2v,SICKRelatedness,0.801426,0.715178,0.364368
1,w2v,STSBenchmark,0.642349,0.619679,1.646826
2,glove,SICKRelatedness,0.786947,0.703704,0.388517
3,glove,STSBenchmark,0.657788,0.645334,1.544022
4,st,SICKRelatedness,0.850944,0.800511,0.282272
5,st,STSBenchmark,0.753489,0.758108,1.094154
6,bert,SICKRelatedness,0.778124,0.686694,0.401963
7,bert,STSBenchmark,0.650744,0.63698,1.551939


## 5. Conclusions


Model's Depth and Complexity: Our results indicate that deeper models, specifically SentenceTransformers, produced the best results in terms of Pearson and Spearman correlations, supporting the hypothesis that model complexity matters.

Training Data and Fine-tuning: SentenceTransformers, which is fine-tuned for sentence comparison tasks, outperformed other models in our experiments, validating the importance of task-specific fine-tuning.

Tokenization and Input Representation: While BERT employs advanced tokenization, it did not significantly outperform GloVe in our tests. This suggests that while tokenization is important, other factors also play a role in determining embedding quality.

Traditional Embeddings vs. Modern Techniques: Modern embeddings, particularly SentenceTransformers, showed superior performance compared to traditional methods, confirming that advancements in embedding techniques have led to improved performance.

Impact of Model Architecture: The superior performance of SentenceTransformers and relatively high performance of BERT, both of which utilize attention mechanisms, supports the idea that model architecture plays a crucial role in determining embedding quality

Overall Quality: All embedding methods provided reasonably good results, indicating their potential usefulness in various applications. However, for tasks that require a deep understanding of sentence semantics, using specialized models like SentenceTransformers would be beneficial.