# Task 1: Sentence Transformer Implementation

### We have a few options here. The simplest route is to use the SentenceTransformer library, which takes a transfomer model, and applies a mean pooling on the token embeddings. This requires very little decision making as it's very straightforward. To make it more interesting, I'll skip using SentenceTransformer and just use the Hugging Face Transformers library and manually code a mean pooling mechanism.

In [1]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModel

#### Load Model, Tokenizer, and dataset from Hugging Face. I chose gte-base as my model choice as it's lightweight, yet ranks high in the MTEB leaderboard. The sms spam collection is what I chose for the dataset to work with. It's lightweight and should be sufficient for NLP tasks.

In [2]:
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base")
model = AutoModel.from_pretrained("thenlper/gte-base")
df = pd.read_csv("hf://datasets/codesignal/sms-spam-collection/sms-spam-collection.csv")

#### A few samples of sentences

In [3]:
print(df[:5]["message"].values)

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
 'Ok lar... Joking wif u oni...'
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
 'U dun say so early hor... U c already then say...'
 "Nah I don't think he goes to usf, he lives around here though"]


#### Use tokenizer to convert the sentences into token ids for the transformer model

In [4]:
inputs = tokenizer(
    df[:5]["message"].tolist(),  # Let's just grab the first 5 and encode those
    padding=True,  # Make sure to pad out to the longest sequence of these inputs
    truncation=True,  # truncate any sequence longer than what the model supports
    return_tensors='pt'  # return pytorch tensors
)

#### Run the token ids through the model to get the token embeddings

In [5]:
with torch.no_grad():  # We're not training, so no need to calculate gradients
    outputs = model(**inputs)  # Give the model 'input_ids', 'token_type_ids', and 'attention_mask'

#### To simulate what SentenceTransformers (SBERT) does with pooling, we're going to average the token embeddings in each sentence down to a single vector. But rather than simply calling `outputs.last_hidden_state.mean(dim=1)`, we only want to consider the actual tokens in each sentence, so we'll apply a mask on the token embeddings using the attention mask

In [6]:
def average_pool(last_hidden_states, attention_mask):
    # We don't want to include padded tokens, so use masked_fill to zero them out.
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

embeddings = average_pool(outputs.last_hidden_state, inputs['attention_mask'])

In [7]:
print(embeddings)
print(embeddings.shape)

tensor([[-0.3814,  0.0221,  0.6724,  ...,  0.3914,  0.2581, -0.3310],
        [ 0.4042, -0.0752,  0.3601,  ...,  0.0812,  0.3644,  0.1573],
        [-0.0445, -0.0166,  0.9526,  ..., -0.1386, -0.1383,  0.1609],
        [-0.4066,  0.3460,  0.5974,  ...,  0.0803,  0.4214, -0.0915],
        [-0.0602, -0.0859,  0.0939,  ...,  0.3480,  0.8346,  0.2323]])
torch.Size([5, 768])
