## What representation to use for Playlist names?

Maybe we could use RoBERTa [CLS] tokens to encode the playlist names. This would give us a fixed-size representation for each playlist name. 

In [1]:
import pandas as pd

In [2]:
from data_storage.DBManager import DBManager
db = DBManager()

query = """
SELECT playlist_name
FROM playlist
LIMIT 25
"""
df = pd.read_sql(query, db.get_connection())
df

INFO:root:Attempting to connect to database.
INFO:root:Connected to database


Unnamed: 0,playlist_name
0,Throwbacks
1,Awesome Playlist
2,korean
3,mat
4,90s
5,Wedding
6,I Put A Spell On You
7,2017
8,BOP
9,old country


In [3]:
# Encode using RoBERTa [CLS] token
from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
def encode_text(text):
    input_ids = tokenizer.encode(text, return_tensors='pt')
    with torch.no_grad():
        # Get [CLS] token
        last_hidden_states = model(input_ids)[0][:,0,:]
    return last_hidden_states

text = "This is a test"
out = encode_text(text)
out

tensor([[-4.6293e-02,  7.3385e-02, -3.4762e-02, -9.8094e-02,  8.4887e-02,
         -8.0023e-02, -3.9277e-02,  4.0365e-02,  7.1148e-02, -7.5925e-02,
         -3.6650e-03,  3.0005e-02,  4.7014e-02, -3.1616e-02,  8.2024e-02,
          8.1583e-03, -6.6087e-02, -5.4865e-03,  2.4064e-02, -3.0748e-02,
         -1.2845e-01,  4.1122e-02, -5.3852e-02,  7.0078e-02,  1.9351e-03,
          4.3178e-02,  8.2619e-02,  7.0314e-02, -6.5753e-02, -1.8443e-02,
         -2.5071e-02, -3.4473e-02,  3.8447e-02, -2.1146e-02,  6.3702e-02,
          8.9470e-02,  4.7023e-02,  1.3934e-02, -1.1149e-01,  1.6021e-02,
         -6.2424e-03,  5.0203e-02,  6.6321e-03,  4.7572e-03,  7.9707e-02,
          1.3910e-02,  2.4194e-02, -1.9361e-03, -3.2750e-02,  7.7194e-03,
          3.4637e-02,  6.3855e-02, -4.5850e-02,  1.0708e-02, -8.6082e-02,
         -4.5220e-03,  3.2621e-03,  7.8260e-02,  6.4172e-02, -5.5234e-02,
          1.2567e-03, -1.3607e-01, -1.3126e-01,  1.9669e-04,  2.9736e-02,
         -4.4052e-02, -4.3773e-02,  4.

In [5]:
# Encode playlist name

playlist_embs = {name: encode_text(name) for name in df.playlist_name}

In [6]:
playlist_embs

{'Throwbacks': tensor([[-4.2796e-02,  8.6097e-02, -2.5533e-03, -1.0006e-01,  5.8668e-02,
          -9.2504e-02, -3.9496e-02,  2.7001e-02,  6.7875e-02, -6.6409e-02,
          -9.1678e-04,  3.1354e-02,  5.1940e-02, -4.4621e-02,  9.8448e-02,
          -6.9358e-03, -5.6692e-02,  2.2650e-02,  3.7914e-02, -4.0698e-02,
          -1.2417e-01,  5.9431e-02, -4.1437e-02,  8.4560e-02, -1.9346e-03,
           2.8403e-02,  4.8600e-02,  7.2124e-02, -6.1859e-02, -4.2098e-03,
          -4.0692e-02, -4.5841e-02,  3.2244e-02, -3.2172e-02,  5.0030e-02,
           9.9001e-02,  3.9348e-02,  2.1437e-02, -1.1216e-01,  6.6809e-03,
           1.5237e-02,  5.3983e-02, -1.8297e-03,  2.4480e-02,  6.2513e-02,
           2.6052e-02,  1.2140e-02, -1.7457e-02, -3.2098e-02,  2.1428e-02,
           2.2571e-02,  9.7436e-02, -5.0430e-02,  1.1080e-02, -9.6472e-02,
           7.8312e-03,  5.2764e-03,  6.7556e-02,  7.1824e-02, -5.2628e-02,
           9.3482e-03, -1.2202e-01, -9.7462e-02, -4.4127e-02,  1.7907e-02,
          -

In [7]:
# Compute cosine similarity between playlist names
from sklearn.metrics.pairwise import cosine_similarity

for name1, emb1 in playlist_embs.items():
    for name2, emb2 in playlist_embs.items():
        sim = cosine_similarity(emb1, emb2)
        print(f"{name1} vs {name2}: {sim}")

Throwbacks vs Throwbacks: [[0.9999998]]
Throwbacks vs Awesome Playlist: [[0.9994164]]
Throwbacks vs korean : [[0.9987885]]
Throwbacks vs mat: [[0.99945396]]
Throwbacks vs 90s: [[0.9995743]]
Throwbacks vs Wedding: [[0.9994364]]
Throwbacks vs I Put A Spell On You: [[0.9981877]]
Throwbacks vs 2017: [[0.99965566]]
Throwbacks vs BOP: [[0.9994837]]
Throwbacks vs old country : [[0.99823093]]
Throwbacks vs abby : [[0.99862474]]
Throwbacks vs VIBE: [[0.9993436]]
Throwbacks vs relax: [[0.99915075]]
Throwbacks vs sleep: [[0.99899334]]
Throwbacks vs 90's : [[0.9987458]]
Throwbacks vs New Songs: [[0.9994335]]
Throwbacks vs slow hands: [[0.9992969]]
Throwbacks vs Mom's playlist: [[0.99914837]]
Throwbacks vs SARAH: [[0.99940634]]
Throwbacks vs melancholy: [[0.99895614]]
Throwbacks vs mixtape: [[0.9991863]]
Throwbacks vs Sad Songs: [[0.9995886]]
Throwbacks vs fall '17: [[0.9991873]]
Throwbacks vs ✔️: [[0.9990892]]
Throwbacks vs Twenty one pilots: [[0.9986772]]
Awesome Playlist vs Throwbacks: [[0.99941

The cosine similarity of the embeddings appears to always be around 0.99, comparing with SBERT

In [9]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")

query_embedding = model.encode("How big is London")
passage_embedding = model.encode([
    "London has 9,787,426 inhabitants at the 2011 census",
    "London is known for its finacial district",
])

print("Similarity:", util.dot_score(query_embedding, passage_embedding))

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: multi-qa-MiniLM-L6-cos-v1


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

  _torch_pytree._register_pytree_node(


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Similarity: tensor([[0.5472, 0.6330]])
