[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/spotify-podcast-search/spotify-podcast-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/semantic-search/spotify-podcast-search/spotify-podcast-search.ipynb)

# Podcast Search

In this notebook we will work through the techniques described by Spotify R&D on [how they implemented semantic search](https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/) (or *natural language search*) to improve the podcast discovery process for users.

## Data

Spotify used four data sources for training and evaluation:

1. *(query, episode)* pairs from successful podcast searches (found in past search logs)
2. Where a successful podcast search occured *after* an initially unsuccessful search, *(query_prior_to_successful_reformulation, episode)* pairs were created. The idea is that these initial queries may be natural language queries that the user then changed to fit a more rigid search format.
3. Generate synthetic queries from popular episode titles and descriptions. They fine-tuned a BART model on MS-MARCO and then used it to generate the queries, creating *(synthetic_query, episode)* pairs.
4. A small curated set of *semantic* queries were manually written for popular podcast episodes, this is used for evaluation only.

Unfortunately we don't have access to Spotify's past search logs and this rules out any emulation of sources **1** and **2**. Given enough time, sure we could curate a set of semantic queries, but we do not (feel free to do this if you want).

That leaves us with option **3**. This is probably the most interesting technique used by Spotify, and fortunately we can replicate it. All we need is a dataset containing podcast metadata, which we can [find here](https://www.kaggle.com/datasets/listennotes/all-podcast-episodes-published-in-december-2017).

Before getting started we must install all prerequisites:

In [None]:
!pip install -U kaggle sentence-transformers pinecone-client tqdm

### Data Download

We need to use the Kaggle API to download our podcast metadata dataset. This is installed using `pip install kaggle`. An account and API key is needed, which should be stored in the location displayed when attempting to `import kaggle` (if no error appears, the API key has already been added).

In [None]:
import kaggle

Once the API key is added, we move onto the data download.

In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

Kaggle hosts two different types of datasets, competition and standalone. We need to know which of those our podcasts dataset is because the method for downloading each is different. Fortunately, it's easy to identify from nothing more than the URL. Competition datasets contain `/c/` at the beginning of the domain path, whereas standalone datasets contain `/dataset/`, here are two examples:

Competition: [https://www.kaggle.com/c/titanic](https://www.kaggle.com/c/titanic)

Standalone: [https://www.kaggle.com/datasets/anandaramg/taxi-trip-data-nyc](https://www.kaggle.com/datasets/anandaramg/taxi-trip-data-nyc)

If we take a look at the podcasts dataset page we will see that it is a *standalone dataset*:

[https://www.kaggle.com/datasets/listennotes/all-podcast-episodes-published-in-december-2017](https://www.kaggle.com/datasets/listennotes/all-podcast-episodes-published-in-december-2017)

And so we can download it using the `dataset_download_file` method, for which we must pass the dataset location (`listennotes/all-podcast-episodes-published-in-december-2017`), filename(s) (`podcasts.csv`, `episodes.csv`), and target save location (current directory, `./`).

In [None]:
api.dataset_download_file(
    'listennotes/all-podcast-episodes-published-in-december-2017',
    file_name='podcasts.csv',
    path='./'
)
api.dataset_download_file(
    'listennotes/all-podcast-episodes-published-in-december-2017',
    file_name='episodes.csv',
    path='./'
)

This will download both of our files as zip files, which we extract using the `zipfile` library.

In [None]:
import zipfile

with zipfile.ZipFile('podcasts.csv.zip', 'r') as zipref:
    zipref.extractall('./')
with zipfile.ZipFile('episodes.csv.zip', 'r') as zipref:
    zipref.extractall('./')

We have two datasets here, `podcasts` details the podcast shows themselves, their title, description, and author. The `episodes` dataset details specific episodes from those podcasts, including the episode title, description, publication date, etc.

In [1]:
import pandas as pd

podcasts = pd.read_csv('podcasts.csv')
podcasts.head()

Unnamed: 0,uuid,title,image,description,language,categories,website,author,itunes_id
0,8d62d3880db2425b890b986e58aca393,"Ecommerce Conversations, by Practical Ecommerce",http://is4.mzstatic.com/image/thumb/Music6/v4/...,Listen in as the Practical Ecommerce editorial...,English,Technology,http://www.practicalecommerce.com,Practical Ecommerce,874457373
1,cbbefd691915468c90f87ab2f00473f9,Eat Sleep Code Podcast,http://is4.mzstatic.com/image/thumb/Music71/v4...,On the show we’ll be talking to passionate peo...,English,Tech News | Technology,http://developer.telerik.com/,Telerik,1015556393
2,73626ad1edb74dbb8112cd159bda86cf,SoundtrackAlley,http://is5.mzstatic.com/image/thumb/Music71/v4...,A podcast about soundtracks and movies from my...,English,Podcasting | Technology,https://soundtrackalley.podbean.com,Randy Andrews,1158188937
3,0f50631ebad24cedb2fee80950f37a1a,The Tech M&A Podcast,http://is1.mzstatic.com/image/thumb/Music71/v4...,The Tech M&A Podcast pulls from the best of th...,English,Business News | Technology | Tech News | Business,http://www.corumgroup.com,Timothy Goddard,538160025
4,69580e7b419045839ca07af06cf0d653,"The Tech Informist - For fans of Apple, Google...",http://is4.mzstatic.com/image/thumb/Music62/v4...,The tech news show with two guys shooting the ...,English,Gadgets | Tech News | Technology,http://techinformist.com,The Tech Informist,916080498


In [2]:
episodes = pd.read_csv('episodes.csv')
episodes.head()

Unnamed: 0,title,audio,audio_length,description,pub_date,uuid,podcast_uuid
0,Piątek - 01 grudnia,https://cdneu.modlitwawdrodze.pl/prayers/MWD_2...,490,"święci męczennicy jezuiccy Edmund Campion SJ, ...",2017-12-01 00:00:00+00,fd5d891411174c7ca953c1f54657c3eb,811c18cf575841b3bef4601978f17ca9
1,Sobota - 02 grudnia,https://cdneu.modlitwawdrodze.pl/prayers/MWD_2...,481,"bł. Rafał Chyliński, prezbiter, Łk 21, 34-36",2017-12-02 00:00:00+00,5c28fa0a27b342cd92ff03c16a8019c2,811c18cf575841b3bef4601978f17ca9
2,Niedziela - 03 grudnia,https://cdneu.modlitwawdrodze.pl/prayers/MWD_2...,667,"Pierwsza Niedziela Adwentu, Mk 13, 33-37",2017-12-03 00:00:00+00,efdc9f4f07fa4c4883f8848256066cec,811c18cf575841b3bef4601978f17ca9
3,Introduction to Luke,http://www.wgcr.net/images/TimelessTruths/TTT-...,1691,Luke 1:1-4 -,2017-12-03 11:30:05+00,cc2860165fa84d1092f6b45f19255a87,36ed4e62dcd94412a5211cc9bd76ba7c
4,"Dear Science: Lightning, Dead Cats and Hand Sa...",http://95bfm.com/sites/default/files/291117_De...,1152,<p>Today on Dear Science with AUT's Allan Blac...,2017-12-27 11:00:00+00,69bd409e0469433581ccc76cf7b664ad,fa36a26a1879453f95da1379c737cd6d


### Data Preparation

Spotify stated that their episode data consists of the podcast title, podcast description, episode title, episode description, and other metadata concatenated together. We will replicate this by first merging the *episodes* and *podcasts* dataframes with an *inner join* on the podcast ID features.

In [3]:
episodes = episodes.merge(
    podcasts,
    left_on='podcast_uuid',
    right_on='uuid',
    suffixes=('_ep', '_pod')
)
episodes.head()

Unnamed: 0,title_ep,audio,audio_length,description_ep,pub_date,uuid_ep,podcast_uuid,uuid_pod,title_pod,image,description_pod,language,categories,website,author,itunes_id
0,Piątek - 01 grudnia,https://cdneu.modlitwawdrodze.pl/prayers/MWD_2...,490,"święci męczennicy jezuiccy Edmund Campion SJ, ...",2017-12-01 00:00:00+00,fd5d891411174c7ca953c1f54657c3eb,811c18cf575841b3bef4601978f17ca9,811c18cf575841b3bef4601978f17ca9,Modlitwa w drodze,http://is4.mzstatic.com/image/thumb/Music62/v4...,\n\t\t\tModlitwa w drodze to propozycja duchow...,Polish,Training | Spirituality | Education | Christia...,http://www.modlitwawdrodze.pl,Modlitwa w drodze,412783872
1,Sobota - 02 grudnia,https://cdneu.modlitwawdrodze.pl/prayers/MWD_2...,481,"bł. Rafał Chyliński, prezbiter, Łk 21, 34-36",2017-12-02 00:00:00+00,5c28fa0a27b342cd92ff03c16a8019c2,811c18cf575841b3bef4601978f17ca9,811c18cf575841b3bef4601978f17ca9,Modlitwa w drodze,http://is4.mzstatic.com/image/thumb/Music62/v4...,\n\t\t\tModlitwa w drodze to propozycja duchow...,Polish,Training | Spirituality | Education | Christia...,http://www.modlitwawdrodze.pl,Modlitwa w drodze,412783872
2,Niedziela - 03 grudnia,https://cdneu.modlitwawdrodze.pl/prayers/MWD_2...,667,"Pierwsza Niedziela Adwentu, Mk 13, 33-37",2017-12-03 00:00:00+00,efdc9f4f07fa4c4883f8848256066cec,811c18cf575841b3bef4601978f17ca9,811c18cf575841b3bef4601978f17ca9,Modlitwa w drodze,http://is4.mzstatic.com/image/thumb/Music62/v4...,\n\t\t\tModlitwa w drodze to propozycja duchow...,Polish,Training | Spirituality | Education | Christia...,http://www.modlitwawdrodze.pl,Modlitwa w drodze,412783872
3,Poniedziałek - 04 grudnia,https://cdneu.modlitwawdrodze.pl/prayers/MWD_2...,654,"św. Jan Damasceński, prezbiter i doktor Kościo...",2017-12-04 00:00:00+00,a6034db279244d21a34c0723d1495fb8,811c18cf575841b3bef4601978f17ca9,811c18cf575841b3bef4601978f17ca9,Modlitwa w drodze,http://is4.mzstatic.com/image/thumb/Music62/v4...,\n\t\t\tModlitwa w drodze to propozycja duchow...,Polish,Training | Spirituality | Education | Christia...,http://www.modlitwawdrodze.pl,Modlitwa w drodze,412783872
4,Wtorek - 05 grudnia,https://cdneu.modlitwawdrodze.pl/prayers/MWD_2...,535,"św. Saba Jerozolimski, prezbiter, Łk 10, 21-24",2017-12-05 00:00:00+00,c35d5236d451454fa0bb5e95a7137d35,811c18cf575841b3bef4601978f17ca9,811c18cf575841b3bef4601978f17ca9,Modlitwa w drodze,http://is4.mzstatic.com/image/thumb/Music62/v4...,\n\t\t\tModlitwa w drodze to propozycja duchow...,Polish,Training | Spirituality | Education | Christia...,http://www.modlitwawdrodze.pl,Modlitwa w drodze,412783872


We now have all the features we'd like to concatenate in one place, those are *title_ep*, *title_pod*, *description_ep*, *description_pod*. The remaining features we can ignore.

Before concatenation, we should remove any record that contains null/empty values within any of these features, and strip excessive whitespace found at the start or end of features.

In [4]:
features = ['title_ep', 'description_ep', 'title_pod', 'description_pod']
# strip whitespace
episodes[features] = episodes[features].apply(lambda x: x.str.strip())

In [5]:
print(f"Before: {len(episodes)}")
episodes = episodes[
    ~episodes[features].isnull().any(axis=1)
]
print(f"After: {len(episodes)}")

Before: 873820
After: 778182


Now concatenate.

In [6]:
episodes = episodes['title_ep'] + '. ' + episodes['description_ep'] + '. ' \
    + episodes['title_pod'] + '. ' + episodes['description_pod']
episodes = episodes.to_list()

In [7]:
episodes[50:53]

['Fancy New Band: Running Stitch. <p>Running Stitch join Hannah to play sme new tracks ahead of their EP release next year. Cheers NZ On Air Music!</p>. 95bFM. Audio on demand from selected shows',
 "Political Commentary w/ David Slack: December 21, 2017. <p>It's the end of the year, and let's face it... 2017 hasn't been a great one for empathy. From\xa0the public treatment of our politicians\xa0to the treament of our least fortunate citizens, David Slack reckons it's about time we all took pause. It is Christmas, after all.</p>. 95bFM. Audio on demand from selected shows",
 'From the Crate w/ Troy Ferguson: December 21, 2017. <p>LP exploration with the ever-knowledgeable Troy, featuring the following new cakes and/or tasty re-releases:</p>\n\n<ul>\n\t<li>Ken Boothe - <em>You Keep Me Hangin\' On</em></li>\n\t<li>The New Sounds -\xa0<em>The Big Score</em></li>\n\t<li>Jitwam -\xa0<em>Keepyourbusinesstoyourself</em></li>\n</ul>\n\n<p>All available from and thanks to\xa0<a href="http://www

Let's shuffle our data too.

In [8]:
from random import shuffle

shuffle(episodes)

### Query Generation

We now have episodes, but no queries, and we need *(query, episode)* pairs to fine-tune a model. Spotify generated synthetic queries from the episode text (which we have). To do this they fine-tuned a BART model on MS MARCO, then used it to generate the queries.

We don't need to fine-tune the BART model as there are already plenty of models that are readily available and have been fine-tuned on the exact same (MS MARCO) dataset, so we will initialize one of these using the HuggingFace *transformers* library.

In [9]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# after testing many BART and T5 query generation models, this seemed best
model_name = 'doc2query/all-t5-base-v1'

tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(
    model_name
).to(device)

Now we begin generating queries, the Spotify article didn't state if they produce a certain number of queries for each episode, we will assume they generated three queries per episode. In line with the approach taken by the GenQ and GPL techniques.

In [10]:
# (OPTIONAL) it will take a long time to produce queries for the entire dataset, let's drop some episodes
episodes = episodes[:100_000]

In [11]:
from tqdm.auto import tqdm

batch_size = 128  # larger batch size == faster processing
num_queries = 3  # number of queries to generate for each episode
pairs = []
ep_batch = []

for ep in tqdm(episodes):
    # remove tab + newline characters if present
    ep_batch.append(ep.replace('\t', ' ').replace('\n', ' '))
    
    # we encode in batches
    if len(ep_batch) == batch_size:
        # tokenize the passage
        inputs = tokenizer(
            ep_batch,
            truncation=True,
            padding=True,
            max_length=256,
            return_tensors='pt'
        )

        # generate three queries per episode
        outputs = model.generate(
            input_ids=inputs['input_ids'].to(device),
            attention_mask=inputs['attention_mask'].to(device),
            max_length=64,
            do_sample=True,
            top_p=0.95,
            num_return_sequences=num_queries
        )

        # decode query to human readable text
        decoded_output = tokenizer.batch_decode(
            outputs,
            skip_special_tokens=True
        )

        # loop through to pair query and episodes
        for i, query in enumerate(decoded_output):
            query = query.replace('\t', ' ').replace('\n', ' ')  # remove newline + tabs
            ep_idx = int(i/num_queries)  # get index of episode to match query
            pairs.append([query, ep_batch[ep_idx]])
        
        ep_batch = []

  0%|          | 0/100000 [00:00<?, ?it/s]

In [12]:
pairs[0], pairs[100], pairs[130], pairs[250], pairs[550]

(['what is psalm 51 verse',
  'Psalm 51:19h. Psalm 51. SermonAudio.com: MP3. The latest MP3 feed from SermonAudio.com.'],
 ['who is the host of bat and fleming',
  'BART & FLEMING 16: "Indies at the Oscars". <p>Deadline Hollywood\'s Peter Bart and Mike Fleming Jr discuss the recent waves of sexual harassment and assault allegations in Hollywood, as well as the trend for the Oscars to embrace independent films in recent years. Produced by David Janove.</p>. The Deadline Podcast. The Deadline Podcast is the one stop shop for all of Deadline Hollywood\'s podcasts including BART & FLEMING and TV TALK with Dominic Patten and Pete Hammond.'],
 ['елитике оруие : акрта секретна рорамма "лма"?',
  'Космическое оружие СССР: почему была закрыта секретная программа "Алмаз"? (831). <table width="100%"><tr><td><div style="float:left;width:235px;"><table cellpadding=0 cellspacing=0><tr><td style="border-bottom:0px;"><img src="http://file2.podfm.ru/37/374/3746/37462/images/pod_28169.jpg?2" ></td></tr>

We now have *(synthetic_query, episode)* pairs that we can use for fine-tuning a sentence embedding model.

## Models

Spotify tested using pretrained BERT models, but as they are not fine-tuned for producing sentence embeddings they did not use them. It looks like they also tested the performance of the original SBERT model, which has been fine-tuned for sentence embeddings, but were not happy with the results.

In the end, they used the Universal Sentence Encoder (USE) model. They took the USE from TFHub, this is a great approach but to keep things as simple as possible we will be avoiding this and instead use a DistilUSE model supported by the *sentence-transformers* library, `distiluse-base-multilingual-cased-v2`. This will allow us to use *sentence-transformers* fine-tuning utilities.

To initialize this model we do:

In [13]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    'distiluse-base-multilingual-cased-v2'
).to(device)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

When fine-tuning with the sentence transformers library we need to reformat our data into a list of `InputExample` objects. The exact format varies based on the training task. Ours is a reranking (more on that soon) task, so all we need are two text items, eg our *(query, episode)* pairs.

In [14]:
from sentence_transformers import InputExample

eval_split = int(0.01 * len(pairs))
test_split = int(0.19 * len(pairs))
print("Eval samples: " + str(eval_split) + "\nTrain samples: " + str(test_split))

# we separate a number of these for testing
test_pairs = pairs[-test_split:]
pairs = pairs[:-test_split]
         
# and take a small number of samples for evaluation
eval_pairs = pairs[-eval_split:]
pairs = pairs[:-eval_split]

train = []

for (query, episode) in tqdm(pairs):
    train.append(InputExample(texts=[query, episode]))

Eval samples: 2999
Train samples: 56981


  0%|          | 0/239924 [00:00<?, ?it/s]

As mentioned, we are going to be using a ranking optimization function. That means that the model is tasked with learning how to identify the correct *episode* from a batch of episodes when given a *query*. The model does this by embedding similar *(query, episode)* pairs as closely as possible in a vector space. We measure the proximity of these embeddings using *cosine similarity*, or the angle between the two embeddings (eg vectors).

Because we are using this ranking optimization function, we need to ensure we do not place duplicate queries or episodes in the same training batch, as this will confuse our model when it is told that despite two queries/episodes being the same, one is correct and the other is not.

Sentence transformers handles this no-duplicates in a single batch using the `NoDuplicatesDataLoader`. We can initialize it, alongside a `batch_size` parameter (higher is better), like so:

In [15]:
from sentence_transformers.datasets import NoDuplicatesDataLoader

batch_size = 64

loader = NoDuplicatesDataLoader(train, batch_size=batch_size)

Now we initialize the loss function, as we're optimizing by ranking (as described above) we will be using the `MultipleNegativesRankingLoss`, known as *MNR loss*.

In [16]:
from sentence_transformers.losses import MultipleNegativesRankingLoss

loss = MultipleNegativesRankingLoss(model)

One final thing before moving onto fine-tuning. We need to setup our training metrics. Spotify describe in-batch metrics, we will do the same by adding an evaluator to the fit function. Again, sentence transformers provides strong support for this via the `RerankingEvaluator`.

Before initializing the evaluator we need to remove any duplicate episodes, of which there will be plenty (as we created 3 queries per episode).

In [17]:
dedup_eval_pairs = []
seen_eps = []

for (query, episode) in eval_pairs:
    if episode not in seen_eps:
        seen_eps.append(episode)
        dedup_eval_pairs.append((query, episode))

eval_pairs = dedup_eval_pairs
print(f"{len(eval_pairs)} unique eval pairs")

1001 unique eval pairs


In [18]:
from sentence_transformers.evaluation import RerankingEvaluator

# we must format samples into a list of:
# {'query': '<query>', 'positive': ['<positive>'], 'negative': [<all negatives>]}
eval_set = []
eval_episodes = [pair[1] for pair in eval_pairs]

for i, (query, episode) in enumerate(eval_pairs):
    negatives = eval_episodes[:i] + eval_episodes[i+1:]
    eval_set.append(
        {'query': query, 'positive': [episode], 'negative': negatives}
    )
    
evaluator = RerankingEvaluator(eval_set, mrr_at_k=5, batch_size=batch_size)

Let's check the zero-shot performance of the model.

In [19]:
evaluator(model, output_path='./')

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

0.6827534406474566

We're now ready to fine-tune our model. The Spotify article doesn't give any detail as to the parameters used here, so we will try the typical values of training for *1 epoch* and warming up the learning rather for the first *10%* of steps.

In [20]:
epochs = 1
warmup_steps = int(len(loader) * epochs * 0.1)

model.fit(
    train_objectives=[(loader, loss)],
    evaluator=evaluator,
    epochs=epochs,
    warmup_steps=warmup_steps,
    output_path='distiluse-podcast-nq',
    show_progress_bar=True
)



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/3748 [00:00<?, ?it/s]

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

---

## Evaluation

For the final evaluation step we want to emulate a more *real-world* scenario. That is, rather than calculating MRR@5 across small batches of data (as done with the evaluation set), we should index many episodes and calculate similar metrics when searching across this larger index.

Earlier we separated the test data `test_pairs`, we can use that now.

We will be encoding episodes using the `model`, the embeddings will then be indexed in a Pinecone vector database (you can sign up for free [here](https://app.pinecone.io)).

Start by initializing the vector index.

In [38]:
from pinecone import Pinecone

pinecone.init(
    api_key='<<YOUR_API_KEY>>',  # app.pinecone.io
    environment='us-west1-gcp'
)

# check if an evaluation index already exists, if not, create it
if 'evaluation' not in pinecone.list_indexes().names():
    pinecone.create_index(
        'evaluation', dimension=model.get_sentence_embedding_dimension(),
        metric='cosine'
    )
    
# now connect to the index
index = pinecone.Index('evaluation')

Before indexing our test data, we should remove duplicates (as we did before for the eval set).

In [22]:
dedup_test_pairs = []
seen_eps = []

for (query, episode) in test_pairs:
    if episode not in seen_eps:
        seen_eps.append(episode)
        dedup_test_pairs.append((query, episode))

test_pairs = dedup_test_pairs
print(f"{len(test_pairs)} unique test pairs")

18579 unique test pairs


Now we can begin encoding and indexing embeddings. 

In [23]:
to_upsert = []
eps_seen = []
queries = []
eps_batch = []
id_batch = []
upsert_batch = 64

for i, (query, episode) in enumerate(tqdm(test_pairs)):
    # do this to avoid episode duplication in index
    if episode not in eps_seen:
        eps_seen.append(episode)
        queries.append((query, str(i)))
        eps_batch.append(episode)
        id_batch.append(str(i))
    # on reaching batch_size we encode and upsert
    if len(eps_batch) == upsert_batch:
        embeds = model.encode(eps_batch).tolist()
        # insert to index
        index.upsert(vectors=list(zip(id_batch, embeds)))
        # refresh batch
        eps_batch = []
        id_batch = []
    
# (optional) take a look at the index stats
index.describe_index_stats()

  0%|          | 0/18579 [00:00<?, ?it/s]

{'dimension': 512,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 18560}}}

In [24]:
recall_at_k = []

for (query, i) in queries:
    # encode the query to an embedding
    xq = model.encode([query]).tolist()
    res = index.query(vector=xq, top_k=30)
    # get IDs
    ids = [x['id'] for x in res['results'][0]['matches']]
    recall_at_k.append(1 if i in ids else 0)

In [25]:
sum(recall_at_k)/len(recall_at_k)

0.883309112438775

So far this looks great, but it assumes that our synthetic queries are perfect, and they are not. Instead, we need to measure model performance on more realistic queries, which in this case we must manually create. Let's take a set of episodes and manually write a query that we believe should match to that episode.

In [34]:
curated = {
    "funny show about after uni party house": 1,
    "interview with cookbook author": 8,
    "eat better during xmas holidays": 14,
    "superhero film analysis": 27,
    "how to tell more engaging stories": 33,
    "how to make money with online content": 34,
    "why is technology so addictive": 38
}

In [39]:
recall_at_k = []

for query, i in curated.items():
    # encode the query to an embedding
    xq = model.encode([query]).tolist()
    res = index.query(vector=xq, top_k=30)
    # get IDs
    ids = [x['id'] for x in res['results'][0]['matches']]
    recall_at_k.append(1 if i in ids else 0)
    
sum(recall_at_k)/len(recall_at_k)

0.5714285714285714

Let's compare this to the zero-shot performance, for which we will need to create a new index.

In [42]:
zero_model = SentenceTransformer(
    'distiluse-base-multilingual-cased-v2'
).to(device)

# check if an evaluation index already exists, if not, create it
if 'eval-zero' not in pinecone.list_indexes().names():
    pinecone.create_index(
        'eval-zero', dimension=model.get_sentence_embedding_dimension(),
        metric='cosine'
    )
    
# now connect to the index
index = pinecone.Index('eval-zero')

In [43]:
to_upsert = []
eps_seen = []
queries = []
eps_batch = []
id_batch = []
upsert_batch = 64

for i, (query, episode) in enumerate(tqdm(test_pairs)):
    # do this to avoid episode duplication in index
    if episode not in eps_seen:
        eps_seen.append(episode)
        queries.append((query, str(i)))
        eps_batch.append(episode)
        id_batch.append(str(i))
    # on reaching batch_size we encode and upsert
    if len(eps_batch) == upsert_batch:
        embeds = zero_model.encode(eps_batch).tolist()
        # insert to index
        index.upsert(vectors=list(zip(id_batch, embeds)))
        # refresh batch
        eps_batch = []
        id_batch = []
    
# (optional) take a look at the index stats
index.describe_index_stats()

  0%|          | 0/18579 [00:00<?, ?it/s]

{'dimension': 512,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 18560}}}

In [44]:
recall_at_k = []

for query, i in curated.items():
    # encode the query to an embedding
    xq = zero_model.encode([query]).tolist()
    res = index.query(vector=xq, top_k=30)
    # get IDs
    ids = [x['id'] for x in res['results'][0]['matches']]
    recall_at_k.append(1 if i in ids else 0)
    
sum(recall_at_k)/len(recall_at_k)

0.2857142857142857

That's a pretty huge difference, despite not being able to follow Spotify's training process exactly thanks to a lack of data, we were able to work with synthetic queries only and produce an impressive performance gain.

# Delete the Index

If you're done with the index, we delete it to save resources.

In [None]:
pinecone.delete_index('evaluation')

---