# Building an Approximate Nearest Neighbours Index

This tutorial shows how to build an approximate nearest neighbours (ann) index for a given set of embeddings.

We use the Spotify [ANNOY](https://github.com/spotify/annoy) library for this task.

The following are the steps of this tutorial:
1. Build the annoy index given the embeddings saved in the TSV file
2. Get track information from BigQuery
3. Use the index to find similar tracks to a given one

<a href="https://colab.research.google.com/github/ksalama/data2cooc2emb2ann/blob/master/03-Building_an_Approximate_Nearest_Neighbours_Index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Setup

In [None]:
# !pip install -r requirements.txt

In [22]:
# If using COLAB
try:
    from google.colab import auth
    auth.authenticate_user()
except: pass

In [1]:
import os
from annoy import AnnoyIndex
from datetime import datetime
from google.cloud import bigquery

In [6]:
PROJECT_ID = 'ksalama-cloudml'
WORKSPACE = './workspace'
embeddings_file_path = os.path.join(WORKSPACE,'embeddings.tsv')
index_file_path = os.path.join(WORKSPACE,'embed-ann.index')

## 1. Build Annoy Index

In [13]:
def build_embeddings_index(embeddings_file, embedding_size, num_trees):
    annoy_index = AnnoyIndex(embedding_size, metric='angular')
    idx2item_mapping = dict()
    itemidx_mapping = dict()
    
    idx = 0
    
    with open(embeddings_file_path) as embedding_file:
        while True:
            line = embedding_file.readline()
            if not line: break
                
            parts = line.split('\t')
            item_id = parts[0]
            embedding = [float(v) for v in parts[1:]]
            
            idx2item_mapping[idx] = item_id
            itemidx_mapping[item_id] = idx

            annoy_index.add_item(idx, embedding)
            idx+=1
        
    print("{} items where added to the index".format(idx))
    annoy_index.build(n_trees=num_trees)
    print("Index is built")
    return annoy_index, idx2item_mapping, item2idx_mapping


In [14]:
num_trees = 100
embedding_size = 32

index, idx2item_mapping,  item2idx_mapping = build_embeddings_index(
    embeddings_file_path, embedding_size, num_trees)

39195 items where added to the index
Index is built


## 2. Get tracks info from BigQuery

In [16]:
track_ids = ",".join(list(item2idx_mapping.keys()))

query = '''
    SELECT DISTINCT
      tracks_data_id AS track_id,
      tracks_data_title AS track_title, 
      tracks_data_artist_name AS artist_name, 
      tracks_data_album_title AS album_title 
    FROM 
      `bigquery-samples.playlists.playlist`
    WHERE
        tracks_data_id IN ({})
'''.format(track_ids)

In [17]:
bq_client = bigquery.Client(project=PROJECT_ID)
query_job = bq_client.query(query)
results = query_job.result().to_dataframe()
display(results.head())



Unnamed: 0,track_id,track_title,artist_name,album_title
0,3637082,He's Got The Whole World In His Hands,,Lady Blue Part 1
1,250011,L'amour dans la rue,K,L'arbre rouge
2,5447851,Est-Ce Que C'est Ça,M,Mister Mystère
3,5447858,Amssétou,M,Mister Mystère
4,3355751,Le blues de soustons (live),M,le tour de m


## 3. Find similar items

In [19]:
def get_similar_items(item_id, num_matches=10):
    
    idx = item2idx_mapping[item_id]
    
    similar_idx = index.get_nns_by_item(
        idx, num_matches, search_k=-1, include_distances=False)
    
    similar_item_ids = []
    for idx in similar_idx:
        similar_item_ids.append(idx2item_mapping[idx])
    
    similar_items = results[results['track_id'].isin(similar_item_ids)]#.track_title
    return similar_items

In [21]:
get_similar_items('5447851')

Unnamed: 0,track_id,track_title,artist_name,album_title
2,5447851,Est-Ce Que C'est Ça,M,Mister Mystère
17888,555438,Il Me Dit Que Je Suis Belle,Liane Foly;Natasha St-Pier;Julie Zenatti;Jenifer,La Foire Aux Enfoires
20724,4311052,Next Time,Soan,Next Time
24283,1123687,Bad Medicine,Bon Jovi,Cross Road
27358,62723999,When I Was Your Man,Bruno Mars,Unorthodox Jukebox
31206,797541,These Streets,Paolo Nutini,These Streets
31830,3774054,Colours,Calvin Harris,I Created Disco (Bonus Version)
33496,2170512,Le Coeur Grenadine,Laurent Voulzy,Belle Ile En Mer
36972,2288566,Go With The Flow,Queens of the Stone Age,Go With The Flow
38189,3368674,Quello Che Non C'è,Afterhours,Quello Che Non C'è
