# Two - Tower Retreival Model

### Key resources:
* Many pages [here](https://www.tensorflow.org/recommenders/examples/deep_recommenders) include great techniques to build custom TFRS Models

### Goals:
* Show how to model off of most data types 
  * (String, Existing Embeddings (vectors), 
  * Floats (Normalized), 
  * Categorical with vocab, 
  * High Dim Categorical (Embed)
* Leverage class templates to create custom 2 Tower Models quick/easy

## SPOTIFY Create the tensorflow.io interface for the event and product table in Bigquery
Best practices from Google are in this blog post

In [1]:
# !gsutil mb -l us-central1 gs://spotify-tfrecords-blog

In [11]:
# set variables
SEED = 41781897
PROJECT_ID = 'hybrid-vertex'
DROP_FIELDS = ['modified_at', 'row_number', 'seed_playlist_tracks']
TF_RECORDS_DIR = 'gs://spotify-tfrecords-blog'
BATCH_SIZE = 10

#### Quick counts on training data



#### Quick counts on the training records for track

In [2]:
%%bigquery TOTAL_PLAYLISTS
select count(1) from hybrid-vertex.spotify_train_3.train

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 563.30query/s] 
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.02s/rows]


In [3]:
TOTAL_PLAYLISTS = TOTAL_PLAYLISTS.values[0][0]
TOTAL_PLAYLISTS

65346428

### Set the tf.io pipelines function from bigquery

[Great blog post here on it](https://towardsdatascience.com/how-to-read-bigquery-data-from-tensorflow-2-0-efficiently-9234b69165c8)

In [6]:
# !pip install tensorflow-recommenders -q --user

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tfx-bsl 1.8.0 requires google-api-python-client<2,>=1.7.11, but you have google-api-python-client 2.49.0 which is incompatible.
tfx-bsl 1.8.0 requires pyarrow<6,>=1, but you have pyarrow 8.0.0 which is incompatible.
tensorflow-transform 1.8.0 requires pyarrow<6,>=1, but you have pyarrow 8.0.0 which is incompatible.
tensorflow-transform 1.8.0 requires tensorflow!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<2.9,>=1.15.5, but you have tensorflow 2.9.0rc2 which is incompatible.
google-cloud-recommendations-ai 0.2.0 requires google-api-core[grpc]<2.0.0dev,>=1.22.2, but you have google-api-core 2.8.0 which is incompatible.
apache-beam 2.39.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.5.1 which is incompatible.
apache-beam 2.39.0 requires httplib2<0.20.0,>=0.8, but you have httplib2 

In [7]:
pip install -U tensorflow-io==0.15.0 --user 

Collecting tensorflow-io==0.15.0
  Downloading tensorflow_io-0.15.0-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.3/22.3 MB[0m [31m65.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: tensorflow-io
  Attempting uninstall: tensorflow-io
    Found existing installation: tensorflow-io 0.16.0
    Uninstalling tensorflow-io-0.16.0:
      Successfully uninstalled tensorflow-io-0.16.0
Successfully installed tensorflow-io-0.15.0
Note: you may need to restart the kernel to use updated packages.


In [8]:
import tensorflow as tf
from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
import warnings
warnings.filterwarnings("ignore") #do this b/c there's an info-level bug that can safely be ignored
import json
import tensorflow as tf
# import tensorflow_recommenders as tfrs
import datetime
from tensorflow.python.lib.io import file_io
from tensorflow.train import BytesList, Feature, FeatureList, Int64List, FloatList
from tensorflow.train import SequenceExample, FeatureLists



# def bq_to_tfdata(client, row_restriction, table_id, col_names, dataset, batch_size=BATCH_SIZE):
#     TABLE_ID = table_id
#     COL_NAMES = col_names
#     DATASET = dataset
#     bqsession = client.read_session(
#         "projects/" + PROJECT_ID,
#         PROJECT_ID, TABLE_ID, DATASET,
#         COL_NAMES,
#         requested_streams=2,
#         row_restriction=row_restriction)
#     dataset = bqsession.parallel_read_rows()
#     return dataset.prefetch(1).shuffle(batch_size*10).batch(batch_size)

2022-06-30 17:29:37.656877: W tensorflow_io/core/kernels/audio_video_mp3_kernels.cc:271] libmp3lame.so.0 or lame functions are not available
2022-06-30 17:29:37.657154: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA


## Get the song metadata

To get a pipeline working we need the metadata for the table along with the table information. The following functions are helpers that give us the metadata into the proper types for `tf`


For each table id, programatically get
* Column names
* Column types

## Metadata dictionary to translate from BQ to tensorflow

From the DDL we are going to get the types for use in a  to create a `BigQueryReadSession` from `tensorflow_io.bigquery` 

In [16]:
bq_2_tf_dict = {'name': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
 'collaborative': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'pid': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
#  'description': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'duration_ms_playlist': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
#  'pid_pos_id': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'pos': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
#  'artist_name_seed': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'track_uri_seed': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'artist_uri_seed': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'track_name_seed': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'album_uri_seed': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'duration_ms_seed': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
#  'album_name_seed': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'track_pop_seed': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
#  'artist_pop_seed': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
 'artist_genres_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.string},
#  'artist_followers_seed': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
#  'pos_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
#  'artist_name_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'artist_uri_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'track_name_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'track_uri_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'album_name_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'album_uri_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
#  'duration_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
#  'duration_ms_seed_pl': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
#  'n_songs': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
#  'num_artists': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
#  'num_albums': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
# 'pos_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.int64},
# 'track_uri_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.string},
# 'track_name_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.string},
# 'duration_ms_seed_songs_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.float64},
# 'album_name_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.string},
# 'artist_pop_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.float64},
# 'artists_followers_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.float64},              
# 'track_pop_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.int64},  
               }

In [17]:
client = BigQueryClient()
batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'train_flatten', 'spotify_train_3',
        bq_2_tf_dict,
        requested_streams=2,)
dataset = bqsession.parallel_read_rows()
dataset = dataset.prefetch(1).shuffle(batch_size*10).batch(batch_size)

In [18]:
for x in dataset.take(1):
    print(x)

OrderedDict([('artist_genres_pl', <tf.Tensor: shape=(1, 22), dtype=string, numpy=
array([[b"'bachata', 'latin'",
        b"'latin', 'latin hip hop', 'reggaeton', 'trap latino'",
        b"'latin', 'latin hip hop', 'reggaeton', 'reggaeton flow', 'trap latino'",
        b'',
        b"'latin', 'reggaeton', 'reggaeton colombiano', 'trap latino'",
        b"'cubaton', 'latin', 'latin hip hop', 'latin pop'",
        b"'latin', 'latin hip hop', 'reggaeton', 'reggaeton flow'",
        b"'chicano rap', 'pop rap'", b"'dominican pop'",
        b"'latin', 'latin hip hop', 'reggaeton', 'reggaeton flow', 'trap latino'",
        b"'latin', 'latin hip hop', 'reggaeton', 'reggaeton flow', 'trap latino'",
        b"'latin', 'latin hip hop', 'reggaeton', 'reggaeton flow', 'trap latino'",
        b"'latin', 'modern salsa', 'salsa', 'salsa peruana', 'salsa puertorriquena', 'tropical'",
        b"'latin', 'latin hip hop', 'reggaeton', 'trap latino'",
        b"'bachata', 'latin', 'latin hip hop', 'latin po

### Confirm matching data and order for arrays

In [42]:
%%bigquery
select * from `hybrid-vertex.spotify_mpd.ordered_position_training` where pid_pos_id = '11834-10'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 360.40query/s]                          
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.14s/rows]


Unnamed: 0,pid_pos_id,name,collaborative,pid,modified_at,num_tracks,num_albums,num_followers,num_edits,num_artists,...,duration_ms,album_name,pos_seed,pos_artist_name,track_uri_seed,artist_uri_seed,track_name_seed,album_uri_seed,duration_ms_seed,album_name_seed
0,11834-10,Throwback,False,11834,1499126400,75,70,1,7,51,...,170626.0,What Was I Thinking Of,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[Guns N' Roses, Big Sounds Band, The Bangles, ...","[spotify:track:7gXdAqJLCa5aYUeLVxosOz, spotify...","[spotify:artist:3qm84nBOXUEQ2vnTfUTTFC, spotif...","[Knockin' On Heaven's Door, Karma Chameleon (f...","[spotify:album:5NL0MCTSbQtO13G62ofWAf, spotify...","[336000.0, 237089.0, 204560.0, 238266.0, 14937...","[Use Your Illusion II, 80s Songs from the Big ..."


### Run adapts, and preprocess string lookups

In [46]:
for k in bq_2_tf_dict:
    print(k)

name
collaborative
pid
description
duration_ms_playlist
pid_pos_id
pos
artist_name_seed
track_uri_seed
artist_uri_seed
track_name_seed
album_uri_seed
duration_ms_seed
album_name_seed
track_pop_seed
artist_pop_seed
artist_genres_seed
artist_followers_seed
pos_seed_track
artist_name_seed_track
artist_uri_seed_track
track_name_seed_track
track_uri_seed_track
album_name_seed_track
album_uri_seed_track
duration_seed_track
duration_ms_seed_pl
n_songs
num_artists
num_albums
pos_seed_pl
track_uri_seed_pl
track_name_seed_pl
duration_ms_seed_songs_pl
album_name_seed_pl
artist_pop_seed_pl
artists_followers_seed_pl
track_pop_seed_pl


# Organize fields by transforms

## Stringlookup get vocab 
- track_uri_seed
- artist_uri_seed
- album_uri_seed
- artist_uri_seed_track
- track_uri_seed_track
- album_uri_seed_track
- track_uri_seed_pl

## TextVectorization (NLPish)
- name
- description
- artist_name_seed
- artist_name_seed_track
- track_name_seed_track
- album_name_seed_track
- track_name_seed_pl
- album_name_seed_pl
- album_name_seed
- artist_genres_seed

## Rich features
- collaborative
- duration_ms_playlist
- track_name_seed
- track_pop_seed
- artist_pop_seed
- duration_seed_track
#### --- playlist features
- n_songs
- num_artists
- num_albums
- duration_ms_seed_pl
- artist_pop_seed_pl
- artists_followers_seed_pl
- track_pop_seed_pl
- artist_followers_seed
- duration_ms_seed_songs_pl
- duration_ms_seed

#not used
#pid
##Identifier pid_pos_id
##POS id not used, infering order in dataset pos_seed_track
##No POS pos_seed_pl
#pos

#### Loop over values to find uniques to save to a vocab file

We will save this in `gs://spotify-assets-blog`

In [54]:
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

vocab_lookup_feats = [
'track_uri_seed',
'artist_uri_seed',
'album_uri_seed',
# 'artist_uri_seed_track',
# 'track_uri_seed_track',
# 'album_uri_seed_track',
# 'track_uri_seed_pl', # ragged playlist
]

vocab_query = [f"select distinct {field} from `hybrid-vertex.spotify_train_3.train_flatten`" for field in vocab_lookup_feats]
vocab_dict = {}
for field, query in zip(vocab_lookup_feats, vocab_query):
    data = client.query(query).result()
    vocab_dict.update({field: list(d[0] for d in data)})

In [56]:
### quick counts

for k in vocab_dict:
    print(f"{k} counts: {len(vocab_dict[k])}")

track_uri_seed counts: 2249561
artist_uri_seed counts: 294110
album_uri_seed counts: 730377


### Quick query to find the max length of repeated fields

Find max counts to pad ragged

In [20]:
%%bigquery
with counts as (select 	pid_pos_id, count(distinct x) as distinct_counts from `hybrid-vertex.spotify_train_3.train_flatten` inner join UNNEST(track_uri_seed_pl) x group by 1)
select max(counts.distinct_counts) from counts

Query complete after 0.00s: 100%|██████████| 16/16 [00:00<00:00, 7311.13query/s]                       
Downloading: 100%|██████████| 1/1 [00:00<00:00,  1.01rows/s]


Unnamed: 0,f0_
0,341


#### Recycling Below

In [None]:
def ragged_unique_collection(ds, field):
    data = np.array([''])
    for x in ds.map(lambda x: x[field]).batch(1):
        y = np.unique(np.concatenate(np.concatenate(x.numpy())))
        data = np.concatenate([data, y])
    data = np.unique(data)
    return(data)

track_uri_seed_pl_vocab = ragged_unique_collection(dataset, 'track_uri_seed_pl') #unique values = 

Cause: could not parse the source code:

    for x in ds.map(lambda x: x[field]).batch(1):

This error may be avoided by creating the lambda in a standalone statement.

Cause: could not parse the source code:

    for x in ds.map(lambda x: x[field]).batch(1):

This error may be avoided by creating the lambda in a standalone statement.



KeyboardInterrupt: 

In [None]:
unique_list

In [None]:
hi

In [None]:
!gsutil mb -l us-central1 gs://spotify-assets-blog

### Save the arrays to google storage

In [None]:
### save above arrays - use naming convention

### Text Vectorization section
Loop over and save the layers to a bucket

In [None]:
text_vector_feats = ['name',
'description',
'artist_name_seed',
'artist_name_seed_track',
'track_name_seed_track',
'album_name_seed_track',
'track_name_seed_pl',
'album_name_seed_pl',
'album_name_seed',
'artist_genres_seed',
                    ]

MAX_TOKENS = 100_000

vectorizors = []

def create_vectorizor_layers(ds, field, n_tokens, ngrams=3):
    name = f"{field}-{n_tokens}-{ngrams}"
    tv_layer = tf.keras.layers.TextVectorization(max_tokens=n_tokens, name=name, ngrams=ngrams)
    return(tv_layer.adapt(ds.map(lambda x: x[field]).batch(1000))
           
for feat in text_vector_feats:
           vectorizors.append(create_vectorizor_layers(dataset, feat, MAX_TOKENS)
    

In [64]:
for x in dataset.take(1):
    print(x)

### Create a function to process data to new ds using map - then write DS to gcs

In [11]:
# ### Vocab to get string lookups
# import numpy as np
# import keras 

# vocab_lookup_feats = [
# 'track_uri_seed',
# 'artist_uri_seed',
# 'album_uri_seed',
# 'artist_uri_seed_track',
# 'track_uri_seed_track',
# 'album_uri_seed_track',
# # 'track_uri_seed_pl', # ragged playlist
# ]

# def get_unique_np(ds, field) -> np.array:
#     unique = np.unique(np.concatenate(list(ds.map(lambda x: x[field]).batch(1000))))
#     return(unique)

# unique_list = []
# for feat in vocab_lookup_feats:
#     unique_list.append(get_unique_np(dataset, feat))
# text_vector_feats

Using TensorFlow backend.


KeyboardInterrupt: 