# Two - Tower Retreival Model

### Key resources:
* Many pages [here](https://www.tensorflow.org/recommenders/examples/deep_recommenders) include great techniques to build custom TFRS Models

### Goals:
* Show how to model off of most data types 
  * (String, Existing Embeddings (vectors), 
  * Floats (Normalized), 
  * Categorical with vocab, 
  * High Dim Categorical (Embed)
* Leverage class templates to create custom 2 Tower Models quick/easy

## SPOTIFY Create the tensorflow.io interface for the event and product table in Bigquery
Best practices from Google are in this blog post

In [1]:
# !gsutil mb -l us-central1 gs://spotify-tfrecords-blog

In [1]:
# set variables
DROPOUT = False
DROPOUT_RATE = 0.2
EMBEDDING_DIM = 64
MAX_TOKENS = 100_000
BATCH_SIZE = 256
ARCH = [128, 64]
NUM_EPOCHS = 1
SEED = 41781897
PROJECT_ID = 'hybrid-vertex'
DROP_FIELDS = ['modified_at', 'row_number', 'seed_playlist_tracks']
N_RECORDS_PER_TFRECORD_FILE = 15000 #100ish mb  
TF_RECORDS_DIR = 'gs://spotify-tfrecords-blog'

#### Quick counts on training data



#### Quick counts on the training records for track

In [2]:
%%bigquery TOTAL_PLAYLISTS
select count(1) from hybrid-vertex.spotify_train_3.train

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 1136.67query/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00,  1.13rows/s]


In [3]:
TOTAL_PLAYLISTS = TOTAL_PLAYLISTS.values[0][0]
TOTAL_PLAYLISTS

65346428

### Set the tf.io pipelines function from bigquery

[Great blog post here on it](https://towardsdatascience.com/how-to-read-bigquery-data-from-tensorflow-2-0-efficiently-9234b69165c8)

In [4]:
# !pip install tensorflow-recommenders -q --user

In [5]:
!pip install tensorflow-io --upgrade --user



In [6]:
import tensorflow as tf
from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
import warnings
warnings.filterwarnings("ignore") #do this b/c there's an info-level bug that can safely be ignored
import json
import tensorflow as tf
import tensorflow_recommenders as tfrs
import datetime
from tensorflow.python.lib.io import file_io
from tensorflow.train import BytesList, Feature, FeatureList, Int64List, FloatList
from tensorflow.train import SequenceExample, FeatureLists



def bq_to_tfdata(client, row_restriction, table_id, col_names, dataset, batch_size=BATCH_SIZE):
    TABLE_ID = table_id
    COL_NAMES = col_names
    DATASET = dataset
    bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, TABLE_ID, DATASET,
        COL_NAMES,
        requested_streams=2,
        row_restriction=row_restriction)
    dataset = bqsession.parallel_read_rows()
    return dataset.prefetch(1).shuffle(batch_size*10).batch(batch_size)

## Get the song metadata

To get a pipeline working we need the metadata for the table along with the table information. The following functions are helpers that give us the metadata into the proper types for `tf`


For each table id, programatically get
* Column names
* Column types

In [7]:
%%bigquery schema
SELECT * FROM hybrid-vertex.spotify_train_3.INFORMATION_SCHEMA.TABLES
where table_name in ('train');

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 527.65query/s]                          
Downloading: 100%|██████████| 1/1 [00:00<00:00,  1.12rows/s]


In [8]:
schema # we will get the fields out of the ddl field

Unnamed: 0,table_catalog,table_schema,table_name,table_type,is_insertable_into,is_typed,creation_time,base_table_catalog,base_table_schema,base_table_name,snapshot_time_ms,ddl,default_collation_name
0,hybrid-vertex,spotify_train_3,train,BASE TABLE,YES,NO,2022-06-24 15:53:36.907000+00:00,,,,NaT,CREATE TABLE `hybrid-vertex.spotify_train_3.tr...,


## Helper functions to pull metadata from ddl statements

From the DDL we are going to get the types for use in a  to create a `BigQueryReadSession` from `tensorflow_io.bigquery` 

In [9]:
# Function to convert string type representation to tf data types

def conv_dtype_to_tf(dtype_str):
    if dtype_str == 'FLOAT64':
        return dtypes.float64
    elif dtype_str == 'INT64':
        return dtypes.int64
    elif dtype_str == 'ARRAY<STRING>':
        return dtypes.int64
    else: 
        return dtypes.string
        
def get_metadata_from_ddl(ddl, drop_field=None):
    fields = []
    types = []
    ddl = ddl.values[0]
    for line in ddl.splitlines():
        if line[:1] == ' ': #only pull indented lines for the fields
            # drop the comma
            line = line.replace(',','')
            space_delim = line.split(' ')
            if space_delim[2] in drop_field:
                pass
            else:
                fields.append(space_delim[2])
                types.append(conv_dtype_to_tf(space_delim[3]))
    return fields, types


playlist_fields, playlist_types = get_metadata_from_ddl(schema.ddl[schema.table_name == 'train'], DROP_FIELDS) 

# { "field_a_name": {"mode": "repeated", "output_type": dtypes.int64}

In [10]:
# Function to convert string type representation to tf data types

def conv_dtype_to_tf(dtype_str):
    if dtype_str == 'FLOAT64':
        return dtypes.float64
    elif dtype_str == 'INT64':
        return dtypes.int64
    elif dtype_str == 'ARRAY<INT64>':
        return dtypes.int64
    elif dtype_str == 'ARRAY<FLOAT64>':
        return dtypes.float64
    else: 
        return dtypes.string

def get_metadata_from_ddl(ddl, drop_field=None):
    field_dict = {} 
    ddl = ddl.values[0]
    for line in ddl.splitlines():
        if line[:1] == ' ': #only pull indented lines for the fields
            # drop the comma
            line = line.replace(',','')
            space_delim = line.split(' ')
            if space_delim[2] in drop_field:
                pass
            else:
                # { "field_a_name": {"mode": "repeated", "output_type": dtypes.int64}
                if 'ARRAY' in space_delim[3]:
                    mode = BigQueryClient.FieldMode.REPEATED
                else:
                    mode = BigQueryClient.FieldMode.NULLABLE
                field_dict.update({space_delim[2]: {"mode": mode, "output_type": conv_dtype_to_tf(space_delim[3])}})
    return field_dict


playlist_fields_dict = get_metadata_from_ddl(schema.ddl[schema.table_name == 'train'], DROP_FIELDS) 

In [11]:
bq_2_tf_dict = {'name': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
 'collaborative': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'pid': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
 'description': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'duration_ms_playlist': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.int64},
 'pid_pos_id': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'pos': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
 'artist_name_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'track_uri_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'artist_uri_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'track_name_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'album_uri_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'duration_ms_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.float64},
 'album_name_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'track_pop_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.int64},
 'artist_pop_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.float64},
 'artist_genres_seed': {'mode': BigQueryClient.FieldMode.REPEATED,
  'output_type': dtypes.string},
 'artist_followers_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.float64},
 'pos_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.int64},
 'artist_name_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'artist_uri_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'track_name_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'track_uri_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'album_name_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'album_uri_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'duration_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.float64},
 'duration_ms_seed_pl': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.float64},
 'n_songs': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.int64},
 'num_artists': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.int64},
 'num_albums': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.int64},
'seed_playlist_tracks': {'mode': BigQueryClient.FieldMode.REPEATED,
  'output_type': dtypes.string}}

In [12]:
bq_2_tf_dict = {'track_uri': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
 # 'collaborative': {'mode': BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'pid': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
 # 'description': {'mode': BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'duration_ms_playlist': {'mode': BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.int64},
 # 'pid_pos_id': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'pos': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
 # 'artist_name_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'track_uri_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'artist_uri_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'track_name_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'album_uri_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'duration_ms_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.float64},
 # 'album_name_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'track_pop_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.int64},
 # 'artist_pop_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.float64},
 # 'artist_genres_seed': {'mode': BigQueryClient.FieldMode.REPEATED,
 #  'output_type': dtypes.string},
 # 'artist_followers_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.float64},
 # 'pos_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.int64},
 # 'artist_name_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'artist_uri_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'track_name_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'track_uri_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'album_name_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'album_uri_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.string},
 # 'duration_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.float64},
 # 'duration_ms_seed_pl': {'mode': BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.float64},
 # 'n_songs': {'mode': BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.int64},
 # 'num_artists': {'mode': BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.int64},
 # 'num_albums': {'mode': BigQueryClient.FieldMode.NULLABLE,
 #  'output_type': dtypes.int64}
                         }

In [14]:
client = BigQueryClient()
batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'unique_track_features', 'spotify_train_3',
        bq_2_tf_dict,
        requested_streams=2,)
dataset = bqsession.parallel_read_rows()
dataset = dataset.prefetch(1).shuffle(batch_size*10).batch(batch_size)

NotImplementedError: unable to open file: libtensorflow_io.so, from paths: ['/home/jupyter/.local/lib/python3.7/site-packages/tensorflow_io/python/ops/libtensorflow_io.so']
caused by: ['/home/jupyter/.local/lib/python3.7/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZNK10tensorflow4data11DatasetBase8FinalizeEPNS_15OpKernelContextESt8functionIFNS_8StatusOrISt10unique_ptrIS1_NS_4core15RefCountDeleterEEEEvEE']

In [None]:
for x in dataset.take(1):
    print(x)

### Now the helper functions are set. Below tf.data pipelines are created from bigquery

### For the song audio data, we are set and will use this pipeline in training - there's no need to pre-process as there are no nested elements

In [None]:
## Validate playlist data
playlist_train_pipeline = bq_to_tfdata(BigQueryClient(), row_restriction=None, table_id = 'train'
                                    , col_names=bq_2_tf_dict
                                       , dataset='spotify_train_3', batch_size=1) #set to one to process each record and maintain shape
# for line in playlist_train_pipeline.take(1):
#     print(line) #should come out based on batch size

## In pulling one record it looks like it's properly parsing a tf record

In [7]:
for _ in playlist_train_pipeline.take(1):
    print(_)

NameError: name 'playlist_train_pipeline' is not defined

# do some data wranglging on the text data
# tf.train.Example(features=tf.train.Features(feature=feature))
for _ in playlist_train_pipeline.map(lambda x: tf.io.parse_sequence_example(tf.io.serialize_tensor(x['tracks'][0]), sequence_features=feature_description, context_features=context_features, name='tracks')).take(1):
    tensor = _
    print(_)

Since the data is stored in a text dictionary we will eagerly execute, grab the values and do a string `eval`
`eval("{'pos': 0, 'artist_name': 'King Crimson', 'track_uri': 'spotify:track:173gp7NIXqk0MEo8K7Av4a', 'artist_uri': 'spotify:artist:7M1FPw29m5FbicYzS2xdpi', 'track_name': '21st Century Schizoid Man', 'album_uri': 'spotify:album:0ga8Q4tTXaFf9q3LvT8hrC', 'duration_ms': 657517, 'album_name': 'Radical Action To Unseat the Hold of Monkey Mind (Live)'}")`

## This funcion parses the playlist data and breaks down the nested fields to be conformant with `SequenceExample`
The 'flat' features come along as `context_features` in `SequenceExample`
There is one more helper function to parse the example and write it to the destination `gs://` path

In [14]:
@tf.function
def get_tensor_from_tracks(tensor):
    key_list = ['pos', 'artist_name', 'track_uri', 'artist_uri', 'track_name', 'album_uri', 'duration_ms', 'album_name']
    y = {}
    
    
    tracks = tensor["tracks"][0]
    tracks = tracks.numpy()

    tracks = eval(tracks)

    for _ in key_list:
        y[_] = []

    for track in tracks:
        y['pos'].append(track['pos'])
        y['artist_name'].append(track['artist_name'].encode('utf8'))
        y['artist_uri'].append(track['artist_uri'].encode('utf8'))
        y['track_name'].append(track['track_name'].encode('utf8'))
        y['album_uri'].append(track['album_uri'].encode('utf8'))
        y['duration_ms'].append(track['duration_ms'])
        y['album_name'].append(track['album_name'].encode('utf8'))
        y['track_uri'].append(track['track_uri'].encode('utf8'))
        


    # set list types
    pos = Int64List(value=y['pos'])
    artist_name = BytesList(value=y['artist_name'])
    artist_uri = BytesList(value=y['artist_uri'])
    track_name = BytesList(value=y['track_name'])
    album_uri = BytesList(value=y['album_uri'])
    duration_ms = Int64List(value=y['duration_ms'])
    album_name = BytesList(value=y['album_name'])
    track_uri = BytesList(value=y['track_uri'])



    sample_dict = {
    "pl_name" : Feature(bytes_list=BytesList(value=tensor['pl_name'].numpy())),
    "collaborative" : Feature(bytes_list=BytesList(value=tensor['collaborative'].numpy())),
    "modified_at_playlist" : Feature(int64_list=Int64List(value=tensor['modified_at_playlist'].numpy())),
    "num_tracks" : Feature(int64_list=Int64List(value=tensor['num_tracks'].numpy())),
    "num_albums" : Feature(int64_list=Int64List(value=tensor['num_albums'].numpy())),
    "num_followers" : Feature(int64_list=Int64List(value=tensor['num_followers'].numpy())),
    "num_edits" : Feature(int64_list=Int64List(value=tensor['num_edits'].numpy())),
    "duration_ms" : Feature(int64_list=Int64List(value=tensor['duration_ms'].numpy())),
    "num_artists" : Feature(int64_list=Int64List(value=tensor['num_artists'].numpy())),
    "genres" : Feature(bytes_list=BytesList(value=tensor['genres'].numpy())),
    #"track_pop" : Feature(int64_list=Int64List(value=tensor['track_pop'].numpy())),
    "time_signature" : Feature(float_list=FloatList(value=tensor['time_signature'].numpy())),
    "uri" : Feature(bytes_list=BytesList(value=tensor['uri'].numpy())),
    "tempo" : Feature(float_list=FloatList(value=tensor['tempo'].numpy())),
    "valence" : Feature(float_list=FloatList(value=tensor['valence'].numpy())),
    "liveness" : Feature(float_list=FloatList(value=tensor['liveness'].numpy())),
    "instrumentalness" : Feature(float_list=FloatList(value=tensor['instrumentalness'].numpy())),
    "acousticness" : Feature(float_list=FloatList(value=tensor['acousticness'].numpy())),
    "speechiness" : Feature(float_list=FloatList(value=tensor['speechiness'].numpy())),
    "mode" : Feature(float_list=FloatList(value=tensor['mode'].numpy())),
    "loudness" : Feature(float_list=FloatList(value=tensor['loudness'].numpy())),
    "key" : Feature(float_list=FloatList(value=tensor['key'].numpy())),
    "energy" : Feature(float_list=FloatList(value=tensor['energy'].numpy())),
    "danceability" : Feature(float_list=FloatList(value=tensor['danceability'].numpy())),
    "speechiness" : Feature(float_list=FloatList(value=tensor['speechiness'].numpy())),
    "artist_name" : Feature(bytes_list=BytesList(value=tensor['artist_name'].numpy())),
    "track_name" : Feature(bytes_list=BytesList(value=tensor['track_name'].numpy())),
    "album_name" : Feature(bytes_list=BytesList(value=tensor['album_name'].numpy())),
    "description" : Feature(bytes_list=BytesList(value=tensor['album_name'].numpy()))
    }

    # combine feature list

    pos = FeatureList(feature=[Feature(int64_list=pos)]) 
    artist_name = FeatureList(feature=[Feature(bytes_list=artist_name)])
    artist_uri = FeatureList(feature=[Feature(bytes_list=artist_uri)])
    track_name = FeatureList(feature=[Feature(bytes_list=track_name)])
    album_uri = FeatureList(feature=[Feature(bytes_list=album_uri)])
    duration_ms = FeatureList(feature=[Feature(int64_list=duration_ms)])
    album_name = FeatureList(feature=[Feature(bytes_list=album_name)])
    track_uri = FeatureList(feature=[Feature(bytes_list=track_uri)])
            

    #create the sequence
    seq = SequenceExample(context=tf.train.Features(feature=sample_dict),
                          feature_lists=FeatureLists(feature_list={
                               "pos": pos,
                               "artist_name": artist_name,
                               "track_name": track_name,
                               "album_uri": album_uri,
                               "duration_ms": duration_ms,
                               "album_name": album_name,
                               "track_uri": track_uri,
                              "artist_uri": artist_uri
    }))
    
    return seq


def write_a_tfrec(lns, n_records_per_file, file_counter, subfolder):
    #next write to a tfrecord
        with tf.io.TFRecordWriter(
            TF_RECORDS_DIR + "/" + subfolder +"/file_%.2i-%i.tfrec" % (n_records_per_file, file_counter)
        ) as writer:
            for example in lns:
                writer.write(example.SerializeToString())

## Now iterate over the pipeline
Creating files with batches of `N_RECORDS_PER_TFRECORD_FILE`

This takes about 30 minutes on a 64 vCPUs, 57.6 GB RAM 

In [15]:
tf.config.run_functions_eagerly(True)
from tqdm import tqdm
# using datetime module

# ct stores current time
ct = str(datetime.datetime.now()).replace(" ",":")
print(f"Timestamp for folder expected: {ct}")
records = []
file_count = 0
for i, line in enumerate(tqdm(playlist_train_pipeline, total=TOTAL_PLAYLISTS)):
    sequence_example = get_tensor_from_tracks(line) #should come out based on batch size
    if i % N_RECORDS_PER_TFRECORD_FILE == 0 and i is not 0: #write-a-file and reset the batch (+1 to avoid modulus reset)
        records.append(sequence_example)
        write_a_tfrec(records, n_records_per_file=N_RECORDS_PER_TFRECORD_FILE, subfolder=ct, file_counter = file_count)
        file_count+=1
        records = []
    else:
        records.append(sequence_example)

Timestamp for folder expected: 2022-04-26:13:17:31.737488


  0%|          | 0/68441867 [00:00<?, ?it/s]2022-04-26 13:17:31.758739: E tensorflow/core/framework/dataset.cc:577] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-26 13:17:31.758789: E tensorflow/core/framework/dataset.cc:581] UNIMPLEMENTED: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-26 13:17:31.759263: E tensorflow/core/framework/dataset.cc:577] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-26 13:17:31.759292: E tensorflow/core/framework/dataset.cc:581] UNIMPLEMENTED: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
 11%|█         | 7592670/68441867 [9:32:46<76:30:23, 220.93it/s] 


### Parse records ensure this worked

In [None]:
# #move the files around

# !gsutil cp gs://spotify-tfrecords/$ct/* gs://spotify-tfrecords
# !gsutil rm gs://spotify-tfrecords/$ct/*

In [12]:
import tensorflow as tf

sequence_features = {'pos': tf.io.RaggedFeature(tf.int64), 
                     'artist_name':  tf.io.RaggedFeature(tf.string), 
                     'track_uri':  tf.io.RaggedFeature(tf.string), 
                     'artist_uri': tf.io.RaggedFeature(tf.string), 
                     'track_name': tf.io.RaggedFeature(tf.string), 
                     'album_uri': tf.io.RaggedFeature(tf.string),
                     'duration_ms': tf.io.RaggedFeature(tf.int64), 
                     'album_name': tf.io.RaggedFeature(tf.string)
                    }
context_features = {"pl_name" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1,)),
                    "collaborative" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1,)),
                    "modified_at_playlist" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1,)),
                    "num_tracks" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1,)),
                    "num_albums" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1,)),
                    "num_followers" :tf.io.FixedLenFeature(dtype=tf.int64, shape=(1,)),
                    "num_edits" :tf.io.FixedLenFeature(dtype=tf.int64, shape=(1,)),
                    "duration_ms" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1,)),
                    "num_artists" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1,)),
                    "description" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1,)),
                    "genres" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1,)),
                    "time_signature" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "uri" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1,)),
                    "tempo" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "valence" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "liveness" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "instrumentalness" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "acousticness" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "speechiness" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "mode" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "loudness" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "key" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "energy" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "danceability" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "speechiness" : tf.io.FixedLenFeature(dtype=tf.float32, shape=(1,)),
                    "artist_name" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1,)),
                    "track_name" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1,)),
                    "album_name" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1,))    
                   }

def parse_tfrecord_fn(example):
    example = tf.io.parse_single_sequence_example(example, sequence_features=sequence_features, context_features=context_features)
    return example

### parse tfrecord dataset

In [None]:
from google.cloud import storage

client = storage.Client()
files = []
for blob in client.list_blobs('spotify-tfrecords'):
    files.append(blob.public_url.replace("https://storage.googleapis.com/", "gs://"))

In [14]:
from pprint import pprint
raw_dataset = tf.data.TFRecordDataset("gs://spotify-tfrecords/2022-04-25:12:34:55.692336/file_1000-0.tfrec")

tf_record_pipeline = raw_dataset.map(parse_tfrecord_fn)

for _ in tf_record_pipeline.take(1):
    pprint(_)

2022-04-25 13:10:08.097630: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at example_parsing_ops.cc:480 : INVALID_ARGUMENT: Inconsistent max number of elements for feature acousticness: expected 1, but found 500
2022-04-25 13:10:08.097737: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at example_parsing_ops.cc:480 : INVALID_ARGUMENT: Inconsistent max number of elements for feature acousticness: expected 1, but found 500
2022-04-25 13:10:08.097844: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at example_parsing_ops.cc:480 : INVALID_ARGUMENT: Inconsistent max number of elements for feature acousticness: expected 1, but found 500
2022-04-25 13:10:08.097897: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at example_parsing_ops.cc:480 : INVALID_ARGUMENT: Inconsistent max number of elements for feature acousticness: expected 1, but found 500
2022-04-25 13:10:08.097975: W tensorflow/core/framework/op_kernel.cc:174

InvalidArgumentError: Inconsistent max number of elements for feature acousticness: expected 1, but found 500
	 [[{{node ParseSingleSequenceExample/ParseSequenceExample/ParseSequenceExampleV2}}]] [Op:IteratorGetNext]

op_kernel.cc:1745] OP_REQUIRES failed at example_parsing_ops.cc:480 : INVALID_ARGUMENT: Inconsistent max number of elements for feature acousticness: expected 1, but found 500
2022-04-25 13:10:08.098807: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at example_parsing_ops.cc:480 : INVALID_ARGUMENT: Inconsistent max number of elements for feature acousticness: expected 1, but found 500
2022-04-25 13:10:08.098864: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at example_parsing_ops.cc:480 : INVALID_ARGUMENT: Inconsistent max number of elements for feature acousticness: expected 1, but found 500
2022-04-25 13:10:08.098914: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at example_parsing_ops.cc:480 : INVALID_ARGUMENT: Inconsistent max number of elements for feature acousticness: expected 1, but found 500
2022-04-25 13:10:08.098954: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at example_parsing_ops.cc:480 : IN

# Model Draft Stuff

In [None]:
# %%writefile -a vertex_train/trainer/task.py

class PlaylistsModel(tf.keras.Model):
    def __init__(self, layer_sizes, adapt_data):
        super().__init__()
        
        #start with lookups on low cardnality categorical items
        colab_vocab = tf.constant(['true','false'], name='colab_vocab', dtype='string')
        
        self.colab = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=colab_vocab, mask_token=None, name="colab_lookup", output_mode='count')
        ], name="colab")
        
        #create text vectorizors to be fed to an embedding layer
        self.artist_vectorizor = tf.keras.layers.TextVectorization(
            max_tokens=MAX_TOKENS, name="artist_tv", ngrams=2)
        
        self.album_vectorizor = tf.keras.layers.TextVectorization(
            max_tokens=MAX_TOKENS, name="album_tv", ngrams=2)
        
        self.description_vectorizor = tf.keras.layers.TextVectorization(
            max_tokens=MAX_TOKENS, name="album_tv", ngrams=2)
        
        self.query_embedding = tf.keras.Sequential([
            self.album_vectorizor,
            tf.keras.layers.Embedding(MAX_TOKENS+1, EMBEDDING_DIM , mask_zero=True, name="album_emb"),
            tf.keras.layers.GlobalAveragePooling1D()
        ], name="album_embedding_model")
        
        self.artist_embedding = tf.keras.Sequential([
            self.artist_vectorizor,
            tf.keras.layers.Embedding(MAX_TOKENS+1, EMBEDDING_DIM , mask_zero=True, name="artist_emb"),
            tf.keras.layers.GlobalAveragePooling1D()
        ], name="artist_embedding")
        
        ###############
        ### adapt stuff
        ###############
        
        self.artist_vectorizor.adapt(adapt_data.map(lambda x: x['artist_name']))
        self.album_vectorizor.adapt(adapt_data.map(lambda x: x['album_name'])) 
        
        # Then construct the layers.
        self.dense_layers = tf.keras.Sequential(name="dense_layers_query")
        
        initializer = tf.keras.initializers.GlorotUniform(seed=SEED)
        # Use the ReLU activation for all but the last layer.
        for layer_size in layer_sizes[:-1]:
            self.dense_layers.add(tf.keras.layers.Dense(layer_size, activation="relu", kernel_initializer=initializer))
            if DROPOUT:
                self.dense_layers.add(tf.keras.layers.Dropout(DROPOUT_RATE))
        # No activation for the last layer
        for layer_size in layer_sizes[-1:]:
            self.dense_layers.add(tf.keras.layers.Dense(layer_size, kernel_initializer=initializer))
        ### ADDING L2 NORM AT THE END
        self.dense_layers.add(tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, 1, epsilon=1e-12, name="normalize_dense")))


    def call(self, data):    
        all_embs = tf.concat(
                [
                    self.album_embedding(data['album_name']),
                    self.artist_embedding(data['artist_name']),
                    self.colab(data['collaborative']),
                    self.description_embedding(data['description'])
                ], axis=1)
        return self.dense_layers(all_embs)

## Use the example output to think of how you process your features

```
OrderedDict([('album_name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'The Helm'], dtype=object)>), ('artist_name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Carrot Green'], dtype=object)>), ('collaborative', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'false'], dtype=object)>), ('description', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b''], dtype=object)>), ('duration_ms', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'358500'], dtype=object)>), ('modified_at', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1505692800])>), ('name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'FeSTa'], dtype=object)>), ('num_albums', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([82])>), ('num_artists', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([66])>), ('num_edits', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([48])>), ('num_followers', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>), ('num_tracks', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([85])>), ('pos', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'45'], dtype=object)>), ('track_name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'The Helm - Carrot Green Remix'], dtype=object)>)])
```

In [None]:
#### Tests

In [None]:
# data[0].keys()#originally got the values from this
feature_description = {'pos': tf.io.RaggedFeature(tf.int64), 
                     'artist_name':  tf.io.RaggedFeature(tf.string), 
                     'track_uri':  tf.io.RaggedFeature(tf.string), 
                     'artist_uri': tf.io.RaggedFeature(tf.string), 
                     'track_name': tf.io.RaggedFeature(tf.string), 
                     'album_uri': tf.io.RaggedFeature(tf.string),
                     'duration_ms': tf.io.RaggedFeature(tf.int64), 
                     'album_name': tf.io.RaggedFeature(tf.string)
                    }
context_features = {"name" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1)),
                    "collaborative" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1)),
                    "modified_at" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_tracks" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_albums" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_followers" :tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_edits" :tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "duration_ms" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_artists" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "description" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1))
                   }