# Two - Tower Retreival Model

### Key resources:
* Many pages [here](https://www.tensorflow.org/recommenders/examples/deep_recommenders) include great techniques to build custom TFRS Models

### Goals:
* Show how to model off of most data types 
  * (String, Existing Embeddings (vectors), 
  * Floats (Normalized), 
  * Categorical with vocab, 
  * High Dim Categorical (Embed)
* Leverage class templates to create custom 2 Tower Models quick/easy

## SPOTIFY Create the tensorflow.io interface for the event and product table in Bigquery
Best practices from Google are in this blog post

In [402]:
# set variables
DROPOUT = False
DROPOUT_RATE = 0.2
EMBEDDING_DIM = 64
MAX_TOKENS = 100_000
BATCH_SIZE = 256
ARCH = [128, 64]
NUM_EPOCHS = 1
SEED = 41781897
PROJECT_ID = 'jtotten-project'
DROP_FIELDS = ['pid', 'track_uri', 'artist_uri', 'album_uri']

#### Quick counts on training data



#### Quick counts on the training records for track

In [403]:
%%bigquery
select count(1) from jtotten-project.spotify_mpd.playlists

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 990.16query/s] 
Downloading: 100%|██████████| 1/1 [00:00<00:00,  1.17rows/s]


Unnamed: 0,f0_
0,66346428


#### Same with playlist

#### Quick counts (this time playlists) on the training records for track

In [404]:
%%bigquery
select count(1) from jtotten-project.spotify_mpd.track_audio

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 1395.31query/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00,  1.31rows/s]


Unnamed: 0,f0_
0,2261490


### Set the tf.io pipelines function from bigquery

[Great blog post here on it](https://towardsdatascience.com/how-to-read-bigquery-data-from-tensorflow-2-0-efficiently-9234b69165c8)

In [405]:
import tensorflow as tf
from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
import warnings
warnings.filterwarnings("ignore") #do this b/c there's an info-level bug that can safely be ignored
import json
import tensorflow as tf
import tensorflow_recommenders as tfrs


def bq_to_tfdata(client, row_restriction, table_id, col_names, col_types, dataset, batch_size=BATCH_SIZE):
    TABLE_ID = table_id
    COL_NAMES = col_names
    COL_TYPES = col_types
    DATASET = dataset
    bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, TABLE_ID, DATASET,
        COL_NAMES, COL_TYPES,
        requested_streams=2,
        row_restriction=row_restriction)
    dataset = bqsession.parallel_read_rows()
    return dataset.prefetch(1).shuffle(batch_size*10).batch(batch_size)

## Get the song metadata

To get a pipeline working we need the metadata for the table along with the table information. The following functions are helpers that give us the metadata into the proper types for `tf`


For each table id, programatically get
* Column names
* Column types

In [406]:
%%bigquery schema
SELECT * FROM jtotten-project.spotify_mpd.INFORMATION_SCHEMA.TABLES
where table_name in ('track_audio', 'playlists_track_string');

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 773.57query/s]                          
Downloading: 100%|██████████| 2/2 [00:00<00:00,  2.60rows/s]


In [407]:
schema # we will get the fields out of the ddl field

Unnamed: 0,table_catalog,table_schema,table_name,table_type,is_insertable_into,is_typed,creation_time,base_table_catalog,base_table_schema,base_table_name,snapshot_time_ms,ddl
0,jtotten-project,spotify_mpd,track_audio,BASE TABLE,YES,NO,2022-04-06 17:46:25.801000+00:00,,,,NaT,CREATE TABLE `jtotten-project.spotify_mpd.trac...
1,jtotten-project,spotify_mpd,playlists_track_string,BASE TABLE,YES,NO,2022-04-22 22:50:46.601000+00:00,,,,NaT,CREATE TABLE `jtotten-project.spotify_mpd.play...


## Helper functions to pull metadata from ddl statements

In [408]:
# Function to convert string type representation to tf data types

def conv_dtype_to_tf(dtype_str):
    if dtype_str == 'FLOAT64':
        return dtypes.float64
    elif dtype_str == 'INT64':
        return dtypes.int64
    else: 
        return dtypes.string
        
def get_metadata_from_ddl(ddl, drop_field=None):
    fields = []
    types = []
    ddl = ddl.values[0]
    for line in ddl.splitlines():
        if line[:1] == ' ': #only pull indented lines for the fields
            # drop the comma
            line = line.replace(',','')
            space_delim = line.split(' ')
            if space_delim[2] in drop_field:
                pass
            else:
                fields.append(space_delim[2])
                types.append(conv_dtype_to_tf(space_delim[3]))
    return fields, types


track_audio_fields, track_audio_types = get_metadata_from_ddl(schema.ddl[schema.table_name == 'track_audio'], DROP_FIELDS)
playlist_fields, playlist_types = get_metadata_from_ddl(schema.ddl[schema.table_name == 'playlists_track_string'], DROP_FIELDS) 

In [409]:
# Quick check on data
for a, b in zip(playlist_fields, playlist_types):
    print(a +" : " + str(b))

name : <dtype: 'string'>
collaborative : <dtype: 'string'>
modified_at : <dtype: 'int64'>
num_tracks : <dtype: 'int64'>
num_albums : <dtype: 'int64'>
num_followers : <dtype: 'int64'>
tracks : <dtype: 'string'>
num_edits : <dtype: 'int64'>
duration_ms : <dtype: 'int64'>
num_artists : <dtype: 'int64'>
description : <dtype: 'string'>


In [410]:
# Quick check on data
for a, b in zip(track_audio_fields, track_audio_types):
    print(a +" : " + str(b))
    
DROP_TRACK_AUDIO_FIELDS = ['pid', 'track_uri', 'artist_uri', 'album_uri']

artist_name : <dtype: 'string'>
track_name : <dtype: 'string'>
album_name : <dtype: 'string'>
name : <dtype: 'string'>
danceability : <dtype: 'float64'>
energy : <dtype: 'float64'>
key : <dtype: 'float64'>
loudness : <dtype: 'float64'>
mode : <dtype: 'float64'>
speechiness : <dtype: 'float64'>
acousticness : <dtype: 'float64'>
instrumentalness : <dtype: 'float64'>
liveness : <dtype: 'float64'>
valence : <dtype: 'float64'>
tempo : <dtype: 'float64'>
type : <dtype: 'string'>
id : <dtype: 'string'>
uri : <dtype: 'string'>
track_href : <dtype: 'string'>
analysis_url : <dtype: 'string'>
time_signature : <dtype: 'float64'>
artist_pop : <dtype: 'int64'>
track_pop : <dtype: 'string'>
genres : <dtype: 'string'>
duration_ms : <dtype: 'int64'>


### Now the helper functions are set. Below tf.data pipelines are created from bigquery

In [411]:
track_train_pipeline = bq_to_tfdata(BigQueryClient(), row_restriction=None, table_id = 'track_audio'
                                    , col_names=track_audio_fields, col_types=track_audio_types, dataset='spotify_mpd', batch_size=1) #we will change to BATCH_SIZE after we test 

In [412]:
### Validate we are getting records

for line in track_train_pipeline.take(1):
    print(line) #should come out based on batch size

2022-04-23 22:47:52.086736: E tensorflow/core/framework/dataset.cc:577] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-23 22:47:52.086783: E tensorflow/core/framework/dataset.cc:581] UNIMPLEMENTED: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-23 22:47:52.087160: E tensorflow/core/framework/dataset.cc:577] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-23 22:47:52.087201: E tensorflow/core/framework/dataset.cc:581] UNIMPLEMENTED: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.


OrderedDict([('acousticness', <tf.Tensor: shape=(1,), dtype=float64, numpy=array([0.993])>), ('album_name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Fundo Musical'], dtype=object)>), ('analysis_url', <tf.Tensor: shape=(1,), dtype=string, numpy=
array([b'https://api.spotify.com/v1/audio-analysis/2V3MAaxLBbS370RqeXpc72'],
      dtype=object)>), ('artist_name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Fundo Musical Latino Star'], dtype=object)>), ('artist_pop', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([0])>), ('danceability', <tf.Tensor: shape=(1,), dtype=float64, numpy=array([0.609])>), ('duration_ms', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([337525])>), ('energy', <tf.Tensor: shape=(1,), dtype=float64, numpy=array([0.103])>), ('genres', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'unknown'], dtype=object)>), ('id', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'2V3MAaxLBbS370RqeXpc72'], dtype=object)>), ('instrumentalness', <tf.Tens

In [413]:
playlist_types[-1] = dtypes.string #try manually setting the dtype for the tracks nested column

In [516]:
## Validate playlist data
playlist_train_pipeline = bq_to_tfdata(BigQueryClient(), row_restriction=None, table_id = 'playlists_track_string'
                                    , col_names=playlist_fields
                                       , col_types=playlist_types
                                       , dataset='spotify_mpd', batch_size=1)
for line in playlist_train_pipeline.take(1):
    print(line) #should come out based on batch size

2022-04-24 01:55:03.612855: E tensorflow/core/framework/dataset.cc:577] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-24 01:55:03.612908: E tensorflow/core/framework/dataset.cc:581] UNIMPLEMENTED: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-24 01:55:03.613285: E tensorflow/core/framework/dataset.cc:577] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-24 01:55:03.613323: E tensorflow/core/framework/dataset.cc:581] UNIMPLEMENTED: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.


OrderedDict([('collaborative', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'true'], dtype=object)>), ('description', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b''], dtype=object)>), ('duration_ms', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([12300276])>), ('modified_at', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1418256000])>), ('name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'DIVA'], dtype=object)>), ('num_albums', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([34])>), ('num_artists', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([17])>), ('num_edits', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([16])>), ('num_followers', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>), ('num_tracks', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([51])>), ('tracks', <tf.Tensor: shape=(1,), dtype=string, numpy=
array([b'[{\'pos\': 0, \'artist_name\': \'Lianne La Havas\', \'track_uri\': \'spotify:track:77fhhezMh8llUQ4ma1RBsr\', \'artist_ur

## In pulling one record it looks like it's properly parsing a tf record

In [49]:
def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

In [415]:
# data[0].keys()#originally got the values from this
feature_description = {'pos': tf.io.RaggedFeature(tf.int64), 
                     'artist_name':  tf.io.RaggedFeature(tf.string), 
                     'track_uri':  tf.io.RaggedFeature(tf.string), 
                     'artist_uri': tf.io.RaggedFeature(tf.string), 
                     'track_name': tf.io.RaggedFeature(tf.string), 
                     'album_uri': tf.io.RaggedFeature(tf.string),
                     'duration_ms': tf.io.RaggedFeature(tf.int64), 
                     'album_name': tf.io.RaggedFeature(tf.string)
                    }
context_features = {"name" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1)),
                    "collaborative" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1)),
                    "modified_at" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_tracks" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_albums" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_followers" :tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_edits" :tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "duration_ms" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_artists" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "description" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1))
                   }

In [40]:
# https://www.tensorflow.org/tutorials/load_data/tfrecord

In [331]:
x = []

In [482]:
import json

def reader(pipeline):
    for record in pipeline:
        # record = tf.train.SequenceExample.FromString(record.numpy())
        record = record.numpy()
        record = eval((record)) #one eval to eval string, second to resolve to list of dicts
        # record = tf.io.serialize_tensor(record)
        # record_list = record.strip('][').split('}, ')
        # for track in record_list:
        #     track = track.replace("[{","{").replace('b"', '').replace('}]','').replace('\\','').replace("'{","{").replace("}'","}")+"}"
        #     try:
        #         r = eval(track.replace('b{', '{').replace("''","'"))
        #     except json.JSONDecodeError as e:
        #         print(e, track)
        # r = tf.train.SequenceExample.FromString(record_list)
        yield record

In [483]:
eval("{'pos': 0, 'artist_name': 'King Crimson', 'track_uri': 'spotify:track:173gp7NIXqk0MEo8K7Av4a', 'artist_uri': 'spotify:artist:7M1FPw29m5FbicYzS2xdpi', 'track_name': '21st Century Schizoid Man', 'album_uri': 'spotify:album:0ga8Q4tTXaFf9q3LvT8hrC', 'duration_ms': 657517, 'album_name': 'Radical Action To Unseat the Hold of Monkey Mind (Live)'}")

{'pos': 0,
 'artist_name': 'King Crimson',
 'track_uri': 'spotify:track:173gp7NIXqk0MEo8K7Av4a',
 'artist_uri': 'spotify:artist:7M1FPw29m5FbicYzS2xdpi',
 'track_name': '21st Century Schizoid Man',
 'album_uri': 'spotify:album:0ga8Q4tTXaFf9q3LvT8hrC',
 'duration_ms': 657517,
 'album_name': 'Radical Action To Unseat the Hold of Monkey Mind (Live)'}

In [None]:
# data[0].keys()#originally got the values from this
feature_description = {'pos': tf.io.RaggedFeature(tf.int64), 
                     'artist_name':  tf.io.RaggedFeature(tf.string), 
                     'track_uri':  tf.io.RaggedFeature(tf.string), 
                     'artist_uri': tf.io.RaggedFeature(tf.string), 
                     'track_name': tf.io.RaggedFeature(tf.string), 
                     'album_uri': tf.io.RaggedFeature(tf.string),
                     'duration_ms': tf.io.RaggedFeature(tf.int64), 
                     'album_name': tf.io.RaggedFeature(tf.string)
                    }
context_features = {"name" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1)),
                    "collaborative" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1)),
                    "modified_at" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_tracks" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_albums" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_followers" :tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_edits" :tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "duration_ms" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_artists" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "description" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1))
                   }

In [790]:
@tf.function
def get_tensor_from_tracks(tensor):
    key_list = ['pos', 'artist_name', 'track_uri', 'artist_uri', 'track_name', 'album_uri', 'duration_ms', 'album_name']
    y = {}
    
    
    tracks = tensor["tracks"][0]
    tracks = tracks.numpy()

    tracks = eval(tracks)

    for _ in key_list:
        y[_] = []

    for track in tracks:
        y['pos'].append(track['pos'])
        y['artist_name'].append(track['artist_name'].encode('utf8'))
        y['artist_uri'].append(track['artist_uri'].encode('utf8'))
        y['track_name'].append(track['track_name'].encode('utf8'))
        y['album_uri'].append(track['album_uri'].encode('utf8'))
        y['duration_ms'].append(track['duration_ms'])
        y['album_name'].append(track['album_name'].encode('utf8'))
        y['track_uri'].append(track['track_uri'].encode('utf8'))
        



    from tensorflow.train import BytesList, Feature, FeatureList, Int64List

    # set list types
    pos = Int64List(value=y['pos'])
    artist_name = BytesList(value=y['artist_name'])
    artist_uri = BytesList(value=y['artist_uri'])
    track_name = BytesList(value=y['track_name'])
    album_uri = BytesList(value=y['album_uri'])
    duration_ms = Int64List(value=y['duration_ms'])
    album_name = BytesList(value=y['album_name'])
    track_uri = BytesList(value=y['track_uri'])

    
    sample_dict = {
    "name" : Feature(bytes_list=BytesList(value=tensor['name'].numpy())),
    "collaborative" : Feature(bytes_list=BytesList(value=tensor['collaborative'].numpy())),
    "modified_at" : Feature(int64_list=Int64List(value=tensor['modified_at'].numpy())),
    "num_tracks" : Feature(int64_list=Int64List(value=tensor['num_tracks'].numpy())),
    "num_albums" : Feature(int64_list=Int64List(value=tensor['num_albums'].numpy())),
    "num_followers" : Feature(int64_list=Int64List(value=tensor['num_followers'].numpy())),
    "num_edits" : Feature(int64_list=Int64List(value=tensor['num_edits'].numpy())),
    "duration_ms" : Feature(int64_list=Int64List(value=tensor['duration_ms'].numpy())),
    "num_artists" : Feature(int64_list=Int64List(value=tensor['num_artists'].numpy())),
    "description" : Feature(bytes_list=BytesList(value=tensor['description'].numpy()))
    }

    # combine feature list

    fl = FeatureList(feature=[Feature(int64_list=pos), 
                              Feature(bytes_list=artist_name),
                              Feature(bytes_list=artist_uri),
                              Feature(bytes_list=track_name),
                              Feature(bytes_list=album_uri),
                              Feature(int64_list=duration_ms),
                              Feature(bytes_list=album_name),
                              Feature(bytes_list=track_uri)
                             ])

    from tensorflow.train import SequenceExample, FeatureLists
    #create the sequence
    seq = SequenceExample(context=tf.train.Features(feature=sample_dict),
                          feature_lists=FeatureLists(feature_list={
        "tracks": fl
    }))
    
    return seq

In [779]:
print(tf.executing_eagerly())

True


In [791]:
tf.config.run_functions_eagerly(True)

for line in playlist_train_pipeline.take(10):
    q = get_tensor_from_tracks(line) #should come out based on batch size

2022-04-24 04:33:37.227993: E tensorflow/core/framework/dataset.cc:577] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-24 04:33:37.228044: E tensorflow/core/framework/dataset.cc:581] UNIMPLEMENTED: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-24 04:33:37.228332: E tensorflow/core/framework/dataset.cc:577] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-24 04:33:37.228357: E tensorflow/core/framework/dataset.cc:581] UNIMPLEMENTED: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.


In [792]:
q

context {
  feature {
    key: "collaborative"
    value {
      bytes_list {
        value: "false"
      }
    }
  }
  feature {
    key: "description"
    value {
      bytes_list {
        value: ""
      }
    }
  }
  feature {
    key: "duration_ms"
    value {
      int64_list {
        value: 5932935
      }
    }
  }
  feature {
    key: "modified_at"
    value {
      int64_list {
        value: 1476403200
      }
    }
  }
  feature {
    key: "name"
    value {
      bytes_list {
        value: "da best"
      }
    }
  }
  feature {
    key: "num_albums"
    value {
      int64_list {
        value: 22
      }
    }
  }
  feature {
    key: "num_artists"
    value {
      int64_list {
        value: 19
      }
    }
  }
  feature {
    key: "num_edits"
    value {
      int64_list {
        value: 6
      }
    }
  }
  feature {
    key: "num_followers"
    value {
      int64_list {
        value: 1
      }
    }
  }
  feature {
    key: "num_tracks"
    value {
      int

In [699]:
line
split_string_tensors = tf.strings.split(line, sep='},')
fixed_string_tensors = tf.strings.as_string(split_string_tensors)
fixed_string_tensors[0][0]

InvalidArgumentError: Value for attr 'T' of string is not in the list of allowed values: float, double, int32, uint8, int16, int8, int64, bfloat16, uint16, half, uint32, uint64, complex64, complex128, bool, variant
	; NodeDef: {{node AsString}}; Op<name=AsString; signature=input:T -> output:string; attr=T:type,allowed=[DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT16, DT_INT8, DT_INT64, DT_BFLOAT16, DT_UINT16, DT_HALF, DT_UINT32, DT_UINT64, DT_COMPLEX64, DT_COMPLEX128, DT_BOOL, DT_VARIANT]; attr=precision:int,default=-1; attr=scientific:bool,default=false; attr=shortest:bool,default=false; attr=width:int,default=-1; attr=fill:string,default=""> [Op:AsString]

In [538]:
# data[0].keys()#originally got the values from this
feature_description = {'pos': tf.io.RaggedFeature(tf.int64), 
                     'artist_name':  tf.io.RaggedFeature(tf.string), 
                     'track_uri':  tf.io.RaggedFeature(tf.string), 
                     'artist_uri': tf.io.RaggedFeature(tf.string), 
                     'track_name': tf.io.RaggedFeature(tf.string), 
                     'album_uri': tf.io.RaggedFeature(tf.string),
                     'duration_ms': tf.io.RaggedFeature(tf.int64), 
                     'album_name': tf.io.RaggedFeature(tf.string)
                    }
context_features = {"name" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1)),
                    "collaborative" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1)),
                    "modified_at" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_tracks" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_albums" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_followers" :tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_edits" :tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "duration_ms" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "num_artists" : tf.io.FixedLenFeature(dtype=tf.int64, shape=(1)),
                    "description" : tf.io.FixedLenFeature(dtype=tf.string, shape=(1))
                   }



#  feature_lists: {
#     feature_list: {
#       key  : "movie_ratings"
#       value: {
#         feature: {
#           float_list: {
#             value: [ 4.5 ]
#           }
#         }
#         feature: {
#           float_list: {
#             value: [ 5.0 ]
#           }
#         }
#       }
#     }
#     feature_list: {
#       key  : "movie_names"
#       value: {
#         feature: {
#           bytes_list: {
#             value: [ "The Shawshank Redemption" ]
#           }
#         }
#         feature: {
#           bytes_list: {
#             value: [ "Fight Club" ]
#           }
#         }
#       }
#     }
#     feature_list: {
#       key  : "actors"
#       value: {
#         feature: {
#           bytes_list: {
#             value: [ "Tim Robbins", "Morgan Freeman" ]
#           }
#         }
#         feature: {
#           bytes_list: {
#             value: [ "Brad Pitt", "Edward Norton", "Helena Bonham Carter" ]
#           }
#         }
#       }
#     }
#   }

In [None]:
#formatting of message to conform to sequence example



In [418]:
tf.io.parse_sequence_example(str(x[0]), sequence_features=feature_description)

2022-04-23 22:52:05.953466: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at example_parsing_ops.cc:480 : INVALID_ARGUMENT: Invalid protocol message input, example id: <unknown>


InvalidArgumentError: Invalid protocol message input, example id: <unknown> [Op:ParseSequenceExampleV2]

# do some data wranglging on the text data
# tf.train.Example(features=tf.train.Features(feature=feature))
for _ in playlist_train_pipeline.map(lambda x: tf.io.parse_sequence_example(tf.io.serialize_tensor(x['tracks'][0]), sequence_features=feature_description, context_features=context_features, name='tracks')).take(1):
    tensor = _
    print(_)

In [156]:
tensor.numpy()[0]

AttributeError: 'tuple' object has no attribute 'numpy'

In [157]:
# for tracks in playlist_train_pipeline.map(lambda x: x['tracks']).take(1):
def get_tracks_from_tracks(tracks):
    for track in tracks:
        track = tf.train.SequenceExample(track)
    return track

In [121]:
serialized_songs = tf.io.serialize_tensor(
    tensor.numpy()[0]
)
serialized_songs
tf.io.parse_sequence_example(serialized_songs, sequence_features=feature_description)

AttributeError: 'tuple' object has no attribute 'numpy'

tensor.numpy()

In [141]:
playlist_train_pipeline2 = playlist_train_pipeline.map(lambda x: tf.io.serialize_tensor(x['tracks'][0]))
# playlist_train_pipeline3 = playlist_train_pipeline2.map(get_tracks_from_tracks)
for _ in playlist_train_pipeline2.map(lambda x: tf.io.parse_single_sequence_example(x, sequence_features=feature_description)).take(1):
    tensor = _
    print(_)

2022-04-23 16:31:23.286491: E tensorflow/core/framework/dataset.cc:577] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-23 16:31:23.286540: E tensorflow/core/framework/dataset.cc:581] UNIMPLEMENTED: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-23 16:31:23.286879: E tensorflow/core/framework/dataset.cc:577] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-23 16:31:23.286921: E tensorflow/core/framework/dataset.cc:581] UNIMPLEMENTED: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.


({}, {'album_name': <tf.RaggedTensor []>, 'album_uri': <tf.RaggedTensor []>, 'artist_name': <tf.RaggedTensor []>, 'artist_uri': <tf.RaggedTensor []>, 'duration_ms': <tf.RaggedTensor []>, 'pos': <tf.RaggedTensor []>, 'track_name': <tf.RaggedTensor []>, 'track_uri': <tf.RaggedTensor []>})


In [131]:
for _ in playlist_train_pipeline2.take(1):
    tensor2 = _

2022-04-23 16:13:20.572132: E tensorflow/core/framework/dataset.cc:577] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-23 16:13:20.572178: E tensorflow/core/framework/dataset.cc:581] UNIMPLEMENTED: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-23 16:13:20.572508: E tensorflow/core/framework/dataset.cc:577] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-04-23 16:13:20.572544: E tensorflow/core/framework/dataset.cc:581] UNIMPLEMENTED: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.


In [136]:
tensor2

<tf.Tensor: shape=(), dtype=string, numpy=b"[{'pos': 0, 'artist_name': 'K CAMP', 'track_uri': 'spotify:track:7hRwwjy1cmwGkzBnJlhtnY', 'artist_uri': 'spotify:artist:5bgfj5zUoWpyeVatGDjn6H', 'track_name': '5 Minutes', 'album_uri': 'spotify:album:2GPRyHEFB2uzrB8aFyAmkP', 'duration_ms': 203080, 'album_name': '5 Minutes'}, {'pos': 1, 'artist_name': 'Post Malone', 'track_uri': 'spotify:track:5yuShbu70mtHXY0yLzCQLQ', 'artist_uri': 'spotify:artist:246dkjvS1zLTtiykXe5h60', 'track_name': 'Go Flex', 'album_uri': 'spotify:album:5s0rmjP8XOPhP6HhqOhuyC', 'duration_ms': 179613, 'album_name': 'Stoney'}, {'pos': 2, 'artist_name': 'Marc E. Bassy', 'track_uri': 'spotify:track:0LFqxv7O10lq49g4wpJ1ht', 'artist_uri': 'spotify:artist:3tQx1LPXbsYjE9VwN1Peaa', 'track_name': 'Relapse (feat. Iamsu!)', 'album_uri': 'spotify:album:2KjK2W2EFrzLF3iTRWnKlJ', 'duration_ms': 184500, 'album_name': 'Only The Poets Mixtape (Vol. 1)'}, {'pos': 3, 'artist_name': 'Kodak Black', 'track_uri': 'spotify:track:5v7kaZNsnyByrSJOfO8

In [74]:
tensor.numpy()[0]

b'[{\'pos\': 0, \'artist_name\': \'George Strait\', \'track_uri\': \'spotify:track:1TanmIWbaUj5NVwJ3k4XPd\', \'artist_uri\': \'spotify:artist:5vngPClqofybhPERIqQMYd\', \'track_name\': \'Write This Down\', \'album_uri\': \'spotify:album:2Kudx2lMsMx3svYdb2xe2F\', \'duration_ms\': 219600, \'album_name\': \'Always Never The Same\'}, {\'pos\': 1, \'artist_name\': \'Keith Urban\', \'track_uri\': \'spotify:track:50nzorQ9gi2md8UpFi8aJT\', \'artist_uri\': \'spotify:artist:0u2FHSq3ln94y5Q57xazwf\', \'track_name\': \'Somewhere In My Car\', \'album_uri\': \'spotify:album:5rESCws46ubPJlqOeb30Rv\', \'duration_ms\': 236906, \'album_name\': \'Fuse\'}, {\'pos\': 2, \'artist_name\': \'Blake Shelton\', \'track_uri\': \'spotify:track:289hx4t6fH2BBe8p6cnXo1\', \'artist_uri\': \'spotify:artist:1UTPBmNbXNTittyMJrNkvw\', \'track_name\': \'Neon Light\', \'album_uri\': \'spotify:album:0daIqjuhsQqXoeII3pBSeT\', \'duration_ms\': 221401, \'album_name\': \'BRINGING BACK THE SUNSHINE\'}, {\'pos\': 3, \'artist_name\'

In [13]:
# %%writefile -a vertex_train/trainer/task.py

class PlaylistsModel(tf.keras.Model):
    def __init__(self, layer_sizes, adapt_data):
        super().__init__()
        
        #start with lookups on low cardnality categorical items
        colab_vocab = tf.constant(['true','false'], name='colab_vocab', dtype='string')
        
        self.colab = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=colab_vocab, mask_token=None, name="colab_lookup", output_mode='count')
        ], name="colab")
        
        #create text vectorizors to be fed to an embedding layer
        self.artist_vectorizor = tf.keras.layers.TextVectorization(
            max_tokens=MAX_TOKENS, name="artist_tv", ngrams=2)
        
        self.album_vectorizor = tf.keras.layers.TextVectorization(
            max_tokens=MAX_TOKENS, name="album_tv", ngrams=2)
        
        self.description_vectorizor = tf.keras.layers.TextVectorization(
            max_tokens=MAX_TOKENS, name="album_tv", ngrams=2)
        
        self.query_embedding = tf.keras.Sequential([
            self.album_vectorizor,
            tf.keras.layers.Embedding(MAX_TOKENS+1, EMBEDDING_DIM , mask_zero=True, name="album_emb"),
            tf.keras.layers.GlobalAveragePooling1D()
        ], name="album_embedding_model")
        
        self.artist_embedding = tf.keras.Sequential([
            self.artist_vectorizor,
            tf.keras.layers.Embedding(MAX_TOKENS+1, EMBEDDING_DIM , mask_zero=True, name="artist_emb"),
            tf.keras.layers.GlobalAveragePooling1D()
        ], name="artist_embedding")
        
        ###############
        ### adapt stuff
        ###############
        
        self.artist_vectorizor.adapt(adapt_data.map(lambda x: x['artist_name']))
        self.album_vectorizor.adapt(adapt_data.map(lambda x: x['album_name'])) 
        
        # Then construct the layers.
        self.dense_layers = tf.keras.Sequential(name="dense_layers_query")
        
        initializer = tf.keras.initializers.GlorotUniform(seed=SEED)
        # Use the ReLU activation for all but the last layer.
        for layer_size in layer_sizes[:-1]:
            self.dense_layers.add(tf.keras.layers.Dense(layer_size, activation="relu", kernel_initializer=initializer))
            if DROPOUT:
                self.dense_layers.add(tf.keras.layers.Dropout(DROPOUT_RATE))
        # No activation for the last layer
        for layer_size in layer_sizes[-1:]:
            self.dense_layers.add(tf.keras.layers.Dense(layer_size, kernel_initializer=initializer))
        ### ADDING L2 NORM AT THE END
        self.dense_layers.add(tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, 1, epsilon=1e-12, name="normalize_dense")))


    def call(self, data):    
        all_embs = tf.concat(
                [
                    self.album_embedding(data['album_name']),
                    self.artist_embedding(data['artist_name']),
                    self.colab(data['collaborative']),
                    self.description_embedding(data['description'])
                ], axis=1)
        return self.dense_layers(all_embs)

## Use the example output to think of how you process your features

```
OrderedDict([('album_name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'The Helm'], dtype=object)>), ('artist_name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Carrot Green'], dtype=object)>), ('collaborative', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'false'], dtype=object)>), ('description', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b''], dtype=object)>), ('duration_ms', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'358500'], dtype=object)>), ('modified_at', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1505692800])>), ('name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'FeSTa'], dtype=object)>), ('num_albums', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([82])>), ('num_artists', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([66])>), ('num_edits', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([48])>), ('num_followers', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>), ('num_tracks', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([85])>), ('pos', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'45'], dtype=object)>), ('track_name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'The Helm - Carrot Green Remix'], dtype=object)>)])
```

In [14]:
#### Tests