# Two - Tower Retreival Model

### Key resources:
* Many pages [here](https://www.tensorflow.org/recommenders/examples/deep_recommenders) include great techniques to build custom TFRS Models

### Goals:
* Show how to model off of most data types 
  * (String, Existing Embeddings (vectors), 
  * Floats (Normalized), 
  * Categorical with vocab, 
  * High Dim Categorical (Embed)
* Leverage class templates to create custom 2 Tower Models quick/easy

## SPOTIFY Create the tensorflow.io interface for the event and product table in Bigquery
Best practices from Google are in this blog post

In [1]:
# !gsutil mb -l us-central1 gs://spotify-tfrecords-blog

In [17]:
# set variables
SEED = 41781897
PROJECT_ID = 'hybrid-vertex'
DROP_FIELDS = ['modified_at', 'row_number', 'seed_playlist_tracks']
TF_RECORDS_DIR = 'gs://spotify-tfrecords-blog'

#### Quick counts on training data



#### Quick counts on the training records for track

In [18]:
%%bigquery TOTAL_PLAYLISTS
select count(1) from hybrid-vertex.spotify_train_3.train

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 992.97query/s] 
Downloading: 100%|██████████| 1/1 [00:00<00:00,  1.00rows/s]


In [19]:
TOTAL_PLAYLISTS = TOTAL_PLAYLISTS.values[0][0]
TOTAL_PLAYLISTS

65346428

### Set the tf.io pipelines function from bigquery

[Great blog post here on it](https://towardsdatascience.com/how-to-read-bigquery-data-from-tensorflow-2-0-efficiently-9234b69165c8)

In [20]:
# !pip install tensorflow-recommenders -q --user

In [22]:
import tensorflow as tf
from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
import warnings
warnings.filterwarnings("ignore") #do this b/c there's an info-level bug that can safely be ignored
import json
import tensorflow as tf
# import tensorflow_recommenders as tfrs
import datetime
from tensorflow.python.lib.io import file_io
from tensorflow.train import BytesList, Feature, FeatureList, Int64List, FloatList
from tensorflow.train import SequenceExample, FeatureLists



def bq_to_tfdata(client, row_restriction, table_id, col_names, dataset, batch_size=BATCH_SIZE):
    TABLE_ID = table_id
    COL_NAMES = col_names
    DATASET = dataset
    bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, TABLE_ID, DATASET,
        COL_NAMES,
        requested_streams=2,
        row_restriction=row_restriction)
    dataset = bqsession.parallel_read_rows()
    return dataset.prefetch(1).shuffle(batch_size*10).batch(batch_size)

## Get the song metadata

To get a pipeline working we need the metadata for the table along with the table information. The following functions are helpers that give us the metadata into the proper types for `tf`


For each table id, programatically get
* Column names
* Column types

## Metadata dictionary to translate from BQ to tensorflow

From the DDL we are going to get the types for use in a  to create a `BigQueryReadSession` from `tensorflow_io.bigquery` 

In [32]:
bq_2_tf_dict = {'name': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
 'collaborative': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'pid': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
 'description': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'duration_ms_playlist': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.int64},
 'pid_pos_id': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'pos': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
 'artist_name_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'track_uri_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'artist_uri_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'track_name_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'album_uri_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'duration_ms_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.float64},
 'album_name_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'track_pop_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.int64},
 'artist_pop_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.float64},
 'artist_genres_seed': {'mode': BigQueryClient.FieldMode.REPEATED,
  'output_type': dtypes.string},
 'artist_followers_seed': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.float64},
 'pos_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.int64},
 'artist_name_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'artist_uri_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'track_name_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'track_uri_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'album_name_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'album_uri_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.string},
 'duration_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.float64},
 'duration_ms_seed_pl': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.float64},
 'n_songs': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.int64},
 'num_artists': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.int64},
 'num_albums': {'mode': BigQueryClient.FieldMode.NULLABLE,
  'output_type': dtypes.int64},
'pos_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED,
  'output_type': dtypes.int64},
'track_uri_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED,
  'output_type': dtypes.string},
'track_name_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED,
  'output_type': dtypes.string},
'duration_ms_seed_songs_pl': {'mode': BigQueryClient.FieldMode.REPEATED,
  'output_type': dtypes.float64},
'album_name_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED,
  'output_type': dtypes.string},
'artist_pop_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED,
  'output_type': dtypes.float64},
'artists_followers_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED,
  'output_type': dtypes.float64},              
'track_pop_seed_pl': {'mode': BigQueryClient.FieldMode.REPEATED,
  'output_type': dtypes.int64},  
               
               }

In [39]:
client = BigQueryClient()
batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'train_flatten', 'spotify_train_3',
        bq_2_tf_dict,
        requested_streams=2,)
dataset = bqsession.parallel_read_rows()
dataset = dataset.prefetch(1).shuffle(batch_size*10).batch(batch_size)

In [40]:
for x in dataset.take(1):
    print(x)

OrderedDict([('album_name_seed', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'What Was I Thinking Of'], dtype=object)>), ('album_name_seed_pl', <tf.Tensor: shape=(1, 10), dtype=string, numpy=
array([[b'Use Your Illusion II', b'80s Songs from the Big Screen',
        b'Different Light', b"She's So Unusual", b"Surfin' USA",
        b'50 Big Ones: Greatest Hits',
        b'Summer Days (And Summer Nights)',
        b'Nursery Rhymes & Bible Songs for Kids - Childrens Music & Hymns & Sunday School Songs for Praise & Christian Worship',
        b'Songs with the Greatest Piano Riffs in the World!',
        b'The Savage Young Beatles Feat. Tony Sheridan']], dtype=object)>), ('album_name_seed_track', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'The Savage Young Beatles Feat. Tony Sheridan'], dtype=object)>), ('album_uri_seed', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'spotify:album:5UvwZF8MTWOpirUaZx6px9'], dtype=object)>), ('album_uri_seed_track', <tf.Tensor: shape=(1

### Serialize the data to disk after we run a pipeline thru our tr

Since the data is stored in a text dictionary we will eagerly execute, grab the values and do a string `eval`
`eval("{'pos': 0, 'artist_name': 'King Crimson', 'track_uri': 'spotify:track:173gp7NIXqk0MEo8K7Av4a', 'artist_uri': 'spotify:artist:7M1FPw29m5FbicYzS2xdpi', 'track_name': '21st Century Schizoid Man', 'album_uri': 'spotify:album:0ga8Q4tTXaFf9q3LvT8hrC', 'duration_ms': 657517, 'album_name': 'Radical Action To Unseat the Hold of Monkey Mind (Live)'}")`

## Use the example output to think of how you process your features

```
OrderedDict([('album_name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'The Helm'], dtype=object)>), ('artist_name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Carrot Green'], dtype=object)>), ('collaborative', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'false'], dtype=object)>), ('description', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b''], dtype=object)>), ('duration_ms', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'358500'], dtype=object)>), ('modified_at', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1505692800])>), ('name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'FeSTa'], dtype=object)>), ('num_albums', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([82])>), ('num_artists', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([66])>), ('num_edits', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([48])>), ('num_followers', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>), ('num_tracks', <tf.Tensor: shape=(1,), dtype=int64, numpy=array([85])>), ('pos', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'45'], dtype=object)>), ('track_name', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'The Helm - Carrot Green Remix'], dtype=object)>)])
```