## Create Vocab Files

> The Keras preprocessing layers API allows developers to build Keras-native input
processing pipelines. These input processing pipelines can be used as independent
preprocessing code in non-Keras workflows, combined directly with Keras models, and
exported as part of a Keras SavedModel.

> With Keras preprocessing layers, you can build and export models that are truly
end-to-end: models that accept raw images or raw structured data as input; models that
handle feature normalization or feature value indexing on their own.

See [Preprocessing Layers Colab](https://colab.sandbox.google.com/github/tensorflow/docs/blob/snapshot-keras/site/en/guide/keras/preprocessing_layers.ipynb#scrollTo=b1d403f04693) for a guided example

#### Tensorflow-IO workarounds

[how to compile tensorflow 2 with avx2-fma instructions](https://technofob.com/2019/06/14/how-to-compile-tensorflow-2-0-with-avx2-fma-instructions-on-mac/)


In [7]:
# !pip install tensorflow-recommenders -q --user
# !pip install -U tensorflow-io==0.16.0 --user # ENCOUNTER ERRORS WITH 0.16
# !pip install -U tensorflow-io==0.15.0 --user

### Notebook Setup

In [69]:
# set variables
SEED = 41781897
PROJECT_ID = 'hybrid-vertex'
BQ_LOCATION='us-central1'
# DROP_FIELDS = ['modified_at', 'row_number', 'seed_playlist_tracks']
TF_RECORDS_DIR = 'gs://spotify-tfrecords-blog'
BATCH_SIZE = 10

In [66]:
import warnings
warnings.filterwarnings("ignore") #do this b/c there's an info-level bug that can safely be ignored

from tensorflow.python.framework import ops
from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession

import json
import tensorflow as tf
# import tensorflow_recommenders as tfrs
import datetime
from tensorflow.python.lib.io import file_io
from tensorflow.train import BytesList, Feature, FeatureList, Int64List, FloatList
from tensorflow.train import SequenceExample, FeatureLists

import os

import numpy as np
# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'


In [67]:
bq_2_tf_dict = {
    'name': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'collaborative': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'pid': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
    # 'duration_ms_playlist': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
    'pid_pos_id': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    # candidate features
    'pos_can': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
    'artist_name_can': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'track_uri_can': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'artist_uri_can': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'track_name_can': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'album_uri_can': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'duration_ms_can': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
    'album_name_can': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'track_pop_can': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
    'artist_pop_can': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
    'artist_genres_can': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'artist_followers_can': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
    # seed track features
    'pos_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64},
    'artist_name_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'artist_uri_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'track_name_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'track_uri_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'album_name_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'album_uri_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'duration_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
    'track_pop_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
    'artist_pop_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
    'artist_genres_seed_track': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'artist_followers_seed_track': {'mode':BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
    ### playlist features
    'duration_ms_seed_pl': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
    'n_songs_pl': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
    'num_artists_pl': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
    'num_albums_pl': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.float64},
    'description_pl': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string},
    'pos_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.int64},
    'artist_name_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.string},
    'track_uri_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.string},
    'track_name_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.string},
    'duration_ms_songs_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.float64},
    'album_name_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.string},
    'artist_pop_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.float64},
    'artists_followers_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.float64},              
    'track_pop_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.int64},
    'artist_genres_pl': {'mode': BigQueryClient.FieldMode.REPEATED, 'output_type': dtypes.string},
}

In [143]:
client = BigQueryClient()

In [72]:
# client = BigQueryClient()
batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'train_flatten', 'spotify_train_3',
        bq_2_tf_dict,
        requested_streams=2,)
dataset = bqsession.parallel_read_rows()
dataset = dataset.prefetch(1).shuffle(batch_size*10).batch(batch_size)

In [73]:
type(dataset)

tensorflow.python.data.ops.dataset_ops.BatchDataset

In [74]:
for x in dataset.take(1):
    print(x)

OrderedDict([('album_name_can', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'A Lovely Way To Spend Christmas'], dtype=object)>), ('album_name_pl', <tf.Tensor: shape=(1, 138), dtype=string, numpy=
array([[b'The Little Christmas Tree',
        b"Grandma Got Run over by a Reindeer: Essential Rock 'N Roll Christmas Oldies with Blue Christmas, Run Rudolph Run, Feliz Navidad, Snoopy's Christmas, I Saw Mommy Kissing Santa Clause & More",
        b'The Little Christmas Tree',
        b'Santa Claus Is Coming to Town: 30 Kids Christmas Songs Including Frosty the Snowman, Rudolph the Red Nosed Reindeer, Jingle Bells, Here Comes Santa Claus & More Oldies!',
        b'Santa Claus Is Coming to Town: 30 Kids Christmas Songs Including Frosty the Snowman, Rudolph the Red Nosed Reindeer, Jingle Bells, Here Comes Santa Claus & More Oldies!',
        b'The Little Christmas Tree', b'The Little Christmas Tree',
        b'The Little Christmas Tree', b'The Little Christmas Tree',
        b'The Little 

## Vocabs and string lookups

### Working with lookup layers with very large vocabularies

You may find yourself working with a very large vocabulary in a `TextVectorization`, a `StringLookup` layer,
or an `IntegerLookup` layer. Typically, a vocabulary larger than 500MB would be considered "very large".

In such case, for best performance, you should avoid using `adapt()`.
Instead, pre-compute your vocabulary in advance
(you could use Apache Beam or TF Transform for this)
and store it in a file. Then load the vocabulary into the layer at construction
time by passing the filepath as the `vocabulary` argument.

See [preprocessing_layers Colab](https://colab.sandbox.google.com/github/tensorflow/docs/blob/snapshot-keras/site/en/guide/keras/preprocessing_layers.ipynb#scrollTo=143ce01c5558) for more details

#### BigQuery SQL

In [None]:
# flatten artist_features table and grab unique artist genres

# CREATE OR REPLACE TABLE `hybrid-vertex.spotify_uniques.unique_artist_features_flat` AS
# SELECT
#   artist_name,
#   artist_uri,
#   artist_pop,
#   ARRAY_TO_STRING(artist_genres," ") as artist_genres,
#   artist_followers
#   FROM `hybrid-vertex.spotify_train_3.unique_artist_features`
# ORDER BY LENGTH(artist_genres) desc

# -- CREATE OR REPLACE TABLE `hybrid-vertex.spotify_uniques.unique_artist_genres` AS 
# -- SELECT DISTINCT TRIM(artist_genres) as u_artist_genres FROM `hybrid-vertex.spotify_uniques.unique_artist_features_flat`

In [None]:
# CREATE OR REPLACE TABLE `hybrid-vertex.spotify_uniques.unique_track_uris` AS 
# SELECT DISTINCT TRIM(track_uri) as u_track_uri FROM `hybrid-vertex.spotify_mpd.playlists`

# CREATE OR REPLACE TABLE `hybrid-vertex.spotify_uniques.unique_track_names` AS 
# SELECT DISTINCT TRIM(track_name) as u_track_name FROM `hybrid-vertex.spotify_mpd.playlists`


# CREATE OR REPLACE TABLE `hybrid-vertex.spotify_uniques.unique_artist_uris` AS 
# SELECT DISTINCT TRIM(artist_uri) as u_artist_uri FROM `hybrid-vertex.spotify_mpd.playlists`

# CREATE OR REPLACE TABLE `hybrid-vertex.spotify_uniques.unique_artist_names` AS 
# SELECT DISTINCT TRIM(artist_name) as u_artist_name FROM `hybrid-vertex.spotify_mpd.playlists`

# CREATE OR REPLACE TABLE `hybrid-vertex.spotify_uniques.unique_album_names` AS 
# SELECT DISTINCT TRIM(album_name) as u_album_name FROM `hybrid-vertex.spotify_mpd.playlists`

# CREATE OR REPLACE TABLE `hybrid-vertex.spotify_uniques.unique_album_uris` AS 
# SELECT DISTINCT TRIM(album_uri) as u_album_uri FROM `hybrid-vertex.spotify_mpd.playlists`

# CREATE OR REPLACE TABLE `hybrid-vertex.spotify_uniques.unique_playlist_names` AS 
# SELECT DISTINCT TRIM(name) as u_playlist_names FROM `hybrid-vertex.spotify_mpd.playlists`

# CREATE OR REPLACE TABLE `hybrid-vertex.spotify_uniques.unique_playlist_descriptions` AS 
# SELECT DISTINCT TRIM(description) as u_descriptions FROM `hybrid-vertex.spotify_mpd.playlists`
# ORDER BY LENGTH(u_descriptions) desc

### Read unique tables

In [None]:

# uni_tables = [
#     'unique_album_names',
#     'unique_album_uris',
#     'unique_artist_genres',
#     'unique_artist_names',
#     'unique_artist_uris',
#     'unique_playlist_names',
#     'unique_track_uris',
#     'unique_track_names',
#     'unique_playlist_descriptions',
# ]

# uni_table_cols = [
#     'u_album_name',
#     'u_album_uri',
#     'u_artist_genres',
#     'u_artist_name',
#     'u_artist_uri',
#     'u_playlist_names',
#     'u_track_uri',
#     'u_track_name',
#     'u_descriptions',
# ]


In [84]:
unique_dict = {'u_descriptions': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string}}

batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'unique_playlist_descriptions', 'spotify_uniques',
        unique_dict,
        requested_streams=2,)
unique_pl_description_ds = bqsession.parallel_read_rows()
unique_pl_description_ds = unique_pl_description_ds.prefetch(1).shuffle(batch_size*10).batch(batch_size)

uni_pl_descriptions = np.unique(np.concatenate(list(unique_pl_description_ds.map(lambda x: x['u_descriptions']).batch(1000))))

In [85]:
unique_dict = {'u_track_name': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string}}

batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'unique_track_names', 'spotify_uniques',
        unique_dict,
        requested_streams=2,)
unique_track_name_ds = bqsession.parallel_read_rows()
unique_track_name_ds = unique_track_name_ds.prefetch(1).shuffle(batch_size*10).batch(batch_size)

uni_track_names = np.unique(np.concatenate(list(unique_track_name_ds.map(lambda x: x['u_track_name']).batch(1000))))

In [86]:
unique_dict = {'u_track_uri': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string}}

batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'unique_track_uris', 'spotify_uniques',
        unique_dict,
        requested_streams=2,)
unique_track_uri_ds = bqsession.parallel_read_rows()
unique_track_uri_ds = unique_track_uri_ds.prefetch(1).shuffle(batch_size*10).batch(batch_size)

uni_track_uris = np.unique(np.concatenate(list(unique_track_uri_ds.map(lambda x: x['u_track_uri']).batch(1000))))

In [87]:
unique_dict = {'u_playlist_names': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string}}

batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'unique_playlist_names', 'spotify_uniques',
        unique_dict,
        requested_streams=2,)
unique_playlist_names_ds = bqsession.parallel_read_rows()
unique_playlist_names_ds = unique_playlist_names_ds.prefetch(1).shuffle(batch_size*10).batch(batch_size)

uni_playlist_names = np.unique(np.concatenate(list(unique_playlist_names_ds.map(lambda x: x['u_playlist_names']).batch(1000))))

In [88]:
unique_dict = {'u_artist_uri': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string}}

batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'unique_artist_uris', 'spotify_uniques',
        unique_dict,
        requested_streams=2,)
unique_artist_uri_ds = bqsession.parallel_read_rows()
unique_artist_uri_ds = unique_artist_uri_ds.prefetch(1).shuffle(batch_size*10).batch(batch_size)

uni_artist_uris = np.unique(np.concatenate(list(unique_artist_uri_ds.map(lambda x: x['u_artist_uri']).batch(1000))))

In [89]:
unique_dict = {'u_artist_name': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string}}

batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'unique_artist_names', 'spotify_uniques',
        unique_dict,
        requested_streams=2,)
unique_artist_name_ds = bqsession.parallel_read_rows()
unique_artist_name_ds = unique_artist_name_ds.prefetch(1).shuffle(batch_size*10).batch(batch_size)

uni_artist_names = np.unique(np.concatenate(list(unique_artist_name_ds.map(lambda x: x['u_artist_name']).batch(1000))))

In [90]:
unique_dict = {'u_artist_genres': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string}}

batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'unique_artist_genres', 'spotify_uniques',
        unique_dict,
        requested_streams=2,)
unique_artist_genres_ds = bqsession.parallel_read_rows()
unique_artist_genres_ds = unique_artist_genres_ds.prefetch(1).shuffle(batch_size*10).batch(batch_size)

uni_artist_genres = np.unique(np.concatenate(list(unique_artist_genres_ds.map(lambda x: x['u_artist_genres']).batch(1000))))

In [91]:
unique_dict = {'u_album_name': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string}}

batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'unique_album_names', 'spotify_uniques',
        unique_dict,
        requested_streams=2,)
unique_album_names_ds = bqsession.parallel_read_rows()
unique_album_names_ds = unique_album_names_ds.prefetch(1).shuffle(batch_size*10).batch(batch_size)

uni_album_names = np.unique(np.concatenate(list(unique_album_names_ds.map(lambda x: x['u_album_name']).batch(1000))))

In [92]:
unique_dict = {'u_album_uri': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.string}}

batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'unique_album_uris', 'spotify_uniques',
        unique_dict,
        requested_streams=2,)
unique_album_uris_ds = bqsession.parallel_read_rows()
unique_album_uris_ds = unique_album_uris_ds.prefetch(1).shuffle(batch_size*10).batch(batch_size)

uni_album_uris = np.unique(np.concatenate(list(unique_album_uris_ds.map(lambda x: x['u_album_uri']).batch(1000))))

In [145]:
unique_dict = {'u_pids': {'mode': BigQueryClient.FieldMode.NULLABLE, 'output_type': dtypes.int64}}

batch_size = 1
bqsession = client.read_session(
        "projects/" + PROJECT_ID,
        PROJECT_ID, 'unique_playlist_ids', 'spotify_uniques',
        unique_dict,
        requested_streams=2,)
unique_pid_ds = bqsession.parallel_read_rows()
unique_pid_ds = unique_pid_ds.prefetch(1).shuffle(batch_size*10).batch(batch_size)

uni_pl_pids = np.unique(np.concatenate(list(unique_pid_ds.map(lambda x: x['u_pids']).batch(1000))))

In [146]:
uni_pl_pids

array([     0,      1,      2, ..., 999997, 999998, 999999])

In [125]:
# uni_pl_descriptions
# uni_track_names
# uni_track_uris
# uni_playlist_names
# uni_artist_uris
# uni_artist_names
# uni_artist_genres
# uni_album_names
# uni_album_uris

print(f"uni_pl_descriptions length: {len(uni_pl_descriptions)}")
print(f"uni_track_names length: {len(uni_track_names)}")
print(f"uni_track_uris length: {len(uni_track_uris)}")
print(f"uni_playlist_names length: {len(uni_playlist_names)}")
print(f"uni_artist_uris length: {len(uni_artist_uris)}")
print(f"uni_artist_names length: {len(uni_artist_names)}")
print(f"uni_artist_genres length: {len(uni_artist_genres)}")
print(f"uni_album_names length: {len(uni_album_names)}")
print(f"uni_album_uris length: {len(uni_album_uris)}")

uni_pl_descriptions length: 18138
uni_track_names length: 1483753
uni_track_uris length: 2262292
uni_playlist_names length: 74028
uni_artist_uris length: 295860
uni_artist_names length: 287710
uni_artist_genres length: 39397
uni_album_names length: 571625
uni_album_uris length: 734684


### Add numerical feature stats

* TODO: create query for this step

In [135]:
min_duration_ms_seed_pl = 0
max_duration_ms_seed_pl = 629322792

min_n_songs_pl = 1
max_n_songs_pl = 375

min_n_artists_pl = 1
max_n_artists_pl = 237

min_n_albums_pl = 1
max_n_albums_pl = 244

min_artist_pop = 0
max_artist_pop = 100

min_duration_ms_songs_pl = -1
max_duration_ms_songs_pl = 20744575

min_artist_followers = 0
max_artist_followers = 94437255

min_track_pop = 0
max_track_pop = 96

In [136]:
print(max_duration_ms_seed_pl)
# print(min_duration_ms_seed_pl)

629322792


### Save to pickle

In [147]:
# save to pickle
import pickle as pkl

vocab_dict = {
    'name': uni_playlist_names,
    'artist_name_can': uni_artist_names,
    'track_uri_can': uni_track_uris,
    'artist_uri_can': uni_artist_uris,
    'track_name_can': uni_track_names,
    'album_uri_can': uni_album_uris,
    'album_name_can': uni_album_names,
    'artist_genres_can': uni_artist_genres,
    'unique_pids':uni_pl_pids,
    # seed track 
    'artist_name_seed_track': uni_artist_names,
    'artist_uri_seed_track': uni_artist_uris,
    'track_name_seed_track': uni_track_names,
    'track_uri_seed_track': uni_track_uris,
    'album_name_seed_track': uni_album_names,
    'album_uri_seed_track': uni_album_uris,
    'artist_genres_seed_track': uni_artist_genres,
    # playlist seq
    'description_pl': uni_pl_descriptions,
    'artist_name_pl': uni_artist_names,
    'track_uri_pl': uni_track_uris,
    'track_name_pl': uni_track_names,
    'album_name_pl': uni_album_names,
    'artist_genres_pl': uni_artist_genres,
    # numerical_stats
    'min_duration_ms_seed_pl':min_duration_ms_seed_pl,
    'max_duration_ms_seed_pl':max_duration_ms_seed_pl,
    'min_n_songs_pl':min_n_songs_pl,
    'max_n_songs_pl':max_n_songs_pl,
    'min_n_artists_pl':min_n_artists_pl,
    'max_n_artists_pl':max_n_artists_pl,
    'min_n_albums_pl':min_n_albums_pl,
    'max_n_albums_pl':max_n_albums_pl,
    'min_artist_pop':min_artist_pop,
    'max_artist_pop':max_artist_pop,
    'min_duration_ms_songs_pl':min_duration_ms_songs_pl,
    'max_duration_ms_songs_pl':max_duration_ms_songs_pl,
    'min_artist_followers':min_artist_followers,
    'max_artist_followers':max_artist_followers,
    'min_track_pop':min_track_pop,
    'max_track_pop':max_track_pop,
}
    
import time

TIMESTAMP = time.strftime("%Y%m%d-%H%M%S")

VERSION = 'v1'

VOCAB_FILENAME = f'string_vocabs_{VERSION}_{TIMESTAMP}.txt'

with open(VOCAB_FILENAME, 'wb') as handle:
        pkl.dump(vocab_dict, handle)

In [148]:
with open(VOCAB_FILENAME, 'rb') as pickle_file:
            vocab_dict2 = pkl.load(pickle_file)
vocab_dict2

{'name': array([b'! 2017 Songs', b'! <3', b'! DJ', ...,
        b'\xf0\x9f\xa6\x84\xf0\x9f\xa6\x84',
        b'\xf0\x9f\xa6\x84\xf0\x9f\xa6\x84\xf0\x9f\xa6\x84',
        b'\xf0\x9f\xa6\x8b\xf0\x9f\xa6\x8b\xf0\x9f\xa6\x8b'], dtype=object),
 'artist_name_can': array([b'!!!', b'!Dela Dap', b'!Distain', ..., b'\xed\x9b\x88 Hun',
        b'\xef\xbc\x92\xef\xbc\x98\xef\xbc\x91\xef\xbc\x94',
        b'\xef\xbc\xad\xef\xbc\xa9\xef\xbc\xb9\xef\xbc\xa1\xef\xbc\xb6\xef\xbc\xa9'],
       dtype=object),
 'track_uri_can': array([b'spotify:track:0000uJA4xCdxThagdLkkLR',
        b'spotify:track:0002yNGLtYSYtc0X6ZnFvp',
        b'spotify:track:00039MgrmLoIzSpuYKurn9', ...,
        b'spotify:track:7zzwsf6tmZETUV26kR7D5z',
        b'spotify:track:7zzxEH0xUl5k3p6IxUfgAO',
        b'spotify:track:7zzyrYnZIfvYAGwl7lRb7X'], dtype=object),
 'artist_uri_can': array([b'spotify:artist:0001ZVMPt41Vwzt1zsmuzp',
        b'spotify:artist:0001cekkfdEBoMlwVQvpLg',
        b'spotify:artist:0001wHqxbF2YYRQxGdbyER', ...,

In [149]:
type(vocab_dict2['max_duration_ms_seed_pl'])

int

### Save Pickle File to GCS

In [150]:
from google.cloud import storage
# import time

# TIMESTAMP = time.strftime("%Y%m%d-%H%M%S")

storage_client = storage.Client()

BUCKET_NAME = 'spotify-v1'
# VERSION = 'v1'
# bucket = storage_client.bucket(BUCKET_NAME)

bucket = storage_client.get_bucket(BUCKET_NAME);  
blob = bucket.blob(f'vocabs/{VERSION}_string_vocabs/{VOCAB_FILENAME}')
blob.upload_from_filename(VOCAB_FILENAME)

File saved to: `gs://spotify-v1/vocabs/v1_string_vocabs/string_vocabs_v1_20220701-174939.txt`

#### To use Vocab file during training...

In [111]:
!pwd

/home/jupyter/spotify_mpd_two_tower/jt-dev


In [114]:
len(vocab_dict_load["name"])

74028

In [115]:
import pickle as pkl
from google.cloud import storage

bucket = BUCKET_NAME
file_path = 'vocabs/v1_string_vocabs'
file_name = 'string_vocabs_v1_20220701-174939.txt'
destination_file = 'downloaded_v1_20220701-174939.txt'


with open(f'{destination_file}', 'wb') as file_obj:
    client.download_blob_to_file(
        f'gs://{bucket}/{file_path}/{file_name}', file_obj)

    
with open(f'{destination_file}', 'rb') as pickle_file:
    vocab_dict_load = pkl.load(pickle_file)

# to use:
# vocab_dict_load['feature_name']

#### TODO: Inspect simple model/layers

In [123]:
# import tensorflow as tf
# from tensorflow import keras
# from tensorflow.keras import layers
# from keras.utils.vis_utils import plot_model

# get MAX_TOKENS
max_tokens_pl_names = len(vocab_dict_load["name"])
OUTPUT_DIM = 32

input_layer = tf.keras.Input(shape=(None,), dtype='string', name='input_layer')

# playlist name text embedding
# pl_name_text_vectorizer = tf.keras.layers.TextVectorization(
pl_name_text_vectorizer = layers.TextVectorization(
    max_tokens=max_tokens_pl_names, 
    name = "names_TXT_vectorizor", 
    vocabulary=vocab_dict_load["name"],
    ngrams=2,
)

# playlist name embedding
# pl_name_embedding = tf.keras.layers.Embedding(
pl_name_embedding = layers.Embedding(
    input_dim=max_tokens_pl_names, 
    output_dim=OUTPUT_DIM, 
    name = "names_EMBED_layer",)

pl_name_model = tf.keras.Sequential(
    [
        pl_name_text_vectorizer, 
        pl_name_embedding,
        tf.keras.layers.GlobalAveragePooling2D(name="Pooling Layer"),
    ],name="pl_name_model"
)
outputs = layers.Dense(1, name="Output Layer")(pl_name_model)
model = tf.keras.Model(inputs=[input_layer], outputs=[outputs])

keras.utils.plot_model(
    model, 
    show_shapes=True,
    show_layer_names=True,
    rankdir='LR',   # 'LR'=horizontal plot; 'TB'=vertical plot
) 

## Available preprocessing

### Text preprocessing

- `tf.keras.layers.TextVectorization`: turns raw strings into an encoded
  representation that can be read by an `Embedding` layer or `Dense` layer.

### Numerical features preprocessing

- `tf.keras.layers.Normalization`: performs feature-wise normalize of
  input features.
- `tf.keras.layers.Discretization`: turns continuous numerical features
  into integer categorical features.

### Categorical features preprocessing

- `tf.keras.layers.CategoryEncoding`: turns integer categorical features
  into one-hot, multi-hot, or count dense representations.
- `tf.keras.layers.Hashing`: performs categorical feature hashing, also known as
  the "hashing trick".
- `tf.keras.layers.StringLookup`: turns string categorical values an encoded
  representation that can be read by an `Embedding` layer or `Dense` layer.
- `tf.keras.layers.IntegerLookup`: turns integer categorical values into an
  encoded representation that can be read by an `Embedding` layer or `Dense`
  layer.
  
  
  
See [Blog](https://towardsdatascience.com/you-should-try-the-new-tensorflows-textvectorization-layer-a80b3c6b00ee) on `TextVectorization` Layer