# Data prep

## In this notebook we will load the songs from the zip file, and perform transformations to prepare the data for two-tower training
Steps
1. Create a bq dataset
2. Load the million playlist data to Big Query
3. Create pipelines to download audio and artist features for training

In [1]:
# Set your variables for your project, region, and dataset name
SOURCE_BUCKET = 'spotify-million-playlist-dataset'
PROJECT_ID = 'hybrid-vertex'
REGION = 'us-central1'
BQ_DATASET = 'spotify_e2e_test'

import time
from google.cloud import bigquery

bigquery_client = bigquery.Client(project=PROJECT_ID)

In [2]:
# # Create a bigquery dataset (one time operation)
# # Construct a full Dataset object to send to the API.
# dataset = bigquery.Dataset(f"`{PROJECT_ID}.{BQ_DATASET}`")

# # TODO(developer): Specify the geographic location where the dataset should reside.
# dataset.location = REGION

# # Send the dataset to the API for creation, with an explicit timeout.
# # Raises google.api_core.exceptions.Conflict if the Dataset already
# # exists within the project.
# dataset = bigquery_client.create_dataset(BQ_DATASET, timeout=30)  # Make an API request.
# print("Created dataset {}.{}".format(bigquery_client.project, dataset.dataset_id))

#### Download the data 
(also `curl` from source, see readme)

In [3]:
# !gsutil cp gs://{SOURCE_BUCKET}/spotify_million_playlist_dataset.zip .
# !unzip -n spotify_million_playlist_dataset.zip 

#### This step can take up to 30 minutes

Iteration occurs over 1000 json files if you are using the full dataset

This should give you a `playlists` bq data set with 1,076,000 rows (playlists)

In [16]:
import os
import json
import pandas as pd
from tqdm import tqdm
import multiprocessing

data_files = os.listdir('data')

def load_data(filename: str):
    with open(f'data/{filename}') as f:
        json_dict = json.load(f)
        df = pd.DataFrame(json_dict['playlists'])
        df['tracks'] = df['tracks'].map(str)
        #write to bq
        return df
        
#make sure there is not already existing data in the playlists table
#loops over json files - converts to pandas then upload/appends

### add this if you want to limit to smaller number of playlists - this scales significantly later!
n_playliststs_limit = None #add if you want to use in for loop: while counter <= n_playliststs_limit:
total_files = len(data_files)
count = 0
batch_size = 15

for filename in tqdm(data_files):
    if count == 0 or (count-1) % batch_size == 0:
        append_df = load_data(filename)
        count += 1
    if count % batch_size == 0 or count == total_files:
        df = load_data(filename) 
        append_df = pd.concat([df, append_df])
        count += 1
        append_df.to_gbq(
                destination_table=f'{BQ_DATASET}.playlists', 
                project_id=PROJECT_ID, # TODO: param
                location='US', 
                progress_bar=False, 
                reauth=True, 
                if_exists='append'
            ) 
    else:
        df = load_data(filename) 
        append_df = pd.concat([df, append_df])
        count += 1

100%|██████████| 1000/1000 [31:09<00:00,  1.87s/it] 


Now the data is loaded but the playlists are nested as one large string that needs to be parsed - we will use json compatible functionality with BigQuery to address

![](img/tracks-string.png)

### Import bigquery and run parameterized queries to shape the data

This query formats the json strings to be read as Bigquery structs, to be manipulated in subsequent queries

In [17]:
%%time

json_extract_query = f"""create or replace table `{PROJECT_ID}.{BQ_DATASET}.playlists_nested` as (
with json_parsed as (SELECT * except(tracks), JSON_EXTRACT_ARRAY(tracks) as json_data FROM `{PROJECT_ID}.{BQ_DATASET}.playlists` )

select json_parsed.* except(json_data),
ARRAY(SELECT AS STRUCT
JSON_EXTRACT_SCALAR(json_data, "$.pos") as pos, 
JSON_EXTRACT_SCALAR(json_data, "$.artist_name") as artist_name,
JSON_EXTRACT_SCALAR(json_data, "$.track_uri") as track_uri,
JSON_EXTRACT_SCALAR(json_data, "$.artist_uri") as artist_uri,
JSON_EXTRACT_SCALAR(json_data, "$.track_name") as track_name,
JSON_EXTRACT_SCALAR(json_data, "$.album_uri") as album_uri,
JSON_EXTRACT_SCALAR(json_data, "$.duration_ms") as duration_ms,
JSON_EXTRACT_SCALAR(json_data, "$.album_name") as album_name
from json_parsed.json_data
) as tracks,
from json_parsed) """

bigquery_client.query(json_extract_query).result()

CPU times: user 148 ms, sys: 51.1 ms, total: 199 ms
Wall time: 36.1 s


<google.cloud.bigquery.table._EmptyRowIterator at 0x7f495488fe90>

Now `playlists_nested` has parsed the string data to a struct with arrays that will allow us to process the data much more easily

![](img/playlists-nested.png)

## Next we get the unique track features to put in a BQ table

This table will then be used to call the Spotify API and enrich with additional data about each track and artist

In [18]:
%%time

unique_tracks_sql = f"""create or replace table `{PROJECT_ID}.{BQ_DATASET}.tracks_unique` as (
SELECT distinct 
    track.track_uri,
    track.album_uri,
    track.artist_uri, 
FROM `{PROJECT_ID}.{BQ_DATASET}.playlists_nested`, UNNEST(tracks) as track)
"""

bigquery_client.query(unique_tracks_sql).result()

CPU times: user 24.3 ms, sys: 29.4 ms, total: 53.7 ms
Wall time: 10.9 s


<google.cloud.bigquery.table._EmptyRowIterator at 0x7f4955f3c9d0>

### Core dataset loading complete

We now have our unique id tables we will use for grabbing additional audio and artist features 

___________
# Spotify API Feature Extraction

___________

* Spotify Mlllion Playlist Dataset Challenge [Homepage](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge)
* [Spotify Web API docs](https://developer.spotify.com/documentation/web-api/reference/#/)

**Community Examples**
* [Extracting song lists](https://github.com/tojhe/recsys-spotify/blob/master/processing/songlist_extraction.py)
* [construct audio features with Spotify API](https://github.com/tojhe/recsys-spotify/blob/master/processing/audio_features_construction.py)
* [Using Spotify API](https://towardsdatascience.com/extracting-song-data-from-the-spotify-api-using-python-b1e79388d50)

#### After reading through these, create a new Spotify App and get the customer id, secret
![](img/spotify-dev-console.png)

This example uses a local json credentials and if you are concerned over visibility to apikeys, please see [GCP Secret Manager](https://cloud.google.com/secret-manager)

Below is an example if you were to add the json file to secret manager (keys: `secret`, `id`)

```python
from google.cloud import secretmanager

###Note you copy/paste this from secret manager in console
SECRET_VERSION = 'projects/934903580331/secrets/spotify-creds1/versions/1'

sm_client = secretmanager.SecretManagerServiceClient()

name = sm_client.secret_path(PROJECT_ID, SECRET_ID)

response = client.access_secret_version(request={"name": SECRET_VERSION})   

payload = json.loads(response.payload.data.decode("UTF-8"))
```

### Package Installation

In [None]:
# ! pip3 install -U spotipy google-cloud-storage google-cloud-aiplatform gcsfs --user -q
# ! pip3 install --user kfp google-cloud-pipeline-components --upgrade -q
# !pip3 install --user -q google-cloud-secret-manager

In [2]:
! python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"
! python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"
! python3 -c "import google.cloud.aiplatform; print('aiplatform SDK version: {}'.format(google.cloud.aiplatform.__version__))"

KFP SDK version: 1.8.19
google_cloud_pipeline_components version: 1.0.39
aiplatform SDK version: 1.22.0


### Constants - setup to your config

In [68]:
PROJECT_ID = 'hybrid-vertex' #update
LOCATION = 'us-central1' 

BUCKET_NAME = 'matching-engine-content'
# BUCKET_NAME = 'spotify-million-playlists'

VERSION = 10
PIPELINE_VERSION = f'v{VERSION}' # pipeline code
PIPELINE_TAG = f'{PIPELINE_VERSION}-spotify-feature-enrich'
print("PIPELINE_TAG:", PIPELINE_TAG)

PIPELINE_TAG: v10-spotify-feature-enrich


Create Bucket if needed

In [69]:
# !gsutil mb -l $LOCATION gs://$BUCKET_NAME

In [70]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

import re
from tqdm import tqdm

import os

import pandas as pd
import numpy as np
import json

import time

import gcsfs

# GCP
from google.cloud import aiplatform
from google.cloud import storage

# Pipelines
from typing import Any, Callable, Dict, NamedTuple, Optional, List
from google_cloud_pipeline_components import aiplatform as gcc_aip
from google_cloud_pipeline_components.types import artifact_types

# Kubeflow SDK
# TODO: fix these
from kfp.v2 import dsl
import kfp
import kfp.v2.dsl
from kfp.v2.google import client as pipelines_client
from kfp.v2.dsl import (Artifact, Dataset, Input, InputPath, Model, Output,
                        OutputPath, component)


aiplatform.init(
    project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_NAME
)

# Get spotify credentials
# This file has id and secret stored as attributes

creds = open('spotify-creds.json')
spotify_creds = json.load(creds)
creds.close()

### Clients & credentials

Setup Vertex AI client for pipelines

Spotify shoulld be stored in a json file with a your credentials

In [71]:
# # Setup clients
aiplatform.init(
    project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_NAME
)

In [72]:
# Get spotify credentials
# This file has id and secret stored as attributes


###########################################
### CAUTION THIS APPROACH WILL HAVE THE CREDENTIALS APPEAR IN THE CONSOLE - 
### USE SECRET MANAGER APPROACH IN EACH COMPONENT AS NEEDED (PROVIDED ABOVE)
###########################################


creds = open('spotify-creds.json')
spotify_creds = json.load(creds)
creds.close()

# Create Pipeline Components

### Audio Features

[Link to artist API and related features we will pull](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features)

In [84]:
@kfp.v2.dsl.component(
    base_image="python:3.9",
    packages_to_install=['fsspec', 'google-cloud-bigquery',
                         'google-cloud-storage',
                         'gcsfs',
                         'spotipy','requests','db-dtypes',
                         'numpy','pandas','pyarrow','absl-py', 'pandas-gbq==0.17.4',
                        ])

def call_spotify_api_audio(
    project: str,
    location: str,
    client_id: str,
    batch_size: int,
    batches_to_store: int,
    target_table: str,
    client_secret: str,
    unique_table: str,
    sleep_param: float,
) -> NamedTuple('Outputs', [('done_message', str),]):
    print(f'pip install complete')
    import os
    import spotipy
    from spotipy.oauth2 import SpotifyClientCredentials
    import re
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
    import pandas as pd
    import json
    import time
    from google.cloud import storage
    import gcsfs
    import numpy as np
    from requests.exceptions import ReadTimeout, HTTPError, ConnectionError, RequestException
    from absl import logging
    from google.cloud import bigquery
    import pandas_gbq

    # print(f'package import complete')

    logging.set_verbosity(logging.INFO)
    logging.info(f'package import complete')

    
    bq_client = bigquery.Client(
      project=project, location=location
    )

    logging.info(f'spotipy auth complete')
    def spot_audio_features(uri, client_id, client_secret):

        # Authenticate
        client_credentials_manager = SpotifyClientCredentials(
            client_id=client_id, 
            client_secret=client_secret
        )
        sp = spotipy.Spotify(
            client_credentials_manager = client_credentials_manager, 
            requests_timeout=10, 
            retries=10 )
        ############################################################################
        # Create Track Audio Features DF
        ############################################################################
        
        uri_stripped = [u.replace('spotify:track:', '') for u in uri] #fix the quotes 
        #getting track popularity
        tracks = sp.tracks(uri_stripped)
        #Audio features
        a_feats = sp.audio_features(uri)
        features = pd.json_normalize(a_feats)#.to_dict('list')
        
        features['track_pop'] = pd.json_normalize(tracks['tracks'])['popularity']
        
        features['track_uri'] = uri
        return features

    bq_client = bigquery.Client(
      project=project, location='US'
    )

    query = f"select distinct track_uri from `{unique_table}` LIMIT 233" 


    #refactor
    schema = [{'name':'danceability', 'type': 'FLOAT'},
            {'name':'energy', 'type': 'FLOAT'},
            {'name':'key', 'type': 'FLOAT'},
            {'name':'loudness', 'type': 'FLOAT'},
            {'name':'mode', 'type': 'INTEGER'},
            {'name':'speechiness', 'type': 'FLOAT'},
            {'name':'acousticness', 'type': 'FLOAT'},
            {'name':'instrumentalness', 'type': 'FLOAT'},
            {'name':'liveness', 'type': 'FLOAT'},
            {'name':'valence', 'type': 'FLOAT'},
            {'name':'followers', 'type': 'FLOAT'},
            {'name':'tempo', 'type': 'FLOAT'},
            {'name':'type', 'type': 'STRING'},
            {'name':'id', 'type': 'STRING'},
            {'name':'uri', 'type': 'STRING'},
            {'name':'track_href', 'type': 'STRING'},
            {'name':'analysis_url', 'type': 'STRING'},
            {'name':'duration_ms_y', 'type': 'INTEGER'},
            {'name':'time_signature', 'type': 'INTEGER'},
            {'name':'track_uri', 'type': 'STRING'},
    ]

    tracks = bq_client.query(query).result().to_dataframe()
    track_list = tracks.track_uri.to_list()
    logging.info(f'finished downloading tracks')
    uri_list_length = len(track_list)-1 #starting count at zero
    inner_batch_count = 0 #avoiding calling the api on 0th iteration
    
    uri_batch = []
    
    for i, uri in enumerate(track_list):
        uri_batch.append(uri)
        if (len(uri_batch) == batch_size or uri_list_length == i) and i > 3: #grab a batch of 50 songs
            logging.info(f"appending final record for nth song at: {inner_batch_count} \n i: {i} \n uri_batch length: {len(uri_batch)}")
            ### Try catch block for function
            try:
                audio_featureDF = spot_audio_features(uri_batch, client_id, client_secret)
                time.sleep(sleep_param)
                uri_batch = []
            except ReadTimeout:
                logging.info("'Spotify timed out... trying again...'")
                audio_featureDF = spot_audio_features(uri_batch, client_id, client_secret)
                uri_batch = []
                time.sleep(sleep_param)
            except HTTPError as err: #JW ADDED
                logging.info(f"HTTP Error: {err}")
            except spotipy.exceptions.SpotifyException as spotify_error: #jw_added
                logging.info(f"Spotify error: {spotify_error}")
            # Accumulate batches on the machine before writing to BQ
            # if inner_batch_count <= batches_to_store or uri_list_length == i:
            if inner_batch_count == 0:
                appended_data = audio_featureDF
                logging.info(f"creating new appended data at IBC: {inner_batch_count} \n i: {i}")
                inner_batch_count += 1
            elif uri_list_length == i or inner_batch_count == batches_to_store: #send the batches to bq
                appended_data = pd.concat([audio_featureDF, appended_data])
                inner_batch_count = 0
                try:
                    appended_data.to_gbq(
                        destination_table=target_table, 
                        project_id=f'{project}', 
                        location='US', 
                        table_schema=schema,
                        progress_bar=False, 
                        reauth=False, 
                        if_exists='append'
                        )
                except pandas_gbq.gbq.InvalidSchema as invalid_schema:
                    logging.info('invalid schema, skipping')
                    pass
                logging.info(f'{i+1} of {uri_list_length} complete!')
                del appended_data
            else:
                appended_data = pd.concat([audio_featureDF, appended_data])
                inner_batch_count += 1

    logging.info(f'audio features appended')
    return (
          f'DONE',
      )

### Artists 

[Link to artist API and related features we will pull](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-an-artist)

In [85]:
### Artist tracks api call

@kfp.v2.dsl.component(
    base_image="python:3.9",
    packages_to_install=['fsspec',' google-cloud-bigquery',
                         'google-cloud-storage',
                         'gcsfs',
                         'spotipy','requests','db-dtypes',
                         'numpy','pandas','pyarrow','absl-py', 'pandas-gbq==0.17.4',
                        'google-cloud-secret-manager'])

def call_spotify_api_artist(
    project: str,
    location: str,
    unique_table: str,
    batch_size: int,
    batches_to_store: int,
    client_id: str,
    client_secret: str,
    sleep_param: float,
    target_table: str,
) -> NamedTuple('Outputs', [('done_message', str),]):
    print(f'pip install complete')
    import os
    import spotipy
    from spotipy.oauth2 import SpotifyClientCredentials
    import re
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
    import pandas as pd
    import json
    import time
    from google.cloud import storage
    import gcsfs
    import numpy as np
    from requests.exceptions import ReadTimeout, HTTPError, ConnectionError, RequestException
    from absl import logging
    from google.cloud import bigquery
    import pandas_gbq
    pd.options.mode.chained_assignment = None  # default='warn'

    logging.set_verbosity(logging.INFO)
    logging.info(f'package import complete')

    storage_client = storage.Client(
        project=project
    )
    
    logging.info(f'spotipy auth complete')
    
    def spot_track_features(uri, client_id, client_secret):

        # Authenticate
        client_credentials_manager = SpotifyClientCredentials(
            client_id=client_id, 
            client_secret=client_secret
        )
        sp = spotipy.Spotify(
            client_credentials_manager = client_credentials_manager, 
            requests_timeout=10, 
            retries=10 )

        ############################################################################
        # Create Track Audio Features DF
        ############################################################################ 

        uri = [u.replace('"', '') for u in uri] #fix the quotes 

        artists = sp.artists(uri)
        features = pd.json_normalize(artists['artists'])
        smaller_features = features[['genres', 'popularity', 'name', 'followers.total']]
        smaller_features.columns = ['genres',  'popularity', 'name',  'followers']
        smaller_features['artist_uri'] = uri
        smaller_features['genres'] = smaller_features['genres'].map(lambda x: '_'.join(x))

        return smaller_features
        

    bq_client = bigquery.Client(
      project=project, location='US'
    )


    query = f"select distinct artist_uri from `{unique_table}`"
    

    schema = [{'name': 'popularity', 'type': 'INTEGER'},
            {'name':'genres', 'type': 'STRING'},
            {'name':'followers', 'type': 'INTEGER'},
            {'name':'artist_uri', 'type': 'STRING'}
    ]
    uri_batch = []

    ats = bq_client.query(query).result().to_dataframe()
    artist_set = ats.artist_uri.to_list()
    uri_list_length = len(artist_set)-1
    logging.info(f'finished downloading tracks')
    inner_batch_count = 0
    for i, uri in enumerate(artist_set):
        if (i % batch_size-1 == 0 or uri_list_length == i) and i != 0: #grab a batch of 50 songs
            uri_batch.append(uri)
            ### Try catch block for function
            try:
                artist_featureDF = spot_track_features(uri_batch, client_id, client_secret)
                time.sleep(sleep_param)
            except ReadTimeout:
                logging.info("'Spotify timed out... trying again...'")
                artist_featureDF = spot_track_features(uri_batch, client_id, client_secret)
                time.sleep(sleep_param)
            except HTTPError as err: #JW ADDED
                logging.info(f"HTTP Error: {err}")
            except spotipy.exceptions.SpotifyException as spotify_error: #jw_added
                logging.info(f"Spotify error: {spotify_error}")
            # Accumulate batches on the machine before writing to BQ
            if inner_batch_count <= batches_to_store or uri_list_length == i:
                if inner_batch_count == 0:
                    appended_data = artist_featureDF
                else:
                    appended_data = pd.concat([artist_featureDF, appended_data])
                inner_batch_count += 1
                uri_batch = []
            else:
                appended_data = pd.concat([artist_featureDF, appended_data])
                try:
                    appended_data.to_gbq(
                        destination_table=target_table, 
                        project_id=f'{project}', 
                        location='US', 
                        table_schema=schema,
                        progress_bar=False, 
                        reauth=False, 
                        if_exists='append'
                        )
                except pandas_gbq.gbq.InvalidSchema as invalid_schema:
                    logging.info('invalid schema, skipping')
                    pass
                logging.info(f'{count} of {uri_list_length} complete!')
                uri_batch = []
                inner_batch_count = 0
        else:
            uri_batch.append(uri)
    
    logging.info(f'artist features appended')
    return (
          f'DONE',
      )

## Build Pipeline

In [86]:
from typing import Dict

@kfp.v2.dsl.pipeline(
  name=f'spotify-feature-enrichment-{PIPELINE_TAG}'.replace('_', '-')
)
def pipeline(
    project: str,
    location: str,
    unique_table: str,
    target_table_audio: str,
    target_table_artist: str,
    batch_size: int,
    batches_to_store: int,
    sleep_param: float,
    spotify_id: str = spotify_creds['id'],
    spotify_secret: str = spotify_creds['secret'],
    ):


    # call_spotify_api_artist_op = call_spotify_api_artist(
    #     project=project,
    #     location=location,
    #     client_id=spotify_id,
    #     client_secret=spotify_secret,
    #     batch_size=batch_size,
    #     sleep_param=sleep_param,
    #     unique_table=unique_table,
    #     target_table=target_table_artist,
    #     batches_to_store=batches_to_store,
    # ).set_display_name("Get Artist Features From Spotify API")

    call_spotify_api_audio_op = call_spotify_api_audio(
        project=project,
        location=location,
        client_id=spotify_id,
        client_secret=spotify_secret,
        batch_size=batch_size,
        sleep_param=sleep_param,
        unique_table=unique_table,
        target_table=target_table_audio,
        batches_to_store=batches_to_store,
    ).set_display_name("Get Track Audio Features From Spotify API")#.after(call_spotify_api_artist_op)

### Compile the pipeline to json
This can be stored on gcs as well for broader orchastration purposes

In [87]:
kfp.v2.compiler.Compiler().compile(
  pipeline_func=pipeline, 
  package_path='custom_container_pipeline_spec.json',
)

### Set the pipeline parameters

Use a dictionary with the afforementioned types defined by your pipeline

In [88]:
# jtotten-project #
GCS_BUCKET = 'matching-engine-content'

ideal_batch_size = 200_000
bts = ideal_batch_size / 50

PIPELINE_PARAMETERS = dict(
    project = PROJECT_ID,
    location = 'us-central1',
    unique_table = f'{PROJECT_ID}.{BQ_DATASET}.tracks_unique', 
    target_table_audio = f'{PROJECT_ID}.{BQ_DATASET}.audio_features',
    target_table_artist = f'{PROJECT_ID}.{BQ_DATASET}.artist_features',
    batch_size = 50,
    batches_to_store = 2, # int(bts),
    sleep_param = 0.05,
)

PIPELINE_PARAMETERS

{'project': 'hybrid-vertex',
 'location': 'us-central1',
 'unique_table': 'hybrid-vertex.spotify_e2e_test.tracks_unique',
 'target_table_audio': 'hybrid-vertex.spotify_e2e_test.audio_features',
 'target_table_artist': 'hybrid-vertex.spotify_e2e_test.artist_features',
 'batch_size': 50,
 'batches_to_store': 2,
 'sleep_param': 0.05}

In [89]:
job = aiplatform.PipelineJob(display_name = f'spotify-feature-enrichment-{PIPELINE_TAG}'.replace('_', '-'),
                             template_path = 'custom_container_pipeline_spec.json',
                             pipeline_root = f'gs://{BUCKET_NAME}/{VERSION}',
                             parameter_values = PIPELINE_PARAMETERS,
                             project = PROJECT_ID,
                             location = LOCATION,
                              enable_caching=False)

job.submit()

Creating PipelineJob
PipelineJob created. Resource name: projects/934903580331/locations/us-central1/pipelineJobs/spotify-feature-enrichment-v10-spotify-feature-enrich-20230228012842
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/934903580331/locations/us-central1/pipelineJobs/spotify-feature-enrichment-v10-spotify-feature-enrich-20230228012842')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/spotify-feature-enrichment-v10-spotify-feature-enrich-20230228012842?project=934903580331


The pipeline should look like this, it can take some time to run depending on your parameters

![](img/feature-enrich-pipeline.png)

#### [The next notebook](01-bq-data-prep.ipynb) will finish feature prep now that all the data is loaded in BQ