# Part 2: Process the item embedding data in BigQuery and export it to Cloud Storage

This notebook is the second of five notebooks that guide you through running the [Real-time Item-to-item Recommendation with BigQuery ML Matrix Factorization and ScaNN](https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/tree/master/retail/recommendation-system/bqml-scann) solution.

Use this notebook to complete the following tasks:

1. Process the product embeddings data in BigQuery to generate a single embedding vector for each song.
1. Use a Dataflow pipeline to write the embedding vector data to CSV files and export the files to a Cloud Storage bucket. 

Before starting this notebook, you must run the [01_train_bqml_mf_pmi](01_train_bqml_mf_pmi.ipynb) notebook to calculate item PMI data and then train a matrix factorization model with it.

## Setup

Import the required libraries, configure the environment variables, and authenticate your GCP account.



### Import libraries

In [1]:
import os
import numpy as np
import apache_beam as beam
from datetime import datetime
from google.cloud import storage

### Configure GCP environment settings

Update the following variables to reflect the values for your GCP environment:

+ `PROJECT_ID`: The ID of the Google Cloud project you are using to implement this solution.
+ `BUCKET`: The name of the Cloud Storage bucket you created to use with this solution. The `BUCKET` value should be just the bucket name, so `myBucket` rather than `gs://myBucket`.
+ `REGION`: The region to use for the Dataflow job.

In [17]:
PROJECT_ID = 'vertex-stuff' # Change to your project.
BUCKET = 'jsw-matching-engine' # Change to the bucket you created.
REGION = 'us-central1'
BQ_DATASET_NAME = 'css_retail'

!gcloud config set project $PROJECT_ID

Updated property [core/project].


### Authenticate your GCP account
This is required if you run the notebook in Colab. If you use an AI Platform notebook, you should already be authenticated.

In [20]:
try:
    from google.colab import auth
    auth.authenticate_user()
    print("Colab user is authenticated.")
except: pass

## Process the item embeddings data

You run the [sp_ExractEmbeddings](sql_scripts/sp_ExractEmbeddings.sql) stored procedure to process the item embeddings data and write the results to the `item_embeddings` table.

This stored procedure works as follows:

1. Uses the [ML.WEIGHTS](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-weights) function to extract the item embedding matrices from the `item_matching_model` model.
1. Aggregates these matrices to generate a single embedding vector for each item.

    Because BigQuery ML matrix factorization models are designed for user-item recommendation use cases, they generate two embedding matrices, one for users, and the other of items. However, in this use case, both embedding matrices represent items, but in different axes of the feedback matrix. For more information about how the feedback matrix is calculated, see [Understanding item embeddings](https://cloud.google.com/solutions/real-time-item-matching#understanding_item_embeddings).


### Run the `sp_ExractEmbeddings` stored procedure

In [19]:
import json
params = {"dataset" : BQ_DATASET_NAME}

In [20]:
%%bigquery --project $PROJECT_ID --params $params

CALL css_retail.sp_ExractEmbeddings() 

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 1021.51query/s]


Get a count of the records in the `item_embeddings` table:

In [21]:
%%bigquery --project $PROJECT_ID

SELECT COUNT(*) embedding_count
FROM css_retail.item_embeddings;

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1082.40query/s]                        
Downloading: 100%|██████████| 1/1 [00:00<00:00,  1.02rows/s]


Unnamed: 0,embedding_count
0,2933


See a sample of the data in the `item_embeddings` table:

In [22]:
%%bigquery --project $PROJECT_ID

SELECT *
FROM css_retail.item_embeddings
LIMIT 5;

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 540.57query/s]                          
Downloading: 100%|██████████| 5/5 [00:00<00:00,  5.83rows/s]


Unnamed: 0,item_Id,embedding,bias
0,2565,"[-1.7000731697684879, 4.32766909722082, -8.684...",0.296106
1,23047,"[-16.45777759001077, 14.336397805487517, -2.60...",-0.399879
2,4363,"[-0.22027002165007695, 4.4951353161154834, -8....",1.32686
3,7182,"[-2.699671782372556, -0.7021591058614676, -10....",-0.605113
4,16404,"[0.22598407970649526, 6.216626707381438, 0.529...",0.318354


## Export the item embedding vector data

Export the item embedding data to Cloud Storage by using a Dataflow pipeline. This pipeline does the following:

1. Reads the item embedding records from the `item_embeddings` table in BigQuery.
1. Writes each item embedding record to a CSV file.
1. Writes the item embedding CSV files to a Cloud Storage bucket.

The pipeline in implemented in the [embeddings_exporter/pipeline.py](embeddings_exporter/pipeline.py) module.

### Configure the pipeline variables

Configure the variables needed by the pipeline:

In [23]:
runner = 'DataflowRunner'
timestamp = datetime.utcnow().strftime('%y%m%d%H%M%S')
job_name = f'ks-bqml-export-embeddings-{timestamp}'
bq_dataset_name = BQ_DATASET_NAME
embeddings_table_name = 'item_embeddings'
output_dir = f'gs://{BUCKET}/bqml/item_embeddings'
project = PROJECT_ID
temp_location = os.path.join(output_dir, 'tmp')
region = REGION

print(f'runner: {runner}')
print(f'job_name: {job_name}')
print(f'bq_dataset_name: {bq_dataset_name}')
print(f'embeddings_table_name: {embeddings_table_name}')
print(f'output_dir: {output_dir}')
print(f'project: {project}')
print(f'temp_location: {temp_location}')
print(f'region: {region}')

runner: DataflowRunner
job_name: ks-bqml-export-embeddings-210930213013
bq_dataset_name: css_retail
embeddings_table_name: item_embeddings
output_dir: gs://jsw-matching-engine/bqml/item_embeddings
project: vertex-stuff
temp_location: gs://jsw-matching-engine/bqml/item_embeddings/tmp
region: us-central1


In [24]:
try: os.chdir(os.path.join(os.getcwd(), 'embeddings_exporter'))
except: pass

### Run the pipeline

It takes about 5 minutes to run the pipeline. You can see the graph for the running pipeline in the [Dataflow Console](https://console.cloud.google.com/dataflow/jobs).

In [36]:
def export_embeddings_pipeline(args):

    pipeline_options = beam.options.pipeline_options.PipelineOptions(**args)
    with beam.Pipeline(options=pipeline_options) as p:
        def get_query(dataset_name, table_name):
            query = f'''
            SELECT 
                item_Id,
                embedding
            FROM 
                `{dataset_name}.{table_name}`;
            '''
            return query

        def to_csv(entry):
            item_Id = entry['item_Id']
            embedding = entry['embedding']
            csv_string = f'{item_Id},'
            csv_string += ','.join([str(value) for value in embedding])
            return csv_string

        query = get_query(BQ_DATASET_NAME, embeddings_table_name)
        output_prefix = os.path.join(output_dir, 'embeddings')
            
        _ = (
            p
            | 'ReadFromBigQuery' >> beam.io.ReadFromBigQuery(
                project=PROJECT_ID, query=query, use_standard_sql=True, flatten_results=False)
            | 'ConvertToCsv' >> beam.Map(to_csv)
            | 'WriteToCloudStorage' >> beam.io.WriteToText(
                file_path_prefix = output_prefix,
                file_name_suffix = ".csv")
            )

In [37]:
import os
from datetime import datetime

RUNNER = 'DataflowRunner'

job_name = f'extract-embeddings-{datetime.utcnow().strftime("%y%m%d%H%M%S")}'

args = {
    'job_name': job_name,
    'runner': RUNNER,
    'project': PROJECT_ID,
    'temp_location': f'gs://{BUCKET}/dataflow_tmp',
    'region': REGION
}

print("Pipeline args are set.")

Pipeline args are set.


In [None]:
print("Running pipeline...")
%time export_embeddings_pipeline(args)
print("Pipeline is done.")



Running pipeline...


  temp_location = pcoll.pipeline.options.view_as(


### List the CSV files that were written to Cloud Storage

In [None]:
!gsutil ls {output_dir}/embeddings-*.csv

## License

Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

See the License for the specific language governing permissions and limitations under the License.

**This is not an official Google product but sample code provided for an educational purpose**