# Part 2: Process the item embedding data in BigQuery and export it to Cloud Storage

This notebook is the second of five notebooks that guide you through running the [Real-time Item-to-item Recommendation with BigQuery ML Matrix Factorization and ScaNN](https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/tree/master/retail/recommendation-system/bqml-scann) solution.

Use this notebook to complete the following tasks:

1. Process the song embeddings data in BigQuery to generate a single embedding vector for each song.
1. Use a Dataflow pipeline to write the embedding vector data to CSV files and export the files to a Cloud Storage bucket. 

Before starting this notebook, you must run the [01_train_bqml_mf_pmi](01_train_bqml_mf_pmi.ipynb) notebook to calculate item PMI data and then train a matrix factorization model with it.

After completing this notebook, run the [03_create_embedding_lookup_model](03_create_embedding_lookup_model.ipynb) notebook to create a model to serve the item embedding data.



## Setup

Import the required libraries, configure the environment variables, and authenticate your GCP account.



In [32]:
!pip install -q tensorflow==2.5.0 --user 

In [33]:
!pip install -U -q apache-beam[gcp] --user

### Import libraries

In [2]:
import os
import numpy as np
# import tensorflow_io as tf_io
import apache_beam as beam
from datetime import datetime
from google.cloud import storage


### Configure GCP environment settings

Update the following variables to reflect the values for your GCP environment:

+ `PROJECT_ID`: The ID of the Google Cloud project you are using to implement this solution.
+ `BUCKET`: The name of the Cloud Storage bucket you created to use with this solution. The `BUCKET` value should be just the bucket name, so `myBucket` rather than `gs://myBucket`.
+ `REGION`: The region to use for the Dataflow job.

In [3]:
PROJECT_ID = 'rec-ai-demo-326116' # Change to your project.
BUCKET = 'rec_bq_jsw' # Change to the bucket you created.
REGION = 'us-central1'
BQ_DATASET_NAME = 'css_retail'

!gcloud config set project $PROJECT_ID

Updated property [core/project].


### Authenticate your GCP account
This is required if you run the notebook in Colab. If you use an AI Platform notebook, you should already be authenticated.

In [20]:
try:
  from google.colab import auth
  auth.authenticate_user()
  print("Colab user is authenticated.")
except: pass

## Process the item embeddings data

You run the [sp_ExractEmbeddings](sql_scripts/sp_ExractEmbeddings.sql) stored procedure to process the item embeddings data and write the results to the `item_embeddings` table.

This stored procedure works as follows:

1. Uses the [ML.WEIGHTS](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-weights) function to extract the item embedding matrices from the `item_matching_model` model.
1. Aggregates these matrices to generate a single embedding vector for each item.

    Because BigQuery ML matrix factorization models are designed for user-item recommendation use cases, they generate two embedding matrices, one for users, and the other of items. However, in this use case, both embedding matrices represent items, but in different axes of the feedback matrix. For more information about how the feedback matrix is calculated, see [Understanding item embeddings](https://cloud.google.com/solutions/real-time-item-matching#understanding_item_embeddings).


### Run the `sp_ExractEmbeddings` stored procedure

In [4]:
%%bigquery --project $PROJECT_ID

CALL css_retail.sp_ExractEmbeddings() 

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 1081.28query/s] 


Get a count of the records in the `item_embeddings` table:

In [5]:
%%bigquery --project $PROJECT_ID

SELECT COUNT(*) embedding_count
FROM css_retail.item_embeddings;

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1014.46query/s]                        
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.04s/rows]


Unnamed: 0,embedding_count
0,2933


See a sample of the data in the `item_embeddings` table:

In [6]:
%%bigquery --project $PROJECT_ID

SELECT *
FROM css_retail.item_embeddings
LIMIT 5;

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 386.36query/s]                          
Downloading: 100%|██████████| 5/5 [00:01<00:00,  2.84rows/s]


Unnamed: 0,item_Id,embedding,bias
0,1537,"[-0.8755925302815449, -19.211703889626925, 2.5...",2.888588
1,19971,"[-10.195254169694085, -30.868551705240407, 1.8...",3.336067
2,14344,"[-15.008873697815245, -14.252017324271174, 18....",3.141889
3,6677,"[-0.5205412664424082, -9.694137743083775, 0.86...",2.5674
4,279,"[0.5785013221257957, -11.059790907075458, 8.05...",4.809565


## Export the item embedding vector data

Export the item embedding data to Cloud Storage by using a Dataflow pipeline. This pipeline does the following:

1. Reads the item embedding records from the `item_embeddings` table in BigQuery.
1. Writes each item embedding record to a CSV file.
1. Writes the item embedding CSV files to a Cloud Storage bucket.

The pipeline in implemented in the [embeddings_exporter/pipeline.py](embeddings_exporter/pipeline.py) module.

### Configure the pipeline variables

Configure the variables needed by the pipeline:

In [7]:
runner = 'DataflowRunner'
timestamp = datetime.utcnow().strftime('%y%m%d%H%M%S')
job_name = f'ks-bqml-export-embeddings-{timestamp}'
bq_dataset_name = BQ_DATASET_NAME
embeddings_table_name = 'item_embeddings'
output_dir = f'gs://{BUCKET}/bqml/item_embeddings'
project = PROJECT_ID
temp_location = os.path.join(output_dir, 'tmp')
region = REGION

print(f'runner: {runner}')
print(f'job_name: {job_name}')
print(f'bq_dataset_name: {bq_dataset_name}')
print(f'embeddings_table_name: {embeddings_table_name}')
print(f'output_dir: {output_dir}')
print(f'project: {project}')
print(f'temp_location: {temp_location}')
print(f'region: {region}')

runner: DataflowRunner
job_name: ks-bqml-export-embeddings-210921192453
bq_dataset_name: css_retail
embeddings_table_name: item_embeddings
output_dir: gs://rec_bq_jsw/bqml/item_embeddings
project: rec-ai-demo-326116
temp_location: gs://rec_bq_jsw/bqml/item_embeddings/tmp
region: us-central1


In [8]:
try: os.chdir(os.path.join(os.getcwd(), 'embeddings_exporter'))
except: pass

### Run the pipeline

It takes about 5 minutes to run the pipeline. You can see the graph for the running pipeline in the [Dataflow Console](https://console.cloud.google.com/dataflow/jobs).

In [9]:


!python runner.py \
  --runner={runner} \
  --job_name={job_name} \
  --bq_dataset_name={bq_dataset_name} \
  --embeddings_table_name={embeddings_table_name} \
  --output_dir={output_dir} \
  --project={project} \
  --temp_location={temp_location} \
  --region={region}

  temp_location = pcoll.pipeline.options.view_as(





### List the CSV files that were written to Cloud Storage

In [11]:
!gsutil rm -r {output_dir}/ #clean-up for intermediate data

Removing gs://rec_bq_jsw/bqml/item_embeddings/tmp/ks-bqml-export-embeddings-210921192453.1632252311.148268/apache_beam-2.32.0-cp37-cp37m-manylinux1_x86_64.whl#1632252312130967...
Removing gs://rec_bq_jsw/bqml/item_embeddings/tmp/ks-bqml-export-embeddings-210921192453.1632252311.148268/dataflow_python_sdk.tar#1632252311660714...
Removing gs://rec_bq_jsw/bqml/item_embeddings/tmp/ks-bqml-export-embeddings-210921192453.1632252311.148268/pipeline.pb#1632252312184432...
Removing gs://rec_bq_jsw/bqml/item_embeddings/tmp/ks-bqml-export-embeddings-210921192453.1632252311.148268/workflow.tar.gz#1632252311711157...
/ [4 objects]                                                                   
Operation completed over 4 objects.                                              


In [12]:
!gsutil ls {output_dir}/embeddings-*.csv

gs://rec_bq_jsw/bqml/item_embeddings/embeddings-00000-of-00001.csv


## License

Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

See the License for the specific language governing permissions and limitations under the License.

**This is not an official Google product but sample code provided for an educational purpose**