# Part 2: Process the item embedding data in BigQuery and export it to Cloud Storage

This notebook is the second of five notebooks that guide you through running the [Real-time Item-to-item Recommendation with BigQuery ML Matrix Factorization and ScaNN](https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/tree/master/retail/recommendation-system/bqml-scann) solution.

Use this notebook to complete the following tasks:

1. Process the song embeddings data in BigQuery to generate a single embedding vector for each song.
1. Use a Dataflow pipeline to write the embedding vector data to CSV files and export the files to a Cloud Storage bucket. 

Before starting this notebook, you must run the [01_train_bqml_mf_pmi](01_train_bqml_mf_pmi.ipynb) notebook to calculate item PMI data and then train a matrix factorization model with it.

After completing this notebook, run the [03_create_embedding_lookup_model](03_create_embedding_lookup_model.ipynb) notebook to create a model to serve the item embedding data.



## Setup

Import the required libraries, configure the environment variables, and authenticate your GCP account.



In [11]:
!pip install tensorflow --upgrade

Collecting tensorflow
  Downloading tensorflow-2.6.0-cp37-cp37m-manylinux2010_x86_64.whl (458.3 MB)
[K     |████████████████████▊           | 296.9 MB 142.1 MB/s eta 0:00:02

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 458.3 MB 6.9 kB/s  eta 0:00:011
Collecting tensorflow-estimator~=2.6
  Downloading tensorflow_estimator-2.6.0-py2.py3-none-any.whl (462 kB)
[K     |████████████████████████████████| 462 kB 81.0 MB/s eta 0:00:01
Installing collected packages: tensorflow-estimator, tensorflow
  Attempting uninstall: tensorflow-estimator
    Found existing installation: tensorflow-estimator 2.5.0
    Uninstalling tensorflow-estimator-2.5.0:
      Successfully uninstalled tensorflow-estimator-2.5.0
  Rolling back uninstall of tensorflow-estimator
  Moving to /home/jupyter/.local/lib/python3.7/site-packages/tensorflow_estimator-2.5.0.dist-info/
   from /home/jupyter/.local/lib/python3.7/site-packages/~ensorflow_estimator-2.5.0.dist-info
  Moving to /home/jupyter/.local/lib/python3.7/site-packages/tensorflow_estimator/
   from /home/jupyter/.local/lib/python3.7/site-packages/~ensorflow_estimator
[31mERROR: Could not install packages due to an OSError: [Errno 13] P

In [None]:
!pip install -U -q apache-beam[gcp]

### Import libraries

In [1]:
import os
import numpy as np
import tensorflow_io as tf_io
import apache_beam as beam
from datetime import datetime

2021-09-20 15:08:52.251995: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


NotFoundError: /opt/conda/lib/python3.7/site-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl12lts_2021032411string_viewESt10unique_ptrINS0_15OpKernelFactoryESt14default_deleteIS9_EE

### Configure GCP environment settings

Update the following variables to reflect the values for your GCP environment:

+ `PROJECT_ID`: The ID of the Google Cloud project you are using to implement this solution.
+ `BUCKET`: The name of the Cloud Storage bucket you created to use with this solution. The `BUCKET` value should be just the bucket name, so `myBucket` rather than `gs://myBucket`.
+ `REGION`: The region to use for the Dataflow job.

In [None]:
PROJECT_ID = 'yourProject' # Change to your project.
BUCKET = 'yourBucketName' # Change to the bucket you created.
REGION = 'yourDataflowRegion' # Change to your Dataflow region.
BQ_DATASET_NAME = 'recommendations'

!gcloud config set project $PROJECT_ID

### Authenticate your GCP account
This is required if you run the notebook in Colab. If you use an AI Platform notebook, you should already be authenticated.

In [None]:
try:
  from google.colab import auth
  auth.authenticate_user()
  print("Colab user is authenticated.")
except: pass

## Process the item embeddings data

You run the [sp_ExractEmbeddings](sql_scripts/sp_ExractEmbeddings.sql) stored procedure to process the item embeddings data and write the results to the `item_embeddings` table.

This stored procedure works as follows:

1. Uses the [ML.WEIGHTS](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-weights) function to extract the item embedding matrices from the `item_matching_model` model.
1. Aggregates these matrices to generate a single embedding vector for each item.

    Because BigQuery ML matrix factorization models are designed for user-item recommendation use cases, they generate two embedding matrices, one for users, and the other of items. However, in this use case, both embedding matrices represent items, but in different axes of the feedback matrix. For more information about how the feedback matrix is calculated, see [Understanding item embeddings](https://cloud.google.com/solutions/real-time-item-matching#understanding_item_embeddings).


### Run the `sp_ExractEmbeddings` stored procedure

In [None]:
%%bigquery --project $PROJECT_ID

CALL recommendations.sp_ExractEmbeddings() 

Get a count of the records in the `item_embeddings` table:

In [None]:
%%bigquery --project $PROJECT_ID

SELECT COUNT(*) embedding_count
FROM recommendations.item_embeddings;

See a sample of the data in the `item_embeddings` table:

In [None]:
%%bigquery --project $PROJECT_ID

SELECT *
FROM recommendations.item_embeddings
LIMIT 5;

## Export the item embedding vector data

Export the item embedding data to Cloud Storage by using a Dataflow pipeline. This pipeline does the following:

1. Reads the item embedding records from the `item_embeddings` table in BigQuery.
1. Writes each item embedding record to a CSV file.
1. Writes the item embedding CSV files to a Cloud Storage bucket.

The pipeline in implemented in the [embeddings_exporter/pipeline.py](embeddings_exporter/pipeline.py) module.

### Configure the pipeline variables

Configure the variables needed by the pipeline:

In [None]:
runner = 'DataflowRunner'
timestamp = datetime.utcnow().strftime('%y%m%d%H%M%S')
job_name = f'ks-bqml-export-embeddings-{timestamp}'
bq_dataset_name = BQ_DATASET_NAME
embeddings_table_name = 'item_embeddings'
output_dir = f'gs://{BUCKET}/bqml/item_embeddings'
project = PROJECT_ID
temp_location = os.path.join(output_dir, 'tmp')
region = REGION

print(f'runner: {runner}')
print(f'job_name: {job_name}')
print(f'bq_dataset_name: {bq_dataset_name}')
print(f'embeddings_table_name: {embeddings_table_name}')
print(f'output_dir: {output_dir}')
print(f'project: {project}')
print(f'temp_location: {temp_location}')
print(f'region: {region}')

In [None]:
try: os.chdir(os.path.join(os.getcwd(), 'embeddings_exporter'))
except: pass

### Run the pipeline

It takes about 5 minutes to run the pipeline. You can see the graph for the running pipeline in the [Dataflow Console](https://console.cloud.google.com/dataflow/jobs).

In [None]:
if tf_io.gfile.exists(output_dir):
  print("Removing {} contents...".format(output_dir))
  tf_io.gfile.rmtree(output_dir)

print("Creating output: {}".format(output_dir))
tf_io.gfile.makedirs(output_dir)

!python runner.py \
  --runner={runner} \
  --job_name={job_name} \
  --bq_dataset_name={bq_dataset_name} \
  --embeddings_table_name={embeddings_table_name} \
  --output_dir={output_dir} \
  --project={project} \
  --temp_location={temp_location} \
  --region={region}

### List the CSV files that were written to Cloud Storage

In [None]:
!gsutil ls {output_dir}/embeddings-*.csv

## License

Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

See the License for the specific language governing permissions and limitations under the License.

**This is not an official Google product but sample code provided for an educational purpose**