# Import the sample data into BigQuery and Datastore

This notebook is the first of two notebooks that guide you through completing the prerequisites for running the [Real-time Item-to-item Recommendation with BigQuery ML Matrix Factorization and ScaNN](https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/tree/master/retail/recommendation-system/bqml-scann) solution.

Use this notebook to complete the following tasks:

1. Importing the `product` table 
1. Creating the `vw_item_groups` view that contains the item data used to compute item co-occurence.
1. Exporting product information to Datastore to make it available for lookup when making similar product recommendations. 

Before starting this notebook, you must [set up the GCP environment](https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/tree/master/retail/recommendation-system/bqml-scann#set-up-the-gcp-environment).

## Setup

Install the required Python packages, configure the environment variables, and authenticate your GCP account. 

### Import libraries

In [1]:
import os
from datetime import datetime
import apache_beam as beam
from apache_beam.io.gcp.datastore.v1new.datastoreio import WriteToDatastore

### Configure GCP environment settings

Update the following variables to reflect the values for your GCP environment:

+ `PROJECT_ID`: The ID of the Google Cloud project you are using to implement this solution.
+ `BUCKET`: The name of the Cloud Storage bucket you created to use with this solution. The `BUCKET` value should be just the bucket name, so `myBucket` rather than `gs://myBucket`.
+ `BQ_REGION`: The region to use for the BigQuery dataset.
+ `DF_REGION`: The region to use for the Dataflow job. Choose the same region that you used for the `BQ_REGION` variable to avoid issues around reading/writing in different locations.

In [3]:
PROJECT_ID = 'USER-SET' # Change to your project.
BUCKET = 'USER-SET' # Change to the bucket you created.
BQ_REGION = 'USER-SET' # Change to your BigQuery region.
DF_REGION = 'USER-SET' # Change to your Dataflow region.
BQ_DATASET_NAME = 'css_retail'
BQ_TABLE_NAME = 'products'
DS_KIND = 'product'

!gcloud config set project $PROJECT_ID

Updated property [core/project].


### Authenticate your GCP account
This is required if you run the notebook in Colab. If you use an AI Platform notebook, you should already be authenticated.

In [4]:
try:
    from google.colab import auth
    auth.authenticate_user()
    print("Colab user is authenticated.")
except: pass

### Create the BigQuery dataset

### Define the Dataflow pipeline

The pipeline selectproductsgs where the `track_data_title` field isn't NULL and the `track_data_id` field is greater than 0.

### Run the Dataflow pipeline

This pipeline takes approximately 15 minutes to run.

# Data is using the `CSS_RETAIL` dataset
```sql
SELECT 
  USER_ID, 	
  ORDER_ID 
FROM 
  order_items
```

## Create the `vw_item_groups` view

Create the `recommendations.vw_item_groups` view to focus oproductnt data.

To adapt this view to your own data, you would need to map your item identifier, for example product SKU, to `item_Id`, and your context identifier, for example purchase order number, to `group_Id`.

In [5]:
%%bigquery 

CREATE or REPLACE VIEW `css_retail.vw_item_groups`
AS
SELECT 
  userInfo.userID as group_id, 
  pd.id as item_id 
FROM 
  `css_retail.purchase_complete`,
  UNNEST(productEventDetail.productDetails) as pd

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 969.11query/s] 


## Export product information to Datastore

Export data from the `track_title` and `artist` fields to Datastore.

### Define the Dataflow pipeline

In [6]:
def create_entity(product_info, kind):

    from apache_beam.io.gcp.datastore.v1new.types import Entity
    from apache_beam.io.gcp.datastore.v1new.types import Key

    product_id = product_info.pop("id")
    key = Key([kind, product_id])
    product_entity = Entity(key)
    product_entity.set_properties(product_info)
    return product_entity

def run_export_to_datatore_pipeline(args):

    query = f'''
      SELECT  
        id, 
        MAX(COST) cost, 
        MAX(DEPARTMENT) department,
        MAX(CATEGORY) category,
        MAX(SUB_CATEGORY) sub_category,
        MAX(NAME) name,
        MAX(RETAIL_PRICE) RETAIL_PRICE
      FROM 
        `{BQ_DATASET_NAME}.{BQ_TABLE_NAME}`
        GROUP BY ID
    '''

    pipeline_options = beam.options.pipeline_options.PipelineOptions(**args)
    with beam.Pipeline(options=pipeline_options) as pipeline:

      _ = (
        pipeline
        | 'ReadFromBigQuery' >> beam.io.Read(beam.io.ReadFromBigQuery(
            project=PROJECT_ID, query=query, use_standard_sql=True))
        | 'ConvertToDatastoreEntity' >> beam.Map(create_entity, DS_KIND)
        | 'WriteToDatastore' >> WriteToDatastore(project=PROJECT_ID)
      )


### Run the Dataflow pipeline

This pipeline takes approximately 15 minutes to run.

In [10]:
import os
from datetime import datetime

DATASET = 'css_retail'
RUNNER = 'DataflowRunner'

job_name = f'load-datastore-{datetime.utcnow().strftime("%y%m%d%H%M%S")}'

args = {
    'job_name': job_name,
    'runner': RUNNER,
    'project': PROJECT_ID,
    'temp_location': f'gs://{BUCKET}/dataflow_tmp',
    'region': DF_REGION
}

print("Pipeline args are set.")

Pipeline args are set.


In [11]:
# Enable the service
!gcloud services enable datastore.googleapis.com dataflow.googleapis.com

### Also enable datastore in the console

Operation "operations/acf.p2-424192748592-e78d75c3-58b0-46fd-ab0a-e87ea5319418" finished successfully.


In [12]:
print("Running pipeline...")
%time run_export_to_datatore_pipeline(args)
print("Pipeline is done.")

Running pipeline...




CPU times: user 5.15 s, sys: 267 ms, total: 5.41 s
Wall time: 57min 20s
Pipeline is done.


After running the pipeline, you can view the product entries on the [Datastore Entities page](https://pantheon.corp.google.com/datastore/entities):

<img src="figures/datastore.png" style="width:600px;"/> 

## License

Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

See the License for the specific language governing permissions and limitations under the License.

**This is not an official Google product but sample code provided for an educational purpose**