# Neural network hybrid recommendation system on Google Analytics data preprocessing

This notebook demonstrates how to implement a hybrid recommendation system using a neural network to combine content-based and collaborative filtering recommendation models using Google Analytics data. We are going to use the learned user embeddings from [wals.ipynb](../wals.ipynb) and combine that with our previous content-based features from [content_based_using_neural_networks.ipynb](../content_based_using_neural_networks.ipynb)

First we are going to preprocess our data using BigQuery and Cloud Dataflow to be used in our later neural network hybrid recommendation model.

Apache Beam only works in Python 2 at the moment, so we're going to switch to the Python 2 kernel. In the above menu, click the dropdown arrow and select `python2`.

In [1]:
%%bash
source activate py2env
pip uninstall -y google-cloud-dataflow
conda install -y pytz==2018.4
pip install apache-beam[gcp]

Uninstalling google-cloud-dataflow-2.0.0:
  Successfully uninstalled google-cloud-dataflow-2.0.0
Solving environment: ...working... done

## Package Plan ##

  environment location: /usr/local/envs/py2env

  added / updated specs: 
    - pytz==2018.4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.0.2r             |       h7b6447c_0         3.2 MB  defaults
    certifi-2019.3.9           |           py27_0         155 KB  defaults
    ca-certificates-2019.1.23  |                0         126 KB  defaults
    ------------------------------------------------------------
                                           Total:         3.4 MB

The following packages will be UPDATED:

    ca-certificates: 2018.03.07-0      defaults --> 2019.1.23-0       defaults
    certifi:         2018.11.29-py27_0 defaults --> 2019.3.9-py27_0   defaults
    openssl:         1.0.2p-h14c3975_0 defaults 



  current version: 4.5.12
  latest version: 4.6.8

Please update conda by running

    $ conda update -n base -c defaults conda


openssl-1.0.2r       | 3.2 MB    |            |   0% openssl-1.0.2r       | 3.2 MB    | #######6   |  76% openssl-1.0.2r       | 3.2 MB    | ########9  |  89% openssl-1.0.2r       | 3.2 MB    | #########9 | 100% openssl-1.0.2r       | 3.2 MB    | ########## | 100% 
certifi-2019.3.9     | 155 KB    |            |   0% certifi-2019.3.9     | 155 KB    | ########## | 100% 
ca-certificates-2019 | 126 KB    |            |   0% ca-certificates-2019 | 126 KB    | ########## | 100% 
google-cloud-monitoring 0.28.0 has requirement google-api-core<0.2.0dev,>=0.1.1, but you'll have google-api-core 1.8.1 which is incompatible.
googledatastore 7.0.1 has requirement httplib2<0.10,>=0.9.1, but you'll have httplib2 0.11.3 which is incompatible.
Cannot uninstall 'dill'. It is a distutils installed project and thus we cannot accurately determine which files belong t

In [2]:
%%bash
conda update -n base -c defaults conda

Solving environment: ...working... done

## Package Plan ##

  environment location: /usr/local

  added / updated specs: 
    - conda


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1b             |       h7b6447c_1         4.0 MB  defaults
    conda-4.6.8                |           py27_0         1.6 MB  defaults
    ------------------------------------------------------------
                                           Total:         5.6 MB

The following packages will be UPDATED:

    ca-certificates: 2018.03.07-0      defaults --> 2019.1.23-0       defaults
    certifi:         2018.11.29-py27_0 defaults --> 2019.3.9-py27_0   defaults
    conda:           4.5.12-py27_0     defaults --> 4.6.8-py27_0      defaults
    openssl:         1.1.1a-h7b6447c_0 defaults --> 1.1.1b-h7b6447c_1 defaults

Proceed ([y]/n)? 

Downloading and Extracting Packages
Preparing transaction: ..

openssl-1.1.1b       | 4.0 MB    |            |   0% openssl-1.1.1b       | 4.0 MB    | #######6   |  77% openssl-1.1.1b       | 4.0 MB    | #########9 |  99% openssl-1.1.1b       | 4.0 MB    | ########## | 100% 
conda-4.6.8          | 1.6 MB    |            |   0% conda-4.6.8          | 1.6 MB    | #######8   |  78% conda-4.6.8          | 1.6 MB    | #########  |  90% conda-4.6.8          | 1.6 MB    | #########9 | 100% conda-4.6.8          | 1.6 MB    | ########## | 100% 


In [3]:
%%bash
pip install google-cloud-dataflow

Collecting google-cloud-dataflow
  Downloading https://files.pythonhosted.org/packages/72/29/3aaa67a276bcb07c3e85a6048b4d9610542082d262256cd6d232d6a8c00a/google-cloud-dataflow-2.5.0.tar.gz
Collecting apache-beam[gcp]==2.5.0 (from google-cloud-dataflow)
  Downloading https://files.pythonhosted.org/packages/ff/10/a59ba412f71fb65412ec7a322de6331e19ec8e75ca45eba7a0708daae31a/apache_beam-2.5.0-cp27-cp27mu-manylinux1_x86_64.whl (2.2MB)
Collecting httplib2<0.10,>=0.8 (from apache-beam[gcp]==2.5.0->google-cloud-dataflow)
  Downloading https://files.pythonhosted.org/packages/ff/a9/5751cdf17a70ea89f6dde23ceb1705bfb638fd8cee00f845308bf8d26397/httplib2-0.9.2.tar.gz (205kB)
Collecting typing<3.7.0,>=3.6.0 (from apache-beam[gcp]==2.5.0->google-cloud-dataflow)
  Using cached https://files.pythonhosted.org/packages/cc/3e/29f92b7aeda5b078c86d14f550bf85cff809042e3429ace7af6193c3bc9f/typing-3.6.6-py2-none-any.whl
Collecting hdfs<3.0.0,>=2.1.0 (from apache-beam[gcp]==2.5.0->google-cloud-dataflow)
Collecti

pandas-gbq 0.3.0 has requirement google-cloud-bigquery>=0.28.0, but you'll have google-cloud-bigquery 0.25.0 which is incompatible.
google-cloud-monitoring 0.28.0 has requirement google-cloud-core<0.29dev,>=0.28.0, but you'll have google-cloud-core 0.25.0 which is incompatible.
datalab 1.1.3 has requirement httplib2>=0.10.3, but you'll have httplib2 0.9.2 which is incompatible.


In [4]:
%%bash
source activate py2env
pip uninstall -y google-cloud-dataflow
conda install -y pytz==2018.4
pip install apache-beam[gcp]

Uninstalling google-cloud-dataflow-2.5.0:
  Successfully uninstalled google-cloud-dataflow-2.5.0
Collecting package metadata: ...working... done
Solving environment: ...working... done

# All requested packages already installed.



Now restart notebook's session kernel!

In [1]:
# Import helpful libraries and setup our project, bucket, and region
import os

output = os.popen("gcloud config get-value project").readlines()
project_name = output[0][:-1]

# change these to try this notebook out
PROJECT = project_name
BUCKET = project_name
#BUCKET = BUCKET.replace("qwiklabs-gcp-", "inna-bckt-")
REGION = 'europe-west1'  ## note: Cloud ML Engine not availabe in europe-west3!

print(PROJECT)
print(BUCKET)
print(REGION)

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8'

qwiklabs-gcp-ca86714dede2236a
qwiklabs-gcp-ca86714dede2236a
europe-west1


In [2]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


<h2> Create ML dataset using Dataflow </h2>
Let's use Cloud Dataflow to read in the BigQuery data, do some preprocessing, and write it out as CSV files.

First, let's create our hybrid dataset query that we will use in our Cloud Dataflow pipeline. This will combine some content-based features and the user and item embeddings learned from our WALS Matrix Factorization Collaborative filtering lab that we extracted from our trained WALSMatrixFactorization Estimator and uploaded to BigQuery.

In [3]:
query_hybrid_dataset = """
WITH CTE_site_history AS (
  SELECT
      fullVisitorId as visitor_id,
      (SELECT MAX(IF(index = 10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id,
      (SELECT MAX(IF(index = 7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category, 
      (SELECT MAX(IF(index = 6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title,
      (SELECT MAX(IF(index = 2, value, NULL)) FROM UNNEST(hits.customDimensions)) AS author_list,
      SPLIT(RPAD((SELECT MAX(IF(index = 4, value, NULL)) FROM UNNEST(hits.customDimensions)), 7), '.') AS year_month_array,
      LEAD(hits.customDimensions, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) AS nextCustomDimensions
  FROM 
    `cloud-training-demos.GA360_test.ga_sessions_sample`,   
     UNNEST(hits) AS hits
   WHERE 
     # only include hits on pages
      hits.type = "PAGE"
      AND
      fullVisitorId IS NOT NULL
      AND
      hits.time != 0
      AND
      hits.time IS NOT NULL
      AND
      (SELECT MAX(IF(index = 10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
),
CTE_training_dataset AS (
SELECT
  (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) AS next_content_id,
  
  visitor_id,
  content_id,
  category,
  REGEXP_REPLACE(title, r",", "") AS title,
  REGEXP_EXTRACT(author_list, r"^[^,]+") AS author,
  DATE_DIFF(DATE(CAST(year_month_array[OFFSET(0)] AS INT64), CAST(year_month_array[OFFSET(1)] AS INT64), 1), DATE(1970, 1, 1), MONTH) AS months_since_epoch
FROM
  CTE_site_history
WHERE (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) IS NOT NULL)

SELECT
  CAST(next_content_id AS STRING) AS next_content_id,
  
  CAST(training_dataset.visitor_id AS STRING) AS visitor_id,
  CAST(training_dataset.content_id AS STRING) AS content_id,
  CAST(IFNULL(category, 'None') AS STRING) AS category,
  CONCAT("\\"", REPLACE(TRIM(CAST(IFNULL(title, 'None') AS STRING)), "\\"",""), "\\"") AS title,
  CAST(IFNULL(author, 'None') AS STRING) AS author,
  CAST(months_since_epoch AS STRING) AS months_since_epoch,
  
  IFNULL(user_factors._0, 0.0) AS user_factor_0,
  IFNULL(user_factors._1, 0.0) AS user_factor_1,
  IFNULL(user_factors._2, 0.0) AS user_factor_2,
  IFNULL(user_factors._3, 0.0) AS user_factor_3,
  IFNULL(user_factors._4, 0.0) AS user_factor_4,
  IFNULL(user_factors._5, 0.0) AS user_factor_5,
  IFNULL(user_factors._6, 0.0) AS user_factor_6,
  IFNULL(user_factors._7, 0.0) AS user_factor_7,
  IFNULL(user_factors._8, 0.0) AS user_factor_8,
  IFNULL(user_factors._9, 0.0) AS user_factor_9,
  
  IFNULL(item_factors._0, 0.0) AS item_factor_0,
  IFNULL(item_factors._1, 0.0) AS item_factor_1,
  IFNULL(item_factors._2, 0.0) AS item_factor_2,
  IFNULL(item_factors._3, 0.0) AS item_factor_3,
  IFNULL(item_factors._4, 0.0) AS item_factor_4,
  IFNULL(item_factors._5, 0.0) AS item_factor_5,
  IFNULL(item_factors._6, 0.0) AS item_factor_6,
  IFNULL(item_factors._7, 0.0) AS item_factor_7,
  IFNULL(item_factors._8, 0.0) AS item_factor_8,
  IFNULL(item_factors._9, 0.0) AS item_factor_9,
  
  FARM_FINGERPRINT(CONCAT(CAST(visitor_id AS STRING), CAST(content_id AS STRING))) AS hash_id
FROM CTE_training_dataset AS training_dataset
LEFT JOIN `cloud-training-demos.GA360_test.user_factors` AS user_factors
  ON CAST(training_dataset.visitor_id AS FLOAT64) = CAST(user_factors.user_id AS FLOAT64)
LEFT JOIN `cloud-training-demos.GA360_test.item_factors` AS item_factors
  ON CAST(training_dataset.content_id AS STRING) = CAST(item_factors.item_id AS STRING)
"""

Let's pull a sample of our data into a dataframe to see what it looks like.

In [4]:
import google.datalab.bigquery as bq
df_hybrid_dataset = bq.Query(query_hybrid_dataset + "LIMIT 100").execute().result().to_dataframe()
df_hybrid_dataset.head()

Unnamed: 0,next_content_id,visitor_id,content_id,category,title,author,months_since_epoch,user_factor_0,user_factor_1,user_factor_2,...,item_factor_1,item_factor_2,item_factor_3,item_factor_4,item_factor_5,item_factor_6,item_factor_7,item_factor_8,item_factor_9,hash_id
0,299837992,1000593816586876859,230814320,Stars & Kultur,"""Kritik an Meghan Markle immer lauter""",Elisabeth Spitzer,562,0.000592,0.000627,-0.000848,...,-2.335446e-14,1.702269e-13,6.061748e-14,-5.093584e-16,-7.285724e-14,-1.158683e-13,-1.558101e-13,2.011165e-13,1.281463e-14,4641499907841586690
1,299826767,1001769331926555188,299836255,News,"""Blümel Kneissl &Co.: Das sind die Fixstarter""",,574,0.000749,0.000923,-0.001529,...,-5.072439e-05,0.0007677825,0.0001595652,0.0003168983,-0.000456539,0.0001829965,-0.0006903299,0.0008621884,0.000115119,-3618990996027508246
2,299921761,1001769331926555188,299826767,Lifestyle,"""Titanic-Regisseur: Darum musste Jack sterben""",Elisabeth Mittendorfer,574,0.000749,0.000923,-0.001529,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-8356988980360872262
3,299912085,1001769331926555188,299921761,News,"""Bitcoin knackt 10.000-Dollar-Marke""",Stefan Hofer,574,0.000749,0.000923,-0.001529,...,18.47323,8.823291,14.86902,5.051981,15.10309,46.37218,-23.84849,-12.16274,15.94312,1549964685624042309
4,299836841,1001769331926555188,299912085,News,"""Erster ÖBB-Containerzug nach China unterwegs""",Stefan Hofer,574,0.000749,0.000923,-0.001529,...,-6.662306e-12,-1.106192e-11,-1.411706e-11,5.926889e-12,-1.093412e-11,4.748844e-12,8.56772e-12,7.530454e-12,1.197553e-11,731115923694303975


In [5]:
df_hybrid_dataset.describe()

Unnamed: 0,user_factor_0,user_factor_1,user_factor_2,user_factor_3,user_factor_4,user_factor_5,user_factor_6,user_factor_7,user_factor_8,user_factor_9,...,item_factor_1,item_factor_2,item_factor_3,item_factor_4,item_factor_5,item_factor_6,item_factor_7,item_factor_8,item_factor_9,hash_id
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,0.000231,0.000559,-0.0003461505,-0.001118,-0.00011,-0.000146,-7.4e-05,-0.000347,0.0007827963,0.000331,...,0.524821,-0.1213014,0.1451377,-0.08779055,0.3551918,0.1528627,-0.3217224,0.1274825,0.0569884,3.813501e+17
std,0.001241,0.001661,0.001405995,0.001622,0.001246,0.001433,0.001509,0.002023,0.001600359,0.002058,...,3.702127,2.453433,1.491142,1.174458,2.399064,5.682699,2.435434,2.951234,1.955282,5.393236e+18
min,-0.002702,-0.001083,-0.003664789,-0.005264,-0.002658,-0.003539,-0.00336,-0.004766,-0.002352181,-0.00629,...,-0.03886719,-22.70137,-0.6738291,-10.19035,-0.1282159,-32.37039,-23.84849,-12.16274,-11.15637,-9.124378e+18
25%,-0.000357,-0.000337,-0.0008085197,-0.00228,-0.001122,-5.3e-05,-0.000739,-0.001188,-6.077433e-08,-0.001001,...,-2.268464e-16,-7.290205000000001e-17,-2.123815e-05,-0.00114593,-2.168568e-08,-3.884151e-15,-0.001959489,-2.505747e-14,-2.548617e-15,-4.253324e+18
50%,0.000353,0.000137,-0.0003112987,-0.000554,1e-05,1.2e-05,-0.000107,-7.8e-05,0.0003225584,-2.7e-05,...,3.627039e-26,2.244701e-16,-7.865934000000001e-23,-3.780486e-16,-8.690795e-28,3.645398e-21,-9.524429e-16,2.543406e-24,1.783951e-24,7.393969e+17
75%,0.001067,0.000785,3.298481e-07,-0.000103,0.000416,0.000657,0.000147,0.000147,0.001384437,0.001526,...,0.0001043613,1.528129e-05,6.483593e-14,1.1956060000000002e-17,6.438866e-13,0.0001829965,3.399927e-20,1.580518e-06,0.0002364778,4.983149e+18
max,0.003468,0.006136,0.003817978,0.003158,0.003781,0.003383,0.003922,0.004065,0.004557515,0.004465,...,32.29185,8.823291,14.86902,5.051981,18.77586,46.37218,0.01651876,26.71273,15.94312,8.962769e+18


In [10]:
import apache_beam as beam
import datetime, os

def to_csv(rowdict):
  # Pull columns from BQ and create a line
  import hashlib
  import copy
  CSV_COLUMNS = 'next_content_id,visitor_id,content_id,category,title,author,months_since_epoch'.split(',')
  FACTOR_COLUMNS = ["user_factor_{}".format(i) for i in range(10)] + ["item_factor_{}".format(i) for i in range(10)]
    
  # Write out rows for each input row for each column in rowdict
  data = ','.join(['None' if k not in rowdict else (rowdict[k].encode('utf-8') if rowdict[k] is not None else 'None') for k in CSV_COLUMNS])
  data += ','
  data += ','.join([str(rowdict[k]) if k in rowdict else 'None' for k in FACTOR_COLUMNS])
  yield ('{}'.format(data))
  
def preprocess(in_test_mode):
  import shutil, os, subprocess
  job_name = 'preprocess-hybrid-recommendation-features' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S')

  if in_test_mode:
      print('Launching local job ... hang on')
      OUTPUT_DIR = './preproc/features'
      shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
      os.makedirs(OUTPUT_DIR)
  else:
      print('Launching Dataflow job {} ... hang on'.format(job_name))
      OUTPUT_DIR = 'gs://{0}/hybrid_recommendation/preproc/features/'.format(BUCKET)
      try:
        subprocess.check_call('gsutil -m rm -r {}'.format(OUTPUT_DIR).split())
      except:
        pass

  options = {
      'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
      'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
      'job_name': job_name,
      'project': PROJECT,
      'teardown_policy': 'TEARDOWN_ALWAYS',
      'no_save_main_session': True
  }
  opts = beam.pipeline.PipelineOptions(flags = [], **options)
  if in_test_mode:
    RUNNER = 'DirectRunner'
  else:
    RUNNER = 'DataflowRunner'
  p = beam.Pipeline(RUNNER, options = opts)
  
  query = query_hybrid_dataset

  if in_test_mode:
    query = query + ' LIMIT 100' 

  for step in ['train', 'eval']:
    if step == 'train':
      selquery = 'SELECT * FROM ({}) WHERE MOD(ABS(hash_id), 10) < 9'.format(query)
    else:
      selquery = 'SELECT * FROM ({}) WHERE MOD(ABS(hash_id), 10) = 9'.format(query)

    (p 
     | '{}_read'.format(step) >> beam.io.Read(beam.io.BigQuerySource(query = selquery, use_standard_sql = True))
     | '{}_csv'.format(step) >> beam.FlatMap(to_csv)
     | '{}_out'.format(step) >> beam.io.Write(beam.io.WriteToText(os.path.join(OUTPUT_DIR, '{}.csv'.format(step))))
    )

  job = p.run()
  if in_test_mode:
    job.wait_until_finish()
    print("Done!")
    
preprocess(in_test_mode = False)

Launching Dataflow job preprocess-hybrid-recommendation-features-190321-093248 ... hang on


Let's check our files to make sure everything went as expected

In [23]:
%%bash
rm -rf features
mkdir features

In [24]:
!gsutil -m cp -r gs://{BUCKET}/hybrid_recommendation/preproc/features/*.csv* features/

Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/features/eval.csv-00000-of-00002...
Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/features/eval.csv-00001-of-00002...
Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/features/train.csv-00000-of-00004...
Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/features/train.csv-00001-of-00004...
Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/features/train.csv-00002-of-00004...
Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/features/train.csv-00003-of-00004...
| [6/6 files][114.6 MiB/114.6 MiB] 100% Done                                    
Operation completed over 6 objects/114.6 MiB.                                    


In [25]:
!head -3 features/*

==> features/eval.csv-00000-of-00002 <==
299965853,7041455396912725884,299935287,Lifestyle,"Nach Manspreading: Was es mit #Womanspreading auf sich hat",Marlene Patsalidis,574,-0.000496511696838,0.000584936060477,-0.000896223180462,-0.000102142788819,-0.000267917377641,-0.000802059425041,0.000927461369429,-0.000913450552616,-0.000238060747506,0.000163289223565,-0.406514137983,0.459142923355,1.98274052143,-0.673829138279,-2.80434513092,1.27803337574,1.03079307079,-2.11008024216,-1.51691091061,0.213096797466
299907275,2977646036924619540,299935287,Lifestyle,"Nach Manspreading: Was es mit #Womanspreading auf sich hat",Marlene Patsalidis,574,-1.19302658277e-05,-5.02647708345e-06,-6.38119900032e-06,8.50290507515e-06,7.17811872164e-07,-5.58365354664e-06,-8.84674136614e-06,-8.21020876174e-06,-1.57803640377e-06,2.38888187596e-05,-0.406514137983,0.459142923355,1.98274052143,-0.673829138279,-2.80434513092,1.27803337574,1.03079307079,-2.11008024216,-1.51691091061,0.213096797466
299925700,654206

<h2> Create vocabularies using Dataflow </h2>

Let's use Cloud Dataflow to read in the BigQuery data, do some preprocessing, and write it out as CSV files.

Now we'll create our vocabulary files for our categorical features.

In [7]:
query_vocabularies = """
SELECT
  CAST((SELECT MAX(IF(index = index_value, value, NULL)) FROM UNNEST(hits.customDimensions)) AS STRING) AS grouped_by
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,
  UNNEST(hits) AS hits
WHERE
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index = index_value, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
GROUP BY
  grouped_by
"""

In [8]:
import apache_beam as beam
import datetime, os

def to_txt(rowdict):
  # Pull columns from BQ and create a line

  # Write out rows for each input row for grouped by column in rowdict
  return '{}'.format(rowdict['grouped_by'].encode('utf-8'))
  
def preprocess(in_test_mode):
  import shutil, os, subprocess
  job_name = 'preprocess-hybrid-recommendation-vocab-lists' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S')

  if in_test_mode:
      print('Launching local job ... hang on')
      OUTPUT_DIR = './preproc/vocabs'
      shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
      os.makedirs(OUTPUT_DIR)
  else:
      print('Launching Dataflow job {} ... hang on'.format(job_name))
      OUTPUT_DIR = 'gs://{0}/hybrid_recommendation/preproc/vocabs/'.format(BUCKET)
      try:
        subprocess.check_call('gsutil -m rm -r {}'.format(OUTPUT_DIR).split())
      except:
        pass

  options = {
      'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
      'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
      'job_name': job_name,
      'project': PROJECT,
      'teardown_policy': 'TEARDOWN_ALWAYS',
      'no_save_main_session': True
  }
  opts = beam.pipeline.PipelineOptions(flags = [], **options)
  if in_test_mode:
      RUNNER = 'DirectRunner'
  else:
      RUNNER = 'DataflowRunner'
      
  p = beam.Pipeline(RUNNER, options = opts)
  
  def vocab_list(index, name):
    query = query_vocabularies.replace("index_value", "{}".format(index))

    (p 
     | '{}_read'.format(name) >> beam.io.Read(beam.io.BigQuerySource(query = query, use_standard_sql = True))
     | '{}_txt'.format(name) >> beam.Map(to_txt)
     | '{}_out'.format(name) >> beam.io.Write(beam.io.WriteToText(os.path.join(OUTPUT_DIR, '{0}_vocab.txt'.format(name))))
    )

  # Call vocab_list function for each
  vocab_list(10, 'content_id') # content_id
  vocab_list(7, 'category') # category
  vocab_list(2, 'author') # author
  
  job = p.run()
  if in_test_mode:
    job.wait_until_finish()
    print("Done!")
    
preprocess(in_test_mode = False)

Launching Dataflow job preprocess-hybrid-recommendation-vocab-lists-190321-092057 ... hang on


Also get vocab counts from the length of the vocabularies

In [9]:
import apache_beam as beam
import datetime, os

def count_to_txt(rowdict):
  # Pull columns from BQ and create a line

  # Write out count
  return '{}'.format(rowdict['count_number'])
  
def mean_to_txt(rowdict):
  # Pull columns from BQ and create a line

  # Write out mean
  return '{}'.format(rowdict['mean_value'])
  
def preprocess(in_test_mode):
  import shutil, os, subprocess
  job_name = 'preprocess-hybrid-recommendation-vocab-counts' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S')

  if in_test_mode:
      print('Launching local job ... hang on')
      OUTPUT_DIR = './preproc/vocab_counts'
      shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
      os.makedirs(OUTPUT_DIR)
  else:
      print('Launching Dataflow job {} ... hang on'.format(job_name))
      OUTPUT_DIR = 'gs://{0}/hybrid_recommendation/preproc/vocab_counts/'.format(BUCKET)
      try:
        subprocess.check_call('gsutil -m rm -r {}'.format(OUTPUT_DIR).split())
      except:
        pass

  options = {
      'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
      'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
      'job_name': job_name,
      'project': PROJECT,
      'teardown_policy': 'TEARDOWN_ALWAYS',
      'no_save_main_session': True
  }
  opts = beam.pipeline.PipelineOptions(flags = [], **options)
  if in_test_mode:
      RUNNER = 'DirectRunner'
  else:
      RUNNER = 'DataflowRunner'
      
  p = beam.Pipeline(RUNNER, options = opts)
  
  def vocab_count(index, column_name):
    query = """
SELECT
  COUNT(*) AS count_number
FROM ({})
""".format(query_vocabularies.replace("index_value", "{}".format(index)))

    (p 
     | '{}_read'.format(column_name) >> beam.io.Read(beam.io.BigQuerySource(query = query, use_standard_sql = True))
     | '{}_txt'.format(column_name) >> beam.Map(count_to_txt)
     | '{}_out'.format(column_name) >> beam.io.Write(beam.io.WriteToText(os.path.join(OUTPUT_DIR, '{0}_vocab_count.txt'.format(column_name))))
    )
    
  def global_column_mean(column_name):
    query = """
SELECT
  AVG(CAST({1} AS FLOAT64)) AS mean_value
FROM ({0})
""".format(query_hybrid_dataset, column_name)
    
    (p 
     | '{}_read'.format(column_name) >> beam.io.Read(beam.io.BigQuerySource(query = query, use_standard_sql = True))
     | '{}_txt'.format(column_name) >> beam.Map(mean_to_txt)
     | '{}_out'.format(column_name) >> beam.io.Write(beam.io.WriteToText(os.path.join(OUTPUT_DIR, '{0}_mean.txt'.format(column_name))))
    )
    
  # Call vocab_count function for each column we want the vocabulary count for
  vocab_count(10, 'content_id') # content_id
  vocab_count(7, 'category') # category
  vocab_count(2, 'author') # author
  
  # Call global_column_mean function for each column we want the mean for
  global_column_mean('months_since_epoch') # months_since_epoch
  
  job = p.run()
  if in_test_mode:
    job.wait_until_finish()
    print("Done!")
    
preprocess(in_test_mode = False)

Launching Dataflow job preprocess-hybrid-recommendation-vocab-counts-190321-093051 ... hang on


Let's check our files to make sure everything went as expected

In [14]:
%%bash
rm -rf vocabs
mkdir vocabs

In [15]:
!gsutil -m cp -r gs://{BUCKET}/hybrid_recommendation/preproc/vocabs/*.txt* vocabs/

Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/vocabs/author_vocab.txt-00000-of-00001...
Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/vocabs/category_vocab.txt-00000-of-00001...
Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/vocabs/content_id_vocab.txt-00000-of-00001...
/ [3/3 files][178.5 KiB/178.5 KiB] 100% Done                                    
Operation completed over 3 objects/178.5 KiB.                                    


In [16]:
!head -3 vocabs/*

==> vocabs/author_vocab.txt-00000-of-00001 <==
Moritz Gottsauner-Wolf
Brigitte Schokarth
Ursula Horvath

==> vocabs/category_vocab.txt-00000-of-00001 <==
News
Stars & Kultur
Lifestyle

==> vocabs/content_id_vocab.txt-00000-of-00001 <==
299969709
299326744
299496976


In [20]:
%%bash
rm -rf vocab_counts
mkdir vocab_counts

In [21]:
!gsutil -m cp -r gs://{BUCKET}/hybrid_recommendation/preproc/vocab_counts/*.txt* vocab_counts/

Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/vocab_counts/author_vocab_count.txt-00000-of-00001...
Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/vocab_counts/category_vocab_count.txt-00000-of-00001...
Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/vocab_counts/content_id_vocab_count.txt-00000-of-00001...
Copying gs://qwiklabs-gcp-ca86714dede2236a/hybrid_recommendation/preproc/vocab_counts/months_since_epoch_mean.txt-00000-of-00001...
/ [4/4 files][   26.0 B/   26.0 B] 100% Done                                    
Operation completed over 4 objects/26.0 B.                                       


In [22]:
!head -3 vocab_counts/*

==> vocab_counts/author_vocab_count.txt-00000-of-00001 <==
1103

==> vocab_counts/category_vocab_count.txt-00000-of-00001 <==
3

==> vocab_counts/content_id_vocab_count.txt-00000-of-00001 <==
15634

==> vocab_counts/months_since_epoch_mean.txt-00000-of-00001 <==
573.60733908
