# Neural network hybrid recommendation system on Google Analytics data model and training

This notebook demonstrates how to implement a hybrid recommendation system using a neural network to combine content-based and collaborative filtering recommendation models using Google Analytics data. We are going to use the learned user embeddings from [wals.ipynb](../wals.ipynb) and combine that with our previous content-based features from [content_based_using_neural_networks.ipynb](../content_based_using_neural_networks.ipynb)

Now that we have our data preprocessed from BigQuery and Cloud Dataflow, we can build our neural network hybrid recommendation model to our preprocessed data. Then we can train locally to make sure everything works and then use the power of Google Cloud ML Engine to scale it out.

We're going to use TensorFlow Hub to use trained text embeddings, so let's first pip install that and reset our session.

In [1]:
!pip install tensorflow_hub

Collecting tensorflow_hub
  Using cached https://files.pythonhosted.org/packages/9e/f0/3a3ced04c8359e562f1b91918d9bde797c8a916fcfeddc8dc5d673d1be20/tensorflow_hub-0.3.0-py2.py3-none-any.whl
Installing collected packages: tensorflow-hub
Successfully installed tensorflow-hub-0.3.0


Now reset the notebook's session kernel! Since we're no longer using Cloud Dataflow, we'll be using the python3 kernel from here on out so don't forget to change the kernel if it's still python2.

In [1]:
# Import helpful libraries and setup our project, bucket, and region
import os
import tensorflow as tf
import tensorflow_hub as hub

PROJECT = 'qwiklabs-gcp-4a684069c4776675'
BUCKET = 'colaborative-filtering-agea'
REGION = 'us-central1'

NUMBER_OF_DAYS=5
# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8'

  from ._conv import register_converters as _register_converters
W0308 16:58:05.404504 139879471625984 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14


In [6]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


In [7]:
import google.datalab.bigquery as bq

In [8]:
def create_hybrid_query(number_of_days, eval):
  query = """
    SELECT 
     h.*
    FROM `AGEA_ASL.Dataset_Hybrid` h
    JOIN `AGEA_ASL.Dataset_A`a ON h.content_id = a.content_id
    WHERE date_diff(CAST('2018-12-31' as DATE),CAST(a.day_write as DATE), day) < {}
  """.format(number_of_days)
  
  if(eval):
    query+= "AND ABS(MOD(h.hash_id,10)) < 3"
  else:
    query+= "AND ABS(MOD(h.hash_id,10)) >= 3"
  return query
  

In [9]:
def get_hybrid_data(number_of_days, eval):
  query = create_hybrid_query(number_of_days,eval)
  return bq.Query(query).execute().result().to_dataframe()
   

In [10]:
hybrid_eval_data = get_hybrid_data(NUMBER_OF_DAYS, eval=True)

In [139]:
hybrid_eval_data.to_csv('data/hybrid_dataset_eval.csv',index=False)

In [12]:
hybrid_training_data = get_hybrid_data(NUMBER_OF_DAYS, eval=False)

In [140]:
hybrid_training_data.to_csv('data/hybrid_dataset_train.csv',index=False)

In [164]:
!head data/hybrid_dataset_train.csv


user_id,content_id,title,section_1,tag_1,d2v,user_factors,item_factors,hash_id,next_article,doc2vec_avg,gender,age
6684642,uU9fSN3rZ,Prioridad para Mauricio Macri: evitar cualquier disparada del dólar,Opinión,Política económica,-0.5431375503540038|1.5081772804260252|0.28818225860595703|-0.22279983758926392|-0.75957942008972157|1.6420205831527708|0.017720999196171761|-1.839012503623962|-0.17818327248096469|-0.58572870492935181|-1.1695746183395384|-2.074692964553833|0.84690624475479115|-0.227681428194046|1.3209612369537351|-0.11642815172672273|0.031647033989429481|-1.0423736572265625|0.98129534721374545|-0.27149784564971924|-0.73087340593338013|-1.3783644437789917|0.65492689609527588|1.7048561573028562|1.3716107606887815|0.26091566681861877|-0.72071701288223255|-2.7966279983520508|1.5285329818725584|2.1016929149627686|-0.75358551740646373|-1.2115024328231809|-0.46927991509437561|-0.96890002489089944|0.478855550289154|2.7522444725036621e-05|-0.934307873249054|1.2002404928207395|-0.076965

In [149]:
!gsutil cp data/hybrid_dataset_*.csv gs://{BUCKET}/hybrid/

Copying file://data/hybrid_dataset_eval.csv [Content-Type=text/csv]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

Copying file://data/hybrid_dataset_train.csv [Content-Type=text/csv]...         
/ [2 files][  3.4 GiB/  3.4 GiB]   36.1 MiB/s                                   
Operation completed over 2 objects/3.4 GiB.                                      

In [150]:
def count_distinct_values(column_name, eval=False):
  query = """
    SELECT COUNT(DISTINCT {}) FROM 
      ({})
  """.format(column_name, create_hybrid_query(NUMBER_OF_DAYS,eval))
  return bq.Query(query).execute().result().to_dataframe()['f0_'][0]

In [151]:
number_of_content_ids = count_distinct_values('content_id',eval)
number_of_sections = count_distinct_values('section_1',eval)
number_of_tags = count_distinct_values('tag_1',eval)

In [152]:
def get_column_values(column_name,eval=False):
  query = """
    SELECT DISTINCT {} FROM 
      ({})
  """.format(column_name, create_hybrid_query(NUMBER_OF_DAYS,eval))
  return bq.Query(query).execute().result().to_dataframe()

In [153]:
sections_vocab = get_column_values('section_1')['section_1']

In [154]:
content_id_vocab = get_column_values('content_id')['content_id']

In [155]:
content_id_vocab.to_csv('data/content_id_vocab',index=False)

<h2> Create hybrid recommendation system model using TensorFlow </h2>

Now that we've created our training and evaluation input files as well as our categorical feature vocabulary files, we can create our TensorFlow hybrid recommendation system model.

Let's first get some of our aggregate information that we will use in the model from some of our preprocessed files we saved in Google Cloud Storage.

In [156]:
from tensorflow.python.lib.io import file_io

In [157]:
# Determine CSV and label columns
CSV_COLUMNS = 'user_id,content_id,title,section_1,tag_1,d2v,user_factors,items_factors,next_article,doc2vec_avg,hash_id,gender,age'.split(',')
LABEL_COLUMN = 'next_content_id'
NON_FEATURES_COLUMNS = ['user_id','content_id','hash_id']
# Set default values for each CSV column
DEFAULTS = [["Unknown"]*len(CSV_COLUMNS)]

Create input function for training and evaluation to read from our preprocessed CSV files.

In [158]:
# Create input function for train and eval
def read_dataset(filename, mode, batch_size = 512):
  def _input_fn():
    def decode_csv(value_column):
      columns = tf.decode_csv(records = value_column, record_defaults = DEFAULTS)
      features = dict(zip(COLUMNS, columns))          
      label = features.pop(LABEL_COLUMN)
      for non_feature in NON_FEATURES_COLUMNS:
            features.pop(non_feature)
      return features, label

    # Create list of files that match pattern
    file_list = tf.gfile.Glob(filename = filename)

    # Create dataset from file list
    dataset = tf.data.TextLineDataset(filenames = file_list).map(map_func = decode_csv)

    if mode == tf.estimator.ModeKeys.TRAIN:
      num_epochs = None # indefinitely
      dataset = dataset.shuffle(buffer_size = 10 * batch_size)
    else:
      num_epochs = 1 # end-of-input after this

    dataset = dataset.repeat(count = num_epochs).batch(batch_size = batch_size)
    return dataset.make_one_shot_iterator().get_next()
  return _input_fn

Next, we will create our feature columns using our read in features.

In [159]:
# Create feature columns to be used in model
def create_feature_columns(args):
  # Create content_id feature column
  content_id_column = tf.feature_column.categorical_column_with_hash_bucket(
    key = "content_id",
    hash_bucket_size = number_of_content_ids)

  # Embed content id into a lower dimensional representation
  embedded_content_column = tf.feature_column.embedding_column(
    categorical_column = content_id_column,
    dimension = args['content_id_embedding_dimensions'])

  # Create section feature column
  categorical_section_column = tf.feature_column.categorical_column_with_vocabulary_list(
    key = "section_1",
    vocabulary_list=sections_vocab,
    num_oov_buckets = 1)

  # Convert section category column into indicator column so that it can be used in a DNN
  indicator_section_column = tf.feature_column.indicator_column(categorical_column = categorical_section_column)

  # Create title feature column using TF Hub
  embedded_title_column = hub.text_embedding_column(
    key = "title", 
    module_spec = "https://tfhub.dev/google/nnlm-es-dim50-with-normalization/1",
    trainable = False)

  # Create tag feature column
  tag_column = tf.feature_column.categorical_column_with_hash_bucket(
    key = "tag_1",
    hash_bucket_size = number_of_tags + 1)

  # Embed tag into a lower dimensional representation
  embedded_tag_column = tf.feature_column.embedding_column(
    categorical_column = tag_column,
    dimension = args['tag_embedding_dimensions'])

  # Create months since epoch boundaries list for our binning
  age_boundaries = list(range(15, 100))

  # Create age feature column using raw data
  age_column = tf.feature_column.numeric_column(
    key = "age")

  # Create bucketized age feature column using our boundaries
  age_bucketized = tf.feature_column.bucketized_column(
    source_column = age_column,
    boundaries = age_boundaries)

  # Cross our categorical section column and bucketized age column
  crossed_age_since_section_column = tf.feature_column.crossed_column(
    keys = [categorical_section_column, age_bucketized],
    hash_bucket_size = len(age_boundaries) * (number_of_sections + 1))

  # Convert crossed categorical category and bucketized months since epoch column into indicator column so that it can be used in a DNN
  indicator_crossed_age_since_section_column = tf.feature_column.indicator_column(categorical_column = crossed_age_since_section_column)
  
  # Create list of feature columns
  feature_columns = [embedded_content_column,
                     embedded_tag_column,
                     indicator_section_column,
                     embedded_title_column,
                     crossed_age_since_section_column] #+ user_factors + item_factors + d2v + d2v_avg

  return feature_columns

Now we'll create our model function

In [160]:
# Create custom model function for our custom estimator
def model_fn(features, labels, mode, params):
  # Create neural network input layer using our feature columns defined above
  net = tf.feature_column.input_layer(features = features)

  # Create hidden layers by looping through hidden unit list
  for units in params['hidden_units']:
    net = tf.layers.dense(inputs = net, units = units, activation = tf.nn.relu)

  # Compute logits (1 per class) using the output of our last hidden layer
  logits = tf.layers.dense(inputs = net, units = params['n_classes'], activation = None)

  # Find the predicted class indices based on the highest logit (which will result in the highest probability)
  predicted_classes = tf.argmax(input = logits, axis = 1)

  # Read in the content id vocabulary so we can tie the predicted class indices to their respective content ids
  content_id_names = tf.constant(value = [x.rstrip() for x in content_id_vocab])

  # Gather predicted class names based predicted class indices
  predicted_class_names = tf.gather(params = content_id_names, indices = predicted_classes)

  # If the mode is prediction
  if mode == tf.estimator.ModeKeys.PREDICT:
    # Create predictions dict
    predictions_dict = {
        'class_ids': tf.expand_dims(input = predicted_classes, axis = -1),
        'class_names' : tf.expand_dims(input = predicted_class_names, axis = -1),
        'probabilities': tf.nn.softmax(logits = logits),
        'logits': logits
    }

    # Create export outputs
    export_outputs = {"predict_export_outputs": tf.estimator.export.PredictOutput(outputs = predictions_dict)}

    return tf.estimator.EstimatorSpec( # return early since we're done with what we need for prediction mode
      mode = mode,
      predictions = predictions_dict,
      loss = None,
      train_op = None,
      eval_metric_ops = None,
      export_outputs = export_outputs)

  # Continue on with training and evaluation modes

  # Create lookup table using our content id vocabulary
  table = tf.contrib.lookup.index_table_from_file(
    vocabulary_file = tf.gfile.Glob(filename = "data/content_id_vocab*")[0])

  # Look up labels from vocabulary table
  labels = table.lookup(keys = labels)

  # Compute loss using sparse softmax cross entropy since this is classification and our labels (content id indices) and probabilities are mutually exclusive
  loss = tf.losses.sparse_softmax_cross_entropy(labels = labels, logits = logits)

  # Compute evaluation metrics of total accuracy and the accuracy of the top k classes
  accuracy = tf.metrics.accuracy(labels = labels, predictions = predicted_classes, name = 'acc_op')
  top_k_accuracy = tf.metrics.mean(values = tf.nn.in_top_k(predictions = logits, targets = labels, k = params['top_k']))
  map_at_k = tf.metrics.average_precision_at_k(labels = labels, predictions = predicted_classes, k = params['top_k'])

  # Put eval metrics into a dictionary
  eval_metrics = {
    'accuracy': accuracy,
    'top_k_accuracy': top_k_accuracy,
    'map_at_k': map_at_k}

  # Create scalar summaries to see in TensorBoard
  tf.summary.scalar(name = 'accuracy', tensor = accuracy[1])
  tf.summary.scalar(name = 'top_k_accuracy', tensor = top_k_accuracy[1])
  tf.summary.scalar(name = 'map_at_k', tensor = map_at_k[1])

  # If the mode is evaluation
  if mode == tf.estimator.ModeKeys.EVAL:
    return tf.estimator.EstimatorSpec( # return early since we're done with what we need for evaluation mode
        mode = mode,
        predictions = None,
        loss = loss,
        train_op = None,
        eval_metric_ops = eval_metrics,
        export_outputs = None)

  # Continue on with training mode

  # If the mode is training
  assert mode == tf.estimator.ModeKeys.TRAIN

  # Create a custom optimizer
  optimizer = tf.train.AdagradOptimizer(learning_rate = params['learning_rate'])

  # Create train op
  train_op = optimizer.minimize(loss = loss, global_step = tf.train.get_global_step())

  return tf.estimator.EstimatorSpec( # final return since we're done with what we need for training mode
    mode = mode,
    predictions = None,
    loss = loss,
    train_op = train_op,
    eval_metric_ops = None,
    export_outputs = None)

Now create a serving input function

In [161]:
# Create serving input function
def serving_input_fn():  
  feature_placeholders = {
    colname : tf.placeholder(dtype = tf.string, shape = [None]) \
    for colname in 'user_id,content_id,title,section_1,tag_1,user_factors,items_factors,doc2vec_avg,gender'.split(',')
  }
  feature_placeholders['age'] = tf.placeholder(dtype = tf.int32, shape = [None])
  
  features = {
    key: tf.expand_dims(tensor, -1) \
    for key, tensor in feature_placeholders.items()
  }
    
  return tf.estimator.export.ServingInputReceiver(features, feature_placeholders)

Now that all of the pieces are assembled let's create and run our train and evaluate loop

In [162]:
# Create train and evaluate loop to combine all of the pieces together.
tf.logging.set_verbosity(tf.logging.INFO)
def train_and_evaluate(args):
  estimator = tf.estimator.Estimator(
    model_fn = model_fn,
    model_dir = args['output_dir'],
    params={
      #'feature_columns': create_feature_columns(args),
      'hidden_units': args['hidden_units'],
      'n_classes': number_of_content_ids,
      'learning_rate': args['learning_rate'],
      'top_k': args['top_k'],
      'bucket': args['bucket']
    })

  train_spec = tf.estimator.TrainSpec(
    input_fn = read_dataset(filename = args['train_data_paths'], mode = tf.estimator.ModeKeys.TRAIN, batch_size = args['batch_size']),
    max_steps = args['train_steps'])

  exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)

  eval_spec = tf.estimator.EvalSpec(
    input_fn = read_dataset(filename = args['eval_data_paths'], mode = tf.estimator.ModeKeys.EVAL, batch_size = args['batch_size']),
    steps = None,
    start_delay_secs = args['start_delay_secs'],
    throttle_secs = args['throttle_secs'],
    exporters = exporter)

  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

Run train_and_evaluate!

In [163]:
# Call train and evaluate loop
import shutil

outdir = 'hybrid_recommendation_trained'
shutil.rmtree(outdir, ignore_errors = True) # start fresh each time

arguments = {
  'bucket': BUCKET,
  'train_data_paths': "gs://{}/hybrid/hybrid_dataset_train.csv*".format(BUCKET),
  'eval_data_paths': "gs://{}/hybrid_/hybrid_dataset_eval.csv*".format(BUCKET),
  'output_dir': outdir,
  'batch_size': 128,
  'learning_rate': 0.1,
  'hidden_units': [256, 128, 64],
  'content_id_embedding_dimensions': 10,
  'category_embedding_dimensions': 10,
  'tag_embedding_dimensions':10,
  'top_k': 10,
  'train_steps': 1000,
  'start_delay_secs': 30,
  'throttle_secs': 30
}

train_and_evaluate(arguments)

INFO:tensorflow:Using default config.


I0308 18:37:12.971140 139879471625984 tf_logging.py:116] Using default config.


INFO:tensorflow:Using config: {'_save_summary_steps': 100, '_session_config': None, '_task_id': 0, '_save_checkpoints_secs': 600, '_evaluation_master': '', '_tf_random_seed': None, '_num_worker_replicas': 1, '_keep_checkpoint_max': 5, '_master': '', '_task_type': 'worker', '_service': None, '_log_step_count_steps': 100, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_train_distribute': None, '_model_dir': 'hybrid_recommendation_trained', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f37a4381b70>, '_save_checkpoints_steps': None, '_is_chief': True, '_num_ps_replicas': 0}


I0308 18:37:12.973239 139879471625984 tf_logging.py:116] Using config: {'_save_summary_steps': 100, '_session_config': None, '_task_id': 0, '_save_checkpoints_secs': 600, '_evaluation_master': '', '_tf_random_seed': None, '_num_worker_replicas': 1, '_keep_checkpoint_max': 5, '_master': '', '_task_type': 'worker', '_service': None, '_log_step_count_steps': 100, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_train_distribute': None, '_model_dir': 'hybrid_recommendation_trained', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f37a4381b70>, '_save_checkpoints_steps': None, '_is_chief': True, '_num_ps_replicas': 0}


INFO:tensorflow:Running training and evaluation locally (non-distributed).


I0308 18:37:12.976098 139879471625984 tf_logging.py:116] Running training and evaluation locally (non-distributed).


INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 30 secs (eval_spec.throttle_secs) or training is finished.


I0308 18:37:12.977734 139879471625984 tf_logging.py:116] Start train and evaluate loop. The evaluate will happen after 30 secs (eval_spec.throttle_secs) or training is finished.


ValueError: Shape of a default must be a length-0 or length-1 vector for 'DecodeCSV' (op: 'DecodeCSV') with input shapes: [], [13].

## Run on module locally

Now let's place our code into a python module with model.py and task.py files so that we can train using Google Cloud's ML Engine! First, let's test our module locally.

In [None]:
%writefile requirements.txt
tensorflow_hub

In [None]:
%bash
echo "bucket=${BUCKET}"
rm -rf hybrid_recommendation_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/hybrid_recommendations_module
python -m trainer.task \
  --bucket=${BUCKET} \
  --train_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/train.csv* \
  --eval_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/eval.csv* \
  --output_dir=${OUTDIR} \
  --batch_size=128 \
  --learning_rate=0.1 \
  --hidden_units="256 128 64" \
  --content_id_embedding_dimensions=10 \
  --author_embedding_dimensions=10 \
  --top_k=10 \
  --train_steps=1000 \
  --start_delay_secs=30 \
  --throttle_secs=60

# Run on Google Cloud ML Engine
If our module locally trained fine, let's now use of the power of ML Engine to scale it out on Google Cloud.

In [None]:
%bash
OUTDIR=gs://${BUCKET}/hybrid_recommendation/small_trained_model
JOBNAME=hybrid_recommendation_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.task \
  --package-path=$(pwd)/hybrid_recommendations_module/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=STANDARD_1 \
  --runtime-version=$TFVERSION \
  -- \
  --bucket=${BUCKET} \
  --train_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/train.csv* \
  --eval_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/eval.csv* \
  --output_dir=${OUTDIR} \
  --batch_size=128 \
  --learning_rate=0.1 \
  --hidden_units="256 128 64" \
  --content_id_embedding_dimensions=10 \
  --author_embedding_dimensions=10 \
  --top_k=10 \
  --train_steps=1000 \
  --start_delay_secs=30 \
  --throttle_secs=30

Let's add some hyperparameter tuning!

In [None]:
%%writefile hyperparam.yaml
trainingInput:
  hyperparameters:
    goal: MAXIMIZE
    maxTrials: 5
    maxParallelTrials: 1
    hyperparameterMetricTag: accuracy
    params:
    - parameterName: batch_size
      type: INTEGER
      minValue: 8
      maxValue: 64
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: learning_rate
      type: DOUBLE
      minValue: 0.01
      maxValue: 0.1
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: hidden_units
      type: CATEGORICAL
      categoricalValues: ['1024 512 256', '1024 512 128', '1024 256 128', '512 256 128', '1024 512 64', '1024 256 64', '512 256 64', '1024 128 64', '512 128 64', '256 128 64', '1024 512 32', '1024 256 32', '512 256 32', '1024 128 32', '512 128 32', '256 128 32', '1024 64 32', '512 64 32', '256 64 32', '128 64 32']
    - parameterName: content_id_embedding_dimensions
      type: INTEGER
      minValue: 5
      maxValue: 250
      scaleType: UNIT_LOG_SCALE
    - parameterName: author_embedding_dimensions
      type: INTEGER
      minValue: 5
      maxValue: 30
      scaleType: UNIT_LINEAR_SCALE

In [None]:
%bash
OUTDIR=gs://${BUCKET}/hybrid_recommendation/hypertuning
JOBNAME=hybrid_recommendation_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.task \
  --package-path=$(pwd)/hybrid_recommendations_module/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=STANDARD_1 \
  --runtime-version=$TFVERSION \
  --config=hyperparam.yaml \
  -- \
  --bucket=${BUCKET} \
  --train_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/train.csv* \
  --eval_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/eval.csv* \
  --output_dir=${OUTDIR} \
  --batch_size=128 \
  --learning_rate=0.1 \
  --hidden_units="256 128 64" \
  --content_id_embedding_dimensions=10 \
  --author_embedding_dimensions=10 \
  --top_k=10 \
  --train_steps=1000 \
  --start_delay_secs=30 \
  --throttle_secs=30

Now that we know the best hyperparameters, run a big training job!

In [None]:
%bash
OUTDIR=gs://${BUCKET}/hybrid_recommendation/big_trained_model
JOBNAME=hybrid_recommendation_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.task \
  --package-path=$(pwd)/hybrid_recommendations_module/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=STANDARD_1 \
  --runtime-version=$TFVERSION \
  -- \
  --bucket=${BUCKET} \
  --train_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/train.csv* \
  --eval_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/eval.csv* \
  --output_dir=${OUTDIR} \
  --batch_size=128 \
  --learning_rate=0.1 \
  --hidden_units="256 128 64" \
  --content_id_embedding_dimensions=10 \
  --author_embedding_dimensions=10 \
  --top_k=10 \
  --train_steps=10000 \
  --start_delay_secs=30 \
  --throttle_secs=30