## Content-Based Filtering Using Neural Networks

This lab relies on files created in the [content_based_preproc.ipynb](./content_based_preproc.ipynb) notebook. Be sure to complete the TODOs in that notebook and run the code there before completing this lab.  
Also, we'll be using the **python3** kernel from here on out so don't forget to change the kernel if it's still python2.

This lab illustrates:
1. how to build feature columns for a model using tf.feature_column
2. how to create custom evaluation metrics and add them to Tensorboard
3. how to train a model and make predictions with the saved model

Tensorflow Hub should already be installed. You can check using pip freeze.

In [1]:
%%bash
pip freeze | grep tensor

tensorboard==1.8.0
tensorflow==1.8.0


If 'tensorflow-hub' isn't one of the outputs above, then you'll need to install it. Uncomment the cell below and execute the commands. After doing the pip install, click **"Reset Session"** on the notebook so that the Python environment picks up the new packages.

In [2]:
%%bash
pip install tensorflow-hub

Collecting tensorflow-hub
  Downloading https://files.pythonhosted.org/packages/9e/f0/3a3ced04c8359e562f1b91918d9bde797c8a916fcfeddc8dc5d673d1be20/tensorflow_hub-0.3.0-py2.py3-none-any.whl (73kB)
Installing collected packages: tensorflow-hub
Successfully installed tensorflow-hub-0.3.0


In [1]:
import os
import tensorflow as tf
import numpy as np
import tensorflow_hub as hub
import shutil

output = os.popen("gcloud config get-value project").readlines()
project_name = output[0][:-1]

PROJECT = project_name
BUCKET = project_name
#BUCKET = BUCKET.replace("qwiklabs-gcp-", "inna-bckt-")
REGION = 'europe-west1'  ## note: Cloud ML Engine not availabe in europe-west3!

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8'

print(PROJECT)
print(BUCKET)
print(REGION)

  from ._conv import register_converters as _register_converters
W0320 07:34:35.182858 139766874433280 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14


qwiklabs-gcp-0b5ac7ef62bf23c8
qwiklabs-gcp-0b5ac7ef62bf23c8
europe-west1


In [2]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


### Build the feature columns for the model.

To start, we'll load the list of categories, authors and article ids we created in the previous **Create Datasets** notebook.

In [3]:
categories_list = open("categories.txt").read().splitlines()
authors_list = open("authors.txt").read().splitlines()
content_ids_list = open("content_ids.txt").read().splitlines()
mean_months_since_epoch = 523

In the cell below we'll define the feature columns to use in our model. If necessary, remind yourself the [various feature columns](https://www.tensorflow.org/api_docs/python/tf/feature_column) to use.  
For the embedded_title_column feature column, use a Tensorflow Hub Module to create an embedding of the article title. Since the articles and titles are in German, you'll want to use a German language embedding module.  
Explore the text embedding Tensorflow Hub modules [available here](https://alpha.tfhub.dev/). Filter by setting the language to 'German'. The 50 dimensional embedding should be sufficient for our purposes. 

In [7]:
#TODO (done): use a Tensorflow Hub module to create a text embeddding column for the article "title". 
# Use the module available at https://alpha.tfhub.dev/ filtering by German language.
embedded_title_column = hub.text_embedding_column(
    key = "title",
    module_spec = "https://tfhub.dev/google/nnlm-de-dim50/1",
    trainable = False
)

#TODO (done): create an embedded categorical feature column for the article id; i.e. "content_id".
content_id_column = tf.feature_column.categorical_column_with_hash_bucket(
    key = "content_id",
    hash_bucket_size = len(content_ids_list) + 1
)
embedded_content_column = tf.feature_column.embedding_column(
    categorical_column = content_id_column,
    dimension = 10
)

#TODO (done): create an embedded categorical feature column for the article "author"
author_column = tf.feature_column.categorical_column_with_hash_bucket(
    key = "author",
    hash_bucket_size = len(authors_list) + 1
)
embedded_author_column = tf.feature_column.embedding_column(
    categorical_column = author_column,
    dimension = 3
)

#TODO (done): create a categorical feature column for the article "category"
category_column_categorical = tf.feature_column.categorical_column_with_vocabulary_list(
    key = "category",
    vocabulary_list = categories_list,
    num_oov_buckets = 1
)
category_column = tf.feature_column.indicator_column(category_column_categorical)
## note: indicator_column creates a multi-hot-encoded column.

#TODO (done): create a bucketized numeric feature column of values for the "months since epoch"
months_since_epoch_boundaries = list(range(400,700,20))
months_since_epochs_numeric = tf.feature_column.numeric_column(key = "months_since_epoch")
months_since_epoch_bucketized = tf.feature_column.bucketized_column(
  source_column = months_since_epochs_numeric,
  boundaries = months_since_epoch_boundaries
)

#TODO (done): create a crossed feature column using the "category" and "months since epoch" values
crossed_months_since_category_column = tf.feature_column.indicator_column(
  tf.feature_column.crossed_column(
    keys = [category_column_categorical, months_since_epoch_bucketized], 
    hash_bucket_size = len(months_since_epoch_boundaries * (len(categories_list) + 1))
  )
)

feature_columns = [embedded_content_column,
                   embedded_author_column,
                   category_column,
                   embedded_title_column,
                   crossed_months_since_category_column] 

### Create the input function.

Next we'll create the input function for our model. This input function reads the data from the csv files we created in the previous labs. 

In [8]:
record_defaults = [["Unknown"], ["Unknown"],["Unknown"],["Unknown"],["Unknown"],[mean_months_since_epoch],["Unknown"]]
column_keys = ["visitor_id", "content_id", "category", "title", "author", "months_since_epoch", "next_content_id"]
label_key = "next_content_id"
def read_dataset(filename, mode, batch_size = 512):
  def _input_fn():
      def decode_csv(value_column):
          columns = tf.decode_csv(value_column,record_defaults=record_defaults)
          features = dict(zip(column_keys, columns))          
          label = features.pop(label_key)         
          return features, label

      # Create list of files that match pattern
      file_list = tf.gfile.Glob(filename)

      # Create dataset from file list
      dataset = tf.data.TextLineDataset(file_list).map(decode_csv)

      if mode == tf.estimator.ModeKeys.TRAIN:
          num_epochs = None # indefinitely
          dataset = dataset.shuffle(buffer_size = 10 * batch_size)
      else:
          num_epochs = 1 # end-of-input after this

      dataset = dataset.repeat(num_epochs).batch(batch_size)
      return dataset.make_one_shot_iterator().get_next()
  return _input_fn

### Create the model and train/evaluate


Next, we'll build our model which recommends an article for a visitor to the Kurier.at website. Look through the code below. We use the input_layer feature column to create the dense input layer to our network. This is just a sigle layer network where we can adjust the number of hidden units as a parameter.

Currently, we compute the accuracy between our predicted 'next article' and the actual 'next article' read next by the visitor. Resolve the TODOs in the cell below by adding additional performance metrics to assess our model. You will need to 
* use the [tf.metrics library](https://www.tensorflow.org/api_docs/python/tf/metrics) to compute an additional performance metric
* add this additional metric to the metrics dictionary, and 
* include it in the tf.summary that is sent to Tensorboard.

In [13]:
def model_fn(features, labels, mode, params):
  net = tf.feature_column.input_layer(features, params['feature_columns'])
  for units in params['hidden_units']:
        net = tf.layers.dense(net, units=units, activation=tf.nn.relu)
   # Compute logits (1 per class).
  logits = tf.layers.dense(net, params['n_classes'], activation=None) 

  predicted_classes = tf.argmax(logits, 1)
  from tensorflow.python.lib.io import file_io
    
  with file_io.FileIO('content_ids.txt', mode='r') as ifp:
    content = tf.constant([x.rstrip() for x in ifp])
  predicted_class_names = tf.gather(content, predicted_classes)
  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {
        'class_ids': predicted_classes[:, tf.newaxis],
        'class_names' : predicted_class_names[:, tf.newaxis],
        'probabilities': tf.nn.softmax(logits),
        'logits': logits,
    }
    return tf.estimator.EstimatorSpec(mode, predictions=predictions)
  table = tf.contrib.lookup.index_table_from_file(vocabulary_file="content_ids.txt")
  labels = table.lookup(labels)
  # Compute loss.
  loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

  # Compute evaluation metrics.
  accuracy = tf.metrics.accuracy(labels=labels,
                                 predictions=predicted_classes,
                                 name='acc_op')
  #TODO (done): Compute the top_10 accuracy, using the tf.nn.in_top_k and tf.metrics.mean functions in Tensorflow
  top_10_accuracy = tf.metrics.mean(
    tf.nn.in_top_k(predictions = logits, targets = labels, k = 10)
  )
  
  metrics = {
    'accuracy': accuracy,
    #TODO (done): Add top_10_accuracy to the metrics dictionary
    'top_10_accuracy': top_10_accuracy
  }
  
  ## note: second element [1] is the `update_op`, which is the updated metric after each batch...
  tf.summary.scalar('accuracy', accuracy[1])
  #TODO (done): Add the top_10_accuracy metric to the Tensorboard summary
  tf.summary.scalar('top_10_accuracy', top_10_accuracy[1])

  if mode == tf.estimator.ModeKeys.EVAL:
      return tf.estimator.EstimatorSpec(
          mode, loss=loss, eval_metric_ops=metrics)

  # Create training op.
  assert mode == tf.estimator.ModeKeys.TRAIN

  optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
  train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
  return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)


### Train and Evaluate

In [14]:
outdir = 'content_based_model_trained'
shutil.rmtree(outdir, ignore_errors = True) # start fresh each time
tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file
estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    model_dir = outdir,
    params={
     'feature_columns': feature_columns,
      'hidden_units': [200, 100, 50],
      'n_classes': len(content_ids_list)
    })

train_spec = tf.estimator.TrainSpec(
    input_fn = read_dataset("training_set.csv", tf.estimator.ModeKeys.TRAIN),
    max_steps = 200)

eval_spec = tf.estimator.EvalSpec(
    input_fn = read_dataset("test_set.csv", tf.estimator.ModeKeys.EVAL),
    steps = None,
    start_delay_secs = 30,
    throttle_secs = 60)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

INFO:tensorflow:Using default config.


I0320 07:55:47.799170 139766874433280 tf_logging.py:116] Using default config.


INFO:tensorflow:Using config: {'_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1de0aa68d0>, '_task_type': 'worker', '_master': '', '_num_ps_replicas': 0, '_model_dir': 'content_based_model_trained', '_save_checkpoints_steps': None, '_is_chief': True, '_train_distribute': None, '_session_config': None, '_save_summary_steps': 100, '_global_id_in_cluster': 0, '_tf_random_seed': None, '_save_checkpoints_secs': 600, '_task_id': 0, '_keep_checkpoint_max': 5, '_num_worker_replicas': 1, '_evaluation_master': '', '_service': None, '_log_step_count_steps': 100, '_keep_checkpoint_every_n_hours': 10000}


I0320 07:55:47.806472 139766874433280 tf_logging.py:116] Using config: {'_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1de0aa68d0>, '_task_type': 'worker', '_master': '', '_num_ps_replicas': 0, '_model_dir': 'content_based_model_trained', '_save_checkpoints_steps': None, '_is_chief': True, '_train_distribute': None, '_session_config': None, '_save_summary_steps': 100, '_global_id_in_cluster': 0, '_tf_random_seed': None, '_save_checkpoints_secs': 600, '_task_id': 0, '_keep_checkpoint_max': 5, '_num_worker_replicas': 1, '_evaluation_master': '', '_service': None, '_log_step_count_steps': 100, '_keep_checkpoint_every_n_hours': 10000}


INFO:tensorflow:Running training and evaluation locally (non-distributed).


I0320 07:55:47.816347 139766874433280 tf_logging.py:116] Running training and evaluation locally (non-distributed).


INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 60 secs (eval_spec.throttle_secs) or training is finished.


I0320 07:55:47.820190 139766874433280 tf_logging.py:116] Start train and evaluate loop. The evaluate will happen after 60 secs (eval_spec.throttle_secs) or training is finished.


INFO:tensorflow:Calling model_fn.


I0320 07:55:47.888497 139766874433280 tf_logging.py:116] Calling model_fn.


INFO:tensorflow:Initialize variable input_layer/title_hub_module_embedding/module/embeddings/part_0:0 from checkpoint b'/tmp/tfhub_modules/e40ef097142ae1de637df7021ce148ffe836e262/variables/variables' with embeddings


I0320 07:55:48.167494 139766874433280 tf_logging.py:116] Initialize variable input_layer/title_hub_module_embedding/module/embeddings/part_0:0 from checkpoint b'/tmp/tfhub_modules/e40ef097142ae1de637df7021ce148ffe836e262/variables/variables' with embeddings


INFO:tensorflow:Done calling model_fn.


I0320 07:55:48.956489 139766874433280 tf_logging.py:116] Done calling model_fn.


INFO:tensorflow:Create CheckpointSaverHook.


I0320 07:55:48.970116 139766874433280 tf_logging.py:116] Create CheckpointSaverHook.


INFO:tensorflow:Graph was finalized.


I0320 07:55:49.448025 139766874433280 tf_logging.py:116] Graph was finalized.


INFO:tensorflow:Running local_init_op.


I0320 07:55:50.627632 139766874433280 tf_logging.py:116] Running local_init_op.


INFO:tensorflow:Done running local_init_op.


I0320 07:55:51.789882 139766874433280 tf_logging.py:116] Done running local_init_op.


INFO:tensorflow:Saving checkpoints for 1 into content_based_model_trained/model.ckpt.


I0320 07:55:54.082758 139766874433280 tf_logging.py:116] Saving checkpoints for 1 into content_based_model_trained/model.ckpt.


INFO:tensorflow:step = 1, loss = 9.656757


I0320 07:55:54.600059 139766874433280 tf_logging.py:116] step = 1, loss = 9.656757


INFO:tensorflow:global_step/sec: 1.99256


I0320 07:56:44.786559 139766874433280 tf_logging.py:116] global_step/sec: 1.99256


INFO:tensorflow:step = 101, loss = 5.3687167 (50.198 sec)


I0320 07:56:44.797833 139766874433280 tf_logging.py:116] step = 101, loss = 5.3687167 (50.198 sec)


INFO:tensorflow:Saving checkpoints for 110 into content_based_model_trained/model.ckpt.


I0320 07:56:49.316921 139766874433280 tf_logging.py:116] Saving checkpoints for 110 into content_based_model_trained/model.ckpt.


INFO:tensorflow:Loss for final step: 5.306711.


I0320 07:56:51.637431 139766874433280 tf_logging.py:116] Loss for final step: 5.306711.


INFO:tensorflow:Calling model_fn.


I0320 07:56:51.697811 139766874433280 tf_logging.py:116] Calling model_fn.


INFO:tensorflow:Initialize variable input_layer/title_hub_module_embedding/module/embeddings/part_0:0 from checkpoint b'/tmp/tfhub_modules/e40ef097142ae1de637df7021ce148ffe836e262/variables/variables' with embeddings


I0320 07:56:51.980008 139766874433280 tf_logging.py:116] Initialize variable input_layer/title_hub_module_embedding/module/embeddings/part_0:0 from checkpoint b'/tmp/tfhub_modules/e40ef097142ae1de637df7021ce148ffe836e262/variables/variables' with embeddings


INFO:tensorflow:Done calling model_fn.


I0320 07:56:52.337481 139766874433280 tf_logging.py:116] Done calling model_fn.


INFO:tensorflow:Starting evaluation at 2019-03-20-07:56:52


I0320 07:56:52.372967 139766874433280 tf_logging.py:116] Starting evaluation at 2019-03-20-07:56:52


INFO:tensorflow:Graph was finalized.


I0320 07:56:52.508660 139766874433280 tf_logging.py:116] Graph was finalized.


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-110


I0320 07:56:52.518365 139766874433280 tf_logging.py:116] Restoring parameters from content_based_model_trained/model.ckpt-110


INFO:tensorflow:Running local_init_op.


I0320 07:56:52.990261 139766874433280 tf_logging.py:116] Running local_init_op.


INFO:tensorflow:Done running local_init_op.


I0320 07:56:54.399979 139766874433280 tf_logging.py:116] Done running local_init_op.


INFO:tensorflow:Finished evaluation at 2019-03-20-07:57:10


I0320 07:57:10.869024 139766874433280 tf_logging.py:116] Finished evaluation at 2019-03-20-07:57:10


INFO:tensorflow:Saving dict for global step 110: accuracy = 0.026407281, global_step = 110, loss = 5.4312987, top_10_accuracy = 0.20270324


I0320 07:57:10.875278 139766874433280 tf_logging.py:116] Saving dict for global step 110: accuracy = 0.026407281, global_step = 110, loss = 5.4312987, top_10_accuracy = 0.20270324


INFO:tensorflow:Calling model_fn.


I0320 07:57:11.818247 139766874433280 tf_logging.py:116] Calling model_fn.


INFO:tensorflow:Initialize variable input_layer/title_hub_module_embedding/module/embeddings/part_0:0 from checkpoint b'/tmp/tfhub_modules/e40ef097142ae1de637df7021ce148ffe836e262/variables/variables' with embeddings


I0320 07:57:12.100546 139766874433280 tf_logging.py:116] Initialize variable input_layer/title_hub_module_embedding/module/embeddings/part_0:0 from checkpoint b'/tmp/tfhub_modules/e40ef097142ae1de637df7021ce148ffe836e262/variables/variables' with embeddings


INFO:tensorflow:Done calling model_fn.


I0320 07:57:13.019013 139766874433280 tf_logging.py:116] Done calling model_fn.


INFO:tensorflow:Create CheckpointSaverHook.


I0320 07:57:13.041149 139766874433280 tf_logging.py:116] Create CheckpointSaverHook.


INFO:tensorflow:Graph was finalized.


I0320 07:57:13.201533 139766874433280 tf_logging.py:116] Graph was finalized.


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-110


I0320 07:57:13.212512 139766874433280 tf_logging.py:116] Restoring parameters from content_based_model_trained/model.ckpt-110


INFO:tensorflow:Running local_init_op.


I0320 07:57:13.723168 139766874433280 tf_logging.py:116] Running local_init_op.


INFO:tensorflow:Done running local_init_op.


I0320 07:57:14.857356 139766874433280 tf_logging.py:116] Done running local_init_op.


INFO:tensorflow:Saving checkpoints for 111 into content_based_model_trained/model.ckpt.


I0320 07:57:16.690572 139766874433280 tf_logging.py:116] Saving checkpoints for 111 into content_based_model_trained/model.ckpt.


INFO:tensorflow:step = 111, loss = 5.3442545


I0320 07:57:17.993168 139766874433280 tf_logging.py:116] step = 111, loss = 5.3442545


INFO:tensorflow:Saving checkpoints for 200 into content_based_model_trained/model.ckpt.


I0320 07:58:02.329573 139766874433280 tf_logging.py:116] Saving checkpoints for 200 into content_based_model_trained/model.ckpt.


INFO:tensorflow:Loss for final step: 5.255849.


I0320 07:58:04.345735 139766874433280 tf_logging.py:116] Loss for final step: 5.255849.


INFO:tensorflow:Calling model_fn.


I0320 07:58:04.405373 139766874433280 tf_logging.py:116] Calling model_fn.


INFO:tensorflow:Initialize variable input_layer/title_hub_module_embedding/module/embeddings/part_0:0 from checkpoint b'/tmp/tfhub_modules/e40ef097142ae1de637df7021ce148ffe836e262/variables/variables' with embeddings


I0320 07:58:04.696885 139766874433280 tf_logging.py:116] Initialize variable input_layer/title_hub_module_embedding/module/embeddings/part_0:0 from checkpoint b'/tmp/tfhub_modules/e40ef097142ae1de637df7021ce148ffe836e262/variables/variables' with embeddings


INFO:tensorflow:Done calling model_fn.


I0320 07:58:05.091192 139766874433280 tf_logging.py:116] Done calling model_fn.


INFO:tensorflow:Starting evaluation at 2019-03-20-07:58:05


I0320 07:58:05.127994 139766874433280 tf_logging.py:116] Starting evaluation at 2019-03-20-07:58:05


INFO:tensorflow:Graph was finalized.


I0320 07:58:05.270436 139766874433280 tf_logging.py:116] Graph was finalized.


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-200


I0320 07:58:05.278519 139766874433280 tf_logging.py:116] Restoring parameters from content_based_model_trained/model.ckpt-200


INFO:tensorflow:Running local_init_op.


I0320 07:58:05.758121 139766874433280 tf_logging.py:116] Running local_init_op.


INFO:tensorflow:Done running local_init_op.


I0320 07:58:07.023025 139766874433280 tf_logging.py:116] Done running local_init_op.


INFO:tensorflow:Finished evaluation at 2019-03-20-07:58:23


I0320 07:58:23.337346 139766874433280 tf_logging.py:116] Finished evaluation at 2019-03-20-07:58:23


INFO:tensorflow:Saving dict for global step 200: accuracy = 0.030548068, global_step = 200, loss = 5.297825, top_10_accuracy = 0.2377046


I0320 07:58:23.345363 139766874433280 tf_logging.py:116] Saving dict for global step 200: accuracy = 0.030548068, global_step = 200, loss = 5.297825, top_10_accuracy = 0.2377046


### Make predictions with the trained model. 

With the model now trained, we can make predictions by calling the predict method on the estimator. Let's look at how our model predicts on the first five examples of the training set.  
To start, we'll create a new file 'first_5.csv' which contains the first five elements of our training set. We'll also save the target values to a file 'first_5_content_ids' so we can compare our results. 

In [15]:
%%bash
head -5 training_set.csv > first_5.csv
head first_5.csv
awk -F "\"*,\"*" '{print $2}' first_5.csv > first_5_content_ids

1000196974485173657,299836841,News,"ÖVP will Studiengebühren FPÖ in Verhandlungen ""flexibel""",Raffaela Lindorfer,574,299959410
1000196974485173657,299959410,News,Koalition: Bildungspapier mit mehr Pflichten und Noten,Peter Temel,574,299925086
1000196974485173657,299925086,News,Marihuana-Adventkalender findet in Kanada reißenden Absatz,,574,299826775
1000196974485173657,299826775,Lifestyle,Auf Bank ausgeruht: Pensionist muss Strafe zahlen,Marlene Patsalidis,574,299930679
1000196974485173657,299930679,News,Wintereinbruch naht: Erster Schnee im Osten möglich,Daniela Wahl,574,299950903


Recall, to make predictions on the trained model we pass a list of examples through the input function. Complete the code below to make predicitons on the examples contained in the "first_5.csv" file we created above. 

In [16]:
#TODO: Use the predict method on our trained model to find the predictions for the examples contained in "first_5.csv".
output = list(
  estimator.predict(
    input_fn = read_dataset(filename = "first_5.csv", mode = tf.estimator.ModeKeys.PREDICT)
  )
)
print(output)

INFO:tensorflow:Calling model_fn.


I0320 08:07:47.323739 139766874433280 tf_logging.py:116] Calling model_fn.


INFO:tensorflow:Initialize variable input_layer/title_hub_module_embedding/module/embeddings/part_0:0 from checkpoint b'/tmp/tfhub_modules/e40ef097142ae1de637df7021ce148ffe836e262/variables/variables' with embeddings


I0320 08:07:47.960026 139766874433280 tf_logging.py:116] Initialize variable input_layer/title_hub_module_embedding/module/embeddings/part_0:0 from checkpoint b'/tmp/tfhub_modules/e40ef097142ae1de637df7021ce148ffe836e262/variables/variables' with embeddings


INFO:tensorflow:Done calling model_fn.


I0320 08:07:48.277340 139766874433280 tf_logging.py:116] Done calling model_fn.


INFO:tensorflow:Graph was finalized.


I0320 08:07:48.413286 139766874433280 tf_logging.py:116] Graph was finalized.


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-200


I0320 08:07:48.421568 139766874433280 tf_logging.py:116] Restoring parameters from content_based_model_trained/model.ckpt-200


INFO:tensorflow:Running local_init_op.


I0320 08:07:48.890547 139766874433280 tf_logging.py:116] Running local_init_op.


INFO:tensorflow:Done running local_init_op.


I0320 08:07:50.154239 139766874433280 tf_logging.py:116] Done running local_init_op.


[{'class_ids': array([4]), 'class_names': array([b'299410466'], dtype=object), 'probabilities': array([1.1793807e-02, 2.0582383e-02, 1.1280167e-03, ..., 2.0453667e-06,
       1.2166000e-06, 2.2878965e-06], dtype=float32), 'logits': array([ 8.577154  ,  9.134015  ,  6.2300406 , ..., -0.08259842,
       -0.60211575,  0.02945683], dtype=float32)}, {'class_ids': array([59]), 'class_names': array([b'299836255'], dtype=object), 'probabilities': array([9.5619652e-03, 1.7562550e-02, 1.1298506e-03, ..., 3.4744508e-06,
       2.2263582e-06, 3.9051861e-06], dtype=float32), 'logits': array([ 7.8327594 ,  8.440735  ,  5.6970515 , ..., -0.08735287,
       -0.53242224,  0.02951602], dtype=float32)}, {'class_ids': array([59]), 'class_names': array([b'299836255'], dtype=object), 'probabilities': array([9.8370342e-03, 1.7713044e-02, 1.1635502e-03, ..., 3.7038003e-06,
       2.3744481e-06, 4.1816338e-06], dtype=float32), 'logits': array([ 7.7908287 ,  8.378976  ,  5.6561503 , ..., -0.0937212 ,
       -0.

In [17]:
import numpy as np
recommended_content_ids = [np.asscalar(d["class_names"]).decode('UTF-8') for d in output]
content_ids = open("first_5_content_ids").read().splitlines()

Finally, we'll map the content id back to the article title. We can then compare our model's recommendation for the first of our examples. This can all be done in BigQuery. Look through the query below and make sure it is clear what is being returned.

In [18]:
import google.datalab.bigquery as bq
recommended_title_sql="""
#standardSQL
SELECT
(SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1""".format(recommended_content_ids[0])

current_title_sql="""
#standardSQL
SELECT
(SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1""".format(content_ids[0])
recommended_title = bq.Query(recommended_title_sql).execute().result().to_dataframe()['title'].tolist()[0]
current_title = bq.Query(current_title_sql).execute().result().to_dataframe()['title'].tolist()[0]
print("Current title: {} ".format(current_title))
print("Recommended title: {}".format(recommended_title))

Current title: ÖVP will Studiengebühren, FPÖ in Verhandlungen "flexibel" 
Recommended title: Carfentanil: Der „serial killer“ ist in Österreich aufgetaucht


### Tensorboard

As usual, we can monitor the performance of our training job using Tensorboard. 

In [20]:
from google.datalab.ml import TensorBoard
TensorBoard().start('content_based_model_trained')

4313

In [21]:
for pid in TensorBoard.list()['pid']:
  TensorBoard().stop(pid)
  print("Stopped TensorBoard with pid {}".format(pid))

Stopped TensorBoard with pid 4295
Stopped TensorBoard with pid 4313
