## Content-Based Filtering Using Neural Networks

This notebook relies on files created in the [content_based_preproc.ipynb](./content_based_preproc.ipynb) notebook. Be sure to run the code in there before completing this notebook.  
Also, we'll be using the **python3** kernel from here on out so don't forget to change the kernel if it's still Python2.

This lab illustrates:
1. how to build feature columns for a model using tf.feature_column
2. how to create custom evaluation metrics and add them to Tensorboard
3. how to train a model and make predictions with the saved model

Tensorflow Hub should already be installed. You can check that it is by using "pip freeze".

In [1]:
%%bash
pip freeze | grep tensor

tensorboard==1.15.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==1.15.3
tensorflow-cloud==0.1.13
tensorflow-data-validation==0.23.1
tensorflow-datasets==1.2.0
tensorflow-estimator==1.15.1
tensorflow-hub==0.7.0
tensorflow-io==0.8.1
tensorflow-metadata==0.23.0
tensorflow-model-analysis==0.23.0
tensorflow-probability==0.8.0
tensorflow-serving-api==1.15.0
tensorflow-transform==0.23.0


Let's make sure we install the necessary version of tensorflow-hub. After doing the pip install below, click **"Restart the kernel"** on the notebook so that the Python environment picks up the new packages.

In [3]:
!pip3 install tensorflow-hub==0.7.0
!pip3 install --upgrade tensorflow==1.15.3
!pip3 install google-cloud-bigquery==1.10

#### **Note**: Please ignore any incompatibility warnings and errors and re-run the cell to view the installed tensorflow version.

In [4]:
import os
import tensorflow as tf
import numpy as np
import tensorflow_hub as hub
import shutil

PROJECT = 'qwiklabs-gcp-01-493597224b34' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'qwiklabs-gcp-01-493597224b34' # REPLACE WITH YOUR BUCKET NAME
REGION = 'australia-southeast1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

tf.__version__

'1.15.3'

In [5]:
# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = tf.__version__

In [6]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


### Build the feature columns for the model.

To start, we'll load the list of categories, authors and article ids we created in the previous **Create Datasets** notebook.

In [7]:
categories_list = open("categories.txt").read().splitlines()
authors_list = open("authors.txt").read().splitlines()
content_ids_list = open("content_ids.txt").read().splitlines()

mean_months_since_epoch = 523

In the cell below we'll define the feature columns to use in our model. If necessary, remind yourself the [various feature columns](https://www.tensorflow.org/api_docs/python/tf/feature_column) to use.  
For the embedded_title_column feature column, use a Tensorflow Hub Module to create an embedding of the article title. Since the articles and titles are in German, you'll want to use a German language embedding module.  
Explore the text embedding Tensorflow Hub modules [available here](https://alpha.tfhub.dev/). Filter by setting the language to 'German'. The 50 dimensional embedding should be sufficient for our purposes. 

In [9]:
embedded_title_column = hub.text_embedding_column(
    key="title", 
    module_spec="https://tfhub.dev/google/nnlm-de-dim50/1",
    trainable=False
)

content_id_column = tf.feature_column.categorical_column_with_hash_bucket(
    key="content_id",
    hash_bucket_size=len(content_ids_list) + 1
)

embedded_content_column = tf.feature_column.embedding_column(
    categorical_column=content_id_column,
    dimension=10
)

author_column = tf.feature_column.categorical_column_with_hash_bucket(
    key="author",
    hash_bucket_size=len(authors_list) + 1
)

embedded_author_column = tf.feature_column.embedding_column(
    categorical_column=author_column,
    dimension=3
)

category_column_categorical = tf.feature_column.categorical_column_with_vocabulary_list(
    key="category",
    vocabulary_list=categories_list,
    num_oov_buckets=1
)
category_column = tf.feature_column.indicator_column(category_column_categorical)

months_since_epoch_boundaries = list(range(400, 700, 20)) # [400, 420, 440, ..., 660, 680]
months_since_epoch_column = tf.feature_column.numeric_column(key="months_since_epoch")
months_since_epoch_bucketized = tf.feature_column.bucketized_column(
    source_column = months_since_epoch_column,
    boundaries = months_since_epoch_boundaries
)

crossed_months_since_category_column = tf.feature_column.indicator_column(
    tf.feature_column.crossed_column(
        keys = [category_column_categorical, months_since_epoch_bucketized], 
        hash_bucket_size = len(months_since_epoch_boundaries) * (len(categories_list) + 1)
    )
)

feature_columns = [
    embedded_content_column,
    embedded_author_column,
    category_column,
    embedded_title_column,
    crossed_months_since_category_column
] 

### Create the input function.

Next we'll create the input function for our model. This input function reads the data from the csv files we created in the previous labs. 

In [10]:
record_defaults = [["Unknown"], ["Unknown"], ["Unknown"], ["Unknown"], ["Unknown"], [mean_months_since_epoch], ["Unknown"]]
column_keys = ["visitor_id", "content_id", "category", "title", "author", "months_since_epoch", "next_content_id"]
label_key = "next_content_id"

def read_dataset(filename, mode, batch_size = 512):
    def _input_fn():
        def decode_csv(value_column):
            columns = tf.decode_csv(value_column, record_defaults=record_defaults)
            features = dict(zip(column_keys, columns))          
            label = features.pop(label_key)         
            return features, label

        # Create list of files that match pattern
        file_list = tf.io.gfile.glob(filename)

        # Create dataset from file list
        dataset = tf.data.TextLineDataset(file_list).map(decode_csv)

        if mode == tf.estimator.ModeKeys.TRAIN:
            num_epochs = None # indefinitely
            dataset = dataset.shuffle(buffer_size = 10 * batch_size)
        else:
            num_epochs = 1 # end-of-input after this

        dataset = dataset.repeat(num_epochs).batch(batch_size)
        return dataset.make_one_shot_iterator().get_next()
    return _input_fn

### Create the model and train/evaluate


Next, we'll build our model which recommends an article for a visitor to the Kurier.at website. Look through the code below. We use the input_layer feature column to create the dense input layer to our network. This is just a single layer network where we can adjust the number of hidden units as a parameter.

Currently, we compute the accuracy between our predicted 'next article' and the actual 'next article' read next by the visitor. We'll also add an additional performance metric of top 10 accuracy to assess our model. To accomplish this, we compute the top 10 accuracy metric, add it to the metrics dictionary below and add it to the tf.summary so that this value is reported to Tensorboard as well.

In [12]:
def model_fn(features, labels, mode, params):
    net = tf.feature_column.input_layer(features, params['feature_columns'])
    for units in params['hidden_units']:
        net = tf.layers.dense(net, units=units, activation=tf.nn.relu)

    # Compute logits (1 per class).
    logits = tf.layers.dense(net, params['n_classes'], activation=None) 

    predicted_classes = tf.argmax(logits, 1)
    from tensorflow.python.lib.io import file_io
    
    with file_io.FileIO('content_ids.txt', mode='r') as ifp:
        content = tf.constant([x.rstrip() for x in ifp])
    
    predicted_class_names = tf.gather(content, predicted_classes)
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions = {
            'class_ids': predicted_classes[:, tf.newaxis],
            'class_names' : predicted_class_names[:, tf.newaxis],
            'probabilities': tf.nn.softmax(logits),
            'logits': logits,
        }
        return tf.estimator.EstimatorSpec(mode, predictions=predictions)

    table = tf.contrib.lookup.index_table_from_file(vocabulary_file="content_ids.txt")
    labels = table.lookup(labels)
  
    # Compute loss.
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    # Compute evaluation metrics.
    accuracy = tf.metrics.accuracy(
        labels=labels,
        predictions=predicted_classes,
        name='acc_op'
    )
    top_10_accuracy = tf.metrics.mean(
        tf.nn.in_top_k(predictions=logits, 
                       targets=labels, 
                       k=10)
    )
    eval_metrics = {
        'accuracy': accuracy, 'top_10_accuracy' : top_10_accuracy
    }
  
    tf.summary.scalar('accuracy', accuracy[1])
    tf.summary.scalar('top_10_accuracy', top_10_accuracy[1])

    if mode == tf.estimator.ModeKeys.EVAL:
        return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=eval_metrics)

    # Create training op.
    assert mode == tf.estimator.ModeKeys.TRAIN

    optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
    train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
    return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

### Train and Evaluate

In [13]:
outdir = 'content_based_model_trained'
shutil.rmtree(outdir, ignore_errors = True) # start fresh each time

# tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file

estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    model_dir = outdir,
    params={
        'feature_columns': feature_columns,
        'hidden_units': [200, 100, 50],
        'n_classes': len(content_ids_list)
    })

train_spec = tf.estimator.TrainSpec(
    input_fn = read_dataset("training_set.csv", tf.estimator.ModeKeys.TRAIN),
    max_steps = 2000
)

eval_spec = tf.estimator.EvalSpec(
    input_fn = read_dataset("test_set.csv", tf.estimator.ModeKeys.EVAL),
    steps = None,
    start_delay_secs = 30,
    throttle_secs = 60
)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

INFO:tensorflow:Using default config.


INFO:tensorflow:Using default config.


INFO:tensorflow:Using config: {'_model_dir': 'content_based_model_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb8b7e71b10>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


INFO:tensorflow:Using config: {'_model_dir': 'content_based_model_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb8b7e71b10>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


INFO:tensorflow:Not using Distribute Coordinator.


INFO:tensorflow:Not using Distribute Coordinator.


INFO:tensorflow:Running training and evaluation locally (non-distributed).


INFO:tensorflow:Running training and evaluation locally (non-distributed).


INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.


INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.


Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.


Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.








Instructions for updating:
Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.


Instructions for updating:
Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.








Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
2022-03-26 22:55:17.540114: W tensorflow/core/graph/graph_constructor.cc:1491] Importing a graph with a lower producer version 26 into an existing graph with producer version 134. Shape inference will have run different parts of the graph with different producer versions.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Instructions for updating:
Use keras.layers.Dense instead.


Instructions for updating:
Use keras.layers.Dense instead.


Instructions for updating:
Please use `layer.__call__` method instead.


Instructions for updating:
Please use `layer.__call__` method instead.


The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.







































Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.
2022-03-26 22:55:18.296252: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-03-26 22:55:18.302931: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2199995000 Hz
2022-03-26 22:55:18.303192: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563bf58ddd70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-03-26 22:55:18.303216: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Saving checkpoints for 0 into content_based_model_trained/model.ckpt.


INFO:tensorflow:Saving checkpoints for 0 into content_based_model_trained/model.ckpt.


INFO:tensorflow:loss = 9.657322, step = 1


INFO:tensorflow:loss = 9.657322, step = 1


INFO:tensorflow:global_step/sec: 8.53615


INFO:tensorflow:global_step/sec: 8.53615


INFO:tensorflow:loss = 5.750519, step = 101 (11.717 sec)


INFO:tensorflow:loss = 5.750519, step = 101 (11.717 sec)


INFO:tensorflow:global_step/sec: 8.68299


INFO:tensorflow:global_step/sec: 8.68299


INFO:tensorflow:loss = 4.7716455, step = 201 (11.517 sec)


INFO:tensorflow:loss = 4.7716455, step = 201 (11.517 sec)


INFO:tensorflow:global_step/sec: 8.70939


INFO:tensorflow:global_step/sec: 8.70939


INFO:tensorflow:loss = 4.968508, step = 301 (11.482 sec)


INFO:tensorflow:loss = 4.968508, step = 301 (11.482 sec)


INFO:tensorflow:global_step/sec: 8.64966


INFO:tensorflow:global_step/sec: 8.64966


INFO:tensorflow:loss = 4.6547394, step = 401 (11.561 sec)


INFO:tensorflow:loss = 4.6547394, step = 401 (11.561 sec)


INFO:tensorflow:global_step/sec: 8.41465


INFO:tensorflow:global_step/sec: 8.41465


INFO:tensorflow:loss = 5.766776, step = 501 (11.886 sec)


INFO:tensorflow:loss = 5.766776, step = 501 (11.886 sec)


INFO:tensorflow:global_step/sec: 8.20565


INFO:tensorflow:global_step/sec: 8.20565


INFO:tensorflow:loss = 5.5914464, step = 601 (12.184 sec)


INFO:tensorflow:loss = 5.5914464, step = 601 (12.184 sec)


INFO:tensorflow:global_step/sec: 8.63053


INFO:tensorflow:global_step/sec: 8.63053


INFO:tensorflow:loss = 4.7110977, step = 701 (11.587 sec)


INFO:tensorflow:loss = 4.7110977, step = 701 (11.587 sec)


INFO:tensorflow:global_step/sec: 8.67429


INFO:tensorflow:global_step/sec: 8.67429


INFO:tensorflow:loss = 4.6834335, step = 801 (11.530 sec)


INFO:tensorflow:loss = 4.6834335, step = 801 (11.530 sec)


INFO:tensorflow:global_step/sec: 8.62521


INFO:tensorflow:global_step/sec: 8.62521


INFO:tensorflow:loss = 4.2940283, step = 901 (11.592 sec)


INFO:tensorflow:loss = 4.2940283, step = 901 (11.592 sec)


INFO:tensorflow:global_step/sec: 8.57679


INFO:tensorflow:global_step/sec: 8.57679


INFO:tensorflow:loss = 5.5169578, step = 1001 (11.659 sec)


INFO:tensorflow:loss = 5.5169578, step = 1001 (11.659 sec)


INFO:tensorflow:global_step/sec: 8.51176


INFO:tensorflow:global_step/sec: 8.51176


INFO:tensorflow:loss = 4.62759, step = 1101 (11.749 sec)


INFO:tensorflow:loss = 4.62759, step = 1101 (11.749 sec)


INFO:tensorflow:global_step/sec: 8.66077


INFO:tensorflow:global_step/sec: 8.66077


INFO:tensorflow:loss = 4.8340554, step = 1201 (11.546 sec)


INFO:tensorflow:loss = 4.8340554, step = 1201 (11.546 sec)


INFO:tensorflow:global_step/sec: 8.69398


INFO:tensorflow:global_step/sec: 8.69398


INFO:tensorflow:loss = 4.5686646, step = 1301 (11.502 sec)


INFO:tensorflow:loss = 4.5686646, step = 1301 (11.502 sec)


INFO:tensorflow:global_step/sec: 8.60412


INFO:tensorflow:global_step/sec: 8.60412


INFO:tensorflow:loss = 5.4971094, step = 1401 (11.622 sec)


INFO:tensorflow:loss = 5.4971094, step = 1401 (11.622 sec)


INFO:tensorflow:global_step/sec: 8.65997


INFO:tensorflow:global_step/sec: 8.65997


INFO:tensorflow:loss = 5.3723993, step = 1501 (11.547 sec)


INFO:tensorflow:loss = 5.3723993, step = 1501 (11.547 sec)


INFO:tensorflow:global_step/sec: 8.613


INFO:tensorflow:global_step/sec: 8.613


INFO:tensorflow:loss = 4.8238387, step = 1601 (11.610 sec)


INFO:tensorflow:loss = 4.8238387, step = 1601 (11.610 sec)


INFO:tensorflow:global_step/sec: 8.4069


INFO:tensorflow:global_step/sec: 8.4069


INFO:tensorflow:loss = 4.5170774, step = 1701 (11.895 sec)


INFO:tensorflow:loss = 4.5170774, step = 1701 (11.895 sec)


INFO:tensorflow:global_step/sec: 8.73248


INFO:tensorflow:global_step/sec: 8.73248


INFO:tensorflow:loss = 4.234061, step = 1801 (11.452 sec)


INFO:tensorflow:loss = 4.234061, step = 1801 (11.452 sec)


INFO:tensorflow:global_step/sec: 8.69798


INFO:tensorflow:global_step/sec: 8.69798


INFO:tensorflow:loss = 5.352684, step = 1901 (11.497 sec)


INFO:tensorflow:loss = 5.352684, step = 1901 (11.497 sec)


INFO:tensorflow:Saving checkpoints for 2000 into content_based_model_trained/model.ckpt.


INFO:tensorflow:Saving checkpoints for 2000 into content_based_model_trained/model.ckpt.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


2022-03-26 22:59:14.108150: W tensorflow/core/graph/graph_constructor.cc:1491] Importing a graph with a lower producer version 26 into an existing graph with producer version 134. Shape inference will have run different parts of the graph with different producer versions.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Starting evaluation at 2022-03-26T22:59:14Z


INFO:tensorflow:Starting evaluation at 2022-03-26T22:59:14Z


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Finished evaluation at 2022-03-26-22:59:19


INFO:tensorflow:Finished evaluation at 2022-03-26-22:59:19


INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.0403922, global_step = 2000, loss = 5.096409, top_10_accuracy = 0.2619243


INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.0403922, global_step = 2000, loss = 5.096409, top_10_accuracy = 0.2619243


INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Loss for final step: 4.5175714.


INFO:tensorflow:Loss for final step: 4.5175714.


({'accuracy': 0.0403922,
  'loss': 5.096409,
  'top_10_accuracy': 0.2619243,
  'global_step': 2000},
 [])

This takes a while to complete but in the end, I get about **30% top 10 accuracy**.

In [None]:
# INFO:tensorflow:loss = 9.657322, step = 1
# INFO:tensorflow:loss = 9.657322, step = 1
# INFO:tensorflow:global_step/sec: 8.53615
# INFO:tensorflow:global_step/sec: 8.53615
# INFO:tensorflow:loss = 5.750519, step = 101 (11.717 sec)
# INFO:tensorflow:loss = 5.750519, step = 101 (11.717 sec)
# INFO:tensorflow:global_step/sec: 8.68299
# INFO:tensorflow:global_step/sec: 8.68299
# INFO:tensorflow:loss = 4.7716455, step = 201 (11.517 sec)
# INFO:tensorflow:loss = 4.7716455, step = 201 (11.517 sec)
# INFO:tensorflow:global_step/sec: 8.70939
# INFO:tensorflow:global_step/sec: 8.70939
# INFO:tensorflow:loss = 4.968508, step = 301 (11.482 sec)
# INFO:tensorflow:loss = 4.968508, step = 301 (11.482 sec)

### Make predictions with the trained model. 

With the model now trained, we can make predictions by calling the predict method on the estimator. Let's look at how our model predicts on the first five examples of the training set.  
To start, we'll create a new file 'first_5.csv' which contains the first five elements of our training set. We'll also save the target values to a file 'first_5_content_ids' so we can compare our results. 

In [14]:
%%bash
head -5 training_set.csv > first_5.csv
head first_5.csv

awk -F "\"*,\"*" '{print $2}' first_5.csv > first_5_content_ids

1030878773401944300,299974496,News,Kurier TV-News: Die Baustelle Bildung,Stefan Berndl,574,299830996
1030878773401944300,299830996,News,Wie die Schule in der Neuzeit ankommen könnte,Martina Salomon,574,299901255
1045356747303546594,299792812,News,Bundesliga: Kein Videobeweis beim Schlager Rapid-Salzburg,,574,299779564
1045356747303546594,299779564,Stars & Kultur,Geschenk: Nicole Kidman bekommt Traumhaus um 40 Mio. Dollar ,Elisabeth Spitzer,574,299809748
104581626240810883,299982579,News,VIDEO: Basejumper springen von Berg in Flugzeug,Mathias Kainz,574,299935287


Recall, to make predictions on the trained model we pass a list of examples through the input function. Complete the code below to make predictions on the examples contained in the "first_5.csv" file we created above. 

In [15]:
output = list(
    estimator.predict(
        input_fn=read_dataset("first_5.csv", tf.estimator.ModeKeys.PREDICT)
    )
)

INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


2022-03-26 23:02:49.209728: W tensorflow/core/graph/graph_constructor.cc:1491] Importing a graph with a lower producer version 26 into an existing graph with producer version 134. Shape inference will have run different parts of the graph with different producer versions.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.


In [23]:
import numpy as np

recommended_content_ids = [np.asscalar(d["class_names"]).decode('UTF-8') for d in output]
content_ids = open("first_5_content_ids").read().splitlines()

  This is separate from the ipykernel package so we can avoid doing imports until


Finally, we map the content id back to the article title. Let's compare our model's recommendation for the first example. This can be done in BigQuery. Look through the query below and make sure it is clear what is being returned.

In [24]:
from google.cloud import bigquery

template_recommended_title_sql="""
#standardSQL
SELECT (SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`, UNNEST(hits) AS hits
WHERE hits.type = "PAGE" # only include hits on pages
AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1"""

template_current_title_sql="""
#standardSQL
SELECT (SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`, UNNEST(hits) AS hits
WHERE hits.type = "PAGE" # only include hits on pages
AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1"""

In [25]:

recommended_title_sqls = [
    template_recommended_title_sql.format(recommended_content_ids[ix]) for ix in range(5)
]
current_title_sqls = [
    template_current_title_sql.format(content_ids[ix]) for ix in range(5)
]

recommended_title = [ 
    bigquery.Client().query(recommended_title_sqls[ix]).to_dataframe()['title'].tolist()[0].encode('utf-8').strip() \
    for ix in range(5)
]

current_title = [
    bigquery.Client().query(current_title_sqls[ix]).to_dataframe()['title'].tolist()[0].encode('utf-8').strip() \
    for ix in range(5)
]

for ix in range(5):
    print(f"{ix}. Current title: {current_title[ix]} / Recommended title: {recommended_title[ix]}")
    
# Not great

0. Current title: b'Kurier TV-News: Die Baustelle Bildung' / Recommended title: b'Auf Bank ausgeruht: Pensionist muss Strafe zahlen'
1. Current title: b'Wie die Schule in der Neuzeit ankommen k\xc3\xb6nnte' / Recommended title: b'Fahnenskandal von Mailand: Die Austria zeigt Flagge'
2. Current title: b'Bundesliga: Kein Videobeweis beim Schlager Rapid-Salzburg' / Recommended title: b'Fahnenskandal von Mailand: Die Austria zeigt Flagge'
3. Current title: b'Geschenk: Nicole Kidman bekommt Traumhaus um 40 Mio. Dollar' / Recommended title: b'"Hat mich gerettet": Andre Agassi \xc3\xbcber Steffi Graf & Familie'
4. Current title: b'VIDEO: Basejumper springen von Berg in Flugzeug' / Recommended title: b'27-J\xc3\xa4hriger soll betagte Nachbarin vergewaltigt haben'
