##### Copyright 2018 The TensorFlow Authors.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Build a linear model with Estimators

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/estimators/linear"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/estimators/linear.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/estimators/linear.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

This tutorial uses the `tf.estimator` API in TensorFlow to solve a benchmark binary classification problem. Estimators are TensorFlow's most scalable and production-oriented model type. For more information see the [Estimator guide](https://www.tensorflow.org/guide/estimators).

## Overview

Using census data which contains data about a person's age, education, marital status, and occupation (the *features*), we will try to predict whether or not the person earns more than 50,000 dollars a year (the target *label*). We will train a *logistic regression* model that, given an individual's information, outputs a number between 0 and 1—this can be interpreted as the probability that the individual has an annual income of over 50,000 dollars.

Key Point: As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is each  feature relevant to the problem you want to solve or will it introduce bias? For more information, read about [ML fairness](https://developers.google.com/machine-learning/fairness-overview/).

## Setup

Import TensorFlow, feature column support, and supporting modules:

In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
import tensorflow.feature_column as fc

import os
import sys

import matplotlib.pyplot as plt
from IPython.display import clear_output

And let's enable [eager execution](https://www.tensorflow.org/guide/eager) to inspect this program as we run it:

In [3]:
tf.enable_eager_execution()

## Download the official implementation

We'll use the [wide and deep model](https://github.com/tensorflow/models/tree/master/official/wide_deep/) available in TensorFlow's [model repository](https://github.com/tensorflow/models/). Download the code, add the root directory to your Python path, and jump to the `wide_deep` directory:

In [4]:
! pip install requests
! git clone --depth 1 https://github.com/tensorflow/models

Cloning into 'models'...
remote: Enumerating objects: 3217, done.[K
remote: Counting objects: 100% (3217/3217), done.[K
remote: Compressing objects: 100% (2742/2742), done.[K
^Cceiving objects:  61% (1990/3217), 201.14 MiB | 15.09 MiB/s   


Add the root directory of the repository to your Python path:

In [6]:
models_path = os.path.join(os.getcwd(), 'models')

sys.path.append(models_path)

Download the dataset:

In [7]:
from official.wide_deep import census_dataset
from official.wide_deep import census_main

census_dataset.download("/tmp/census_data/")

### Command line usage

The repo includes a complete program for experimenting with this type of model.

To execute the tutorial code from the command line first add the path to tensorflow/models to your `PYTHONPATH`.

In [8]:
#export PYTHONPATH=${PYTHONPATH}:"$(pwd)/models"
#running from python you need to set the `os.environ` or the subprocess will not see the directory.

if "PYTHONPATH" in os.environ:
  os.environ['PYTHONPATH'] += os.pathsep +  models_path
else:
  os.environ['PYTHONPATH'] = models_path

Use `--help` to see what command line options are available:

In [9]:
!python -m official.wide_deep.census_main --help

Train DNN on census income dataset.
flags:

/home/jupyter/DLforCompMath/Tutorial2/models/official/wide_deep/census_main.py:
  -bs,--batch_size:
    Batch size for training and evaluation. When using multiple gpus, this is
    the
    global batch size for all devices. For example, if the batch size is 32 and
    there are 4 GPUs, each GPU will get 8 examples on each step.
    (default: '40')
    (an integer)
  --[no]clean:
    If set, model_dir will be removed if it exists.
    (default: 'false')
  -dd,--data_dir:
    The location of the input data.
    (default: '/tmp/census_data')
  --[no]download_if_missing:
    Download data to data_dir if it is not already present.
    (default: 'true')
  -ebe,--epochs_between_evals:
    The number of training epochs to run between evaluations.
    (default: '2')
    (an integer)
  -ed,--export_dir:
    If set, a SavedModel serialization of the model will be exported to this
    directory at the end of training. See the R

Now run the model:


In [10]:
!python -m official.wide_deep.census_main --model_type=wide --train_epochs=2

I0723 06:40:58.792081 139835868321600 estimator.py:201] Using config: {'_model_dir': '/tmp/census_model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': device_count {
  key: "GPU"
  value: 0
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2dec3846d8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
W0723 06:40:58.792726 139835868321600 tf_logging.py:161] 'cpuinfo' not imported. CPU info will not be logged.
I0723 06:40:59.021049 139835868321600 logger.py:152] Benchmark run: {'model_name': 'wide_de

I0723 06:41:10.532779 139835868321600 basic_session_run_hooks.py:680] global_step/sec: 164.341
I0723 06:41:10.533220 139835868321600 basic_session_run_hooks.py:247] average_loss = 0.3098352, loss = 12.393408 (0.608 sec)
I0723 06:41:10.533364 139835868321600 basic_session_run_hooks.py:247] loss = 12.393408, step = 1401 (0.608 sec)
I0723 06:41:11.128962 139835868321600 basic_session_run_hooks.py:680] global_step/sec: 167.737
I0723 06:41:11.129641 139835868321600 basic_session_run_hooks.py:247] average_loss = 0.3877038, loss = 15.508152 (0.596 sec)
I0723 06:41:11.129849 139835868321600 basic_session_run_hooks.py:247] loss = 15.508152, step = 1501 (0.596 sec)
I0723 06:41:11.726249 139835868321600 basic_session_run_hooks.py:680] global_step/sec: 167.421
I0723 06:41:11.726667 139835868321600 basic_session_run_hooks.py:247] average_loss = 0.20675388, loss = 8.270155 (0.597 sec)
I0723 06:41:11.726832 139835868321600 basic_session_run_hooks.py:247] loss = 8.270155, step = 1601 (0.597 sec)
I0723

## Read the U.S. Census data

This example uses the [U.S Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income) from 1994 and 1995. We have provided the [census_dataset.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_dataset.py) script to download the data and perform a little cleanup.

Since the task is a *binary classification problem*, we'll construct a label column named "label" whose value is 1 if the income is over 50K, and 0 otherwise. For reference, see the `input_fn` in [census_main.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py).

Let's look at the data to see which columns we can use to predict the target label:

In [11]:
!ls  /tmp/census_data/

adult.data  adult.test


In [12]:
train_file = "/tmp/census_data/adult.data"
test_file = "/tmp/census_data/adult.test"

[pandas](https://pandas.pydata.org/) provides some convenient utilities for data analysis. Here's a list of columns available in the Census Income dataset:

In [13]:
import pandas

train_df = pandas.read_csv(train_file, header = None, names = census_dataset._CSV_COLUMNS)
test_df = pandas.read_csv(test_file, header = None, names = census_dataset._CSV_COLUMNS)

train_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


The columns are grouped into two types: *categorical* and *continuous* columns:

* A column is called *categorical* if its value can only be one of the categories in a finite set. For example, the relationship status of a person (wife, husband, unmarried, etc.) or the education level (high school, college, etc.) are categorical columns.
* A column is called *continuous* if its value can be any numerical value in a continuous range. For example, the capital gain of a person (e.g. $14,084) is a continuous column.

## Converting Data into Tensors

When building a `tf.estimator` model, the input data is specified by using an *input function* (or `input_fn`). This builder function returns a `tf.data.Dataset` of batches of `(features-dict, label)` pairs. It is not called until it is passed to `tf.estimator.Estimator` methods such as `train` and `evaluate`.

The input builder function returns the following pair:

1. `features`: A dict from feature names to `Tensors` or `SparseTensors` containing batches of features.
2. `labels`: A `Tensor` containing batches of labels.

The keys of the `features` are used to configure the model's input layer.

Note: The input function is called while constructing the TensorFlow graph, *not* while running the graph. It is returning a representation of the input data as a sequence of TensorFlow graph operations.

For small problems like this, it's easy to make a `tf.data.Dataset` by slicing the `pandas.DataFrame`:

In [14]:
def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):
  label = df[label_key]
  ds = tf.data.Dataset.from_tensor_slices((dict(df),label))

  if shuffle:
    ds = ds.shuffle(10000)

  ds = ds.batch(batch_size).repeat(num_epochs)

  return ds

Since we have eager execution enabled, it's easy to inspect the resulting dataset:

In [15]:
ds = easy_input_function(train_df, label_key='income_bracket', num_epochs=5, shuffle=True, batch_size=10)

for feature_batch, label_batch in ds.take(1):
  print('Some feature keys:', list(feature_batch.keys())[:5])
  print()
  print('A batch of Ages  :', feature_batch['age'])
  print()
  print('A batch of Labels:', label_batch )

Instructions for updating:
Colocations handled automatically by placer.


W0723 06:41:43.805676 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py:532: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.


Some feature keys: ['age', 'workclass', 'fnlwgt', 'education', 'education_num']

A batch of Ages  : tf.Tensor([27 66 21 64 29 39 41 48 22 39], shape=(10,), dtype=int32)

A batch of Labels: tf.Tensor(
[b'<=50K' b'<=50K' b'<=50K' b'<=50K' b'<=50K' b'<=50K' b'>50K' b'>50K'
 b'<=50K' b'<=50K'], shape=(10,), dtype=string)


But this approach has severly-limited scalability. Larger datasets should be streamed from disk. The `census_dataset.input_fn` provides an example of how to do this using `tf.decode_csv` and `tf.data.TextLineDataset`:

<!-- TODO(markdaoust): This `input_fn` should use `tf.contrib.data.make_csv_dataset` -->

In [16]:
import inspect
print(inspect.getsource(census_dataset.input_fn))

def input_fn(data_file, num_epochs, shuffle, batch_size):
  """Generate an input function for the Estimator."""
  assert tf.gfile.Exists(data_file), (
      '%s not found. Please make sure you have run census_dataset.py and '
      'set the --data_dir argument to the correct path.' % data_file)

  def parse_csv(value):
    tf.logging.info('Parsing {}'.format(data_file))
    columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
    features = dict(zip(_CSV_COLUMNS, columns))
    labels = features.pop('income_bracket')
    classes = tf.equal(labels, '>50K')  # binary classification
    return features, classes

  # Extract lines from input files using the Dataset API.
  dataset = tf.data.TextLineDataset(data_file)

  if shuffle:
    dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])

  dataset = dataset.map(parse_csv, num_parallel_calls=5)

  # We call repeat after shuffling, rather than before, to prevent separate
  # epochs from blending together.
  dataset = 

This `input_fn` returns equivalent output:

In [17]:
ds = census_dataset.input_fn(train_file, num_epochs=5, shuffle=True, batch_size=10)

for feature_batch, label_batch in ds.take(1):
  print('Feature keys:', list(feature_batch.keys())[:5])
  print()
  print('Age batch   :', feature_batch['age'])
  print()
  print('Label batch :', label_batch )

INFO:tensorflow:Parsing /tmp/census_data/adult.data


I0723 06:41:51.094161 140376063911744 census_dataset.py:167] Parsing /tmp/census_data/adult.data


Feature keys: ['age', 'workclass', 'fnlwgt', 'education', 'education_num']

Age batch   : tf.Tensor([42 57 25 29 17 46 62 37 23 56], shape=(10,), dtype=int32)

Label batch : tf.Tensor([False  True  True False False  True False  True False False], shape=(10,), dtype=bool)


Because `Estimators` expect an `input_fn` that takes no arguments, we typically wrap configurable input function into an object with the expected signature. For this notebook configure the `train_inpf` to iterate over the data twice:

In [18]:
import functools

train_inpf = functools.partial(census_dataset.input_fn, train_file, num_epochs=2, shuffle=True, batch_size=64)
test_inpf = functools.partial(census_dataset.input_fn, test_file, num_epochs=1, shuffle=False, batch_size=64)

## Selecting and Engineering Features for the Model

Estimators use a system called [feature columns](https://www.tensorflow.org/guide/feature_columns) to describe how the model should interpret each of the raw input features. An Estimator expects a vector of numeric inputs, and feature columns describe how the model should convert each feature.

Selecting and crafting the right set of feature columns is key to learning an effective model. A *feature column* can be either one of the raw inputs in the original features `dict` (a *base feature column*), or any new columns created using transformations defined over one or multiple base columns (a *derived feature columns*).

A feature column is an abstract concept of any raw or derived variable that can be used to predict the target label.

### Base Feature Columns

#### Numeric columns

The simplest `feature_column` is `numeric_column`. This indicates that a feature is a numeric value that should be input to the model directly. For example:

In [19]:
age = fc.numeric_column('age')

The model will use the `feature_column` definitions to build the model input. You can inspect the resulting output using the `input_layer` function:

In [20]:
fc.input_layer(feature_batch, [age]).numpy()

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:41:58.675343 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:205: NumericColumn._get_dense_tensor (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:41:58.704581 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:2121: NumericColumn._transform_feature (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
Use tf.cast instead.


W0723 06:41:58.706185 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column_v2.py:2703: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:41:58.707468 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:206: NumericColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


array([[42.],
       [57.],
       [25.],
       [29.],
       [17.],
       [46.],
       [62.],
       [37.],
       [23.],
       [56.]], dtype=float32)

The following will train and evaluate a model using only the `age` feature:

In [22]:
classifier = tf.estimator.LinearClassifier(feature_columns=[age])
classifier.train(train_inpf)
result = classifier.evaluate(test_inpf)

clear_output()  # used for display in notebook
print(result)

{'accuracy': 0.7570788, 'accuracy_baseline': 0.76377374, 'auc': 0.67835695, 'auc_precision_recall': 0.3113919, 'average_loss': 0.5236284, 'label/mean': 0.23622628, 'loss': 33.432137, 'precision': 0.1572327, 'prediction/mean': 0.25249344, 'recall': 0.00650026, 'global_step': 1018}


Similarly, we can define a `NumericColumn` for each continuous feature column
that we want to use in the model:

In [23]:
education_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week = tf.feature_column.numeric_column('hours_per_week')

my_numeric_columns = [age,education_num, capital_gain, capital_loss, hours_per_week]

fc.input_layer(feature_batch, my_numeric_columns).numpy()

array([[4.2000e+01, 0.0000e+00, 0.0000e+00, 9.0000e+00, 4.0000e+01],
       [5.7000e+01, 1.5024e+04, 0.0000e+00, 1.5000e+01, 3.5000e+01],
       [2.5000e+01, 0.0000e+00, 0.0000e+00, 1.3000e+01, 4.0000e+01],
       [2.9000e+01, 2.2020e+03, 0.0000e+00, 1.0000e+01, 5.0000e+01],
       [1.7000e+01, 0.0000e+00, 0.0000e+00, 7.0000e+00, 3.0000e+01],
       [4.6000e+01, 0.0000e+00, 0.0000e+00, 1.1000e+01, 4.0000e+01],
       [6.2000e+01, 0.0000e+00, 0.0000e+00, 1.3000e+01, 3.0000e+01],
       [3.7000e+01, 0.0000e+00, 0.0000e+00, 1.1000e+01, 4.8000e+01],
       [2.3000e+01, 0.0000e+00, 0.0000e+00, 9.0000e+00, 4.8000e+01],
       [5.6000e+01, 0.0000e+00, 0.0000e+00, 9.0000e+00, 9.0000e+01]],
      dtype=float32)

You could retrain a model on these features by changing the `feature_columns` argument to the constructor:

In [24]:
classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns)
classifier.train(train_inpf)

result = classifier.evaluate(test_inpf)

clear_output()

for key,value in sorted(result.items()):
  print('%s: %s' % (key, value))

accuracy: 0.78177017
accuracy_baseline: 0.76377374
auc: 0.75969464
auc_precision_recall: 0.5255299
average_loss: 1.5091102
global_step: 1018
label/mean: 0.23622628
loss: 96.35225
precision: 0.5705344
prediction/mean: 0.28281045
recall: 0.30811232


#### Categorical columns

To define a feature column for a categorical feature, create a `CategoricalColumn` using one of the `tf.feature_column.categorical_column*` functions.

If you know the set of all possible feature values of a column—and there are only a few of them—use `categorical_column_with_vocabulary_list`. Each key in the list is assigned an auto-incremented ID starting from 0. For example, for the `relationship` column we can assign the feature string `Husband` to an integer ID of 0 and "Not-in-family" to 1, etc.

In [25]:
relationship = fc.categorical_column_with_vocabulary_list(
    'relationship',
    ['Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried', 'Other-relative'])

This creates a sparse one-hot vector from the raw input feature.

The `input_layer` function we're using is designed for DNN models and expects dense inputs. To demonstrate the categorical column we must wrap it in a `tf.feature_column.indicator_column` to create the dense one-hot output (Linear `Estimators` can often skip this dense-step).

Note: the other sparse-to-dense option is `tf.feature_column.embedding_column`.

Run the input layer, configured with both the `age` and `relationship` columns:

In [26]:
fc.input_layer(feature_batch, [age, fc.indicator_column(relationship)])

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:43:13.834218 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:205: IndicatorColumn._get_dense_tensor (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:43:13.904497 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:2121: IndicatorColumn._transform_feature (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:43:13.905902 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4295: VocabularyListCategoricalColumn._get_sparse_tensors (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:43:13.906886 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:2121: VocabularyListCategoricalColumn._transform_feature (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
Use tf.cast instead.


W0723 06:43:13.909019 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/lookup_ops.py:1137: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:43:13.911054 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4266: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:43:13.912049 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4321: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


<tf.Tensor: id=6800, shape=(10, 7), dtype=float32, numpy=
array([[42.,  0.,  1.,  0.,  0.,  0.,  0.],
       [57.,  0.,  0.,  1.,  0.,  0.,  0.],
       [25.,  1.,  0.,  0.,  0.,  0.,  0.],
       [29.,  0.,  1.,  0.,  0.,  0.,  0.],
       [17.,  0.,  0.,  0.,  1.,  0.,  0.],
       [46.,  1.,  0.,  0.,  0.,  0.,  0.],
       [62.,  0.,  1.,  0.,  0.,  0.,  0.],
       [37.,  0.,  0.,  1.,  0.,  0.,  0.],
       [23.,  0.,  0.,  0.,  0.,  1.,  0.],
       [56.,  1.,  0.,  0.,  0.,  0.,  0.]], dtype=float32)>

If we don't know the set of possible values in advance, use the `categorical_column_with_hash_bucket` instead:

In [27]:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    'occupation', hash_bucket_size=1000)

Here, each possible value in the feature column `occupation` is hashed to an integer ID as we encounter them in training. The example batch has a few different occupations:

In [28]:
for item in feature_batch['occupation'].numpy():
    print(item.decode())

Transport-moving
Sales
Exec-managerial
Craft-repair
?
Exec-managerial
Farming-fishing
Adm-clerical
Sales
Transport-moving


If we run `input_layer` with the hashed column, we see that the output shape is `(batch_size, hash_bucket_size)`:

In [29]:
occupation_result = fc.input_layer(feature_batch, [fc.indicator_column(occupation)])

occupation_result.numpy().shape

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:44:25.187920 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4295: HashedCategoricalColumn._get_sparse_tensors (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:44:25.204374 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:2121: HashedCategoricalColumn._transform_feature (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:44:25.206940 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4321: HashedCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


(10, 1000)

It's easier to see the actual results if we take the `tf.argmax` over the `hash_bucket_size` dimension. Notice how any duplicate occupations are mapped to the same pseudo-random index:

In [30]:
tf.argmax(occupation_result, axis=1).numpy()

array([420, 631, 800, 466,  65, 800, 936,  96, 631, 420])

Note: Hash collisions are unavoidable, but often have minimal impact on model quality. The effect may be noticable if the hash buckets are being used to compress the input space. See [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb) for a more visual example of the effect of these hash collisions.

No matter how we choose to define a `SparseColumn`, each feature string is mapped into an integer ID by looking up a fixed mapping or by hashing. Under the hood, the `LinearModel` class is responsible for managing the mapping and creating `tf.Variable` to store the model parameters (model *weights*) for each feature ID. The model parameters are learned through the model training process described later.

Let's do the similar trick to define the other categorical features:

In [31]:
education = tf.feature_column.categorical_column_with_vocabulary_list(
    'education', [
        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])

marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    'marital_status', [
        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])

workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    'workclass', [
        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])


my_categorical_columns = [relationship, occupation, education, marital_status, workclass]

It's easy to use both sets of columns to configure a model that uses all these features:

In [32]:
classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns+my_categorical_columns)
classifier.train(train_inpf)
result = classifier.evaluate(test_inpf)

clear_output()

for key,value in sorted(result.items()):
  print('%s: %s' % (key, value))

accuracy: 0.82820463
accuracy_baseline: 0.76377374
auc: 0.8771789
auc_precision_recall: 0.66650975
average_loss: 0.71846026
global_step: 1018
label/mean: 0.23622628
loss: 45.871574
precision: 0.64460987
prediction/mean: 0.25261232
recall: 0.6079043


### Derived feature columns

#### Make Continuous Features Categorical through Bucketization

Sometimes the relationship between a continuous feature and the label is not linear. For example, *age* and *income*—a person's income may grow in the early stage of their career, then the growth may slow at some point, and finally, the income decreases after retirement. In this scenario, using the raw `age` as a real-valued feature column might not be a good choice because the model can only learn one of the three cases:

1.  Income always increases at some rate as age grows (positive correlation),
2.  Income always decreases at some rate as age grows (negative correlation), or
3.  Income stays the same no matter at what age (no correlation).

If we want to learn the fine-grained correlation between income and each age group separately, we can leverage *bucketization*. Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into. So, we can define a `bucketized_column` over `age` as:

In [33]:
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

`boundaries` is a list of bucket boundaries. In this case, there are 10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24, 25-29, ..., to 65 and over).

With bucketing, the model sees each bucket as a one-hot feature:

In [34]:
fc.input_layer(feature_batch, [age, age_buckets]).numpy()

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:44:49.809098 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:205: BucketizedColumn._get_dense_tensor (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:44:49.810300 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:2121: BucketizedColumn._transform_feature (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0723 06:44:49.812458 140376063911744 deprecation.py:323] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:206: BucketizedColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


array([[42.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [57.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [25.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [29.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [17.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [46.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [62.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [37.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [23.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [56.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.]],
      dtype=float32)

#### Learn complex relationships with crossed column

Using each base feature column separately may not be enough to explain the data. For example, the correlation between education and the label (earning > 50,000 dollars) may be different for different occupations. Therefore, if we only learn a single model weight for `education="Bachelors"` and `education="Masters"`, we won't capture every education-occupation combination (e.g. distinguishing between `education="Bachelors"` AND `occupation="Exec-managerial"` AND `education="Bachelors" AND occupation="Craft-repair"`).

To learn the differences between different feature combinations, we can add *crossed feature columns* to the model:

In [35]:
education_x_occupation = tf.feature_column.crossed_column(
    ['education', 'occupation'], hash_bucket_size=1000)

We can also create a `crossed_column` over more than two columns. Each constituent column can be either a base feature column that is categorical (`SparseColumn`), a bucketized real-valued feature column, or even another `CrossColumn`. For example:

In [36]:
age_buckets_x_education_x_occupation = tf.feature_column.crossed_column(
    [age_buckets, 'education', 'occupation'], hash_bucket_size=1000)

These crossed columns always use hash buckets to avoid the exponential explosion in the number of categories, and put the control over number of model weights in the hands of the user.

For a visual example the effect of hash-buckets with crossed columns see [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb)


## Define the logistic regression model

After processing the input data and defining all the feature columns, we can put them together and build a *logistic regression* model. The previous section showed several types of base and derived feature columns, including:

*   `CategoricalColumn`
*   `NumericColumn`
*   `BucketizedColumn`
*   `CrossedColumn`

All of these are subclasses of the abstract `FeatureColumn` class and can be added to the `feature_columns` field of a model:

In [37]:
import tempfile

base_columns = [
    education, marital_status, relationship, workclass, occupation,
    age_buckets,
]

crossed_columns = [
    tf.feature_column.crossed_column(
        ['education', 'occupation'], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        [age_buckets, 'education', 'occupation'], hash_bucket_size=1000),
]

model = tf.estimator.LinearClassifier(
    model_dir=tempfile.mkdtemp(),
    feature_columns=base_columns + crossed_columns,
    optimizer=tf.train.FtrlOptimizer(learning_rate=0.1))

INFO:tensorflow:Using default config.


I0723 06:45:06.221355 140376063911744 estimator.py:1739] Using default config.


INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpw6k2m8r3', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fab2841a400>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


I0723 06:45:06.304709 140376063911744 estimator.py:201] Using config: {'_model_dir': '/tmp/tmpw6k2m8r3', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fab2841a400>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


The model automatically learns a bias term, which controls the prediction made without observing any features. The learned model files are stored in `model_dir`.

## Train and evaluate the model

After adding all the features to the model, let's train the model. Training a model is just a single command using the `tf.estimator` API:

In [38]:
train_inpf = functools.partial(census_dataset.input_fn, train_file,
                               num_epochs=40, shuffle=True, batch_size=64)

model.train(train_inpf)

clear_output()  # used for notebook display

After the model is trained, evaluate the accuracy of the model by predicting the labels of the holdout data:

In [39]:
results = model.evaluate(test_inpf)

clear_output()

for key,value in sorted(results.items()):
  print('%s: %0.2f' % (key, value))

accuracy: 0.84
accuracy_baseline: 0.76
auc: 0.88
auc_precision_recall: 0.69
average_loss: 0.35
global_step: 20351.00
label/mean: 0.24
loss: 22.64
precision: 0.69
prediction/mean: 0.24
recall: 0.55


The first line of the output should display something like: `accuracy: 0.84`, which means the accuracy is 84%. You can try using more features and transformations to see if you can do better!

After the model is evaluated, we can use it to predict whether an individual has an annual income of over 50,000 dollars given an individual's information input.

Let's look in more detail how the model performed:

In [40]:
import numpy as np

predict_df = test_df[:20].copy()

pred_iter = model.predict(
    lambda:easy_input_function(predict_df, label_key='income_bracket',
                               num_epochs=1, shuffle=False, batch_size=10))

classes = np.array(['<=50K', '>50K'])
pred_class_id = []

for pred_dict in pred_iter:
  pred_class_id.append(pred_dict['class_ids'])

predict_df['predicted_class'] = classes[np.array(pred_class_id)]
predict_df['correct'] = predict_df['predicted_class'] == predict_df['income_bracket']

clear_output()

predict_df[['income_bracket','predicted_class', 'correct']]

Unnamed: 0,income_bracket,predicted_class,correct
0,<=50K,<=50K,True
1,<=50K,<=50K,True
2,>50K,<=50K,False
3,>50K,<=50K,False
4,<=50K,<=50K,True
5,<=50K,<=50K,True
6,<=50K,<=50K,True
7,>50K,>50K,True
8,<=50K,<=50K,True
9,<=50K,<=50K,True


For a working end-to-end example,  download our [example code](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py) and set the `model_type` flag to `wide`.

## Adding Regularization to Prevent Overfitting

Regularization is a technique used to avoid overfitting. Overfitting happens when a model performs well on the data it is trained on, but worse on test data that the model has not seen before. Overfitting can occur when a model is excessively complex, such as having too many parameters relative to the number of observed training data. Regularization allows you to control the model's complexity and make the model more generalizable to unseen data.

You can add L1 and L2 regularizations to the model with the following code:

In [41]:
model_l1 = tf.estimator.LinearClassifier(
    feature_columns=base_columns + crossed_columns,
    optimizer=tf.train.FtrlOptimizer(
        learning_rate=0.1,
        l1_regularization_strength=10.0,
        l2_regularization_strength=0.0))

model_l1.train(train_inpf)

results = model_l1.evaluate(test_inpf)
clear_output()
for key in sorted(results):
  print('%s: %0.2f' % (key, results[key]))

INFO:tensorflow:Using default config.


I0723 06:48:28.616790 140376063911744 estimator.py:1739] Using default config.




W0723 06:48:28.707285 140376063911744 estimator.py:1760] Using temporary folder as model directory: /tmp/tmpnn36n4gp


INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpnn36n4gp', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7faaf6574160>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


I0723 06:48:28.708405 140376063911744 estimator.py:201] Using config: {'_model_dir': '/tmp/tmpnn36n4gp', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7faaf6574160>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


INFO:tensorflow:Parsing /tmp/census_data/adult.data


I0723 06:48:28.727035 140376063911744 census_dataset.py:167] Parsing /tmp/census_data/adult.data


INFO:tensorflow:Calling model_fn.


I0723 06:48:28.754614 140376063911744 estimator.py:1111] Calling model_fn.


INFO:tensorflow:Done calling model_fn.


I0723 06:48:29.979316 140376063911744 estimator.py:1113] Done calling model_fn.


INFO:tensorflow:Create CheckpointSaverHook.


I0723 06:48:29.980793 140376063911744 basic_session_run_hooks.py:527] Create CheckpointSaverHook.


INFO:tensorflow:Graph was finalized.


I0723 06:48:30.374946 140376063911744 monitored_session.py:222] Graph was finalized.


INFO:tensorflow:Running local_init_op.


I0723 06:48:30.457409 140376063911744 session_manager.py:491] Running local_init_op.


INFO:tensorflow:Done running local_init_op.


I0723 06:48:30.478245 140376063911744 session_manager.py:493] Done running local_init_op.


INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpnn36n4gp/model.ckpt.


I0723 06:48:31.070041 140376063911744 basic_session_run_hooks.py:594] Saving checkpoints for 0 into /tmp/tmpnn36n4gp/model.ckpt.


INFO:tensorflow:loss = 44.36142, step = 1


I0723 06:48:31.543227 140376063911744 basic_session_run_hooks.py:249] loss = 44.36142, step = 1


INFO:tensorflow:global_step/sec: 92.6723


I0723 06:48:32.622017 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 92.6723


INFO:tensorflow:loss = 24.961689, step = 101 (1.083 sec)


I0723 06:48:32.626173 140376063911744 basic_session_run_hooks.py:247] loss = 24.961689, step = 101 (1.083 sec)


INFO:tensorflow:global_step/sec: 111.989


I0723 06:48:33.514930 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 111.989


INFO:tensorflow:loss = 21.310783, step = 201 (0.890 sec)


I0723 06:48:33.516530 140376063911744 basic_session_run_hooks.py:247] loss = 21.310783, step = 201 (0.890 sec)


INFO:tensorflow:global_step/sec: 122.72


I0723 06:48:34.329782 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 122.72


INFO:tensorflow:loss = 27.713894, step = 301 (0.815 sec)


I0723 06:48:34.331496 140376063911744 basic_session_run_hooks.py:247] loss = 27.713894, step = 301 (0.815 sec)


INFO:tensorflow:global_step/sec: 111.874


I0723 06:48:35.223645 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 111.874


INFO:tensorflow:loss = 27.750263, step = 401 (0.894 sec)


I0723 06:48:35.225422 140376063911744 basic_session_run_hooks.py:247] loss = 27.750263, step = 401 (0.894 sec)


INFO:tensorflow:global_step/sec: 111.924


I0723 06:48:36.117098 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 111.924


INFO:tensorflow:loss = 28.841791, step = 501 (0.893 sec)


I0723 06:48:36.118752 140376063911744 basic_session_run_hooks.py:247] loss = 28.841791, step = 501 (0.893 sec)


INFO:tensorflow:global_step/sec: 110.377


I0723 06:48:37.023103 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 110.377


INFO:tensorflow:loss = 29.47107, step = 601 (0.906 sec)


I0723 06:48:37.025103 140376063911744 basic_session_run_hooks.py:247] loss = 29.47107, step = 601 (0.906 sec)


INFO:tensorflow:global_step/sec: 123.603


I0723 06:48:37.832130 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 123.603


INFO:tensorflow:loss = 25.34545, step = 701 (0.809 sec)


I0723 06:48:37.833873 140376063911744 basic_session_run_hooks.py:247] loss = 25.34545, step = 701 (0.809 sec)


INFO:tensorflow:global_step/sec: 112.87


I0723 06:48:38.718172 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 112.87


INFO:tensorflow:loss = 30.35869, step = 801 (0.886 sec)


I0723 06:48:38.720234 140376063911744 basic_session_run_hooks.py:247] loss = 30.35869, step = 801 (0.886 sec)


INFO:tensorflow:global_step/sec: 122.494


I0723 06:48:39.534533 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 122.494


INFO:tensorflow:loss = 22.641308, step = 901 (0.816 sec)


I0723 06:48:39.536462 140376063911744 basic_session_run_hooks.py:247] loss = 22.641308, step = 901 (0.816 sec)


INFO:tensorflow:global_step/sec: 112.879


I0723 06:48:40.420434 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 112.879


INFO:tensorflow:loss = 26.031319, step = 1001 (0.886 sec)


I0723 06:48:40.422478 140376063911744 basic_session_run_hooks.py:247] loss = 26.031319, step = 1001 (0.886 sec)


INFO:tensorflow:global_step/sec: 111.385


I0723 06:48:41.318218 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 111.385


INFO:tensorflow:loss = 21.85146, step = 1101 (0.898 sec)


I0723 06:48:41.320320 140376063911744 basic_session_run_hooks.py:247] loss = 21.85146, step = 1101 (0.898 sec)


INFO:tensorflow:global_step/sec: 122.952


I0723 06:48:42.131551 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 122.952


INFO:tensorflow:loss = 23.684887, step = 1201 (0.813 sec)


I0723 06:48:42.133543 140376063911744 basic_session_run_hooks.py:247] loss = 23.684887, step = 1201 (0.813 sec)


INFO:tensorflow:global_step/sec: 112.245


I0723 06:48:43.022460 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 112.245


INFO:tensorflow:loss = 24.597286, step = 1301 (0.891 sec)


I0723 06:48:43.024515 140376063911744 basic_session_run_hooks.py:247] loss = 24.597286, step = 1301 (0.891 sec)


INFO:tensorflow:global_step/sec: 113.194


I0723 06:48:43.905888 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 113.194


INFO:tensorflow:loss = 22.231056, step = 1401 (0.884 sec)


I0723 06:48:43.908723 140376063911744 basic_session_run_hooks.py:247] loss = 22.231056, step = 1401 (0.884 sec)


INFO:tensorflow:global_step/sec: 121.775


I0723 06:48:44.727085 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 121.775


INFO:tensorflow:loss = 22.992348, step = 1501 (0.820 sec)


I0723 06:48:44.729196 140376063911744 basic_session_run_hooks.py:247] loss = 22.992348, step = 1501 (0.820 sec)


INFO:tensorflow:global_step/sec: 112.29


I0723 06:48:45.617621 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 112.29


INFO:tensorflow:loss = 20.596924, step = 1601 (0.890 sec)


I0723 06:48:45.619463 140376063911744 basic_session_run_hooks.py:247] loss = 20.596924, step = 1601 (0.890 sec)


INFO:tensorflow:global_step/sec: 122.473


I0723 06:48:46.434097 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 122.473


INFO:tensorflow:loss = 21.644773, step = 1701 (0.816 sec)


I0723 06:48:46.435900 140376063911744 basic_session_run_hooks.py:247] loss = 21.644773, step = 1701 (0.816 sec)


INFO:tensorflow:global_step/sec: 113.004


I0723 06:48:47.319022 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 113.004


INFO:tensorflow:loss = 13.840406, step = 1801 (0.885 sec)


I0723 06:48:47.320690 140376063911744 basic_session_run_hooks.py:247] loss = 13.840406, step = 1801 (0.885 sec)


INFO:tensorflow:global_step/sec: 111.802


I0723 06:48:48.213462 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 111.802


INFO:tensorflow:loss = 24.50061, step = 1901 (0.897 sec)


I0723 06:48:48.217335 140376063911744 basic_session_run_hooks.py:247] loss = 24.50061, step = 1901 (0.897 sec)


INFO:tensorflow:global_step/sec: 112.152


I0723 06:48:49.105127 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 112.152


INFO:tensorflow:loss = 20.557762, step = 2001 (0.892 sec)


I0723 06:48:49.109675 140376063911744 basic_session_run_hooks.py:247] loss = 20.557762, step = 2001 (0.892 sec)


INFO:tensorflow:global_step/sec: 110.218


I0723 06:48:50.012434 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 110.218


INFO:tensorflow:loss = 34.678604, step = 2101 (0.908 sec)


I0723 06:48:50.017372 140376063911744 basic_session_run_hooks.py:247] loss = 34.678604, step = 2101 (0.908 sec)


INFO:tensorflow:global_step/sec: 111.737


I0723 06:48:50.907362 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 111.737


INFO:tensorflow:loss = 21.675182, step = 2201 (0.892 sec)


I0723 06:48:50.909041 140376063911744 basic_session_run_hooks.py:247] loss = 21.675182, step = 2201 (0.892 sec)


INFO:tensorflow:global_step/sec: 122.915


I0723 06:48:51.720943 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 122.915


INFO:tensorflow:loss = 19.028896, step = 2301 (0.814 sec)


I0723 06:48:51.722601 140376063911744 basic_session_run_hooks.py:247] loss = 19.028896, step = 2301 (0.814 sec)


INFO:tensorflow:global_step/sec: 113.027


I0723 06:48:52.605693 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 113.027


INFO:tensorflow:loss = 19.790756, step = 2401 (0.885 sec)


I0723 06:48:52.607553 140376063911744 basic_session_run_hooks.py:247] loss = 19.790756, step = 2401 (0.885 sec)


INFO:tensorflow:global_step/sec: 121.711


I0723 06:48:53.427291 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 121.711


INFO:tensorflow:loss = 33.803444, step = 2501 (0.822 sec)


I0723 06:48:53.429083 140376063911744 basic_session_run_hooks.py:247] loss = 33.803444, step = 2501 (0.822 sec)


INFO:tensorflow:global_step/sec: 112.382


I0723 06:48:54.317146 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 112.382


INFO:tensorflow:loss = 26.669046, step = 2601 (0.890 sec)


I0723 06:48:54.319054 140376063911744 basic_session_run_hooks.py:247] loss = 26.669046, step = 2601 (0.890 sec)


INFO:tensorflow:global_step/sec: 123.588


I0723 06:48:55.126284 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 123.588


INFO:tensorflow:loss = 22.295452, step = 2701 (0.810 sec)


I0723 06:48:55.129545 140376063911744 basic_session_run_hooks.py:247] loss = 22.295452, step = 2701 (0.810 sec)


INFO:tensorflow:global_step/sec: 123.37


I0723 06:48:55.936851 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 123.37


INFO:tensorflow:loss = 24.710098, step = 2801 (0.875 sec)


I0723 06:48:56.004094 140376063911744 basic_session_run_hooks.py:247] loss = 24.710098, step = 2801 (0.875 sec)


INFO:tensorflow:global_step/sec: 113.547


I0723 06:48:56.817537 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 113.547


INFO:tensorflow:loss = 21.410133, step = 2901 (0.815 sec)


I0723 06:48:56.819258 140376063911744 basic_session_run_hooks.py:247] loss = 21.410133, step = 2901 (0.815 sec)


INFO:tensorflow:global_step/sec: 122.974


I0723 06:48:57.630714 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 122.974


INFO:tensorflow:loss = 13.801193, step = 3001 (0.813 sec)


I0723 06:48:57.632243 140376063911744 basic_session_run_hooks.py:247] loss = 13.801193, step = 3001 (0.813 sec)


INFO:tensorflow:global_step/sec: 112.716


I0723 06:48:58.517909 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 112.716


INFO:tensorflow:loss = 34.726273, step = 3101 (0.887 sec)


I0723 06:48:58.519638 140376063911744 basic_session_run_hooks.py:247] loss = 34.726273, step = 3101 (0.887 sec)


INFO:tensorflow:global_step/sec: 122.354


I0723 06:48:59.335230 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 122.354


INFO:tensorflow:loss = 19.603273, step = 3201 (0.886 sec)


I0723 06:48:59.405480 140376063911744 basic_session_run_hooks.py:247] loss = 19.603273, step = 3201 (0.886 sec)


INFO:tensorflow:global_step/sec: 112.759


I0723 06:49:00.222047 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 112.759


INFO:tensorflow:loss = 23.27134, step = 3301 (0.818 sec)


I0723 06:49:00.223580 140376063911744 basic_session_run_hooks.py:247] loss = 23.27134, step = 3301 (0.818 sec)


INFO:tensorflow:global_step/sec: 123.023


I0723 06:49:01.034891 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 123.023


INFO:tensorflow:loss = 16.374321, step = 3401 (0.813 sec)


I0723 06:49:01.036446 140376063911744 basic_session_run_hooks.py:247] loss = 16.374321, step = 3401 (0.813 sec)


INFO:tensorflow:global_step/sec: 113.595


I0723 06:49:01.915213 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 113.595


INFO:tensorflow:loss = 23.744987, step = 3501 (0.880 sec)


I0723 06:49:01.916905 140376063911744 basic_session_run_hooks.py:247] loss = 23.744987, step = 3501 (0.880 sec)


INFO:tensorflow:global_step/sec: 111.899


I0723 06:49:02.808903 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 111.899


INFO:tensorflow:loss = 25.98109, step = 3601 (0.895 sec)


I0723 06:49:02.812104 140376063911744 basic_session_run_hooks.py:247] loss = 25.98109, step = 3601 (0.895 sec)


INFO:tensorflow:global_step/sec: 122.747


I0723 06:49:03.623573 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 122.747


INFO:tensorflow:loss = 22.687468, step = 3701 (0.813 sec)


I0723 06:49:03.625528 140376063911744 basic_session_run_hooks.py:247] loss = 22.687468, step = 3701 (0.813 sec)


INFO:tensorflow:global_step/sec: 111.757


I0723 06:49:04.518356 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 111.757


INFO:tensorflow:loss = 19.996302, step = 3801 (0.895 sec)


I0723 06:49:04.520323 140376063911744 basic_session_run_hooks.py:247] loss = 19.996302, step = 3801 (0.895 sec)


INFO:tensorflow:global_step/sec: 112.643


I0723 06:49:05.406125 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 112.643


INFO:tensorflow:loss = 25.203217, step = 3901 (0.888 sec)


I0723 06:49:05.408080 140376063911744 basic_session_run_hooks.py:247] loss = 25.203217, step = 3901 (0.888 sec)


INFO:tensorflow:global_step/sec: 121.779


I0723 06:49:06.227286 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 121.779


INFO:tensorflow:loss = 26.474108, step = 4001 (0.821 sec)


I0723 06:49:06.229302 140376063911744 basic_session_run_hooks.py:247] loss = 26.474108, step = 4001 (0.821 sec)


INFO:tensorflow:global_step/sec: 110.764


I0723 06:49:07.130127 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 110.764


INFO:tensorflow:loss = 25.324778, step = 4101 (0.903 sec)


I0723 06:49:07.132142 140376063911744 basic_session_run_hooks.py:247] loss = 25.324778, step = 4101 (0.903 sec)


INFO:tensorflow:global_step/sec: 112.007


I0723 06:49:08.022906 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 112.007


INFO:tensorflow:loss = 25.474874, step = 4201 (0.895 sec)


I0723 06:49:08.027449 140376063911744 basic_session_run_hooks.py:247] loss = 25.474874, step = 4201 (0.895 sec)


INFO:tensorflow:global_step/sec: 112.213


I0723 06:49:08.914071 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 112.213


INFO:tensorflow:loss = 20.918747, step = 4301 (0.891 sec)


I0723 06:49:08.918706 140376063911744 basic_session_run_hooks.py:247] loss = 20.918747, step = 4301 (0.891 sec)


INFO:tensorflow:global_step/sec: 121.74


I0723 06:49:09.735475 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 121.74


INFO:tensorflow:loss = 19.186718, step = 4401 (0.886 sec)


I0723 06:49:09.804866 140376063911744 basic_session_run_hooks.py:247] loss = 19.186718, step = 4401 (0.886 sec)


INFO:tensorflow:global_step/sec: 112.136


I0723 06:49:10.627295 140376063911744 basic_session_run_hooks.py:680] global_step/sec: 112.136


INFO:tensorflow:loss = 16.784672, step = 4501 (0.828 sec)


I0723 06:49:10.632505 140376063911744 basic_session_run_hooks.py:247] loss = 16.784672, step = 4501 (0.828 sec)


KeyboardInterrupt: 

In [None]:
model_l2 = tf.estimator.LinearClassifier(
    feature_columns=base_columns + crossed_columns,
    optimizer=tf.train.FtrlOptimizer(
        learning_rate=0.1,
        l1_regularization_strength=0.0,
        l2_regularization_strength=10.0))

model_l2.train(train_inpf)

results = model_l2.evaluate(test_inpf)
clear_output()
for key in sorted(results):
  print('%s: %0.2f' % (key, results[key]))

These regularized models don't perform much better than the base model. Let's look at the model's weight distributions to better see the effect of the regularization:

In [None]:
def get_flat_weights(model):
  weight_names = [
      name for name in model.get_variable_names()
      if "linear_model" in name and "Ftrl" not in name]

  weight_values = [model.get_variable_value(name) for name in weight_names]

  weights_flat = np.concatenate([item.flatten() for item in weight_values], axis=0)

  return weights_flat

weights_flat = get_flat_weights(model)
weights_flat_l1 = get_flat_weights(model_l1)
weights_flat_l2 = get_flat_weights(model_l2)

The models have many zero-valued weights caused by unused hash bins (there are many more hash bins than categories in some columns). We can mask these weights when viewing the weight distributions:

In [None]:
weight_mask = weights_flat != 0

weights_base = weights_flat[weight_mask]
weights_l1 = weights_flat_l1[weight_mask]
weights_l2 = weights_flat_l2[weight_mask]

Now plot the distributions:

In [None]:
plt.figure()
_ = plt.hist(weights_base, bins=np.linspace(-3,3,30))
plt.title('Base Model')
plt.ylim([0,500])

plt.figure()
_ = plt.hist(weights_l1, bins=np.linspace(-3,3,30))
plt.title('L1 - Regularization')
plt.ylim([0,500])

plt.figure()
_ = plt.hist(weights_l2, bins=np.linspace(-3,3,30))
plt.title('L2 - Regularization')
_=plt.ylim([0,500])



Both types of regularization squeeze the distribution of weights towards zero. L2 regularization has a greater effect in the tails of the distribution eliminating extreme weights. L1 regularization produces more exactly-zero values, in this case it sets ~200 to zero.