# Intro to Classifying Structured Data with TensorFlow
This notebook demonstrates classifying structured. The code presented here can become a starting point for a problem you care about. Our goal is to introduce a variety of techniques (especially, feature engineering) rather than to aim for high-accuracy on the  demo dataset we'll explore. 

### Notes
* If you run this notebook multiple times, you'll want to restore it to a clean state. When you run the notebook, the Estimators will write logs and checkpoint files to disk. These will be in a *./graphs* directory in the same folder as this notebook. Delete this to restore to a clean state.


* We'll demonstrate two types of input functions. First, the pre-built Pandas input function, and second, one written using the new [Datasets API](https://www.tensorflow.org/programmers_guide/datasets). At the time of writing (v1.3) the Datasets API is in contrib. When it moves to core (most likely in v1.4) we'll update this notebook.

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections

import numpy as np
import pandas as pd

from IPython.display import Image

import tensorflow as tf
print('This code requires TensorFlow v1.3+')
print('You have:', tf.__version__)

This code requires TensorFlow v1.3+
You have: 2.5.0


### About the dataset

Here, we'll work with the [Adult dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/) from the 1990 US Census. Our task is to predict whether an individual has an income over $50,000 / year, based attributes such as their age and occupation. This is a generic problem with a variety of numeric and categorical attributes - which makes it useful for demonstration purposes.

A great way to get to know the dataset is by using [Facets](https://github.com/pair-code/facets) - an open source tool for visualizing and exploring data. At the time of writing, the [online demo](https://pair-code.github.io/facets/) has the Census data preloaded. Try it! In the screenshot below, each dot represents a person, or, a row from the CSV. They're colored by the label we want to predict ('blue' for less than 50k / year, 'red' for more). In the online demo, clicking on a person will show the attributes, or columns from the CSV file, that describe them - such as their age and occuptation.

In [2]:
Image(filename='./images/facets1.jpg', width=500)

<IPython.core.display.Image object>

In [3]:
census_train_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
census_test_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
census_train_path = tf.keras.utils.get_file('census.train', census_train_url)
census_test_path = tf.keras.utils.get_file('census.test', census_test_url)

The dataset is missing a header, so we'll add one here. You can find descriptions of these columns in the [names file](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names).

In [4]:
column_names = [
  'age', 'workclass', 'fnlwgt', 'education', 'education-num',
  'marital-status', 'occupation', 'relationship', 'race', 'gender',
  'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
  'income'
]

### Load the using Pandas

In the first half of this notebook, we'll assume the dataset fits into memory. Should you need to work with larger files, you can use the Datasets API to read them.

In [5]:
# Notes
# 1) We provide the header from above.
# 2) The test file has a line we want to disgard at the top, so we include the parameter 'skiprows=1'
census_train = pd.read_csv(census_train_path, index_col=False, names=column_names) 
census_test = pd.read_csv(census_test_path, skiprows=1, index_col=False, names=column_names) 

# Drop any rows that have missing elements
# Of course there are other ways to handle missing data, but we'll
# take the simplest approach here.
census_train = census_train.dropna(how="any", axis=0)
census_test = census_test.dropna(how="any", axis=0)

### Correct formatting problems with the Census data
As it happens, there's a small formatting problem with the testing CSV file that we'll fix here. The labels in the testing file are written differently than they are in the training file. Notice the extra "." after "<=50K" and ">50K" in the screenshot below.

You can open the CSVs in your favorite text editor to see the error, or you can see it with Facets in "overview mode" - which makes it easy to catch this kind of mistake early.

In [6]:
Image(filename='./images/facets2.jpg', width=500)

<IPython.core.display.Image object>

In [7]:
# Separate the label we want to predict into its own object 
# At the same time, we'll convert it into true/false to fix the formatting error
census_train_label = census_train.pop('income').apply(lambda x: ">50K" in x)
census_test_label = census_test.pop('income').apply(lambda x: ">50K" in x)

I find it useful to print out the shape of the data as I go, as a sanity check.

In [8]:
print ("Training examples: %d" % census_train.shape[0])
print ("Training labels: %d" % census_train_label.shape[0])
print()
print ("Test examples: %d" % census_test.shape[0])
print ("Test labels: %d" % census_test_label.shape[0])

Training examples: 32561
Training labels: 32561

Test examples: 16281
Test labels: 16281


Likewise, I like to see the head of each file, to help spot errors early on. First for the training examples...

In [9]:
census_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


... and now for the labels. Notice the label column is now true/false.

In [10]:
census_train_label.head(10)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7     True
8     True
9     True
Name: income, dtype: bool

In [11]:
# Likewise, you could do a spot check of the testing examples and labels.
# census_test.head()
# census_test_label.head()

# Estimators and Input Functions

[TensorFlow Estimators](https://www.tensorflow.org/get_started/estimator) provide a high-level API you can use to train your models. Here, we'll use Canned Estimators ("models-in-a-box"). These handle many implementation details for you, so you can focus on solving your problem (e.g., by coming up with informative features using the feature engineering techniques we introduce below). 

To learn more about Estimators, you can watch this talk from Google I/O by Martin Wicke: [Effective TensorFlow for Non-Experts](https://www.youtube.com/watch?v=5DknTFbcGVM). Here's a diagram of the methods we'll use here.

In [12]:
Image(filename='./images/estimators1.jpeg', width=400)

<IPython.core.display.Image object>

You can probably guess the purpose of methods like train / evaluate / and predict. What may be new to you, though, are [Input Functions](https://www.tensorflow.org/get_started/estimator#describe_the_training_input_pipeline). These are responsible for reading your data, preprocessing it, and sending it to the model. When you use an input function, your code will read *estimator.train(your_input_function)* rather than *estimator.train(your_training_data)*. 

First, we'll use a [pre-built](https://www.tensorflow.org/get_started/input_fn) input function. This is useful for working with a Pandas dataset that you happen to already have in memory, as we do here. Next, we'll use the [Datasets API](https://www.tensorflow.org/programmers_guide/datasets) to write our own. The Datasets API will become the standard way of writing input functions moving forward. It's in contrib in TensorFlow v1.3, but will most likely move to core in v1.4.

### Input functions for training and testing data
Why do we need two input functions? There are a couple differences in how we handle our training and testing data. We want the training input function to loop over the data indefinitely (returning batches of examples and labels when called). We want the testing input function run for just one epoch, so we can make one prediction for each testing example. We'll also want to shuffle the training data, but not the testing data (so we can compare it to the labels later).

In [30]:
def create_train_input_fn(): 
    return tf.compat.v1.estimator.inputs.pandas_input_fn(
        x=census_train,
        y=census_train_label, 
        batch_size=32,
        num_epochs=None, # Repeat forever
        shuffle=True)

In [31]:
def create_test_input_fn():
    return tf.compat.v1.estimator.inputs.pandas_input_fn(
        x=census_test,
        y=census_test_label, 
        num_epochs=1, # Just one epoch
        shuffle=False) # Don't shuffle so we can compare to census_test_labels later

See the bottom of the notebook for an example of doing this with the new Datasets API.

# Feature Engineering

Now we'll specify the features we'll use and how we'd like them represented. To do so, we'll use tf.feature_columns. Basically, these enable you to represent a column from the CSV file in a variety of interesting ways. Our goal here is to demostrate how to work with different types of features, rather than to aim for an accurate model. Here are five different types we'll use in our Linear model:

* A numeric_column. This is just a real-valued attribute.
* A bucketized_column. TensorFlow automatically buckets a numeric column for us.
* A categorical_column_with_vocabulary_list. This is just a categorical column, where you know the possible values in advance. This is useful when you have a small number of possibilities.
* A categorical_column_with_hash_bucket. This is a useful way to represent categorical features when you have a large number of values. Beware of hash collisions.
* A crossed_column. Linear models cannot consider interactions between features, so we'll ask TensorFlow to cross features for us.

In the Deep model, we'll also use:

* An embedding column(!). This automatically creates an embedding for categorical data.

You can learn more about feature columns in the [Large Scale Linear Models Tutorial](https://www.tensorflow.org/tutorials/linear#feature_columns_and_transformations) in the [Wide & Deep tutorial](https://www.tensorflow.org/tutorials/wide_and_deep#define_base_feature_columns), as well as in the [API doc](https://www.tensorflow.org/api_docs/python/tf/feature_column). 

Following is a demo of a couple of the things you can do.

In [15]:
# A list of the feature columns we'll use to train the Linear model
feature_columns = []

In [16]:
# To start, we'll use the raw, numeric value of age.
age = tf.feature_column.numeric_column('age')
feature_columns.append(age)

Next, we'll add a bucketized column. Bucketing divides the data based on ranges, so the classifier can consider each independently. This is especially helpful to linear models. Here's what the buckets below look like for age, as seen using Facets.

In [17]:
Image(filename='./images/buckets.jpeg', width=400)

<IPython.core.display.Image object>

In [18]:
age_buckets = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('age'), 
    boundaries=[31, 46, 60, 75, 90] # specify the ranges
)

feature_columns.append(age_buckets)

You can also evenly divide the data, if you prefer not to specify the ranges yourself.

In [19]:
# age_buckets = tf.feature_column.bucketized_column(
#    tf.feature_column.numeric_column('age'), 
#    list(range(10))
#)

In [20]:
# Here's a categorical column
# We're specifying the possible values
education = tf.feature_column.categorical_column_with_vocabulary_list(
    "education", [
        "Bachelors", "HS-grad", "11th", "Masters", "9th",
        "Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
        "Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
        "Preschool", "12th"
    ])

feature_columns.append(education)

If you prefer not to specify the vocab in code, you can also read it from a file, or alternatively - use a categorical_column_with_hash_bucket. Beware of hash collisions.

In [21]:
# A categorical feature with a possibly large number of values
# and the vocabulary not specified in advance.
native_country = tf.feature_column.categorical_column_with_hash_bucket('native-country', 1000)
feature_columns.append(native_country)

Now let's create a crossed column for age and education. Here's what this looks like.

In [22]:
Image(filename='./images/crossed.jpeg', width=400)

<IPython.core.display.Image object>

In [23]:
age_cross_education = tf.feature_column.crossed_column(
    [age_buckets, education],
    hash_bucket_size=int(1e4) # Using a hash is handy here
)
feature_columns.append(age_cross_education)

## Train a Canned Linear Estimator

Note: logs and a checkpoint file will be written to *model_dir*. Delete this from disk before rerunning the notebook for a clean start.

In [32]:
train_input_fn = create_train_input_fn()
estimator = tf.estimator.LinearClassifier(feature_columns, model_dir='graphs/linear', n_classes=2)
estimator.train(train_input_fn, steps=1000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'graphs/linear', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
I



Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into graphs/linear\model.ckpt.


NotFoundError: Failed to create a NewWriteableFile: graphs/linear\model.ckpt-0_temp\part-00000-of-00001.data-00000-of-00001.tempstate8354314742948008978 : Das System kann den angegebenen Pfad nicht finden.
; No such process
	 [[node save/SaveV2 (defined at C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py:1497) ]]

Errors may have originated from an input operation.
Input Source operations connected to node save/SaveV2:
 linear/linear_model/bias_weights/Read/ReadVariableOp (defined at C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\canned\linear.py:1471)	
 training/Ftrl/linear/linear_model/age_bucketized/weights/accumulator/Read/ReadVariableOp (defined at C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\head\base_head.py:910)	
 global_step/Read/ReadVariableOp (defined at C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py:1079)	
 linear/linear_model/education/weights/Read/ReadVariableOp (defined at C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\canned\linear.py:1462)

Original stack trace for 'save/SaveV2':
  File "C:\Users\elhatri\Anaconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\elhatri\Anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\traitlets\config\application.py", line 845, in launch_instance
    app.start()
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 612, in start
    self.io_loop.start()
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tornado\platform\asyncio.py", line 199, in start
    self.asyncio_loop.run_forever()
  File "C:\Users\elhatri\Anaconda3\lib\asyncio\base_events.py", line 570, in run_forever
    self._run_once()
  File "C:\Users\elhatri\Anaconda3\lib\asyncio\base_events.py", line 1859, in _run_once
    handle._run()
  File "C:\Users\elhatri\Anaconda3\lib\asyncio\events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tornado\ioloop.py", line 688, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tornado\ioloop.py", line 741, in _run_callback
    ret = callback()
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tornado\gen.py", line 814, in inner
    self.ctx_run(self.run)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tornado\gen.py", line 775, in run
    yielded = self.gen.send(value)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 365, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tornado\gen.py", line 234, in wrapper
    yielded = ctx_run(next, result)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 268, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tornado\gen.py", line 234, in wrapper
    yielded = ctx_run(next, result)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 543, in execute_request
    self.do_execute(
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tornado\gen.py", line 234, in wrapper
    yielded = ctx_run(next, result)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 306, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 536, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2894, in run_cell
    result = self._run_cell(
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2940, in _run_cell
    return runner(coro)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\IPython\core\async_helpers.py", line 68, in _pseudo_sync_runner
    coro.send(None)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3165, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3357, in run_ast_nodes
    if (await self.run_code(code, result,  async_=asy)):
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-32-8aadc95c2cc8>", line 3, in <module>
    estimator.train(train_input_fn, steps=1000)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1206, in _train_model_default
    return self._train_with_estimator_spec(estimator_spec, worker_hooks,
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1497, in _train_with_estimator_spec
    with training.MonitoredTrainingSession(
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 602, in MonitoredTrainingSession
    return MonitoredSession(
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1035, in __init__
    super(MonitoredSession, self).__init__(
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 750, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1232, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1237, in _create_session
    return self._sess_creator.create_session()
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 903, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 661, in create_session
    self._scaffold.finalize()
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 244, in finalize
    self._saver.build()
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 848, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 876, in _build
    self.saver_def = self._builder._build_internal(  # pylint: disable=protected-access
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 507, in _build_internal
    save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 299, in _AddShardedSaveOps
    return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 273, in _AddShardedSaveOpsForV2
    sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 205, in _AddSaveOps
    save = self.save_op(filename_tensor, saveables)
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 120, in save_op
    return io_ops.save_v2(filename_tensor, tensor_names, tensor_slices,
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 1701, in save_v2
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 748, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3557, in _create_op_internal
    ret = Operation(
  File "C:\Users\elhatri\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2045, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)


## Evaluate

In [29]:
test_input_fn = create_test_input_fn()
estimator.evaluate(test_input_fn)

AttributeError: module 'tensorflow_estimator.python.estimator.api._v2.estimator' has no attribute 'inputs'

### Predict

The Estimator returns a generator object. This bit of code demonstrates how to retrieve predictions for individual examples.

In [None]:
# reinitialize the input function
test_input_fn = create_test_input_fn()

predictions = estimator.predict(test_input_fn)
i = 0
for prediction in predictions:
    true_label = census_test_label[i]
    predicted_label = prediction['class_ids'][0]
    # Uncomment the following line to see probabilities for individual classes
    # print(prediction) 
    print("Example %d. Actual: %d, Predicted: %d" % (i, true_label, predicted_label))
    i += 1
    if i == 5: break

## What features can you use to achieve higher accuracy?
This dataset is imbalanced, so an an accuracy of around 75% is *low* in this context (this could be achieved merely by predicting *everyone* makes less than 50k / year). In fact, if you look through the predictions closely, you'll find that many are zero. We'll get a little smarter as we go.

## Train a Deep Model

### Add an embedding feature(!) and update the feature columns
Instead of using a hash to represent categorical features, here we'll use a learned embedding. (Cool, right?) We'll also update how the features are represented for our deep model. Here, we'll use a different combination of features that before, just for fun.

In [None]:
# We'll provide vocabulary lists for features with just a few terms
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    'workclass',
    [' Self-emp-not-inc', ' Private', ' State-gov', ' Federal-gov',
     ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay', ' Never-worked'])

education = tf.feature_column.categorical_column_with_vocabulary_list(
    'education',
    [' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th', ' Some-college',
     ' Assoc-acdm', ' Assoc-voc', ' 7th-8th', ' Doctorate', ' Prof-school',
     ' 5th-6th', ' 10th', ' 1st-4th', ' Preschool', ' 12th'])

marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    'marital-status',
    [' Married-civ-spouse', ' Divorced', ' Married-spouse-absent',
     ' Never-married', ' Separated', ' Married-AF-spouse', ' Widowed'])
     
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    'relationship',
    [' Husband', ' Not-in-family', ' Wife', ' Own-child', ' Unmarried',
     ' Other-relative'])

In [None]:
feature_columns = [

    # Use indicator columns for low dimensional vocabularies
    tf.feature_column.indicator_column(workclass),
    tf.feature_column.indicator_column(education),
    tf.feature_column.indicator_column(marital_status),
    tf.feature_column.indicator_column(relationship),

    # Use embedding columns for high dimensional vocabularies
    tf.feature_column.embedding_column(  # now using embedding!
        # params are hash buckets, embedding size
        tf.feature_column.categorical_column_with_hash_bucket('occupation', 100), 10),
    
    # numeric features
    tf.feature_column.numeric_column('age'),
    tf.feature_column.numeric_column('education-num'),
    tf.feature_column.numeric_column('capital-gain'),
    tf.feature_column.numeric_column('capital-loss'),
    tf.feature_column.numeric_column('hours-per-week'),   
]

In [None]:
estimator = tf.estimator.DNNClassifier(hidden_units=[256, 128, 64], 
                                       feature_columns=feature_columns, 
                                       n_classes=2, 
                                       model_dir='graphs/dnn')

In [None]:
train_input_fn = create_train_input_fn()
estimator.train(train_input_fn, steps=2000)

In [None]:
test_input_fn = create_test_input_fn()
estimator.evaluate(test_input_fn)

That's a little better.

### TensorBoard
If you like, you can start TensorBoard by running this from a terminal command (in the same directory as this notebook):

```$ tensorboard --logdir=graphs```

then pointing your web-browser to ```http://localhost:6006``` (check the TensorBoard output in the terminal in case it's running on a different port).

When that launches, you'll be able to see a variety of graphs that compares the linear and deep models.

In [None]:
Image(filename='./images/tensorboard.jpeg', width=500)

## Datasets API
Here, I'll demonstrate how to use the new [Datasets API](https://www.tensorflow.org/programmers_guide/datasets), which you can use to write complex input pipeline from simple, reusable pieces. 

At the time of writing (v1.3) this API is in contrib. It's most likely moving into core in v1.4, which is good news. Using TensorFlow 1.4, the below can be written using *regular* Python code to parse the CSV file, via the *Datasets.from_generator()* method. This improves producivity a lot - it means you can use Python to read, parse, and apply whatever logic you wish to your input data - then you can take advantage of the reusable pieces of the Datasets API (e.g., batch, shuffle, repeat, etc) - as well as the optional performance tuning (e.g., prefetch, parallel process, etc).

In combination with Estimators, this means you can train and tune deep models at scale on data of almost any size, entirely using a high-level API. I'll update this notebook after v1.4 is released with an example. It's neat. 

In [None]:
# I'm going to reset the notebook to show you how to do this from a clean slate
%reset -f 

import collections
import tensorflow as tf

census_train_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
census_test_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
census_train_path = tf.contrib.keras.utils.get_file('census.train', census_train_url)
census_test_path = tf.contrib.keras.utils.get_file('census.test', census_test_url)

In [None]:
# Provide default values for each of the CSV columns
# and a header at the same time.
csv_defaults = collections.OrderedDict([
  ('age',[0]),
  ('workclass',['']),
  ('fnlwgt',[0]),
  ('education',['']),
  ('education-num',[0]),
  ('marital-status',['']),
  ('occupation',['']),
  ('relationship',['']),
  ('race',['']),
  ('sex',['']),
  ('capital-gain',[0]),
  ('capital-loss',[0]),
  ('hours-per-week',[0]),
  ('native-country',['']),
  ('income',['']),
])

In [None]:
# Decode a line from the CSV.
def csv_decoder(line):
    """Convert a CSV row to a dictonary of features."""
    parsed = tf.decode_csv(line, list(csv_defaults.values()))
    return dict(zip(csv_defaults.keys(), parsed))

# The train file has an extra empty line at the end.
# We'll use this method to filter that out.
def filter_empty_lines(line):
    return tf.not_equal(tf.size(tf.string_split([line], ',').values), 0)

def create_train_input_fn(path):
    def input_fn():    
        dataset = (
            tf.contrib.data.TextLineDataset(path)  # create a dataset from a file
                .filter(filter_empty_lines)  # ignore empty lines
                .map(csv_decoder)  # parse each row
                .shuffle(buffer_size=1000)  # shuffle the dataset
                .repeat()  # repeate indefinitely
                .batch(32)) # batch the data

        # create iterator
        columns = dataset.make_one_shot_iterator().get_next()
        
        # separate the label and convert it to true/false
        income = tf.equal(columns.pop('income')," >50K") 
        return columns, income
    return input_fn

def create_test_input_fn(path):
    def input_fn():    
        dataset = (
            tf.contrib.data.TextLineDataset(path)
                .skip(1) # The test file has a strange first line, we want to ignore this.
                .filter(filter_empty_lines)
                .map(csv_decoder)
                .batch(32))

        # create iterator
        columns = dataset.make_one_shot_iterator().get_next()
        
        # separate the label and convert it to true/false
        income = tf.equal(columns.pop('income')," >50K") 
        return columns, income
    return input_fn

## Here's code you can use test the Dataset input functions

In [None]:
train_input_fn = create_train_input_fn(census_train_path)
next_batch = train_input_fn()

with tf.Session() as sess:
    features, label = sess.run(next_batch)
    print(features['education'])
    print(label)

    print()

    features, label = sess.run(next_batch)
    print(features['education'])
    print(label)

From here, you can use the input functions to train and evaluate your Estimators. I'll add some minimal code to do this, just to show the mechanics.

In [None]:
train_input_fn = create_train_input_fn(census_train_path)
test_input_fn = create_train_input_fn(census_test_path)

feature_columns = [
    tf.feature_column.numeric_column('age'),
]

estimator = tf.estimator.DNNClassifier(hidden_units=[256, 128, 64], 
                                       feature_columns=feature_columns, 
                                       n_classes=2, 
                                       # creating a new folder in case you haven't cleared 
                                       # the old one yet
                                       model_dir='graphs_datasets/dnn')

estimator.train(train_input_fn, steps=100)
estimator.evaluate(train_input_fn, steps=100)

This would be a good time to clean up the logs and checkpoints on disk, by deleting ```./graphs``` and ```./graphs_datasets```.

## Next steps

## To learn more about feature engineering

Check out the [Wide and Deep tutorial](https://www.tensorflow.org/tutorials/wide_and_deep). Also, see that tutorial for another kind of Estimator you can try that combines the Linear and Deep models.

## To learn more about Datasets

Check out the [programmers guide](https://www.tensorflow.org/programmers_guide/datasets), and check back after v1.4 is released for the Dataset.from_generator method, which I think will improve productivity a lot.