# Build a linear model with Estimators

- This tutorial uses the `tf.estimator` API in TensorFlow to solve a benchmark binary classification problem. Estimators are TensorFlow's most scalable and production-oriented model type. For more information see the [Estimator guide](https://www.tensorflow.org/guide/estimators).

## Overview

- Using census data which contains data a person's age, education, martial status, and occupation (the *features*), we will try to predict whether or not the person earns more than 50,000 dollars a year (the target *label*). We will train a *logistic regression* model that, given an individual's information, outputs a number between 0 and 1 - this can be interpreted as the probability that the individual has an annual income of over 50,000 dollars.

- **Key Point**: As a modeler and developer, think about how the data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is each feature relevant to the problem you want to solve or will it introduce bias? For more information, read about [ML fairness](https://developers.google.com/machine-learning/fairness-overview/).

### Setup

- Import TensorFlow, feature column support, and supporting modules:

In [2]:
import tensorflow as tf
import tensorflow.feature_column as fc

import os
import sys

import matplotlib.pyplot as plt
from IPython.display import clear_output

- And let's enable [eager execution](https://www.tensorflow.org/guide/eager) to inspect this program as we run it:

In [3]:
tf.enable_eager_execution()

### Download the official implementation

- We'll use the [wide and deep model](https://github.com/tensorflow/models/tree/master/official/wide_deep/) available in TensorFlow's [model repository](https://github.com/tensorflow/models/). Download the code, and the root directory to your Python path, and jump to the `wide_deep` directory:

In [3]:
!pip install -q requests
!git clone --depth 1 https://github.com/tensorflow/models

Cloning into 'models'...
remote: Enumerating objects: 3025, done.[K
remote: Counting objects: 100% (3025/3025), done.[K
remote: Compressing objects: 100% (2544/2544), done.[K
remote: Total 3025 (delta 534), reused 2097 (delta 404), pack-reused 0[K
Receiving objects: 100% (3025/3025), 370.36 MiB | 5.99 MiB/s, done.
Resolving deltas: 100% (534/534), done.
Checking out files: 100% (2859/2859), done.


- Add the root directory of the repository to your Python path:

In [4]:
models_path = os.path.join(os.getcwd(), 'models')
sys.path.append(models_path)

- Download the dataset:

In [5]:
from official.wide_deep import census_dataset
from official.wide_deep import census_main

census_dataset.download("/tmp/census_data/")

### Command line usage

- The repo includes a complete program for experimenting with this type of model.

- To execute the tutorial code from the command line first add the path to tensorflow/models to your `PYTHONPATH`.

In [None]:
#export PYTHONPATH=${PYTHONPATH}:"$(pwd)/models"
#running from python you need to set the `os.environ` or subprocess will not see the directory

if "PYTHONPATH" in os.environ:
    os.environ['PYTHONPATH'] += os.pathsep + models_path
else:
    os.environ['PYTHONPATH'] = models_path

- Use `--help` to see what command line options are available:

In [7]:
!python -m official.wide_deep.census_main --help

/usr/bin/python3: Error while finding module specification for 'official.wide_deep.census_main' (ModuleNotFoundError: No module named 'official')


- Now run the model:

In [None]:
!python -m official.wide_deep.census_main --model_type=wide --train_epochs=2

### Read the U.S. Census data

- This example uses the [U.S Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income) from 1994 and 1995. We have provided the [census_dataset.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_dataset.py) script to download the data and perform a little cleanup.

- Since the task is *binary classification problem*, we'll construct a label column named "label" whose value is 1 if the income is over 50K, and 0 otherwise. For reference, see the `input_fn` in [census_main.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py)

- Let's look at the data to see which columns we can use to predict the target label:

In [8]:
!ls /tmp/census_data/

adult.data  adult.test


In [9]:
train_file = "/tmp/census_data/adult.data"
test_file = "/tmp/census_data/adult.test"

- [pandas](https://pandas.pydata.org/) provides some convenient utilities for data analysis. Here's a list of columns available in the Census Income dataset:

In [10]:
import pandas

train_df = pandas.read_csv(train_file, header=None, names=census_dataset._CSV_COLUMNS)
test_df = pandas.read_csv(test_file, header=None, names=census_dataset._CSV_COLUMNS)

train_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


- The columns are grouped into two types: *categorical* and *continuous* columns:

    - A column is called *categorical* if its value can only be one of the categories in a finite set. For example, the relationship status of a person (wife, husband, unmarried, etc.) or the education level (high school, college, etc.) are categorical columns.
    
    - A column is called *continuous* if its value can be any numerical value in a continuous range. For example, the capital gain of a person (e.g. $14,084) is a continuous column.

### Converting Data into Tensors

- When building a `tf.estimator` model, the input data is specified by using and *input function* (or `input_fn`). This builder function returns a `tf.data.Dataset` of batches of `(features-dict, label)` pairs. It is not called until it is passed to `tf.estimator.Estimator` methods such as `train` and evaluate`.

- The input builder function returns the following pair:

    1. `features`: A dict from feature names to `Tensors` or `SparseTensors` containing batches of features.
    2. `labels`: A `Tensor` containing batches of labels.
    
- The keys of the `features` are used to configure the model's input layer.

- **Note**: The input function is called while constructing the TensorFlow graph, *not* while running the graph. It is returning a representation of the input data as a sequence of TensorFlow graph operations.

- For small problems like this, it's easy to make a `tf.data.Dataset` by slicing the `pandas.DataFrame`:

In [16]:
def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):
    label = df[label_key]
    ds = tf.data.Dataset.from_tensor_slices((dict(df), label))
    
    if shuffle:
        ds = ds.shuffle(10000)
        
    ds = ds.batch(batch_size).repeat(num_epochs)
    
    return ds

- Since we have eager execution enabled, it's easy to inspect the resulting dataset:

In [26]:
ds = easy_input_function(train_df, label_key='income_bracket', num_epochs=5, shuffle=True, batch_size=10)

for feature_batch, label_batch in ds.take(1):
    print('Some feature keys:', list(feature_batch.keys())[:5])
    print()
    print('A batch of Ages :', feature_batch['age'])
    print()
    print('A batch of Labels:', label_batch)

Some feature keys: ['age', 'workclass', 'fnlwgt', 'education', 'education_num']

A batch of Ages : tf.Tensor([20 37 49 23 33 47 32 44 41 44], shape=(10,), dtype=int32)

A batch of Labels: tf.Tensor(
[b'<=50K' b'>50K' b'>50K' b'<=50K' b'<=50K' b'<=50K' b'<=50K' b'<=50K'
 b'<=50K' b'<=50K'], shape=(10,), dtype=string)


- But this approach has severly-limited scalability. Larger datasets should be streamed from disk. The `census_dataset.input_fn` provides an example of how to do this using `tf.decode_csv` and `tf.data.TextLineDataset`:

In [28]:
import inspect
print(inspect.getsource(census_dataset.input_fn))

def input_fn(data_file, num_epochs, shuffle, batch_size):
  """Generate an input function for the Estimator."""
  assert tf.gfile.Exists(data_file), (
      '%s not found. Please make sure you have run census_dataset.py and '
      'set the --data_dir argument to the correct path.' % data_file)

  def parse_csv(value):
    tf.logging.info('Parsing {}'.format(data_file))
    columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
    features = dict(zip(_CSV_COLUMNS, columns))
    labels = features.pop('income_bracket')
    classes = tf.equal(labels, '>50K')  # binary classification
    return features, classes

  # Extract lines from input files using the Dataset API.
  dataset = tf.data.TextLineDataset(data_file)

  if shuffle:
    dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])

  dataset = dataset.map(parse_csv, num_parallel_calls=5)

  # We call repeat after shuffling, rather than before, to prevent separate
  # epochs from blending together.
  dataset = 