# Introduction to tf.estimator

In [None]:
import os
import sys

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf

In [None]:
# Python version 3.5 or 3.6
assert sys.version_info >= (3, 5)
assert sys.version_info < (3, 7)
# Tensorflow 2.0
assert tf.__version__ >= "2.0"

This end-to-end walkthrough trains a logistic regression model using the `tf.estimator` API. The model is often used as a baseline for other, more complex, algorithms.

_NB : This notebook is drawn from one of the [TensorFlow Tutorials](https://www.tensorflow.org/alpha/tutorials/estimators/linear)_

# Input Data Management

You will use the Titanic dataset with the (rather morbid) goal of predicting passenger survival, given characteristics such as gender, age, class, etc.

## Download the data

In [None]:
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')
y_train = dftrain.pop('survived')
y_eval = dfeval.pop('survived')

## Explore the data

In [None]:
dftrain.head()

In [None]:
dftrain.describe()

# Feature Engineering for the Model

Estimators use a system called [feature columns](https://www.tensorflow.org/guide/feature_columns) to describe how the model should interpret each of the raw input features. An Estimator expects a vector of numeric inputs, and *feature columns* describe how the model should convert each feature.

Selecting and crafting the right set of feature columns is key to learning an effective model. A feature column can be either one of the raw inputs in the original features `dict` (a *base feature column*), or any new columns created using transformations defined over one or multiple base columns (a *derived feature columns*).

The linear estimator uses both numeric and categorical features. Feature columns work with all TensorFlow estimators and their purpose is to define the features used for modeling. Additionally, they provide some feature engineering capabilities like one-hot-encoding, normalization, and bucketization.

In [None]:
feature_columns = []

> <div class="mark">Use the function numeric_column to prepare the numeric columns</div><i class="fa fa-lightbulb-o "></i>

Documentation : https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/feature_column/numeric_column

In [None]:
NUMERIC_COLUMNS = ['age', 'fare']

for feature_name in NUMERIC_COLUMNS:
    feature_columns.append( # TODO

> <div class="mark">Use the function categorical_column_with_vocabulary_list to prepare the categorical columns</div><i class="fa fa-lightbulb-o "></i>

Documentation : https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_list

In [None]:
CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck',
                       'embark_town', 'alone']

for feature_name in CATEGORICAL_COLUMNS:
    vocabulary = dftrain[feature_name].unique()
    feature_columns.append( # TODO

The `input_function` specifies how data is converted to a `tf.data.Dataset` that feeds the input pipeline in a streaming fashion. `tf.data.Dataset` take take in multiple sources such as a dataframe, a csv-formatted file, and more.

# Prepare input functions

In [None]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
    def input_function():
        ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))
        if shuffle:
            ds = ds.shuffle(1000)
        ds = ds.batch(batch_size).repeat(num_epochs)
        return ds
    return input_function

In [None]:
train_input_fn = make_input_fn(dftrain, y_train)
eval_input_fn = make_input_fn(dfeval, y_eval, num_epochs=1, shuffle=False)

You can inspect the dataset:

In [None]:
ds = make_input_fn(dftrain, y_train, batch_size=10)()
for feature_batch, label_batch in ds.take(1):
    print('Some feature keys:', list(feature_batch.keys()))
    print()
    print('A batch of class:', feature_batch['class'].numpy())
    print()
    print('A batch of Labels:', label_batch.numpy())

# Train the estimator

After adding all the base features to the model, let's train the model. Training a model is just a single command using the `tf.estimator` API. 

We will use the `LinearClassifier` estimator.

> <div class="mark">Instanciate the Estimator</div><i class="fa fa-lightbulb-o "></i>

Documentation : https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/estimator/LinearClassifier

In [None]:
linear_est = # TODO

> <div class="mark">Train the Estimator</div><i class="fa fa-lightbulb-o "></i>

Documentation : https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/estimator/LinearClassifier#train

In [None]:
linear_est. # TODO

> <div class="mark">Evaluate the results on the test set</div><i class="fa fa-lightbulb-o "></i>

Documentation : https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/estimator/LinearClassifier#evaluate

In [None]:
result = linear_est. # TODO

print(result)

# Add Derived Feature Columns

Using each base feature column separately may not be enough to explain the data. For example, the correlation between gender and the label may be different for different gender. Therefore, if you only learn a single model weight for `gender="Male"` and `gender="Female"`, you won't capture every age-gender combination (e.g. distinguishing between `gender="Male"` AND `age="30"` AND `gender="Male"` AND `age="40"`).

To learn the differences between different feature combinations, you can add *crossed feature columns* to the model, using `crossed_column`.

> <div class="mark">Create a Crossed Column between age and sex with a bucket size of 100 </div><i class="fa fa-lightbulb-o "></i>

Documentation : https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/feature_column/crossed_column

In [None]:
age_x_gender = # TODO

Concatenate all columns

In [None]:
derived_feature_columns = [age_x_gender]
all_columns = feature_columns+derived_feature_columns

> <div class="mark">Instanciate, train and evaluate the estimator using all created columns</div><i class="fa fa-lightbulb-o "></i>

Documentation : 
- https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/estimator/LinearClassifier
- https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/estimator/LinearClassifier#train
- https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/estimator/LinearClassifier#evaluate

In [None]:
# Create Linear Estimator
linear_est = # TODO

# Train the estimator
linear_est. # TODO

# Evaluate the estimator
result = linear_est. # TODO

print(result)

Results should be slightly better than only trained in base features. You can try using more features and transformations to see if you can do better!

Now you can use the train model to make predictions on a passenger from the evaluation set. TensorFlow models are optimized to make predictions on a batch, or collection, of examples at once. Earlier,  the `eval_input_fn` was  defined using the entire evaluation set.

In [None]:
pred_dicts = list(linear_est.predict(eval_input_fn))
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])

probs.plot(kind='hist', bins=20, title='predicted probabilities')

# Bonus

> <div class="mark">Try other kinds of estimators, for example the DNNLinearCombinedClassifier to perform wide & deep learning</div><i class="fa fa-lightbulb-o "></i>

Documentation : https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/estimator/DNNLinearCombinedClassifier