In [208]:
import tensorflow as tf
tf.__version__

'1.5.0'

### Load the data
    kaggle competitions download -c titanic

In [161]:
import pandas as pd
train_csv = pd.read_csv("/Users/mzielinski/.kaggle/competitions/titanic/train.csv",
                        names = ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
                        skipinitialspace=True, low_memory=False, 
                        skiprows=1, na_values=[])

train_csv.fillna({'PassengerId': '', 'Survived': 0, 'Pclass': 0, 'Name': '', 'Sex': '', 'Age': 0, 'SibSp': 0,
       'Parch': 0, 'Ticket': '', 'Fare': 0, 'Cabin': '', 'Embarked': ''}, inplace=True)

In [162]:
train_csv

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,0.0,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [163]:
labels = train_csv["Survived"]

In [164]:
feature_cols = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
features = train_csv[feature_cols]

### Define the feature columns
A feature column is an object describing how the model should use raw input data from the features dictionary. When you build an Estimator model, you pass it a list of feature columns that describes each of the features you want the model to use. The `tf.feature_column` module provides many options for representing data to the model.

To create feature columns, call functions from the `tf.feature_column` module. As the following figure shows, all nine functions return either a Categorical-Column or a Dense-Column object, except bucketized_column, which inherits from both classes:

<img src="https://www.tensorflow.org/images/feature_columns/some_constructors.jpg", width=400>

Although `tf.numeric_column` provides optional arguments, calling `tf.numeric_column` without any arguments, as follows, is a fine way to specify a numerical value with the default data type (`tf.float32`) as input to your model:

In [165]:
# Numeric columns
numeric_age = tf.feature_column.numeric_column("Age", default_value=features["Age"].mean())
numeric_fare = tf.feature_column.numeric_column("Fare", default_value=features["Fare"].mean())

numeric_features = [numeric_age, numeric_fare]
numeric_features

[_NumericColumn(key='Age', shape=(1,), default_value=(23.79929292929293,), dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Fare', shape=(1,), default_value=(32.204207968574636,), dtype=tf.float32, normalizer_fn=None)]

Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. To do so, create a **bucketized column**. For example, consider raw data that represents the year a house was built. Instead of representing that year as a scalar numeric column, we could split the year into the following four buckets:
<img src="https://www.tensorflow.org/images/feature_columns/bucketized_column.jpg", width=300>

In [166]:
bucketized_age = tf.feature_column.bucketized_column(numeric_age, [0, 10, 20, 30, 40, 50, 60, 100])
one_hot_age = tf.feature_column.indicator_column(bucketized_age)
one_hot_age

_IndicatorColumn(categorical_column=_BucketizedColumn(source_column=_NumericColumn(key='Age', shape=(1,), default_value=(23.79929292929293,), dtype=tf.float32, normalizer_fn=None), boundaries=(0, 10, 20, 30, 40, 50, 60, 100)))

**Categorical identity** columns can be seen as a special case of bucketized columns. In traditional bucketized columns, each bucket represents a range of values (for example, from 1960 to 1979). In a categorical identity column, each bucket represents a single, unique integer. For example, let's say you want to represent the integer range [0, 4). That is, you want to represent the integers 0, 1, 2, or 3. In this case, the categorical identity mapping looks like this:

<img src="https://www.tensorflow.org/images/feature_columns/categorical_column_with_identity.jpg", width=300>

In [167]:
categorical_identity_cols = ["Pclass", "SibSp", "Parch"]
categorical_identity_features = [
    tf.feature_column.categorical_column_with_identity(
        key,
        len(features[key].unique()) + 1,
        0
    ) for key in categorical_identity_cols] 
one_hot_identity_features = [
    tf.feature_column.indicator_column(key) for key in categorical_identity_features]
one_hot_identity_features

[_IndicatorColumn(categorical_column=_IdentityCategoricalColumn(key='Pclass', num_buckets=4, default_value=0)),
 _IndicatorColumn(categorical_column=_IdentityCategoricalColumn(key='SibSp', num_buckets=8, default_value=0)),
 _IndicatorColumn(categorical_column=_IdentityCategoricalColumn(key='Parch', num_buckets=8, default_value=0))]

We cannot input strings directly to a model. Instead, we must first map strings to numeric or categorical values. **Categorical vocabulary** columns provide a good way to represent strings as a one-hot vector. For example:
<img src="https://www.tensorflow.org/images/feature_columns/categorical_column_with_vocabulary.jpg", width=300>

In [168]:
categorical_dictionary_cols = ["Sex", "Embarked"]
categorical_dictionary_features = [
    tf.feature_column.categorical_column_with_vocabulary_list(
        key,
        list(features[key].dropna().unique())
    ) for key in categorical_dictionary_cols] 
one_hot_dictionary_features = [
    tf.feature_column.indicator_column(key) for key in categorical_dictionary_features]
one_hot_dictionary_features

[_IndicatorColumn(categorical_column=_VocabularyListCategoricalColumn(key='Sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 _IndicatorColumn(categorical_column=_VocabularyListCategoricalColumn(key='Embarked', vocabulary_list=('S', 'C', 'Q', ''), dtype=tf.string, default_value=-1, num_oov_buckets=0))]

Combining features into a single feature, better known as **feature crosses**, enables the model to learn separate weights for each combination of features.

Additionally, instead of representing the data as a one-hot vector of many dimensions, an **embedding column** represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.

In [169]:
categorical_crossed_age_sex = tf.feature_column.crossed_column(
    [bucketized_age, categorical_dictionary_features[0]],
    5000)
categorical_embedding_crossed_age_sex = tf.feature_column.embedding_column(categorical_crossed_age_sex, 9)

In [170]:
categorical_features = one_hot_identity_features + one_hot_dictionary_features + [one_hot_age, categorical_embedding_crossed_age_sex]
all_features = numeric_features + categorical_features

### Defining input function

The `tf.data` module contains a collection of classes that allows you to easily load data, manipulate it, and pipe it into your model. Taking slices from an array is the simplest way to get started with `tf.data`.

<img src="https://www.tensorflow.org/images/feature_columns/inputs_to_model_bridge.jpg", width=600>

In [171]:
def train_input_fn(features, labels, batch_size):
    """An input function for training"""
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    # Shuffle, repeat, and batch the examples.
    return dataset.shuffle(1000).repeat().batch(batch_size)

* The `shuffle` method uses a fixed-size buffer to shuffle the items as they pass through. In this case the buffer_size is greater than the number of examples in the Dataset, ensuring that the data is completely shuffled,

* The `repeat` method restarts the Dataset when it reaches the end. To limit the number of epochs, set the count argument.

* The `batch` method collects a number of examples and stacks them, to create batches. This adds a dimension to their shape. The new dimension is added as the first dimension.

In [172]:
train_input_fn(features, labels, 32)

<BatchDataset shapes: ({SibSp: (?,), Parch: (?,), Embarked: (?,), Age: (?,), Sex: (?,), Fare: (?,), Pclass: (?,)}, (?,)), types: ({SibSp: tf.int64, Parch: tf.int64, Embarked: tf.string, Age: tf.float64, Sex: tf.string, Fare: tf.float64, Pclass: tf.int64}, tf.int64)>

You can also use pre-made `pandas_input_fn`

In [217]:
train_input_fn2 = tf.estimator.inputs.pandas_input_fn(features, labels, 32, shuffle=True, num_epochs=10)

### Build an Estimator

In [173]:
# Build a DNN with 2 hidden layers and 10 nodes in each hidden layer.
classifier = tf.estimator.DNNClassifier(
    feature_columns=all_features,
    # Two hidden layers of 10 nodes each.
    hidden_units=[10, 10],
    # The model must choose between 3 classes.
    n_classes=2)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_summary_steps': 100, '_tf_random_seed': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_task_id': 0, '_keep_checkpoint_every_n_hours': 10000, '_task_type': 'worker', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x129444828>, '_service': None, '_master': '', '_log_step_count_steps': 100, '_is_chief': True, '_num_worker_replicas': 1, '_session_config': None, '_model_dir': '/var/folders/7h/zprvl3sd2ds38fr4p3bfxj640000gn/T/tmpdcpazf1n', '_keep_checkpoint_max': 5, '_num_ps_replicas': 0}


In [219]:
# Train the estimator
classifier.train(
    steps=1000,
    input_fn=lambda : train_input_fn(features, labels, 32))

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /tmp/tfestimator/model.ckpt-1000
INFO:tensorflow:Saving checkpoints for 1001 into /tmp/tfestimator/model.ckpt.
INFO:tensorflow:step = 1001, loss = 10.976455
INFO:tensorflow:global_step/sec: 256.731
INFO:tensorflow:step = 1101, loss = 10.90387 (0.391 sec)
INFO:tensorflow:global_step/sec: 380.823
INFO:tensorflow:step = 1201, loss = 9.59169 (0.262 sec)
INFO:tensorflow:global_step/sec: 387.713
INFO:tensorflow:step = 1301, loss = 10.369967 (0.258 sec)
INFO:tensorflow:global_step/sec: 368.498
INFO:tensorflow:step = 1401, loss = 14.00698 (0.272 sec)
INFO:tensorflow:global_step/sec: 351.339
INFO:tensorflow:step = 1501, loss = 9.899971 (0.284 sec)
INFO:tensorflow:global_step/sec: 360.706
INFO:tensorflow:step = 1601, loss = 7.324501 (0.277 sec)
INFO:tensorflow:global_step/sec: 400.07
INFO:tensorflow:step = 1701, loss = 11.931451 (0.251 sec)
INFO:tensorflow:global_step/sec: 371.573
INFO:tensorflow:step = 1801, l

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x129e8eba8>

In [220]:
# Train the estimator
classifier.train(
    steps=1000,
    input_fn=train_input_fn2)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /tmp/tfestimator/model.ckpt-2000
INFO:tensorflow:Saving checkpoints for 2001 into /tmp/tfestimator/model.ckpt.
INFO:tensorflow:step = 2001, loss = 6.2639484
INFO:tensorflow:global_step/sec: 201.484
INFO:tensorflow:step = 2101, loss = 17.137714 (0.500 sec)
INFO:tensorflow:global_step/sec: 248.535
INFO:tensorflow:step = 2201, loss = 9.366995 (0.400 sec)
INFO:tensorflow:Saving checkpoints for 2279 into /tmp/tfestimator/model.ckpt.
INFO:tensorflow:Loss for final step: 4.3477426.


<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x129e8eba8>

### Evaluate and predict the estimator

<img src="https://www.tensorflow.org/images/first_train_calls.png", width=600>

In [194]:
def eval_input_fn(features, labels, batch_size):
    """An input function for evaluation or prediction"""
    features=dict(features)
    if labels is None:
        # No labels, use only features.
        inputs = features
    else:
        inputs = (features, labels)

    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices(inputs)

    # Batch the examples
    assert batch_size is not None, "batch_size must not be None"
    dataset = dataset.batch(batch_size)

    # Return the dataset.
    return dataset

In [195]:
# Evaluate the model.
eval_result = classifier.evaluate(
    input_fn=lambda:eval_input_fn(features, labels, 32))

print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))

INFO:tensorflow:Starting evaluation at 2018-04-08-09:41:15
INFO:tensorflow:Restoring parameters from /var/folders/7h/zprvl3sd2ds38fr4p3bfxj640000gn/T/tmpdcpazf1n/model.ckpt-1000
INFO:tensorflow:Finished evaluation at 2018-04-08-09:41:16
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.8327722, accuracy_baseline = 0.6161616, auc = 0.89540535, auc_precision_recall = 0.87436295, average_loss = 0.38446283, global_step = 1000, label/mean = 0.3838384, loss = 12.234157, prediction/mean = 0.41623342

Test set accuracy: 0.833



In [196]:
predictions = classifier.predict(
    input_fn=lambda:eval_input_fn(features, 32))

### Checkpoints

Estimators automatically write the following to disk:

* **checkpoints**, which are versions of the model created during training.
* **event files**, which contain information that TensorBoard uses to create visualizations.

To specify the top-level directory in which the Estimator stores its information, assign a value to the optional `model_dir` argument of any Estimator's constructor.

In [202]:
classifier = tf.estimator.DNNClassifier(
    feature_columns=all_features,
    hidden_units=[10, 10],
    n_classes=2,
    model_dir="/tmp/tfestimator/")

print("Model checkoint in {}".format(classifier.model_dir))

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_summary_steps': 100, '_tf_random_seed': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_task_id': 0, '_keep_checkpoint_every_n_hours': 10000, '_task_type': 'worker', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x129f2b780>, '_service': None, '_master': '', '_log_step_count_steps': 100, '_is_chief': True, '_num_worker_replicas': 1, '_session_config': None, '_model_dir': '/tmp/tfestimator/', '_keep_checkpoint_max': 5, '_num_ps_replicas': 0}
Model checkoint in /tmp/tfestimator/


The first call to `train` adds checkpoints and other files to the `model_dir` directory:

In [198]:
classifier.train(
    steps=1000,
    input_fn=lambda : train_input_fn(features, labels, 32))

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tfestimator/model.ckpt.
INFO:tensorflow:step = 1, loss = 21.429394
INFO:tensorflow:global_step/sec: 237.103
INFO:tensorflow:step = 101, loss = 13.965596 (0.424 sec)
INFO:tensorflow:global_step/sec: 359.583
INFO:tensorflow:step = 201, loss = 14.733107 (0.278 sec)
INFO:tensorflow:global_step/sec: 333.319
INFO:tensorflow:step = 301, loss = 13.6704645 (0.302 sec)
INFO:tensorflow:global_step/sec: 293.268
INFO:tensorflow:step = 401, loss = 12.029801 (0.340 sec)
INFO:tensorflow:global_step/sec: 319.947
INFO:tensorflow:step = 501, loss = 12.056955 (0.312 sec)
INFO:tensorflow:global_step/sec: 411.414
INFO:tensorflow:step = 601, loss = 12.0845375 (0.243 sec)
INFO:tensorflow:global_step/sec: 387.555
INFO:tensorflow:step = 701, loss = 15.8097925 (0.258 sec)
INFO:tensorflow:global_step/sec: 390.593
INFO:tensorflow:step = 801, loss = 10.677053 (0.255 sec)
INFO:tensorflow:global_step/sec: 387.007
INFO:tenso

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x123a76940>

In [199]:
!ls /tmp/tfestimator/

checkpoint                              model.ckpt-1.meta
events.out.tfevents.1523196162.Siri-Ann model.ckpt-1000.data-00000-of-00001
graph.pbtxt                             model.ckpt-1000.index
model.ckpt-1.data-00000-of-00001        model.ckpt-1000.meta
model.ckpt-1.index


By default, the `Estimator` saves checkpoints in the `model_dir` according to the following schedule:

* Writes a checkpoint every 10 minutes (600 seconds).
* Writes a checkpoint when the train method starts (first iteration) and completes (final iteration).
* Retains only the 5 most recent checkpoints in the directory.

You may alter the default schedule by taking the following steps:

* Create a `RunConfig` object that defines the desired schedule.
* When instantiating the `Estimator`, pass that `RunConfig` object to the `Estimator`'s config argument.

In [205]:
my_checkpointing_config = tf.estimator.RunConfig(
    save_checkpoints_secs = 20*60,  # Save checkpoints every 20 minutes.
    keep_checkpoint_max = 10,       # Retain the 10 most recent checkpoints.
)

classifier = tf.estimator.DNNClassifier(
    feature_columns=all_features,
    hidden_units=[10, 10],
    n_classes=2,
    model_dir='/tmp/tfestimator/',
    config=my_checkpointing_config)

INFO:tensorflow:Using config: {'_save_summary_steps': 100, '_tf_random_seed': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 1200, '_task_id': 0, '_keep_checkpoint_every_n_hours': 10000, '_task_type': 'worker', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x129e8eb70>, '_service': None, '_master': '', '_log_step_count_steps': 100, '_is_chief': True, '_num_worker_replicas': 1, '_session_config': None, '_model_dir': '/tmp/tfestimator/', '_keep_checkpoint_max': 10, '_num_ps_replicas': 0}
