This tutorial helps you to load a CSV file via `tf.data` APIs.

In [1]:
!pip install -q tf-nightly

[K     |████████████████████████████████| 446.6MB 39kB/s 
[K     |████████████████████████████████| 2.9MB 38.5MB/s 
[K     |████████████████████████████████| 3.9MB 58.1MB/s 
[K     |████████████████████████████████| 26.1MB 117kB/s 
[K     |████████████████████████████████| 460kB 52.8MB/s 
[K     |████████████████████████████████| 81kB 12.0MB/s 
[31mERROR: google-colab 1.0.0 has requirement google-auth~=1.4.0, but you'll have google-auth 1.10.0 which is incompatible.[0m
[31mERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.[0m
[31mERROR: tb-nightly 2.2.0a20200106 has requirement grpcio>=1.24.3, but you'll have grpcio 1.15.0 which is incompatible.[0m
[?25h

In [2]:
import tensorflow as tf
import pandas as pd
import functools
import numpy as np

print("Tensorflow Version: {}".format(tf.__version__))
print("TF Eager Model: {}".format(tf.executing_eagerly()))
print("GPU {} available".format("is" if tf.config.experimental.list_physical_devices("GPU") else "not"))

Tensorflow Version: 2.1.0-dev20200106
TF Eager Model: True
GPU is available


# Load Datasets

Here you are going to use a dataset `titanic` to predict the survival state of the passengers based on several attributes, like ages, genders, etc. This dataset is a CSV file hosted on a Google storage.

In [0]:
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

TF2.keras APIs provide you an easier way to get the file or download from the URL via a `get_file()` method.

In [4]:
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("test.csv", TEST_DATA_URL)
train_file_path, test_file_path

Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv


('/root/.keras/datasets/train.csv', '/root/.keras/datasets/test.csv')

In [0]:
# Make numpy values much easier to read
np.set_printoptions(precision=3, suppress=True)

Inspect the data.

In [6]:
!head {train_file_path}

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n


## Load the Data via Pandas

You can load the CSV file via the Pandas APIs and then pass it to the Tensorflow/Keras runtime.

In [0]:
pd_csv = pd.read_csv(train_file_path)

In [8]:
pd_csv.head(5)

Unnamed: 0,survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
2,1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
3,1,female,35.0,1,0,53.1,First,C,Southampton,n
4,0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y


## Load the Data via TF2.data

You can load the CSV file via the `tf.data.experimental.make_csv_dataset` API. This API helps you to load the data in a more scalable way.

In [0]:
LABEL_COL = "survived"
LABELS = [0, 1]

In [0]:
def get_dataset(file_path, **kwargs):
  # batch_size: set to a smaller value to show the examples more easier
  dataset = tf.data.experimental.make_csv_dataset(
      file_path, batch_size=5, label_name=LABEL_COL, na_value='?', num_epochs=1, 
      ignore_errors=True, **kwargs)
  return dataset

In [0]:
raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

In [0]:
def show_batch(dataset):
  for batch, label in dataset.take(1):
    for key, value in batch.items():
      print("{:20s}: {}".format(key, value.numpy()))
    print("Label: {}".format(label))

In [13]:
show_batch(raw_train_data)

sex                 : [b'male' b'male' b'male' b'male' b'male']
age                 : [21. 45. 62. 22. 41.]
n_siblings_spouses  : [0 0 0 0 2]
parch               : [0 0 0 0 0]
fare                : [ 7.925  8.05  10.5    7.25  14.108]
class               : [b'Third' b'Third' b'Second' b'Third' b'Third']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton'
 b'Southampton']
alone               : [b'y' b'y' b'y' b'y' b'n']
Label: [0 1 1 0 0]


It seems the column name is correctly selected. If the column name is not correctly selected, you can assign the column name to the dataset.

In [14]:
CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 
               'class', 'deck', 'embark_town', 'alone']
temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMNS)
show_batch(temp_dataset)

sex                 : [b'male' b'male' b'male' b'female' b'male']
age                 : [28. 25. 28. 28. 28.]
n_siblings_spouses  : [0 0 0 8 0]
parch               : [0 0 0 2 0]
fare                : [221.779   7.65    7.75   69.55   56.496]
class               : [b'First' b'Third' b'Third' b'Third' b'Third']
deck                : [b'C' b'F' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Queenstown' b'Southampton' b'Southampton']
alone               : [b'y' b'y' b'y' b'n' b'y']
Label: [0 0 0 0 0]


If you want to select several columns and omit the other, you can set the `selected_columns` argument to the constructor.

In [15]:
SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'class', 'deck', 'alone']
temp_dataset = get_dataset(train_file_path, select_columns=SELECT_COLUMNS)
show_batch(temp_dataset)

age                 : [32.5 28.  45.  37.  19. ]
n_siblings_spouses  : [0 3 0 0 0]
class               : [b'Second' b'Third' b'First' b'Third' b'Third']
deck                : [b'E' b'unknown' b'B' b'unknown' b'unknown']
alone               : [b'y' b'n' b'y' b'y' b'y']
Label: [1 0 0 0 0]


# Data Preprocessing

A CSV file can contain a variety of data types. In general, you first transform them into a fixed-length vector and then feed them into the model.

Tensorflow core has a lot of built-in APIs to do conversions, like `tf.feature_column`. You can also preprocess the data via `nltk` or `sklearn` first and then send them to Tensorflow. The most advantage and difference is that you can wrap the preprocessing in a model and no need to do further preprocesses while importing the model.

## Continuous Data

In [16]:
SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare']
DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
temp_dataset = get_dataset(train_file_path, 
                           select_columns=SELECT_COLUMNS, 
                           column_defaults=DEFAULTS)
show_batch(temp_dataset)

age                 : [27. 35. 29. 28. 17.]
n_siblings_spouses  : [1. 1. 0. 0. 0.]
parch               : [0. 0. 0. 0. 0.]
fare                : [ 53.1    90.    211.337  26.55   12.   ]
Label: [1 1 1 0 1]


In [17]:
example_batch, label_batch = next(iter(temp_dataset))
example_batch

OrderedDict([('age',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([32., 41., 28., 47., 26.], dtype=float32)>),
             ('n_siblings_spouses',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 0., 0., 1., 0.], dtype=float32)>),
             ('parch',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 1., 0., 1., 0.], dtype=float32)>),
             ('fare',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 30.5  ,  19.5  , 110.883,  52.554,   8.05 ], dtype=float32)>)])

A simple function packs together all columns.

In [0]:
def pack(features, label):
  return tf.stack(list(features.values()), axis=-1), label

In [19]:
packed_dataset = temp_dataset.map(pack)

for features, labels in packed_dataset.take(1):
  print(features.numpy())
  print()
  print(labels.numpy())

[[47.     1.     0.    14.5  ]
 [28.     0.     1.    55.   ]
 [ 9.     4.     2.    31.388]
 [49.     0.     0.     0.   ]
 [ 0.75   2.     1.    19.258]]

[0 1 0 0 1]


A more general way is to define a preprocessor that collects a list of numeric features and packs them into a single column. Let's look at the data first.

In [20]:
show_batch(raw_train_data)

sex                 : [b'male' b'female' b'female' b'male' b'male']
age                 : [51. 44. 28. 28. 47.]
n_siblings_spouses  : [0 0 1 0 0]
parch               : [0 0 0 0 0]
fare                : [26.55  27.721 15.5    7.75   9.   ]
class               : [b'First' b'First' b'Third' b'Third' b'Third']
deck                : [b'E' b'B' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Cherbourg' b'Queenstown' b'Queenstown' b'Southampton']
alone               : [b'y' b'y' b'n' b'y' b'y']
Label: [1 1 1 0 0]


In [0]:
class PackNumericFeatures(object):
  def __init__(self, names):
    self.names = names
  def __call__(self, features, labels):
    numeric_features = [features.pop(name) for name in self.names]
    numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_features]
    numeric_features = tf.stack(numeric_features, axis=-1)
    features["numeric"] = numeric_features
    return features, labels

In [0]:
NUMERIC_FEATURES = ["age", "n_siblings_spouses", "parch", "fare"]

packed_train_data = raw_train_data.map(PackNumericFeatures(NUMERIC_FEATURES))
packed_test_data = raw_test_data.map(PackNumericFeatures(NUMERIC_FEATURES))

In [23]:
show_batch(packed_train_data)

sex                 : [b'male' b'female' b'male' b'male' b'male']
class               : [b'Third' b'Second' b'Third' b'First' b'First']
deck                : [b'F' b'unknown' b'unknown' b'C' b'C']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton' b'Cherbourg']
alone               : [b'y' b'y' b'y' b'y' b'y']
numeric             : [[ 19.      0.      0.      7.65 ]
 [ 34.      0.      0.     13.   ]
 [ 36.      0.      0.      7.496]
 [ 28.      0.      0.    221.779]
 [ 28.      0.      0.     29.7  ]]
Label: [0 1 0 0 1]


In [24]:
example_batch, label_batch = next(iter(packed_train_data))
example_batch, label_batch

(OrderedDict([('sex',
               <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'male', b'male', b'male', b'male', b'male'], dtype=object)>),
              ('class',
               <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Second', b'Third', b'Third', b'Third', b'Second'], dtype=object)>),
              ('deck', <tf.Tensor: shape=(5,), dtype=string, numpy=
               array([b'unknown', b'unknown', b'unknown', b'unknown', b'unknown'],
                     dtype=object)>),
              ('embark_town', <tf.Tensor: shape=(5,), dtype=string, numpy=
               array([b'Southampton', b'Southampton', b'Cherbourg', b'Queenstown',
                      b'Southampton'], dtype=object)>),
              ('alone',
               <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'n', b'y', b'n', b'y', b'n'], dtype=object)>),
              ('numeric', <tf.Tensor: shape=(5, 4), dtype=float32, numpy=
               array([[25.   ,  1.   ,  0.   , 26.   ],
                  

### Data Normalization

Continuous data always needs to be normalized.

In [25]:
desc = pd.read_csv(train_file_path)[NUMERIC_FEATURES].describe()
desc

Unnamed: 0,age,n_siblings_spouses,parch,fare
count,627.0,627.0,627.0,627.0
mean,29.631308,0.545455,0.379585,34.385399
std,12.511818,1.15109,0.792999,54.59773
min,0.75,0.0,0.0,0.0
25%,23.0,0.0,0.0,7.8958
50%,28.0,0.0,0.0,15.0458
75%,35.0,1.0,0.0,31.3875
max,80.0,8.0,5.0,512.3292


In [0]:
MEAN = np.array(desc.T["mean"])
STD = np.array(desc.T["std"])

In [0]:
def normalize_numeric_data(data, mean, std):
  return (data-mean) / std

Bind the MEAN and STD values via the `functools.partial` function.

In [28]:
normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)
normalizer

functools.partial(<function normalize_numeric_data at 0x7f0680361158>, mean=array([29.631,  0.545,  0.38 , 34.385]), std=array([12.512,  1.151,  0.793, 54.598]))

In [29]:
numeric_column = tf.feature_column.numeric_column('numeric', 
                                                  normalizer_fn=normalizer, 
                                                  shape=[len(NUMERIC_FEATURES)])
numeric_column = [numeric_column]
numeric_column

[NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function normalize_numeric_data at 0x7f0680361158>, mean=array([29.631,  0.545,  0.38 , 34.385]), std=array([12.512,  1.151,  0.793, 54.598])))]

After creating several features columns, you can define a dense feature layer to use all of the features via the `tf.keras.layers.DenseFeatures` APIs.

You can use this wrapped layer (function) to be a part of the model. 

In [30]:
numeric_layer = tf.keras.layers.DenseFeatures(numeric_column)

# normalized result
numeric_layer(example_batch).numpy()

array([[-0.37 ,  0.395, -0.479, -0.154],
       [ 0.109, -0.474, -0.479, -0.487],
       [-0.13 ,  0.395,  0.782, -0.351],
       [-0.13 , -0.474, -0.479, -0.488],
       [ 0.589,  0.395, -0.479, -0.154]], dtype=float32)

## Categorical Data

The categorical data differs from the continuous one in the fact that its value is limited in a set of categories. The relative column API `tf.feature_column.numeric_column` to the numeric value, the API `tf.feature_column.indicator_column` is used for the categorical value.

In [0]:
CATEGORIES = {
    'sex': ['male', 'female'],
    'class': ['First', 'Second', 'Third'],
    'deck': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town': ['Cherbourg', 'Southampton', 'Queenstown'],
    'alone': ['y', 'n']
}

In [32]:
categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
      key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))
categorical_columns, type(cat_col), type(tf.feature_column.indicator_column(cat_col))

([IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
  IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
  IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
  IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
  IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))],
 tensorflow.python.feature_column.feature_column_v2.VocabularyListCa

In [33]:
example_batch

OrderedDict([('sex',
              <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'male', b'male', b'male', b'male', b'male'], dtype=object)>),
             ('class',
              <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Second', b'Third', b'Third', b'Third', b'Second'], dtype=object)>),
             ('deck', <tf.Tensor: shape=(5,), dtype=string, numpy=
              array([b'unknown', b'unknown', b'unknown', b'unknown', b'unknown'],
                    dtype=object)>),
             ('embark_town', <tf.Tensor: shape=(5,), dtype=string, numpy=
              array([b'Southampton', b'Southampton', b'Cherbourg', b'Queenstown',
                     b'Southampton'], dtype=object)>),
             ('alone',
              <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'n', b'y', b'n', b'y', b'n'], dtype=object)>),
             ('numeric', <tf.Tensor: shape=(5, 4), dtype=float32, numpy=
              array([[25.   ,  1.   ,  0.   , 26.   ],
                     [31.   ,  0

In [34]:
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
categorical_layer(example_batch).numpy()

array([[0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 1., 0.],
       [0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 1., 0.],
       [0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 1., 0.]], dtype=float32)

The `DenseFeatures` layer is encoded as multiple one-hot encoding options. For example, a data entity is composed of `['male', 'Third', 'unknown', 'Southampton', 'y']`, then it is encoded as `[(1,0), (0,0,1),(0,0,0,0,0,0,0,0,0,0), (0,1,0), (1,0)]`.

## Combined both Continuous and Categorical Columns

In [35]:
preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns + numeric_column)
preprocessing_layer(example_batch).numpy()[0], len(preprocessing_layer(example_batch).numpy()[0])

(array([ 0.   ,  1.   ,  0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  0.   ,
         0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
         1.   ,  0.   , -0.37 ,  0.395, -0.479, -0.154,  1.   ,  0.   ],
       dtype=float32), 24)

# Build and Train the Model

## Build the Model

In [0]:
def build_seq():
  model = tf.keras.Sequential([
    preprocessing_layer,
    tf.keras.layers.Dense(128, activation='elu'),
    tf.keras.layers.Dense(128, activation='elu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])
  return model

In [0]:
def compile_model():
  """Build a sequential model."""
  model = build_seq()

  model.compile(loss='binary_crossentropy', 
                optimizer='adam', 
                metrics=['accuracy'])
  return model

model = compile_model()  

Another way to build a model that is much better than the linear stack format.

In [0]:
def build_model(inputs):
  x = preprocessing_layer(inputs)
  x = tf.keras.layers.Dense(128, activation='elu')(x)
  x = tf.keras.layers.Dense(128, activation='elu')(x)
  y = tf.keras.layers.Dense(1, activation='sigmoid')(x)
  return y

Here you can't feed a tensor with a fixed shape into the model for training, instead, you have to feed a dictionary-like tensor due to the `DenseFeature` input.

In [0]:
def build_inputs():
  column_names = ['sex', 'class', 'deck', 'embark_town', 'alone', 'numeric']
  total_columns = {}
  for item in column_names:
    if item == "numeric":
      total_columns[item] = tf.keras.layers.Input(shape=(4), dtype=tf.float32, name=item)
      continue  
    total_columns[item] = tf.keras.layers.Input(shape=(1), dtype=tf.string, name=item)
  return total_columns

In [48]:
def compile_model_adv():
  """Build a sequential model."""
  inputs = build_inputs()
  outputs = build_model(inputs)
  model = tf.keras.Model(inputs, outputs)

  model.compile(loss='binary_crossentropy', 
                optimizer='adam', 
                metrics=['accuracy'])
  return model

model = compile_model_adv()  
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
alone (InputLayer)              [(None, 1)]          0                                            
__________________________________________________________________________________________________
class (InputLayer)              [(None, 1)]          0                                            
__________________________________________________________________________________________________
deck (InputLayer)               [(None, 1)]          0                                            
__________________________________________________________________________________________________
embark_town (InputLayer)        [(None, 1)]          0                                            
____________________________________________________________________________________________

## Train and Evaluate the Model

In [0]:
train_data = packed_train_data.shuffle(500)
test_data = packed_test_data

In [50]:
model.fit(train_data, epochs=20, verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f06802e0780>

In [51]:
loss, acc = model.evaluate(test_data, verbose=2)
print("Loss: {}, Acc: {}".format(loss, acc))

Loss: 0.4275430693941296, Acc: 0.8219696879386902


## Prediction

In [55]:
predictions = model.predict(test_data)

for prediction, survived in zip(predictions[:20], list(test_data)[0][1][:20]):
  print("Predicted  {:.2%}, Actual".format(prediction[0]),
        ("Survived" if bool(survived) else "Died"))

Predicted  10.15%, Actual Survived
Predicted  29.16%, Actual Died
Predicted  67.71%, Actual Survived
Predicted  20.60%, Actual Died
Predicted  17.60%, Actual Survived
