<a href="https://colab.research.google.com/github/nigoda/machine_learning/blob/main/8_Titanic_csvdata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**How to laod CSV Data from a file into tf.data.Dataset**

In [None]:
try:
  # % tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals
import functools

import numpy as np
import tensorflow as tf

In [None]:
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

In [None]:
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)

In [None]:
# Make numpy values earier to read.
np.set_printoptions(precision=3, suppress=True)

## **Load data**
To start, let's look at the top of the CSV file to see how it is formatted.

In [None]:
!head {train_file_path}

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n


You can laod this using.pandas, and pass the Numpy arrays to tensorFlow. If you need to scale up to a large set of files, or need a loader that integrates with TensorFlow and tf.data then use the tf.data.experimental.make_cvs_dataset function:

The only column you need to identify explicitly is the one with the value that the model is intended to predict.


In [None]:
LABEL_COLUMN = 'survived'
LABEL = [0, 1]

Now read the CSV data from  the file and create a dataset.
(For the full documentation, see t.data.experimental.make_csv_dataset)

In [None]:
def get_dataset(file_path, **kwargs):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size = 5, #Artificially small to make example easier to show.
      label_name =LABEL_COLUMN,
      na_value = '?',
      num_epochs = 1,
      ignore_errors = True,
      **kwargs)
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

In [None]:
def show_batch(dataset):
  for batch, label in dataset.take(1):
    for key, value in batch.items():
      print("{:20s}: {}".format(key, value.numpy()))

Each item in the dataset is a batch, represented as a tuple of (many examples, many lables). The data from the example is organized in column-based tensors(rather than row-based tensor), each with as many element as the batch size(12 in this case).

It might help to see this yourself.

In [None]:
show_batch(raw_train_data)

sex                 : [b'female' b'female' b'female' b'male' b'male']
age                 : [28. 38. 38. 36. 47.]
n_siblings_spouses  : [2 0 0 0 0]
parch               : [0 0 0 0 0]
fare                : [ 23.25   80.    227.525   7.896  34.021]
class               : [b'Third' b'First' b'First' b'Third' b'First']
deck                : [b'unknown' b'B' b'C' b'unknown' b'D']
embark_town         : [b'Queenstown' b'unknown' b'Cherbourg' b'Southampton' b'Southampton']
alone               : [b'n' b'y' b'y' b'y' b'y']


As you can see, the columns in the CSV are named. The dataset constructor will pick these names up automatically. If the file you are working with does not contain the column names in the first line, pass then in a list os string of the column_names argumnet in the make csv dataset function.

In [None]:
CSV_COLUMN = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare','class', 'deck','embark_town', 'alone']
temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMN)
show_batch(temp_dataset)

sex                 : [b'male' b'male' b'female' b'female' b'female']
age                 : [19. 39. 34. 28. 21.]
n_siblings_spouses  : [0 0 0 1 0]
parch               : [0 0 1 0 0]
fare                : [ 10.5    0.    23.   133.65   7.75]
class               : [b'Second' b'First' b'Second' b'First' b'Third']
deck                : [b'unknown' b'A' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton' b'Queenstown']
alone               : [b'y' b'y' b'n' b'n' b'y']


In [None]:
SELECT_COLUMNS = ['survived','age','n_siblings_spouses','class','deck','alone']
temp_datset = get_dataset(train_file_path, select_columns =SELECT_COLUMNS )
show_batch(temp_datset)

age                 : [34. 20. 42. 28. 28.]
n_siblings_spouses  : [0 1 1 0 0]
class               : [b'First' b'Third' b'Second' b'Third' b'Third']
deck                : [b'unknown' b'unknown' b'unknown' b'F' b'unknown']
alone               : [b'y' b'n' b'n' b'y' b'y']


## **Data preprocessing**

A csv file can comtain a variety of data types. Typically you want to convert from those mixed types to a fixed length vector before feeding the data into your model.

TensorFlow has a built-in system for describing common input conversion:tf.feature_column.

You can preprocess your data using any tool you like(like nltk or sklearn), and just pass the processed output to TensorFlow.

The primary advanatage of doing the preprocessing inside your model is that when you export the model it includes the preprosessing. This way you can passa the raw data directly to your model.

### **Continuous data**
 If your data is already in an apropriate numeric formate,, you can pack the data into a vector passinf it off the model:

In [None]:
 SELECT_COLUMNS = ['survived','age','n_siblings_spouses','parch','fare']
 DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
 temp_dataset = get_dataset(train_file_path,
                           select_columns = SELECT_COLUMNS,
                           column_defaults = DEFAULTS)
 
 show_batch(temp_dataset)

age                 : [80. 49. 28. 31. 28.]
n_siblings_spouses  : [0. 0. 0. 0. 0.]
parch               : [0. 0. 0. 0. 0.]
fare                : [30.     0.    35.5   13.     7.896]


In [None]:
example_batch, labels_batch = next(iter(temp_dataset))

Here's a simple function that will pack together all the columns:

In [None]:
def pack(features, label):
  return tf.stack(list(features.values()), axis=-1), label

Apply this to each element of the dataset

In [None]:
packed_dataset = temp_dataset.map(pack)

for features, labels in packed_dataset.take(1):
  print(features.numpy())
  print()
  print(labels.numpy())

[[36.     1.     2.    27.75 ]
 [18.     0.     1.     9.35 ]
 [38.     0.     0.    80.   ]
 [37.     1.     0.    26.   ]
 [28.     0.     0.     7.775]]

[0 1 1 0 1]


If you have mixed datatypes you may want to separate out these simple-numeric fileds. The tf.feature_column api can handle them, but this incurs some overhead and should be avoided unless really necessary. Swirch back to the mixed dataset:

In [None]:
show_batch(raw_train_data)

sex                 : [b'female' b'male' b'female' b'male' b'male']
age                 : [28. 30. 28. 35. 21.]
n_siblings_spouses  : [1 0 0 0 0]
parch               : [0 0 0 0 0]
fare                : [24.    27.75  79.2   26.     7.775]
class               : [b'Second' b'First' b'First' b'Second' b'Third']
deck                : [b'unknown' b'C' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Cherbourg' b'Cherbourg' b'Cherbourg' b'Southampton' b'Southampton']
alone               : [b'n' b'y' b'y' b'y' b'y']


In [None]:
example_batch, labels_batch = next(iter(temp_dataset))
example_batch

OrderedDict([('age',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([47., 28., 35., 63., 30.], dtype=float32)>),
             ('n_siblings_spouses',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([1., 0., 0., 0., 0.], dtype=float32)>),
             ('parch',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([1., 0., 0., 0., 0.], dtype=float32)>),
             ('fare',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([52.554,  7.787,  7.896,  9.587, 13.   ], dtype=float32)>)])

In [None]:
labels_batch

<tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 0, 1, 0], dtype=int32)>

So define a more general preprocessor that selects a list of numeric features and packs them into a single column:

In [None]:
class PackNumericFeatures(object):
  def __init__(self, names):
    self.names = names
  
  def __call__(self, features, labels):
    numeric_freatures = [features.pop(name) for name in self.names]
    numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_freatures]
    numeric_features = tf.stack(numeric_features, axis = -1)
    features['numeric'] = numeric_features

    return features, labels

In [None]:
NUMERIC_FEATURES = ['age', 'n_siblings_spouses','parch','fare']

packed_train_data = raw_train_data.map(
    PackNumericFeatures(NUMERIC_FEATURES))

packed_test_data = raw_test_data.map(
    PackNumericFeatures(NUMERIC_FEATURES))

In [None]:
show_batch(packed_train_data)

sex                 : [b'male' b'male' b'female' b'female' b'male']
class               : [b'First' b'First' b'First' b'First' b'Third']
deck                : [b'E' b'C' b'E' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton'
 b'Southampton']
alone               : [b'y' b'n' b'n' b'n' b'y']
numeric             : [[ 47.      0.      0.     38.5  ]
 [ 36.      1.      0.     78.85 ]
 [ 39.      1.      1.     79.65 ]
 [ 45.      1.      1.    164.867]
 [ 36.      0.      0.      7.896]]


In [None]:
show_batch(packed_test_data)

sex                 : [b'female' b'male' b'male' b'male' b'female']
class               : [b'Third' b'Third' b'Third' b'Third' b'Third']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Queenstown' b'Southampton' b'Southampton' b'Southampton']
alone               : [b'n' b'y' b'y' b'y' b'n']
numeric             : [[ 5.     4.     2.    31.388]
 [28.     0.     0.     7.738]
 [21.     0.     0.     8.05 ]
 [43.     0.     0.     6.45 ]
 [21.     1.     0.     9.825]]


In [None]:
example_batch, labels_batch = next(iter(packed_train_data))

### **Data Normalization**

Continuous data should always normalized.

In [None]:
import pandas as pd
desc = pd.read_csv(train_file_path)[NUMERIC_FEATURES].describe()
desc

Unnamed: 0,age,n_siblings_spouses,parch,fare
count,627.0,627.0,627.0,627.0
mean,29.631308,0.545455,0.379585,34.385399
std,12.511818,1.15109,0.792999,54.59773
min,0.75,0.0,0.0,0.0
25%,23.0,0.0,0.0,7.8958
50%,28.0,0.0,0.0,15.0458
75%,35.0,1.0,0.0,31.3875
max,80.0,8.0,5.0,512.3292


In [None]:
MEAN = np.array(desc.T['mean'])
STD = np.array(desc.T['std'])

In [None]:
def normalize_numeric_data(data, mean ,std):
  # Center the data
  return (data-mean)/std

Now create a numeric column, The tf.feature_columns.numeric_column API accepts a normalizer_fn argument, which will be run on each batch.

Bind the MEAN and STD to normalizer fn using funtools.partial.

In [None]:
# See what you just created.
normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)

numeric_column = tf.feature_column.numeric_column('numeric', normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])
numeric_columns = [numeric_column]
numeric_column

NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function normalize_numeric_data at 0x7f5c052c2510>, mean=array([29.631,  0.545,  0.38 , 34.385]), std=array([12.512,  1.151,  0.793, 54.598])))

When you train the model, include this feature column to select and block of numeric data:

In [None]:
example_batch['numeric']

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[16.   ,  0.   ,  0.   ,  9.5  ],
       [21.   ,  0.   ,  0.   , 10.5  ],
       [27.   ,  1.   ,  0.   , 21.   ],
       [41.   ,  0.   ,  5.   , 39.688],
       [28.   ,  1.   ,  0.   , 15.85 ]], dtype=float32)>

In [None]:
numeric_layer = tf.keras.layers.DenseFeatures(numeric_column)
numeric_layer(example_batch).numpy()

array([[-1.089, -0.474, -0.479, -0.456],
       [-0.69 , -0.474, -0.479, -0.437],
       [-0.21 ,  0.395, -0.479, -0.245],
       [ 0.909, -0.474,  5.827,  0.097],
       [-0.13 ,  0.395, -0.479, -0.339]], dtype=float32)

The mean based normalization used here requires knowing the mean of each column ahead of time.

### **Categorical data**

Some of the columns in the CSV data are categorical columns. That is, the content should be one of a limited set of options.

Use the tf.feature_column API to create a collection with a tf.feature_column.indicator_column for each categorical colunmn.

In [None]:
CATEGORIES = {
    'sex' : ['male','female'],
    'class' : ['First','Second','Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton','Queenstown'],
    'alone' : ['y', 'n']
}

In [None]:
categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
      key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))

In [None]:
# See what you just created
categorical_columns

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]

In [None]:
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(example_batch).numpy()[0])

[1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]


This will be become part of a data processing input later when you build the model

### **Combined preprocessing layer**

Add the two feature column collections and pass them to a tf.keras.layers.DenseFeature to create an input layer that will extract and preprocessing both input types:

In [None]:
preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numeric_columns)

In [None]:
print(preprocessing_layer(example_batch).numpy()[0])

[ 1.     0.     0.     0.     1.     0.     0.     0.     0.     0.
  0.     0.     0.     0.     0.     0.     0.     0.    -1.089 -0.474
 -0.479 -0.456  1.     0.   ]


### **Building the model**
 BUild a tf.keras.Sequential,starting with the preprocessing_layer.

In [None]:
model = tf.keras.Sequential([
  preprocessing_layer,
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid'),
])

model.compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = ['accuracy'])

### **Train, evaluate and predict**

Now the model can be instantiated and trained.

In [None]:
train_data = packed_train_data.shuffle(500)
test_data = packed_test_data

In [None]:
model.fit(train_data, epochs = 20)

Epoch 1/20
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f5c04702da0>

In [None]:
test_loss, test_accuracy = model.evaluate(test_data)

print('\n\nTest Loss {}, Test Accuracy {}'.format(test_loss, test_accuracy))



Test Loss 0.4398212134838104, Test Accuracy 0.8333333134651184


In [None]:
#Use tf.keras.Model.predict to infer label on a batch or a dataset of batches.

predictions = model.predict(test_data)
# print(predictions)
# show some results
for prediction, survived in zip(predictions[:10],list(test_data)[0][1][:10]):
  print('Predicted survival : {:.2%}'.format(prediction[0]),
             " | Actual outcome : ",
        ("SURVIVED" if bool(survived) else "DIED"))

Predicted survival : 11.02%  | Actual outcome :  SURVIVED
Predicted survival : 25.95%  | Actual outcome :  SURVIVED
Predicted survival : 10.99%  | Actual outcome :  DIED
Predicted survival : 22.44%  | Actual outcome :  DIED
Predicted survival : 88.92%  | Actual outcome :  SURVIVED


In [None]:
print(survived)

tf.Tensor(0, shape=(), dtype=int32)
