# Tabular Data for training (Titanic dataset)

When working Tabular data yo need to pay special attention to the categorical columns. These type of data requires you to pounder which is the most appropiated representation, it may be a class or a binary representation (one-hot enconding).
Tabular data has also the potential to interact with other columns of the data via **Crossed feature columns**

For this exercise we are going to be using the Titanic dataset again.

In [1]:
import functools
import numpy as np
import tensorflow as tf
import pandas as pd
from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

# Get the file paths of the titanic dataset
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv",TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)
print(f'train: {train_file_path}')
print(f'test: {test_file_path}')

train: /Users/ness/.keras/datasets/train.csv
test: /Users/ness/.keras/datasets/eval.csv


# Use pandas for a brief Data Exploration

In [3]:
titanic_df = pd.read_csv(train_file_path, header='infer')
titanic_df

Unnamed: 0,survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,0,male,22.0,1,0,7.2500,Third,unknown,Southampton,n
1,1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
2,1,female,26.0,0,0,7.9250,Third,unknown,Southampton,y
3,1,female,35.0,1,0,53.1000,First,C,Southampton,n
4,0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
...,...,...,...,...,...,...,...,...,...,...
622,0,male,28.0,0,0,10.5000,Second,unknown,Southampton,y
623,0,male,25.0,0,0,7.0500,Third,unknown,Southampton,y
624,1,female,19.0,0,0,30.0000,First,B,Southampton,y
625,0,female,28.0,1,2,23.4500,Third,unknown,Southampton,n


# Create datasets (training and tests) by choosing the 'survived' column as the target column (`label_name` parameter)

In [16]:
LABEL_COLUMN = 'survived'

train_ds = tf.data.experimental.make_csv_dataset(
    train_file_path,
    batch_size=3,
    label_name=LABEL_COLUMN,
    na_value="?",
    num_epochs=1, 
    ignore_errors=True
)

test_ds = tf.data.experimental.make_csv_dataset(
    test_file_path, 
    batch_size=3,   
    label_name=LABEL_COLUMN, 
    na_value="?", 
    num_epochs=1,
    ignore_errors=True
)

def describe_dataset(dataset):
    for labels, target in dataset.take(1):
        print("target:            {}".format(target.numpy()))
        for key, value in labels.items():
            print("{:20s}: {}".format(key, value))

print("-------------Train dataset----------------")
describe_dataset(train_ds)
print("-------------Test dataset----------------")
describe_dataset(test_ds)

-------------Train dataset----------------
target:            [1 1 0]
sex                 : [b'male' b'female' b'male']
age                 : [28.  4. 29.]
n_siblings_spouses  : [0 1 1]
parch               : [0 1 0]
fare                : [35.5    23.      7.0458]
class               : [b'First' b'Second' b'Third']
deck                : [b'C' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton']
alone               : [b'y' b'n' b'n']
-------------Test dataset----------------
target:            [0 1 0]
sex                 : [b'female' b'male' b'male']
age                 : [ 9.  38.  36.5]
n_siblings_spouses  : [1 1 0]
parch               : [1 0 2]
fare                : [15.2458 90.     26.    ]
class               : [b'Third' b'First' b'Second']
deck                : [b'unknown' b'C' b'F']
embark_town         : [b'Cherbourg' b'Southampton' b'Southampton']
alone               : [b'n' b'n' b'n']


# Preprocess Tabular data

Steps to preprocess data:
1. Designate columns by feature types
    - Numeric columns: `age`, `n_siblings_spouses`, `parch`, and `fare`
    - Categorical columns = `sex`, `class`, `deck`, `embark_town`, `alone`
2. Decide wether or not to embed or cross columns
3. Choose the columns of interest, possible as an experiment
4. Create a Feature Layer for consumption by the training paradigm

### Lets start by create all the numeric columns and set them into an array of `feature_columns`

In [24]:
feature_columns = [feature_column.numeric_column(header) for header in ['age', 'n_siblings_spouses', 'parch', 'fare']]
print(feature_columns)


[NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='n_siblings_spouses', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='parch', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='fare', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]


## Age

In [22]:
# Lets bucketize the values of the Age Column by distributing them into 3 bins of boundaries: 23,28,35
# If you are wondering why this specific set of ages, you can review the distribution of these values in the dataset:
titanic_df.describe()

Unnamed: 0,survived,age,n_siblings_spouses,parch,fare
count,627.0,627.0,627.0,627.0,627.0
mean,0.38756,29.631308,0.545455,0.379585,34.385399
std,0.487582,12.511818,1.15109,0.792999,54.59773
min,0.0,0.75,0.0,0.0,0.0
25%,0.0,23.0,0.0,0.0,7.8958
50%,0.0,28.0,0.0,0.0,15.0458
75%,1.0,35.0,1.0,0.0,31.3875
max,1.0,80.0,8.0,5.0,512.3292


In [26]:
# Feature column for age bucketized
age = feature_column.numeric_column('age') # returns a type: NumericColumn
age_buckets = feature_column.bucketized_column(age, boundaries=[23,28,35]) # returns type BucketizedColumn

# Categorical Columns
It would be nice to have a dictionary with the distinct values in each category (unique values for each column)

In [32]:
h = {}
for col in titanic_df:
    if col in ['sex', 'class', 'deck', 'embark_town', 'alone']:
        print(col, ':', titanic_df[col].unique())
        h[col] = titanic_df[col].unique()

sex : ['male' 'female']
class : ['Third' 'First' 'Second']
deck : ['unknown' 'C' 'G' 'A' 'B' 'D' 'F' 'E']
embark_town : ['Southampton' 'Cherbourg' 'Queenstown' 'unknown']
alone : ['n' 'y']


## With the dictionary containing the unique values for the categorical colums you can start creating one_hot encoders

In [33]:
# Encoding Categorical Column: Sex
sex_type = feature_column.categorical_column_with_vocabulary_list('Type', h.get('sex').tolist())
sex_type_one_hot = feature_column.indicator_column(sex_type)

# Encoding Categorical Column: class
class_type = feature_column.categorical_column_with_vocabulary_list('Type', h.get('class').tolist())
class_type_one_hot = feature_column.indicator_column(class_type)

# Encoding Categorical Column: deck
deck_type = feature_column.categorical_column_with_vocabulary_list('Type', h.get('deck').tolist())
deck_type_one_hot = feature_column.indicator_column(deck_type)

# Encoding Categorical Column: embark_town
embark_town_type = feature_column.categorical_column_with_vocabulary_list('Type', h.get('embark_town').tolist())
embark_town_type_one_hot = feature_column.indicator_column(embark_town_type)

# Encoding Categorical Column: alone
alone_type = feature_column.categorical_column_with_vocabulary_list('Type', h.get('alone').tolist())
alone_one_hot = feature_column.indicator_column(alone_type)


VocabularyListCategoricalColumn(key='Type', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='Type', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0))
