# Basic use of Keras-batchflow on Titanic data

Below example shows the most basic use of keras batchflow for predicting survival in Titanic disaster. A well known [Titanic dataset](https://www.kaggle.com/c/titanic/data) from [Kaggle](https://www.kaggle.com) is used in this example

This dataset has a mixture of both categorical and numeric variables which will highlight the features of keras-batchflow better.  

## Data pre-processing

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('../data/titanic/train.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Imagine after exploratory analysis and model finding, only few columns were selected as features: **Pclass, Sex, Age, and Embarked**. 

Lets see if there are any NAs to fill:

In [3]:
data[['Pclass', 'Sex', 'Age', 'Embarked', 'Survived']].isna().apply(sum)

Pclass        0
Sex           0
Age         177
Embarked      2
Survived      0
dtype: int64

Lets fill those NAs:

In [4]:
data['Age'] = data['Age'].fillna(0)
data['Embarked'] = data['Embarked'].fillna('')

## Batch generator

I would like to build a simple neural network using embedding for all categorical values, which will predict if a passenger would survive. 

When building such a model, I will need to provide number of levels of each categorical feature in embedding layers declarations. Keras-batchflow provides some automation helping determining this parameter for each feature and therefore, I will build a generator first.

To build a batchflow generator you will first need to define your encoders, that will map categorical value to its integer repredentation. I will use sklearn LabelEncoder for this purpose. 

In [5]:
from sklearn.preprocessing import LabelEncoder

class_enc = LabelEncoder().fit(data['Pclass'])
sex_enc = LabelEncoder().fit(data['Sex'])
embarked_enc = LabelEncoder().fit(data['Embarked'].astype(str))

Now, I can define a batch generator. I will be using a basic class `BatchGenerator`

In [6]:
from keras_batchflow.batch_generator import BatchGenerator

x_structure = [
    ('Pclass', class_enc),
    ('Sex', sex_enc),
    ('Embarked', embarked_enc),
    ('Age', None),
]
y_structure = ('Survived', None)

bg_train = BatchGenerator(data,
                          x_structure=x_structure,
                          y_structure=y_structure,
                          shuffle = True,
                          batch_size=32)
bg_test = BatchGenerator(data,
                         x_structure=x_structure,
                         y_structure=y_structure,
                         shuffle = True,
                         batch_size=32)

Using TensorFlow backend.


I can now check the first batch it generates

In [7]:
bg_train[0]

([array([2, 0, 1, 2, 1, 2, 0, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 0, 1, 2, 0, 1,
         2, 2, 1, 2, 2, 1, 1, 2, 1, 2]),
  array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
         1, 1, 0, 1, 1, 0, 1, 1, 1, 1]),
  array([3, 3, 3, 2, 3, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 1, 1, 3, 3, 1, 1, 3,
         1, 1, 3, 3, 3, 3, 3, 2, 3, 3]),
  array([43.  , 35.  ,  0.  , 19.  , 31.  , 45.  ,  0.  ,  0.  ,  0.  ,
         24.  ,  0.  , 25.  , 39.  , 22.  , 52.  , 26.  ,  0.  , 16.  ,
         18.  , 22.  , 56.  , 26.  , 30.  ,  0.  ,  4.  , 26.  , 19.  ,
         27.  ,  0.83,  0.  , 28.  , 11.  ])],
 array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
        0, 1, 1, 0, 0, 1, 1, 0, 0, 0]))

It is exactly what keras will expect: 

- the batch is tuple (X, y)
- X is a list of numpy arrays - this is how Keras expects multiple inputs to be passed
- y is a single numpy array. 

Before I jump into building a keras model. I'd like to show a helper functions of keras-batchflow for automated model creation

In [8]:
bg_train.shape

([(None,), (None,), (None,), (None,)], (None,))

In [9]:
bg_train.metadata

([{'name': 'Pclass',
   'encoder': LabelEncoder(),
   'shape': (None,),
   'dtype': dtype('int64'),
   'n_classes': 3},
  {'name': 'Sex',
   'encoder': LabelEncoder(),
   'shape': (None,),
   'dtype': dtype('int64'),
   'n_classes': 2},
  {'name': 'Embarked',
   'encoder': LabelEncoder(),
   'shape': (None,),
   'dtype': dtype('int64'),
   'n_classes': 4},
  {'name': 'Age',
   'encoder': None,
   'shape': (None,),
   'dtype': dtype('float64'),
   'n_classes': None}],
 {'name': 'Survived',
  'encoder': None,
  'shape': (None,),
  'dtype': dtype('int64'),
  'n_classes': None})

## Keras model

In [10]:
from keras.layers import Input, Embedding, Dense, Concatenate, Lambda
from keras.models import Model
import keras.backend as K

metadata_x, metadata_y = bg_train.metadata
# define categorical and numeric inputs from X metadata
inputs = [Input(batch_shape=m['shape'], dtype=m['dtype']) for m in metadata_x]
# Define embeddings for categorical features (where n_classes not None) and connect them to inputs
embs = [Embedding(m['n_classes'], 10)(inp) for m, inp in zip(metadata_x, inputs) if m['n_classes'] is not None]
# separate numeric inputs
num_inps = [inp for m, inp in zip(metadata_x, inputs) if m['n_classes'] is None]
# expand dimension of num_inputs and convert to default float32 dtype
num_x = [Lambda(lambda x: K.expand_dims(K.cast(x, 'float32'), axis=-1))(ni) for ni in num_inps]
# merge all inputs
x = Concatenate()(embs + num_x)
x = Dense(64, activation='relu')(x)
x = Dense(32, activation='relu')(x)
survived = Dense(1, activation='sigmoid')(x)

model = Model(inputs, survived)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [11]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None,)              0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None,)              0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None,)              0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            (None,)              0                                            
____________________________________________________________________________________________

In [12]:
model.compile('adam', 'binary_crossentropy')

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
