## Titanic Competition with TensorFlow Decision Forests

This notebook shows some basic pre-processing, for example, the ticket names will be divided into sections and the passenger names will be tokenized. And also go through the steps to train a baseline Gradient boosted Trees (GBT) Model with default parameters using TensorFlow Decision Forests.

### Import dependencies

In [5]:
import numpy as np
import pandas as pd
import os

!pip install tensorflow
!pip install tensorflow_decision_forests

import tensorflow as tf
import tensorflow_decision_forests as tfdf

print(f"Found TF-DF {tfdf.__version__}")



Found TF-DF 1.6.0


### Load datasets

In [7]:
train_df = pd.read_csv("/Users/moshingliu/Desktop/项目实习XDF/互联网项目/train.csv")
test_df = pd.read_csv("/Users/moshingliu/Desktop/项目实习XDF/互联网项目/test.csv")

train_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### Prepare dataset

We will apply transformations on the dataset.

In [8]:
def preprocess(df):
    df = df.copy()
    
    def normalize_name(x):
        return " ".join([v.strip(",()[].\"'") for v in x.split(" ")])
    
    def ticket_number(x):
        return x.split(" ")[-1]
        
    def ticket_item(x):
        items = x.split(" ")
        if len(items) == 1:
            return "NONE"
        return "_".join(items[0:-1])
    
    df["Name"] = df["Name"].apply(normalize_name)
    df["Ticket_number"] = df["Ticket"].apply(ticket_number)
    df["Ticket_item"] = df["Ticket"].apply(ticket_item)                     
    return df
    
preprocessed_train_df = preprocess(train_df)
preprocessed_test_df = preprocess(test_df)

preprocessed_train_df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Ticket_number,Ticket_item
0,1,0,3,Braund Mr Owen Harris,male,22.0,1,0,A/5 21171,7.25,,S,21171,A/5
1,2,1,1,Cumings Mrs John Bradley Florence Briggs Thayer,female,38.0,1,0,PC 17599,71.2833,C85,C,17599,PC
2,3,1,3,Heikkinen Miss Laina,female,26.0,0,0,STON/O2. 3101282,7.925,,S,3101282,STON/O2.
3,4,1,1,Futrelle Mrs Jacques Heath Lily May Peel,female,35.0,1,0,113803,53.1,C123,S,113803,NONE
4,5,0,3,Allen Mr William Henry,male,35.0,0,0,373450,8.05,,S,373450,NONE


In [9]:
input_features = list(preprocessed_train_df.columns)
input_features.remove("Ticket")
input_features.remove("PassengerId")
input_features.remove("Survived")
#input_features.remove("Ticket_number")

print(f"Input features: {input_features}")

Input features: ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked', 'Ticket_number', 'Ticket_item']


### Convert Pandas dataset to TensorFlow Dataset

In [11]:
def tokenize_names(features, labels=None):
    """Divite the names into tokens. TF-DF can consume text tokens natively."""
    features["Name"] =  tf.strings.split(features["Name"])
    return features, labels

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(preprocessed_train_df,label="Survived").map(tokenize_names)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(preprocessed_test_df).map(tokenize_names)

### Train model with default parameters

#### Train model
First, we are training a Gradient Boosted Trees Model with the default parameters.

In [13]:
model = tfdf.keras.GradientBoostedTreesModel(
    verbose=0,
    features=[tfdf.keras.FeatureUsage(name=n) for n in input_features],
    exclude_non_specified_features=True,
    random_seed=1234,
)
model.fit(train_ds)

self_evaluation = model.make_inspector().evaluation()
print(f"Accuracy: {self_evaluation.accuracy} Loss:{self_evaluation.loss}")



Accuracy: 0.804347813129425 Loss:0.8922085165977478


[INFO 23-10-19 23:01:35.4418 EDT kernel.cc:1233] Loading model from path /var/folders/09/74dcx6ys20v6brjfqswsm1gm0000gn/T/tmpkmyrb2bd/model/ with prefix 0a13d9a3f2314d69
[INFO 23-10-19 23:01:35.4435 EDT abstract_model.cc:1344] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO 23-10-19 23:01:35.4436 EDT kernel.cc:1061] Use fast generic engine


### Train model with improved default parameters

Then we will use some specific parameters when creating the GBT model.

In [14]:
model = tfdf.keras.GradientBoostedTreesModel(
    verbose=0,
    features=[tfdf.keras.FeatureUsage(name=n) for n in input_features],
    exclude_non_specified_features=True,
    min_examples=1,
    categorical_algorithm="RANDOM",
    shrinkage=0.05,
    split_axis="SPARSE_OBLIQUE",
    sparse_oblique_normalization="MIN_MAX",
    sparse_oblique_num_projections_exponent=2.0,
    num_trees=2000,
    random_seed=1234,
    
)
model.fit(train_ds)

self_evaluation = model.make_inspector().evaluation()
print(f"Accuracy: {self_evaluation.accuracy} Loss:{self_evaluation.loss}")



Accuracy: 0.77173912525177 Loss:1.0119096040725708


[INFO 23-10-19 23:12:06.9968 EDT kernel.cc:1233] Loading model from path /var/folders/09/74dcx6ys20v6brjfqswsm1gm0000gn/T/tmphm_zbb82/model/ with prefix dfa88ff065b44edb
[INFO 23-10-19 23:12:07.0019 EDT decision_forest.cc:660] Model loaded with 54 root(s), 2824 node(s), and 10 input feature(s).
[INFO 23-10-19 23:12:07.0019 EDT abstract_model.cc:1344] Engine "GradientBoostedTreesGeneric" built
[INFO 23-10-19 23:12:07.0019 EDT kernel.cc:1061] Use fast generic engine


Then, summarized the model and noticed the information about variable importance that the model figured out.

In [15]:
model.summary()

Model: "gradient_boosted_trees_model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
Total params: 1 (1.00 Byte)
Trainable params: 0 (0.00 Byte)
Non-trainable params: 1 (1.00 Byte)
_________________________________________________________________
Type: "GRADIENT_BOOSTED_TREES"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (11):
	Age
	Cabin
	Embarked
	Fare
	Name
	Parch
	Pclass
	Sex
	SibSp
	Ticket_item
	Ticket_number

No weights

Variable Importance: INV_MEAN_MIN_DEPTH:
    1.           "Sex"  0.431313 ################
    2.           "Age"  0.373986 ############
    3.          "Fare"  0.250863 ####
    4.          "Name"  0.225272 ###
    5.   "Ticket_item"  0.182111 
    6.      "Embarked"  0.181387 
    7. "Ticket_number"  0.180897 
    8.        "Pclass"  0.178279 
    9.         "Parch"  0.175249 
   10.         "SibSp"  0.172167 

Variable Importance: NUM_AS_ROOT:
    1.  "Sex" 39.00000