# Multi-Label classification with Enzyme Substrate Dataset with TensorFlow Decision Forests

This notebook walks you through how to train a baseline Gradient Boosted Tree model using TensorFlow Decision Forests on this Playground series using the Multi-Label Classification with Enzyme Substrate Dataset made available for this competition. 

The goal of the model is to predict EC1 and EC2 based on all the features, excluding the other secondary labels: E3, E4, E5 and E6.

​
Roughly, the code will look as follows:
​
```
import tensorflow_decision_forests as tfdf
import pandas as pd
​
dataset = pd.read_csv("project/dataset.csv")
tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(dataset, label="my_label")
​
model = tfdf.keras.GradientBoostedTreesModel(...)
model.fit(tf_dataset)
​
print(model.summary())

create_submission(test_ds)
```
​
Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform (or provide a strong baseline) before you begin experimenting with neural networks. Specifically for this dataset, doing multi labels is specifically easy as you'll see.

# Import the libraries

In [1]:
import numpy as np
import pandas as pd
import tensorflow_decision_forests as tfdf
import tensorflow as tf

import matplotlib.pyplot as plt

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [2]:
print(f"TensorFlow Decision Forests version: {tfdf.__version__}")

TensorFlow Decision Forests version: 1.3.0


# Load the Dataset

In [3]:
train_pd = pd.read_csv("/kaggle/input/playground-series-s3e18/train.csv")
test_pd = pd.read_csv("/kaggle/input/playground-series-s3e18/test.csv")

In [4]:
train_pd.head()

Unnamed: 0,id,BertzCT,Chi1,Chi1n,Chi1v,Chi2n,Chi2v,Chi3v,Chi4n,EState_VSA1,...,SlogP_VSA3,VSA_EState9,fr_COO,fr_COO2,EC1,EC2,EC3,EC4,EC5,EC6
0,0,323.390782,9.879918,5.875576,5.875576,4.304757,4.304757,2.754513,1.749203,0.0,...,4.794537,35.527357,0,0,1,1,0,0,0,0
1,1,273.723798,7.259037,4.441467,5.834958,3.285046,4.485235,2.201375,1.289775,45.135471,...,13.825658,44.70731,0,0,0,1,1,0,0,0
2,2,521.643822,10.911303,8.527859,11.050864,6.665291,9.519706,5.824822,1.770579,15.645394,...,17.964475,45.66012,0,0,1,1,0,0,1,0
3,3,567.431166,12.453343,7.089119,12.833709,6.478023,10.978151,7.914542,3.067181,95.639554,...,31.961948,87.509997,0,0,1,1,0,0,0,0
4,4,112.770735,4.414719,2.866236,2.866236,1.875634,1.875634,1.03645,0.727664,17.980451,...,9.589074,33.333333,2,2,1,0,1,1,1,0


# Quick basic dataset exploration

In [5]:
train_pd.describe()

Unnamed: 0,id,BertzCT,Chi1,Chi1n,Chi1v,Chi2n,Chi2v,Chi3v,Chi4n,EState_VSA1,...,SlogP_VSA3,VSA_EState9,fr_COO,fr_COO2,EC1,EC2,EC3,EC4,EC5,EC6
count,14838.0,14838.0,14838.0,14838.0,14838.0,14838.0,14838.0,14838.0,14838.0,14838.0,...,14838.0,14838.0,14838.0,14838.0,14838.0,14838.0,14838.0,14838.0,14838.0,14838.0
mean,7418.5,515.153604,9.135189,5.854307,6.738497,4.43257,5.253221,3.418749,1.773472,29.202823,...,13.636941,49.309959,0.458215,0.459226,0.667745,0.798962,0.313789,0.279081,0.144831,0.15157
std,4283.505982,542.45637,6.819989,4.647064,5.866444,3.760516,4.925065,3.436208,1.865898,31.728679,...,14.598554,29.174824,0.667948,0.668111,0.471038,0.40079,0.464047,0.448562,0.351942,0.358616
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-5.430556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3709.25,149.103601,4.680739,2.844556,2.932842,1.949719,2.034468,1.160763,0.503897,5.969305,...,4.794537,30.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
50%,7418.5,290.987941,6.48527,4.052701,4.392859,2.970427,3.242775,1.948613,1.073261,17.353601,...,9.589074,41.666667,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
75%,11127.75,652.652585,11.170477,7.486791,8.527859,5.788793,6.60935,4.50207,2.534281,44.876559,...,14.912664,56.09065,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
max,14837.0,4069.95978,69.551167,50.174588,53.431954,32.195368,34.579313,22.880836,16.07281,363.705954,...,115.406157,384.450519,8.0,8.0,1.0,1.0,1.0,1.0,1.0,1.0


Defining the proper labels to be used during training and columns to be dropped from the dataset.

In [6]:
primary_labels = ["EC1", "EC2"]
secondary_labels = ["EC3", "EC4", "EC5", "EC6"]
non_feature_columns = ["id"]

Creates a dataset from the pandas dataframe. Special atention to the use of multiple label keys (`EC1` and `EC2`). The other labels are dropped because you will not use them in this notebook.

In [7]:
def to_tf_dataset(pd_dataset: pd.DataFrame, label_keys: list[str], droped_features: list[str]) -> tf.data.Dataset:
    features = dict(pd_dataset.drop(label_keys + droped_features, axis=1))
    labels = dict(pd_dataset[label_keys])
    return tf.data.Dataset.from_tensor_slices((features, labels)).batch(100)

train_tf = to_tf_dataset(train_pd, label_keys=primary_labels, droped_features=non_feature_columns + secondary_labels)
test_tf = to_tf_dataset(test_pd, label_keys=[], droped_features=non_feature_columns)

# Train the model

To train a Gradient Boosted Trees model, it's very straightforward. 

For the multi-label case, you'll need to define that you'll use multi labels for classification, adding only one parameter to the model creation.



In [8]:
model = tfdf.keras.GradientBoostedTreesModel(
    multitask=[tfdf.keras.MultiTaskItem(label=l, task=tfdf.keras.Task.CLASSIFICATION) for l in primary_labels],
    verbose=1,
)
model.fit(train_tf)

Use /tmp/tmp4nfxo1ur as temporary training directory
Reading training dataset...
Training dataset read in 0:00:08.108786. Found 14838 examples.
Training model...
Model trained in 0:00:04.951313
Compiling model...


[INFO 23-07-06 15:01:58.8473 UTC kernel.cc:1242] Loading model from path /tmp/tmp4nfxo1ur/model/ with prefix ef49929f46df4571_0
[INFO 23-07-06 15:01:58.8569 UTC abstract_model.cc:1311] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO 23-07-06 15:01:58.8569 UTC kernel.cc:1074] Use fast generic engine
[INFO 23-07-06 15:01:58.8646 UTC kernel.cc:1242] Loading model from path /tmp/tmp4nfxo1ur/model/ with prefix ef49929f46df4571_1
[INFO 23-07-06 15:01:58.8681 UTC kernel.cc:1074] Use fast generic engine


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Model compiled.


<keras.callbacks.History at 0x7a9ffa2aecb0>

For some more information about the created model, you can call the `summary` method

In [9]:
model.summary()

Model: "gradient_boosted_trees_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
Total params: 1
Trainable params: 0
Non-trainable params: 1
_________________________________________________________________
Type: "MULTITASKER"
Task: CLASSIFICATION
Label: "EC1"

Input Features (31):
	BertzCT
	Chi1
	Chi1n
	Chi1v
	Chi2n
	Chi2v
	Chi3v
	Chi4n
	EState_VSA1
	EState_VSA2
	ExactMolWt
	FpDensityMorgan1
	FpDensityMorgan2
	FpDensityMorgan3
	HallKierAlpha
	HeavyAtomMolWt
	Kappa3
	MaxAbsEStateIndex
	MinEStateIndex
	NumHeteroatoms
	PEOE_VSA10
	PEOE_VSA14
	PEOE_VSA6
	PEOE_VSA7
	PEOE_VSA8
	SMR_VSA10
	SMR_VSA5
	SlogP_VSA3
	VSA_EState9
	fr_COO
	fr_COO2

No weights

Variable Importance disabled i.e. compute_oob_variable_importances=false.
Cannot compute model self evaluation:This model does not support evaluation reports.
model #0:
Type: "GRADIENT_BOOSTED_TREES"
Task: CLASSIFICATION
Label: "EC1"

Input Features (31):


With the model trained, you can now do the inference on the test data to prepare the submission

In [10]:
prediction = model.predict(test_tf)

prediction



{'EC1': array([[0.47080842],
        [0.80730313],
        [0.77015865],
        ...,
        [0.44579405],
        [0.5141975 ],
        [0.4168353 ]], dtype=float32),
 'EC2': array([[0.7818088 ],
        [0.85070294],
        [0.7597029 ],
        ...,
        [0.8409997 ],
        [0.84569997],
        [0.8245511 ]], dtype=float32)}

# Creating a submission 

In [11]:
prediction_pd = pd.DataFrame({
    "id": test_pd["id"],
    "EC1": prediction["EC1"].flatten(),
    "EC2": prediction["EC2"].flatten(),
})

prediction_pd.to_csv("submission.csv",index=False)

prediction_pd

Unnamed: 0,id,EC1,EC2
0,14838,0.470808,0.781809
1,14839,0.807303,0.850703
2,14840,0.770159,0.759703
3,14841,0.706230,0.824318
4,14842,0.794809,0.750287
...,...,...,...
9888,24726,0.631759,0.759918
9889,24727,0.775003,0.813876
9890,24728,0.445794,0.841000
9891,24729,0.514198,0.845700


In [12]:
!head submission.csv

id,EC1,EC2
14838,0.47080842,0.7818088
14839,0.80730313,0.85070294
14840,0.77015865,0.7597029
14841,0.7062302,0.824318
14842,0.79480875,0.7502875
14843,0.43836853,0.81279343
14844,0.5377519,0.8449801
14845,0.5772134,0.8367062
14846,0.6690013,0.7614374
