# Titanic Example
An example of using `smart-pandas` to define a schema for an ML project using the Titanic dataset.

## Setup

In [1]:
from smart_pandas import pandas as pd
from sklearn import datasets
import numpy as np

In [2]:
def get_data() -> pd.DataFrame:
    data_objects = datasets.fetch_openml(name='titanic', version=1)
    data = data_objects["frame"]
    data.loc[:, "id"] = np.arange(data.shape[0])
    data.drop(columns=["boat", "body", "home.dest"], inplace=True)
    return data

In [3]:
data = get_data()

## Init Smart Pandas Config

Let's initialise the smart-pandas config, and then have a look at the different properties it gives us.

In [4]:
data.smart_pandas.init(config_path="titanic_config.yaml")

In [5]:
data.smart_pandas.state

State(name=StateName.RAW, ml_stage=MLStage.TRAINING)

We can see that the currnt state of our dataframe is 'RAW, TRAINING'. That implies two things:
- We are yet to calculate any of our 'derived_features'
- Our target column is present in our dataframe

In [6]:
data.smart_pandas.target.head()

Unnamed: 0,survived
0,1
1,1
2,0
3,0
4,0


In [7]:
data.smart_pandas.raw_features.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,cabin,embarked,fare
0,1,female,29.0,0,0,B5,S,211.3375
1,1,male,0.9167,1,2,C22 C26,S,151.55
2,1,female,2.0,1,2,C22 C26,S,151.55
3,1,male,30.0,1,2,C22 C26,S,151.55
4,1,female,25.0,1,2,C22 C26,S,151.55


In [8]:
data.smart_pandas.config.derived_features

['number_of_cabins']

We can see that we only have one 'derived_feature' that we are expecting to generate later. Finally, let's validate our data in it's current state against the internal Pandera schema.

In [9]:
# validate the raw data against the schema
data.smart_pandas.validate(inplace=True)

You'll notice that even though our dataset is missing the 'number_of_cabins' feature, it still passed the data validation. That is because the schema is built dynamically based on the data state and the different column tags. Given we are still in the RAW state, it is not expecting any 'derived_features' to exist, and hence does not include them in the schema.

## Feature Engineering

In [10]:
def engineer_features(data: pd.DataFrame) -> pd.DataFrame:
    data["cabin"] = data["cabin"].fillna("missing")
    data["cabin"] = data["cabin"].astype(str)
    data["number_of_cabins"] = data["cabin"].apply(lambda x: len(x.split(" ")))

    data["embarked"] = data["embarked"].fillna("missing")
    data["age"] = data["age"].fillna(data["age"].mean())
    data["fare"] = data["fare"].fillna(data["fare"].mean())
    return data

In [11]:
data = engineer_features(data)

Now that we've created the 'number_of_cabins' feature, lets update the state and check it again.

In [12]:
data.smart_pandas.update_state()
data.smart_pandas.state

State(name=StateName.PROCESSED, ml_stage=MLStage.TRAINING)

You'll see the state has changed to PROCESSED, this indicates that all modelling features are present in the data. Now when we run `validate()` again, it will recognise that it does expect the 'number_of_cabins' feature and will validate against it.

In [13]:
data.smart_pandas.validate(inplace=True)

## Modelling
The below modelling step uses catboost to train a simple model on the data. We are committing a huge collection of modelling faux pas in this notebook (such as not splitting our data into samples), but it is simply to demonstrate the functionality of smart-pandas, so cut us some slack :)

Notice how when training the ML model, we can access the correct parts of the data through the smart pandas attributes, rather than having to explicity list them. This means we can define features all in one place (the config).

In [14]:
from catboost import CatBoostClassifier

In [15]:
model = CatBoostClassifier()
model.fit(
    data.smart_pandas.model_features, data.smart_pandas.target,
    cat_features=[x for x in data.smart_pandas.config.model_features if data[x].dtype == "object"],  # it would be great if smart-pandas could have a tag for categorical features, it's in the backlog!
    verbose=False,
)

<catboost.core.CatBoostClassifier at 0x1255b3af0>

In [16]:
predictions = model.predict(data.smart_pandas.model_features)
predictions

array([1, 1, 1, ..., 0, 0, 0])

## Building Output Payload
Now if you wanted to bundle your output up to send forward to another service, you can easily append any information you may want through the metadata tags.

In [17]:
output = data.smart_pandas.metadata.copy()
output["predictions"] = predictions

In [18]:
output.head()

Unnamed: 0,name,ticket,predictions
0,"Allen, Miss. Elisabeth Walton",24160,1
1,"Allison, Master. Hudson Trevor",113781,1
2,"Allison, Miss. Helen Loraine",113781,1
3,"Allison, Mr. Hudson Joshua Creighton",113781,0
4,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",113781,1
