# Introduction to `mleko` with Titanic

This notebook is a quick introduction to `mleko` package. We will use the Titanic dataset to predict whether a passenger survived or not.

The library provides 2 subpackages needed for data processing and model training: 
- `dataset`: Subpackage for handling and processing datasets.
  - `ingest`: Module for ingesting (loading) data from various sources.
    - `BaseIngester`: Base class for all ingesters.
    - `KaggleIngester`: Ingests data from Kaggle.
    - `S3Ingester`: Ingests data from Amazon S3.
  - `convert`: Module for converting data into different formats.
    - `BaseConverter`: Base class for all converters.
    - `CsvToVaexConverter`: Converts CSV data into a Vaex DataFrame.
  - `split`: Module for splitting datasets into training and testing sets.
    - `BaseSplitter`: Base class for all splitters.
    - `RandomSplitter`: Splits data randomly.
    - `ExpressionSplitter`: Splits data based on a given expression.
  - `transform`: Module for transforming datasets.
    - `BaseTransformer`: Base class for all transformers.
    - `CompositeTransformer`: Combines multiple transformers.
    - `FrequencyEncoderTransformer`: Encodes categorical variables based on their frequency.
    - `LabelEncoderTransformer`: Encodes categorical variables with unique labels.
    - `MaxAbsScalerTransformer`: Scales each feature by its maximum absolute value.
    - `MinMaxScalerTransformer`: Scales each feature to a given range.
  - `feature_select`: Module for feature selection.
    - `BaseFeatureSelector`: Base class for all feature selectors.
    - `CompositeFeatureSelector`: Combines multiple feature selectors.
    - `InvarianceFeatureSelector`: Selects features based on their invariance.
    - `MissingRateFeatureSelector`: Selects features based on their missing rate.
    - `PearsonCorrelationFeatureSelector`: Selects features based on their Pearson correlation.
    - `VarianceFeatureSelector`: Selects features based on their variance.
- `model`: Subpackage for building and training models.
  - `BaseModel`: Base class for all models.
  - `LGBMModel`: Trains a LightGBM model.


# Configuration
This section contains configurations for the notebook. 

In [None]:
%reload_ext autoreload
%autoreload 2

## Constants
Define various constants that will be used throughout the notebook.

In [None]:
# Kaggle dataset identifier
OWNER_SLUG = "yasserh"
DATASET_SLUG = "titanic-dataset"
DATASET_NAME = f"{OWNER_SLUG}/{DATASET_SLUG}"

# Define meta features of the dataset not used as model inputs
TARGET_FEATURE = "Survived"
ID_COLUMN = "PassengerId"
META_FEATURES = [ID_COLUMN, TARGET_FEATURE]

# General Configuration
RANDOM_STATE = 1337

# Download Data
In this cell, we use the `KaggleIngester` from the `mleko` library to download the Titanic dataset from Kaggle.

In [None]:
from mleko.dataset.ingest import KaggleIngester


# Fetch data from Kaggle and return paths to the downloaded files
csv_paths = KaggleIngester(
    destination_directory=f"data/{DATASET_NAME}/raw", 
    owner_slug=OWNER_SLUG, 
    dataset_slug=DATASET_SLUG
).fetch_data()


## Fetching Data from S3
In addition to the `KaggleIngester`, `mleko` also provides the `S3Ingester` for downloading datasets from Amazon S3.

Here's an example of how you can use it:
```python
from mleko.dataset.ingest import S3Ingester

csv_paths = S3Ingester(
    destination_directory="data",
    s3_bucket_name="mleko-datasets",
    s3_key_prefix="kaggle/nehaprabhavalkar/indian-food-101",
    aws_profile_name="mleko",
    aws_region_name="eu-west-1",
    num_workers=64,  # Number of workers to use for downloading files.
    check_s3_timestamps=True,  # Ensure that all files are from the same date.
).fetch_data()
```

# Clean Data

Here, we use the `CSVToVaexConverter` from `mleko` to clean the data. 

The converter reads the CSV file, drops unnecessary columns, handles missing values, and converts the data into a Vaex DataFrame for efficient processing.

In [None]:
from mleko.dataset.convert import CSVToVaexConverter


clean_schema, clean_df = CSVToVaexConverter(
    cache_directory=f"data/{DATASET_NAME}/converted",
    drop_columns=["Ticket"],
    meta_columns=META_FEATURES,
    drop_rows_with_na_columns=[TARGET_FEATURE],  # Drop rows with missing target values
    random_state=RANDOM_STATE,  # We like reproducibility
).convert(csv_paths)

Investigate the data to see what columns are available and what their data types are.

In [None]:
clean_schema.features

In [None]:
clean_df.head(10)

# Split Train/Val and Test Dataset

In this section, we split the cleaned data into a training/validation set and a test set. 

We use the `RandomSplitter` from `mleko` to perform a stratified random split, ensuring that both sets have the same proportion of class labels.

In [None]:
from mleko.dataset.split import RandomSplitter


clean_train_val_df, clean_test_df = RandomSplitter(
    cache_directory=f"data/{DATASET_NAME}/split",
    data_split=(0.90, 0.10),  # 90% train/val, 10% test
    shuffle=True,  # Shuffle the data before splitting
    stratify=TARGET_FEATURE,  # Stratify on the target feature
    random_state=RANDOM_STATE,  # We like reproducibility
).split(clean_df)

Ensure the class balance is maintained in the train/val and test sets.

In [None]:
def print_split_stats(df, split_name):
    total_count = df.shape[0]
    survival_count = df[TARGET_FEATURE].sum()  # type: ignore
    survival_rate = survival_count / total_count

    print(f"{split_name}: {survival_rate * 100:.3f}% (Survived: {survival_count:3d}, Total: {total_count:3d})")


print_split_stats(clean_train_val_df, "Train/Val")
print_split_stats(clean_test_df, "Test")

## Splitting Based on Boolean Expressions

For more complex splits, you can use the `ExpressionSplitter` from `mleko` to split the data based on a given boolean expression.

It is suitable for splitting data based on time, location, or any other condition like the one below:
```python
from mleko.dataset.split import ExpressionSplitter

train_val_df, test_df = ExpressionSplitter(
    cache_directory=f"data/{DATASET_NAME}/split",
    expression="(Embarked == 'S') | ((Embarked == 'C') & (Fare < 50))"  # Train/val set contains passengers who embarked from Southampton or Cherbourg and paid less than 50.
).split(clean_df)
```

# Feature Engineering & Transformation 
In this section, we perform feature engineering and transformation using custom transformers, predefined transformers and combining them inside a `CompositeTransformer` from `mleko`. 

In [None]:
clean_train_val_df

## Custom Transformers
It is important the the ML pipeline is flexible and allows for easy experimentation, be it with different feature engineering techniques or different models.

In many cases, the classes provided by `mleko` will be sufficient for your needs. However, you can also create your own custom classes by inheriting from the `BaseClass` class. For transformers, you need to inherit from the `BaseTransformer` class and implement the `__init__`, `_fit`, `_transform`, and `_fingerprint` methods.

In [None]:
from __future__ import annotations

from pathlib import Path
from typing import Hashable

import vaex
import vaex.ml

from mleko.dataset import DataSchema
from mleko.dataset.transform import BaseTransformer
from mleko.utils import auto_repr


class IsAloneTransformer(BaseTransformer):
    @auto_repr
    def __init__(
        self,
        cache_directory: str | Path,
        cache_size: int = 1,
    ) -> None:
        super().__init__(cache_directory, [], cache_size)
        self._transformer = None

    def _fit(self, data_schema: DataSchema, _dataframe: vaex.DataFrame) -> tuple[DataSchema, None]:
        """No fitting required for this transformer."""
        ds = data_schema.copy().add_feature("IsAlone", "boolean")
        return ds, self._transformer

    def _transform(self, data_schema: DataSchema, dataframe: vaex.DataFrame) -> tuple[DataSchema, vaex.DataFrame]:
        """Add a new feature to the dataset indicating whether the passenger was alone or not."""
        df = dataframe.copy()
        df["IsAlone"] = df["SibSp"] + df["Parch"] == 0  # type: ignore
        ds = data_schema.copy().add_feature("IsAlone", "boolean")
        return ds, df

    def _fingerprint(self) -> Hashable:
        return super()._fingerprint()


class FeatureDropperTransformer(BaseTransformer):
    @auto_repr
    def __init__(
        self,
        cache_directory: str | Path,
        features: list[str] | tuple[str, ...],
        cache_size: int = 1,
    ) -> None:
        super().__init__(cache_directory, features, cache_size)
        self._transformer = None

    def _fit(self, data_schema: DataSchema, _dataframe: vaex.DataFrame) -> tuple[DataSchema, None]:
        """No fitting required for this transformer."""
        ds = data_schema.copy().drop_features(self._features)
        return ds, self._transformer

    def _transform(self, data_schema: DataSchema, dataframe: vaex.DataFrame) -> tuple[DataSchema, vaex.DataFrame]:
        """Drop the specified features from the dataset."""
        df = dataframe.drop(self._features, inplace=False)
        ds = data_schema.copy().drop_features(self._features)
        return ds, df

    def _fingerprint(self) -> Hashable:
        return super()._fingerprint()

Transformers can be applied individually or combined into a `CompositeTransformer`. The `CompositeTransformer` allows you to combine multiple transformers into a single transformer, saving you from having to apply each transformer individually, like a small pipeline. The custom transformers can be used in the same way as the predefined ones, e.g. `LabelEncoderTransformer` or `FrequencyEncoderTransformer`.

The transformers follows the common `fit` and `transform` pattern, similar to `scikit-learn`. This is true for feature selectors and models as well.

In [None]:
from mleko.dataset.transform import CompositeTransformer, LabelEncoderTransformer


composite_transformer = CompositeTransformer(
    cache_directory=f"data/{DATASET_NAME}/transform",
    transformers=[
        FeatureDropperTransformer(
            cache_directory=f"data/{DATASET_NAME}/transform",
            features=["Name"],
        ),
        IsAloneTransformer(
            cache_directory=f"data/{DATASET_NAME}/transform",
        ),
        LabelEncoderTransformer(
            cache_directory=f"data/{DATASET_NAME}/transform",
            features=["Sex", "Embarked", "IsAlone"],
        ),
    ],
)

transform_schema, _, transform_train_val_df = composite_transformer.fit_transform(clean_schema, clean_train_val_df)
_, transform_test_df = composite_transformer.transform(clean_schema, clean_test_df)

Ensure the transformed dataset has correct data types.

In [None]:
transform_schema.features

In [None]:
transform_train_val_df

# Feature Selection

Here, we use the `CompositeFeatureSelector` from `mleko` to select the most relevant features for our model. We use three selectors: `MissingRateFeatureSelector` to remove features with too many missing values, `InvarianceFeatureSelector` to remove invariant features, and `PearsonCorrelationFeatureSelector` to remove highly correlated features. We also display a correlation matrix for the selected numerical features.

Just like transformers, feature selectors can be applied individually or combined into a `CompositeFeatureSelector`, and allow for custom feature selectors.

In [None]:
from mleko.dataset.feature_select import (
    CompositeFeatureSelector,
    InvarianceFeatureSelector,
    MissingRateFeatureSelector,
    PearsonCorrelationFeatureSelector,
)


composite_feature_selector = CompositeFeatureSelector(
    cache_directory=f"data/{DATASET_NAME}/feature_select",
    feature_selectors=[
        MissingRateFeatureSelector(
            cache_directory=f"data/{DATASET_NAME}/feature_select",
            missing_rate_threshold=0.5,
            ignore_features=META_FEATURES,
        ),
        InvarianceFeatureSelector(
            cache_directory=f"data/{DATASET_NAME}/feature_select",
            ignore_features=META_FEATURES,
        ),
        PearsonCorrelationFeatureSelector(
            cache_directory=f"data/{DATASET_NAME}/feature_select",
            correlation_threshold=0.7,
            ignore_features=META_FEATURES,
        ),
    ],
)

data_schema, _, feature_select_train_val_df = composite_feature_selector.fit_transform(
    transform_schema, transform_train_val_df
)
_, test_df = composite_feature_selector.transform(transform_schema, transform_test_df)

The `Cabin` feature has too many missing values, so we drop it. No other feature was dropped.

In [None]:
data_schema.features

In [None]:
feature_select_train_val_df

# Train Model

We further split our training/validation data into a training set and a validation set. The `LGBMModel` is trained on the training set and evaluated on the validation set.

In [None]:
train_df, val_df = RandomSplitter(
    cache_directory=f"data/{DATASET_NAME}/split",
    data_split=(0.80, 0.20),
    shuffle=True,
    stratify=TARGET_FEATURE,
    random_state=RANDOM_STATE,
).split(feature_select_train_val_df, cache_group="train_val")

Ensure the class balance is maintained in the training, validation and test sets.

In [None]:
print_split_stats(train_df, "Train")
print_split_stats(val_df, "Val")
print_split_stats(test_df, "Test")

Train the model and evaluate it on the validation set.

In [None]:
from mleko.model import LGBMModel


lgbm_model = LGBMModel(
    cache_directory=f"data/{DATASET_NAME}/model",
    objective="binary",
    target=TARGET_FEATURE,
    num_iterations=100,
    ignore_features=META_FEATURES,
    metric=["average_precision", "auc"],
)

model, metrics, p_train_df, p_val_df = lgbm_model.fit_transform(data_schema, train_df, val_df, {})

In [None]:
import lightgbm


ax = lightgbm.plot_metric(metrics, metric="auc")
ax = lightgbm.plot_metric(metrics, metric="average_precision")

# `mleko` Pipeline

The `mleko` pipeline is used to streamline the entire process. Pipelines are very flexible and allows users to define a directed acyclic graph (DAG) of operations. You can chain together all operations in a single pipeline or create multiple pipelines for different tasks.

We create two pipelines: 
- a pre-processing pipeline that handles data ingestion, conversion, splitting, transformation, and feature selection
- a model pipeline that trains and evaluates the model.

Next we define all required classes to create the pre-processing pipeline.

## Pre-Processing Pipeline

In [None]:
kaggle_ingester = KaggleIngester(
    destination_directory=f"data/{DATASET_NAME}/raw", 
    owner_slug=OWNER_SLUG, 
    dataset_slug=DATASET_SLUG
)

In [None]:
csv_to_vaex_converter = CSVToVaexConverter(
    cache_directory=f"data/{DATASET_NAME}/converted",
    drop_columns=["Ticket"],
    meta_columns=META_FEATURES,
    drop_rows_with_na_columns=[TARGET_FEATURE],
    random_state=RANDOM_STATE,
)

In [None]:
random_splitter_90_10 = RandomSplitter(
    cache_directory=f"data/{DATASET_NAME}/split",
    data_split=(0.90, 0.10),
    shuffle=True,
    stratify=TARGET_FEATURE,
    random_state=RANDOM_STATE,
)

In [None]:
composite_transformer = CompositeTransformer(
    cache_directory=f"data/{DATASET_NAME}/transform",
    transformers=[
        FeatureDropperTransformer(
            cache_directory=f"data/{DATASET_NAME}/transform",
            features=["Name"],
        ),
        IsAloneTransformer(
            cache_directory=f"data/{DATASET_NAME}/transform",
        ),
        LabelEncoderTransformer(
            cache_directory=f"data/{DATASET_NAME}/transform",
            features=["Sex", "Embarked", "IsAlone"],
        ),
    ],
)

In [None]:
composite_feature_selector = CompositeFeatureSelector(
    cache_directory=f"data/{DATASET_NAME}/feature_select",
    feature_selectors=[
        MissingRateFeatureSelector(
            cache_directory=f"data/{DATASET_NAME}/feature_select",
            missing_rate_threshold=0.5,
            ignore_features=META_FEATURES,
        ),
        InvarianceFeatureSelector(
            cache_directory=f"data/{DATASET_NAME}/feature_select",
            ignore_features=META_FEATURES,
        ),
        PearsonCorrelationFeatureSelector(
            cache_directory=f"data/{DATASET_NAME}/feature_select",
            correlation_threshold=0.7,
            ignore_features=META_FEATURES,
        ),
    ],
)

In [None]:
random_splitter_80_20 = RandomSplitter(
    cache_directory=f"data/{DATASET_NAME}/split",
    data_split=(0.80, 0.20),
    shuffle=True,
    stratify=TARGET_FEATURE,
    random_state=RANDOM_STATE,
)

Define the pre-processing pipeline performing all the dataset pre-processing steps before training the model.

Each `PipelineStep` accepts a class, a list of input names, and a list of output names. The class is used to instantiate the step, the input names are used to fetch the required data from the previous steps, and the output names are used to store the output of the step for use by subsequent steps.

In [None]:
from mleko.pipeline import Pipeline
from mleko.pipeline.steps import ConvertStep, FeatureSelectStep, IngestStep, SplitStep, TransformStep


pre_pipeline = Pipeline(
    steps=[
        IngestStep(kaggle_ingester, outputs=["raw_csv"]),
        ConvertStep(csv_to_vaex_converter, inputs=["raw_csv"], outputs=["clean_data_schema", "clean_df"]),
        SplitStep(
            random_splitter_90_10,
            inputs=["clean_df"],
            outputs=["train_val_clean_df", "test_clean_df"],
            cache_group="train_val_test",
        ),
        TransformStep(
            composite_transformer,
            action="fit_transform",
            inputs=["clean_data_schema", "train_val_clean_df"],
            outputs=["transform_data_schema", "composite_transformer", "transform_train_val_df"],
            cache_group="train_val",
        ),
        TransformStep(
            composite_transformer,
            action="transform",
            inputs=["clean_data_schema", "test_clean_df"],
            outputs=["transform_data_schema", "transform_test_df"],
            cache_group="test",
        ),
        FeatureSelectStep(
            composite_feature_selector,
            action="fit_transform",
            inputs=["transform_data_schema", "transform_train_val_df"],
            outputs=["data_schema", "composite_feature_selector", "selected_train_val_df"],
            cache_group="train_val",
        ),
        FeatureSelectStep(
            composite_feature_selector,
            action="transform",
            inputs=["transform_data_schema", "transform_test_df"],
            outputs=["data_schema", "test_df"],
            cache_group="test",
        ),
        SplitStep(
            random_splitter_80_20,
            inputs=["selected_train_val_df"],
            outputs=["train_df", "val_df"],
            cache_group="train_val",
        ),
    ]
)

Print `Pipeline` steps to see the order in which they will be executed, for double-checking.

In [None]:
pre_pipeline

Execute the pre-processing pipeline and store the output in `pre_data_container`.

In [None]:
pre_data_container = pre_pipeline.run(force_recompute=True)

## Model Pipeline

In [None]:
from mleko.pipeline.steps import ModelStep


lgbm_model = LGBMModel(
    cache_directory=f"data/{DATASET_NAME}/model",
    objective="binary",
    num_leaves=11,
    target=TARGET_FEATURE,
    num_iterations=100,
    ignore_features=META_FEATURES,
    metric=["average_precision", "auc"],
)

model_pipeline = Pipeline(
    steps=[
        ModelStep(
            lgbm_model,
            action="fit_transform",
            inputs=["data_schema", "train_df", "val_df"],
            outputs=["lgbm_model", "metrics", "pred_train_df", "pred_val_df"],
        ),
        ModelStep(
            lgbm_model,
            action="transform",
            inputs=["data_schema", "test_df"],
            outputs=["pred_test_df"],
        ),
    ]
)

Run the model pipeline by feeding the output of the pre-processing pipeline into it.

In [None]:
data_container = model_pipeline.run(data_container=pre_data_container)
result = data_container.data

All results from each step are stored in the `data_container.data` object.

In [None]:
list(result.keys())

In [None]:
ax = lightgbm.plot_metric(result["metrics"], metric='auc')
ax = lightgbm.plot_metric(result["metrics"], metric='average_precision')