# mloda demo: How can we make feature engineering shareable?

### Define dummy data as plugin

In [1]:
import numpy as np
from mloda_core.abstract_plugins.abstract_feature_group import AbstractFeatureGroup
from mloda_core.abstract_plugins.components.input_data.creator.data_creator import DataCreator


class DummyData(AbstractFeatureGroup):
    @classmethod
    def calculate_feature(cls, data, features):
        n_samples = features.get_options_key("n_samples") or 100
        return {
            "age": np.random.randint(18, 80, n_samples),
            "weight": np.random.normal(70, 15, n_samples),
            "state": np.random.choice(["CA", "NY", "TX", "FL"], n_samples),
            "gender": np.random.choice(["M", "F"], n_samples),
        }

    @classmethod
    def input_data(cls):
        return DataCreator({"age", "weight", "state", "gender"})

### Request mlodaAPI to create features

In [2]:
# We load dependencies.
from mloda_core.api.request import mlodaAPI

# Load plugins into namespace
from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataframe
from mloda_plugins.compute_framework.base_implementations.pyarrow.table import PyarrowTable

# from mloda_core.abstract_plugins.plugin_loader.plugin_loader import PluginLoader
# plugin_loader = PluginLoader.all()

result = mlodaAPI.run_all(["age", "weight", "state", "gender"], compute_frameworks=["PyarrowTable", "PandasDataframe"])
print(result)

[   state     weight gender  age
0     CA  64.880888      F   76
1     TX  55.829132      F   37
2     NY  93.605064      F   31
3     NY  52.031323      F   27
4     TX  73.107226      F   25
..   ...        ...    ...  ...
95    TX  52.434296      M   21
96    NY  88.033971      F   52
97    FL  80.829318      M   57
98    CA  55.786040      M   73
99    TX  65.051152      F   60

[100 rows x 4 columns]]


### Alternative options to consume data

- Apidata
- Files
- DBs
- Streams
- ...

This is not the heart of mloda.

### Chain features - automatic dependency resolution

In [None]:
# Load plugin into namespace again
from mloda_plugins.compute_framework.base_implementations.polars.lazy_dataframe import PolarsLazyDataframe
from mloda_plugins.feature_group.experimental.aggregated_feature_group.polars_lazy import (
    PolarsLazyAggregatedFeatureGroup,
)


result = mlodaAPI.run_all(
    ["age__sum_aggr"],
    compute_frameworks=["PolarsLazyDataframe"],
)
print(result)

As long as the plugins exists, we can run any datatransformation.

### What is behind the "age__sum_aggr" syntax?

In [4]:
from mloda_core.abstract_plugins.components.feature import Feature
from mloda_core.abstract_plugins.components.options import Options

feature = Feature(
    name="CustomConfiguration",
    options=Options(
        context={"aggregation_type": "sum", "mloda_source_features": Feature("age", options={"n_samples": 5})}
    ),
)

result = mlodaAPI.run_all(
    [feature],
    compute_frameworks=["PolarsLazyDataframe"],
)
print(result)

[shape: (5, 1)
┌─────────────────────┐
│ CustomConfiguration │
│ ---                 │
│ i64                 │
╞═════════════════════╡
│ 211                 │
│ 211                 │
│ 211                 │
│ 211                 │
│ 211                 │
└─────────────────────┘]


### How the chaining essentially works 

```python
class AbstractFeatureGroup(ABC):

    def input_features(self, options: Options, feature_name: FeatureName) -> Optional[Set[Feature]]:
        
        # In principle, the resolver checks if the feature group depends on another input feature
        # -> then adds it to the chain of features which need to be resolved
        if feature_name contains "input_feature__sum_aggr":
            return input_feature

    # How does mloda knows a feature matches a feature group?
    # Customizable, but some good guesses
    @classmethod
    def match_feature_group_criteria(
        cls,
        feature_name: Union[FeatureName, str],
        options: Options,
        data_access_collection: Optional[DataAccessCollection] = None,
    ) -> bool:
```

### Now we have chaining and matching. Why do we do this?


```python
class AbstractFeatureGroup(ABC):

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        """
        This function should be used to calculate the feature.
        """
        
        # data is the incoming data from other feature dependencies or data via API

        # features is the configuration
```

### Business knowledge is in the data and in the configuration, but not in the plugin definition.

## Big idea

**Separate business logic from transformation logic:**

- Plugins = generic transformations (shareable across companies)
- Data + Config = your business knowledge (stays private)

→ Stop rewriting "sum of a column" at every company

→ Build a shared ecosystem of feature engineering plugins