# Core 7.5 Feature Store - Engines

In this section, we will take a look at the various ingestion and transformation engines in the Featore Store to allow for ingesting data in batch as well as real-time.

---

### References

Much of the following content is derived from the official documenation:
- [Feature Store: Data ingestion](https://docs.mlrun.org/en/stable/feature-store/feature-store-data-ingestion.html)

---

### What is an engine?

So what exactly is an `engine` in the context of the feature store? Simply put, it is the processing framework for ingesting and transforming data into the Feature Store. For example, we would use a different `engine` to ingest data in batch than we would for real-time.

---

### What engines are supported?

At this time, two engines are supported for batch - `Pandas` and `Spark`. Chances are, you are using at least one of these tools today to do data processing. This will allow you to take that exerience and apply it to the Iguazio Feature Store

At this time, one engine is supported for real-time - `storey`. This is an engine of our own design that is a part of `MLRun` which allows for complex transformations and flows in real-time. If you have completed our `Real-Time Pipelines` module, we are using the same underlying technology here.

![](img/transformation-engines.png)

---

### What are the differences between the engines?

The main difference between the engines is what framework will be used to process the data and apply transfomations. For example:
- The `pandas` engine is designed for batch data that can fit into memory that will be transformed using Pandas dataframes
- The `spark` engine is designed for batch data that cannot fit into memory that will be transformed using Spark dataframes
- The `storey` engine is designed for real-time data (i.e. individual records) that will be transformed using Python functions and classes

---

### Specifying an Engine

Specifying an `engine` in the Feature Store is done at the `Feature Set` level. If you do not specify an engine (like in our previous examples), MLRun will default to the `storey` engine. You can specify an engine like so:

```python
import mlrun.feature_store as fstore

my_feature_set = fstore.FeatureSet(
    name="stocks",
    entities=[fstore.Entity("ticker")],
    description="Stock data per ticker",
    engine="pandas"
)
```

---

### Ingestion Example per Engine

#### Setup

In [30]:
import pandas as pd
import mlrun
import mlrun.feature_store as fstore
from mlrun.datastore.sources import DataFrameSource, CSVSource
from pyspark.sql import SparkSession

project = mlrun.get_or_create_project("iguazio-academy", context="./")

data = pd.read_csv("data/heart_disease_categorical.csv")

data.head()

> 2022-04-27 23:28:07,955 [info] loaded project iguazio-academy from MLRun DB


Unnamed: 0,patient_id,age,sex,cp,exang,fbs,slope,thal
0,e443544b-8d9e-4f6c-9623-e24b6139aae0,52,male,typical_angina,no,False,downsloping,normal
1,8227d3df-16ab-4452-8ea5-99472362d982,53,male,typical_angina,yes,True,upsloping,normal
2,10c4b4ba-ab40-44de-8aba-6bdb062192c4,70,male,typical_angina,yes,False,upsloping,normal
3,f0acdc22-7ee6-4817-a671-e136211bc0a6,61,male,typical_angina,no,False,downsloping,normal
4,2d6b3bca-4841-4618-9a8c-ca902010b009,62,female,typical_angina,no,True,flat,reversable_defect


#### Storey Engine

In [15]:
storey_set = my_feature_set = fstore.FeatureSet(
    name="heart-disease-storey",
    entities=[fstore.Entity("patient_id")],
    description="Heart disease data via storey engine",
    engine="storey"
)

In [16]:
fstore.ingest(featureset=storey_set, source=DataFrameSource(df=data))

Converting input from bool to <class 'numpy.uint8'> for compatibility.


Unnamed: 0_level_0,age,sex,cp,exang,fbs,slope,thal
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
e443544b-8d9e-4f6c-9623-e24b6139aae0,52,male,typical_angina,no,False,downsloping,normal
8227d3df-16ab-4452-8ea5-99472362d982,53,male,typical_angina,yes,True,upsloping,normal
10c4b4ba-ab40-44de-8aba-6bdb062192c4,70,male,typical_angina,yes,False,upsloping,normal
f0acdc22-7ee6-4817-a671-e136211bc0a6,61,male,typical_angina,no,False,downsloping,normal
2d6b3bca-4841-4618-9a8c-ca902010b009,62,female,typical_angina,no,True,flat,reversable_defect
...,...,...,...,...,...,...,...
5d2fc80f-ed64-4e1c-9c95-3baace09118b,59,male,atypical_angina,yes,False,downsloping,reversable_defect
01548a7e-0f68-4308-80de-cd93fdbfb903,60,male,typical_angina,yes,False,flat,normal
f8c97cc1-8a3a-4b8e-965c-58e75c2379e6,47,male,typical_angina,yes,False,flat,reversable_defect
d7fc9e01-b792-44da-88fa-a0057527da3f,50,female,typical_angina,no,False,downsloping,reversable_defect


#### Pandas Engine

In [27]:
pandas_set = my_feature_set = fstore.FeatureSet(
    name="heart-disease-pandas",
    entities=[fstore.Entity("patient_id")],
    description="Heart disease data via pandas engine",
    engine="pandas"
)

In [29]:
fstore.ingest(featureset=pandas_set, source=DataFrameSource(df=data))

Converting input from bool to <class 'numpy.uint8'> for compatibility.


Unnamed: 0,patient_id,age,sex,cp,exang,fbs,slope,thal
0,e443544b-8d9e-4f6c-9623-e24b6139aae0,52,male,typical_angina,no,False,downsloping,normal
1,8227d3df-16ab-4452-8ea5-99472362d982,53,male,typical_angina,yes,True,upsloping,normal
2,10c4b4ba-ab40-44de-8aba-6bdb062192c4,70,male,typical_angina,yes,False,upsloping,normal
3,f0acdc22-7ee6-4817-a671-e136211bc0a6,61,male,typical_angina,no,False,downsloping,normal
4,2d6b3bca-4841-4618-9a8c-ca902010b009,62,female,typical_angina,no,True,flat,reversable_defect
...,...,...,...,...,...,...,...,...
963,5d2fc80f-ed64-4e1c-9c95-3baace09118b,59,male,atypical_angina,yes,False,downsloping,reversable_defect
964,01548a7e-0f68-4308-80de-cd93fdbfb903,60,male,typical_angina,yes,False,flat,normal
965,f8c97cc1-8a3a-4b8e-965c-58e75c2379e6,47,male,typical_angina,yes,False,flat,reversable_defect
966,d7fc9e01-b792-44da-88fa-a0057527da3f,50,female,typical_angina,no,False,downsloping,reversable_defect


#### Spark Engine

In [44]:
spark_set = my_feature_set = fstore.FeatureSet(
    name="heart-disease-spark",
    entities=[fstore.Entity("patient_id")],
    description="Heart disease data via spark engine",
    engine="spark"
)

In [45]:
spark = SparkSession.builder.appName("Spark function").getOrCreate()

In [46]:
v3io_data_path = "v3io://users/nick/igz_repos/iguazio-academy/modules/core/7_feature_store/data/heart_disease_categorical.csv"

In [47]:
fstore.ingest(featureset=spark_set, source=CSVSource(path=v3io_data_path), spark_context=spark)

> 2022-04-27 23:32:56,581 [info] writing to target parquet, spark options {'path': 'v3io://projects/iguazio-academy/FeatureStore/heart-disease-spark/parquet/sets/heart-disease-spark-latest', 'format': 'parquet'}
> 2022-04-27 23:32:57,030 [info] writing to target nosql, spark options {'path': 'v3io://projects/iguazio-academy/FeatureStore/heart-disease-spark/nosql/sets/heart-disease-spark-latest', 'format': 'io.iguaz.v3io.spark.sql.kv', 'key': 'patient_id'}


DataFrame[patient_id: string, age: int, sex: string, cp: string, exang: string, fbs: boolean, slope: string, thal: string]

---