# Core 7.4 Feature Store - Basic Ingestion

In this section, we will do a basic of a dataset. Feature retrival and additional configuration will be covered in later sections.

---

### References

Much of the following content is derived from the official documenation:
- [Feature sets](https://docs.mlrun.org/en/latest/feature-store/feature-sets.html#)

---

### Example Overview

In this example, we will ingest multiple datasets as `Feature Sets`. In future sections, we will retrieve/join them together into a single `Feature Vector`.

---

### Setup

In [9]:
import pandas as pd
import mlrun
import mlrun.feature_store as fstore
from mlrun.datastore.sources import DataFrameSource, CSVSource, ParquetSource

project = mlrun.get_or_create_project("iguazio-academy", context="./")

> 2022-04-22 21:59:10,254 [info] created and saved project iguazio-academy


---

### What data are we using?

We will be using a simple heart disease dataset for ingestion and retrieval. The dataset itself will be separated across 3 different files to simulate multiple data sources. However, there is a single `patiend_id` column that is the same across all datasets and will be used to join them together.

---

### Define Feature Sets

We will need to define a `Feature Set` per data source. This will look something like the following:

In [10]:
categorical_fset = fstore.FeatureSet(
    name="heart-disease-categorical",
    entities=[fstore.Entity("patient_id")],
    description="Categorical columns for heart disease dataset"
)

In [11]:
continuous_fset = fstore.FeatureSet(
    name="heart-disease-continuous",
    entities=[fstore.Entity("patient_id")],
    description="Continuous columns for heart disease dataset"
)

In [12]:
target_fset = fstore.FeatureSet(
    name="heart-disease-target",
    entities=[fstore.Entity("patient_id")],
    description="Target column for heart disease dataset"
)

---

### Ingest Data into Feature Sets

Now that our `Feature Sets` are defined, we can ingest data into them as follows. We will be using the `DataFrameSource`, `CSVSource`, and `ParquetSource` for this example, although you could use just one.

#### Ingest Categorical data using DataFrameSource

In [13]:
categorical_df = pd.read_csv("data/heart_disease_categorical.csv")

In [14]:
fstore.ingest(
    featureset=categorical_fset,
    source=DataFrameSource(df=categorical_df)  # Use DataFrameSource and pandas dataframe from above
)

Converting input from bool to <class 'numpy.uint8'> for compatibility.


Unnamed: 0_level_0,age,sex,cp,exang,fbs,slope,thal
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
e443544b-8d9e-4f6c-9623-e24b6139aae0,52,male,typical_angina,no,False,downsloping,normal
8227d3df-16ab-4452-8ea5-99472362d982,53,male,typical_angina,yes,True,upsloping,normal
10c4b4ba-ab40-44de-8aba-6bdb062192c4,70,male,typical_angina,yes,False,upsloping,normal
f0acdc22-7ee6-4817-a671-e136211bc0a6,61,male,typical_angina,no,False,downsloping,normal
2d6b3bca-4841-4618-9a8c-ca902010b009,62,female,typical_angina,no,True,flat,reversable_defect
...,...,...,...,...,...,...,...
5d2fc80f-ed64-4e1c-9c95-3baace09118b,59,male,atypical_angina,yes,False,downsloping,reversable_defect
01548a7e-0f68-4308-80de-cd93fdbfb903,60,male,typical_angina,yes,False,flat,normal
f8c97cc1-8a3a-4b8e-965c-58e75c2379e6,47,male,typical_angina,yes,False,flat,reversable_defect
d7fc9e01-b792-44da-88fa-a0057527da3f,50,female,typical_angina,no,False,downsloping,reversable_defect


#### Ingest Continuous data using CSVSource

In [15]:
fstore.ingest(
    featureset=continuous_fset,
    source=CSVSource(path="./data/heart_disease_continuous.csv")  # Use CSVSource and path to CSV file
)

Unnamed: 0_level_0,trestbps,chol,restecg,thalach,oldpeak,ca
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
e443544b-8d9e-4f6c-9623-e24b6139aae0,125,212,1,168,1.0,2.0
8227d3df-16ab-4452-8ea5-99472362d982,140,203,0,155,3.1,0.0
10c4b4ba-ab40-44de-8aba-6bdb062192c4,145,174,1,125,2.6,0.0
f0acdc22-7ee6-4817-a671-e136211bc0a6,148,203,1,161,0.0,1.0
2d6b3bca-4841-4618-9a8c-ca902010b009,138,294,1,106,1.9,3.0
...,...,...,...,...,...,...
5d2fc80f-ed64-4e1c-9c95-3baace09118b,140,221,1,164,0.0,0.0
01548a7e-0f68-4308-80de-cd93fdbfb903,125,258,0,141,2.8,1.0
f8c97cc1-8a3a-4b8e-965c-58e75c2379e6,110,275,0,118,1.0,1.0
d7fc9e01-b792-44da-88fa-a0057527da3f,110,254,0,159,0.0,0.0


#### Ingest Target data using ParquetSource

In [16]:
fstore.ingest(
    featureset=target_fset,
    source=ParquetSource(path="./data/heart_disease_target.parquet")  # Use ParquetSource and path to parquet file
)

Unnamed: 0_level_0,target
patient_id,Unnamed: 1_level_1
e443544b-8d9e-4f6c-9623-e24b6139aae0,0
8227d3df-16ab-4452-8ea5-99472362d982,0
10c4b4ba-ab40-44de-8aba-6bdb062192c4,0
f0acdc22-7ee6-4817-a671-e136211bc0a6,0
2d6b3bca-4841-4618-9a8c-ca902010b009,0
...,...
5d2fc80f-ed64-4e1c-9c95-3baace09118b,1
01548a7e-0f68-4308-80de-cd93fdbfb903,0
f8c97cc1-8a3a-4b8e-965c-58e75c2379e6,0
d7fc9e01-b792-44da-88fa-a0057527da3f,1


---