# Feature sets

In Iguazio, features are kept in a logical group called feature set. The feature set has meta data and a list of features that are associated with it. <br>
You can think about feature set in a similar way as a table in a database.
The feature set contains the following information:
- Metadata - general information about the feature set used which is helpful for search and organization. examples are project, name, owner, last update, description, labels and etc..
- key attributes - entity (the join key), timestamp key
- transformation reference - the transformation logic (e.g. aggregation, enrichment etc..)
- Target stores - Feature set can be saved for online or offline or both


### Benefits:
* Ensure the same computation for both training and serving
* Enable users to search for features across projects with a business context
* Share and reuse features
* Features versioning
* Calculate features in real time - run real time feature engineering on live events


### Key attributes:
* Name - The feature set is a unique name within a project. 
* entities - Each feature set must be associated with one or more index column. when joining feature sets the entity is used as the key column.
* timestamp key - 


## Create a feature set
Creating a feature set comprises of the following steps:
- Creates a feature set object with basic definition of its name, entity and timestamp key (optional) <br>

- Add transformation - Iguazio feature store provides the option to create a variety of transformations such as aggregations, joins, filter as well as adding custom logic. <br>
The Transformation can be done both as a batch process or in real time by processing live events. 
There are two engines that can be used for transformation: <br>
1) A graph engine called storey which is an asynchronous streaming library, for real time event processing and feature extraction. Add link to the transformation section.<br>
2) Using spark <br>

- Ingest the data to the feature sets - ingesting data could be done as batch or in real time. <br>
in this step the users defines the data source, scheduling, target data stores and the ingestion type


### Batch Ingestion

Ingest data into the feature store from various sources. the source could be a local DataFrame, files (e.g. csv, parquet) or URL (e.g. S3, Azure blob). Then, run the graph transformations, infers metadata and stats and writes the results to the default or specified targets.

when targets are not specified data is stored in the configured default targets (will usually be NoSQL for real-time and Parquet for offline).

Ingestion can be done locally (i.e. running as a python process in the Jupyter pod) or as an MLRun job.

#### Ingest data (locally)

In the example below we run a simple ingestion that is running "localy" in the jupyter notebook pod

In [None]:
# Simple feature set that reads a csv file as a dataframe and ingest it as is 
stocks_set = FeatureSet("stocks", entities=[Entity("ticker")])
stocks = pd.read_csv("stocks.csv")
df = ingest(stocks_set, stocks, infer_options=fs.InferOptions.default())

# specify a csv file as source and targets
source = CSVSource("mycsv", path="stocks.csv")
targets = [CSVTarget("mycsv", path="./new_stocks.csv")]
ingest(measurements, source, targets)

#### Ingest data using an MLRun job

In the example below we run the ingestion part using an MLrun job. By doing it the ingestion process is running on its own pod on the kubernetes cluster. <br>
This job can be scheduled for running at a later time.

In [None]:
# for running as remote job
stocks_set = FeatureSet("stocks", entities=[Entity("ticker")])
config = RunConfig(image='mlrun/mlrun').apply(mount_v3io())
df = ingest(stocks_set, stocks, run_config=config)

### Real time ingestion

In real time use cases (e.g. real time fraud detection) users need to create features on live data (e.g. z-score calculation). <br>
Iguazio's feature store enables users to start real-time ingestion service using a serverless function framework called nuclio. <br>
When running the deploy_ingestion_service the feature store creates a real time function (AKA nuclio functio). The function trigger's support the following sources: http, kafka, v3io stream etc.. <br>
The trigger as well as other parameters can be configured in the Nuclio UI. <br> 


In [None]:
# Create a real time function that recieve http requests
# the "ingest" function runs the feature engineering logic on live events
source = HTTPSource()
func = mlrun.code_to_function("ingest", kind="serving").apply(mount_v3io())
config = RunConfig(function=func)
fs.deploy_ingestion_service(my_set, source, run_config=config)

### Simulation
During the development phase users may want to check their feature set definition and simulate the creation of the feature set without the actual data ingestion part. This allows them to get a preview of the results and then decide if they start the ingestion process of change the feature  set definition. 
The simulation command infer the source data schema as well as run the graph (assuming there is one) on a small subset of data. 


In [None]:
fs.infer_metadata(
    quotes_set,
    quotes,
    entity_columns=["ticker"],
    timestamp_key="time",
    options=fs.InferOptions.default(),
)

### Data sources

For the batch ingestion the feature store supports dataframes or files (i.e. csv & parquet). <br>
For the real time ingestion the source could be http, kafk and v3io stream etc.
When defining a source  it maps to a nuclio event triggers. <br>
Note that users can also create a custom source.

### Target stores
By default the feature store store the data as parquet file for training and in Iguazio key value store for online serving. <br>
when working with Iguazio platform the parquet files is stored under "Projects" container --> <project name>/fs/parquet folder. <br>
The key value table is stored under --> "Projects" container --> <project name>/fs/nosql folder. <br>
Additional supported targets are Azure blob and S3

## Create a feature set with transformation

A feature set contains an execution graph of operations that are performed when data is ingested, or when simulating data flow for inferring its metadata. <br>
This graph utilizes MLRun's serving graph. <br>
to learn more about creating the transformation process go to ADD LINK

## Consume features for training

### Create a feature vector

In order to retrieve the feature set one needs to create a feature vector. <br>
A feature vector is a logical definition of a list of features that are based on one or more feature sets. <br>
By default the feature vector is saved just as a logical definition, yet users can persist it by using "target=" parameter.
 

In [None]:
features = [
    "stock-quotes.multi",
    "stock-quotes.asks_sum_5h as total_ask",
    "stock-quotes.bids_min_1h",
    "stock-quotes.bids_max_1h",
    "stocks.*",
]

vector = fs.FeatureVector("stocks-vec", features)

Once we have a feature vector we can run get_offline_features to retireve our feature set. <br>
This command fetch the data from the "offline" feature store and return a dataframe. <br>
you can also write  the result as a parquet file. <br>


In [None]:
resp = fs.get_offline_features(vector)
resp.to_dataframe()

You can also join another feature set while retrieving the data. this is done by using entity_rows and entity_timestamp_column. <br>
The data is joined based on the feature set entity column/s.

In [None]:
resp = fs.get_offline_features(vector, entity_rows=trades, entity_timestamp_column="time")
resp.to_dataframe()

## Consume features for online inference

By default feature set are ingested to both "offline" and "Online" feature store. To consume the features for online applications use the get_online_feature_service API. <br>
in order to do that we need to initialize the online service and then get the relevant features. <br>
in a single get request you can get features for one or more keys

In [None]:
service = fs.get_online_feature_service("vector")

In [22]:
service.get([{"ticker": "GOOG"}, {"ticker": "MSFT"}])

[{'asks_sum_5h': 2162.74,
  'bids_min_1h': 720.5,
  'bids_max_1h': 720.5,
  'multi': 2161.5,
  'name': 'Alphabet Inc',
  'exchange': 'NASDAQ'},
 {'asks_sum_5h': 207.97,
  'bids_min_1h': 51.95,
  'bids_max_1h': 52.01,
  'multi': 156.03,
  'name': 'Microsoft Corporation',
  'exchange': 'NASDAQ'}]

## Show statistics and metadata

By running get_stats_table() you view the feature set or feature vector statistics: count, mean, min, max, std, his (histogram), unique value, top, frequency

In [None]:
stocks_set.get_stats_table()

In [17]:
service.vector.get_stats_table()

Unnamed: 0,count,mean,min,max,std,hist,unique,top,freq
multi,8.0,925.27875,155.85,2161.5,1024.751408,"[[4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",,,
total_ask,8.0,617.91875,51.96,2162.74,784.87798,"[[4, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...",,,
bids_min_1h,8.0,308.41125,51.95,720.5,341.596673,"[[4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",,,
bids_max_1h,8.0,308.42625,51.95,720.5,341.583803,"[[4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",,,
name,3.0,,,,,,3.0,Apple Inc,1.0
exchange,3.0,,,,,,1.0,NASDAQ,3.0


## Viewing and managing features in the UI


User can search features across feature sets and view their metadata and statistics using the feature store dashboard. <br>
In future versions we'll enable users to create and manage the feature set from the UI as well.