In [None]:
!odsc conda install -s fspyspark32_p38_cpu_v3

Oracle Data Science service sample notebook.

Copyright (c) 2022, 2023 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

***

# <font color="red">Enhancing Real-time Capabilities: Streaming Use Cases in Feature Store</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---
# Overview:
---
Managing many datasets, data sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift, and training serving skew all lead to increased model development time and poor model performance. Feature store can be used to solve many of the problems becuase it provides a centralised way to transform and access data for training and serving time. Feature store helps define a standardised pipeline for ingestion of data and querying of data. This notebook shows how schema enforcement and schema evolution are carried out in Feature Store

Compatible conda pack: [PySpark 3.2 and Feature store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8


## Contents:

- <a href='#introduction'>1. Introduction</a>
- <a href='#pre_requisites'>2. Pre-requisites to Running this Notebook</a>
    - <a href='#setup_setup'>2.1. Setup</a>
    - <a href='#policies_'>2.2. Policies</a>
    - <a href='#authentication'>2.3. Authentication</a>
    - <a href='#variables'>2.4. Variables</a>
- <a href='#dataexploration'>3. Streaming data</a>
    - <a href='#dataexploration'>3.1. Exploration of data stream in feature store</a>
    - <a href='#feature_store'>3.2. Create feature store logical entities</a>
    - <a href='#ingestion_modes'>3.3. Ingestion Modes</a>
        - <a href='#append'>3.3.1. Append</a>
        - <a href='#complete'>3.3.2. Complete</a>
- <a href='#references'>4. References</a>

---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

<a id='introduction'></a>
# 1. Introduction

OCI Data Science feature store is a stack-based API solution that's deployed using OCI Resource Manager in your tenancy.

Review the following key terms to understand the Data Science feature store:


* **Feature Vector**: Set of feature values for any one primary or identifier key. For example, all or a subset of features of customer id ‘2536’ can be called as one feature vector.

* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.

* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Or, an entity is an object or concept that is described by its features. Examples of entities are customer, product, transaction, review, image, document, and so on.

* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in machine learning (ML) models. It serves as an organizational unit within the feature store for you to manage, version, and share features across different ML projects. By organizing features into groups, data scientists and ML engineers can efficiently discover, reuse, and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.

* **Feature Group Job**: A feature group job is the processing instance of a feature group. Each feature group job includes validation and statistics results.

* **Dataset**: A dataset is a collection of features that are used together to either train a model or perform model inference.

* **Dataset Job**: A dataset job is the processing instance of a dataset. Each dataset job includes validation and statistics results.

<a id='pre_requisites'></a>
# 2. Pre-requisites to Running this Notebook
Notebook Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v2) conda environment.

You can customize `fspyspark32_p38_cpu_v2`, publish it, and use it as a runtime environment for a Notebook session.


<a id='setup_setup'></a>
### 2.1. Setup

<a id='setup_spark-defaults'></a>
### `spark-defaults.conf`

The `spark-defaults.conf` file is used to define the properties that are used by Spark. A templated version is installed when you install a Data Science conda environment that supports PySpark. However, you must update the template so that the Data Catalog metastore can be accessed. You can do this manually. However, the `odsc data-catalog config` commandline tool is ideal for setting up the file because it gathers information about your environment, and uses that to build the file.

The `odsc data-catalog config` command line tool needs the `--metastore` option to define the Data Catalog metastore OCID. No other command line option is needed because settings have default values, or they take values from your notebook session environment. Following are common parameters that you may need to override.

The `--authentication` option sets the authentication mode. It supports resource principal and API keys. The preferred method for authentication is resource principal, which is sent with `--authentication resource_principal`. If you want to use API keys, then use the `--authentication api_key` option. If the `--authentication` isn't specified, API keys are used. When API keys are used, information from the OCI configuration file is used to create the `spark-defaults.conf` file.

Object Storage and Data Catalog are regional services. By default, the region is set to the region your notebook session is running in. This information is taken from the environment variable, `NB_REGION`. Use the `--region` option to override this behavior.

The default location of the `spark-defaults.conf` file is `/home/datascience/spark_conf_dir` as defined in the `SPARK_CONF_DIR` environment variable. Use the `--output` option to define the directory where to write the file.

You need to determine what settings are appropriate for your configuration. However, the following works for most configurations and is run in a terminal window.

```bash
odsc data-catalog config --authentication resource_principal --metastore <metastore_id>
```
For more assistance, use the following command in a terminal window:

```bash
odsc data-catalog config --help
```

<a id='policies_'></a>
### 2.2. Policies
This section covers the creation of dynamic groups and policies needed to use the service.

* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm/)
* [Data Catalog Metastore Required Policies](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)
* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)
* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)

<a id="authentication"></a>
### 2.3. Authentication
The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook session.<br>
To setup authentication use the ```ads.set_auth("resource_principal")``` or ```ads.set_auth("api_key")```.

In [None]:
import ads
ads.set_auth(auth="resource_principal", client_kwargs={"fs_service_endpoint": "<api_gateway_endpoint>"})

<a id="variables"></a>
### 2.4. Variables
To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `<compartment_id>` and  `<metastore_id>` for offline feature store.

In [None]:
import os

compartment_id = os.environ.get("NB_SESSION_COMPARTMENT_OCID")
metastore_id = "<metastore_id>"

# Path to the stream data directory
stream_data_dir = "oci://{bucket}@{namespace}"

In [None]:
import pandas as pd
from ads.feature_store.feature_store import FeatureStore
from ads.feature_store.feature_group import FeatureGroup
from ads.feature_store.model_details import ModelDetails
from ads.feature_store.dataset import Dataset
from ads.feature_store.common.enums import DatasetIngestionMode

from ads.feature_store.feature_group_expectation import ExpectationType
from great_expectations.core import ExpectationSuite, ExpectationConfiguration
from ads.feature_store.feature_store_registrar import FeatureStoreRegistrar

<a id='dataexploration'></a>
### 3.1. Exploration of stream data in feature store

Feature store managed spark session can be created using ```SparkSessionSingleton(metastore_id=<metastore_id>).get_spark_session()```, this session is configured with the specified metastore_id for seamless integration with the Feature Store functionalities.

In [None]:
from ads.feature_store.common.spark_session_singleton import SparkSessionSingleton
from pyspark.sql.types import StructType

# Get the spark session managed by the feature store
spark = SparkSessionSingleton(metastore_id=metastore_id).get_spark_session()

# Define the schema for the streaming data frame
food_reviews_schema = StructType() \
    .add("Time", "string") \
    .add("ProductId", "string") \
    .add("UserId", "string") \
    .add("Score", "string") \
    .add("Summary", "string") \
    .add("Text", "string")

food_reviews_streaming_df = spark.readStream \
    .option("sep", ",") \
    .option("header", True) \
    .schema(food_reviews_schema) \
    .csv(f"{stream_data_dir}/")

<a id="feature_store"></a>
### 3.2. Create feature store logical entities

#### 3.2.1. Feature Store
Feature store is the top level entity for feature store service

In [None]:
feature_store_resource = (
    FeatureStore().
    with_description("Data consisting of food reviews").
    with_compartment_id(compartment_id).
    with_name("food reviews").
    with_offline_config(metastore_id=metastore_id)
)

<a id="create_feature_store"></a>
##### Create Feature Store

Call the ```.create()``` method of the Feature store instance to create a feature store.

In [None]:
feature_store = feature_store_resource.create()
feature_store

#### 3.2.2. Entity
An entity is a group of semantically related features.

In [None]:
entity = feature_store.create_entity(
    name="food reviews streaming use case",
    description="description for food reviews details"
)
entity

<a id="create_transformation"></a>
#### 3.2.3 Transformation
Transformations in a feature store refers to the operations and processes applied to raw data to create, modify or derive new features that can be used as inputs for ML Models

In [None]:
def calculate_average_score_per_product(input_df):
    # Perform aggregation to calculate average score for each product
    return f"""
        SELECT ProductId, AVG(CAST(Score AS DOUBLE)) AS AvgScore
        FROM {input_df}
        GROUP BY ProductId
    """

In [None]:
from ads.feature_store.transformation import TransformationMode

average_score_transformation = feature_store.create_transformation(
    transformation_mode=TransformationMode.SQL,
    source_code_func=calculate_average_score_per_product,
    name="calculate_average_score_per_product",
)
average_score_transformation

In [None]:
def select_relevant_columns(input_df):
    # Select relevant columns from the streaming DataFrame
    return f"""
        SELECT Time, ProductId, UserId, Score, Summary, Text
        FROM {input_df}
    """

In [None]:
from ads.feature_store.transformation import TransformationMode

transformation = feature_store.create_transformation(
    transformation_mode=TransformationMode.SQL,
    source_code_func=select_relevant_columns,
    name="select_relevant_columns",
)
transformation

<a id="ingestion_modes"></a>
### 3.3. Ingestion modes

Feature store currently offers two modes for streaming ingestion:

###### 1. Append Mode (Default)

In this default mode, only the new rows added to the Result Table since the last trigger are outputted to the sink. This mode is suitable for queries where the rows added to the Result Table do not change.

###### 2. Complete Mode

In Complete Mode, the entire Result Table is outputted to the sink after every trigger. This mode is specifically supported for aggregation queries.


<a id="append"></a>
#### 3.3.1. Append

In ``append`` mode, new data is added to the existing table. If the table already exists, the new data is appended to it, extending the dataset. This mode is suitable for scenarios where you want to continuously add new records without modifying or deleting existing data. It preserves the existing data and only appends the new data to the end of the table.

For ``append`` mode, transformation attached to the ``FeatureGroup`` should not contain aggregation operations.


<a id="create_feature_group_for_append"></a>
##### 3.3.1.1 Feature Group

Create feature group for food reviews.

In [None]:
from ads.feature_store.statistics_config import StatisticsConfig

stats_config = StatisticsConfig().with_is_enabled(False)
non_aggregated_fg = entity.create_feature_group(
    primary_keys=["ProductId"],
    schema_details_dataframe=food_reviews_streaming_df,
    statistics_config=stats_config,
    name="food_reviews_feature_group_with_non_aggregation",
    transformation_id=transformation.id
)

non_aggregated_fg

In [None]:
query = fg.materialise_stream(input_dataframe=food_reviews_streaming_df, checkpoint_dir=f"{stream_data_dir}chec")

<a id="complete"></a>
#### 3.3.2. Complete

Use ``complete`` as ingestion mode when you want to aggregate the data and output the entire results to sink every time. This mode is used only when you have streaming aggregated data. One example would be counting the words on streaming data and aggregating with previous data and output the results to sink.

<a id="create_feature_group_for_complete"></a>
##### 3.3.2.1 Feature Group

Create feature group for food reviews with streaming aggregated data.

In [None]:
from ads.feature_store.statistics_config import StatisticsConfig
from ads.feature_store.common.enums import (
    ExpectationType,
    EntityType,
    StreamingIngestionMode,
    BatchIngestionMode,
)

stats_config = StatisticsConfig().with_is_enabled(False)
aggregated_fg = entity.create_feature_group(
    primary_keys=["ProductId"],
    schema_details_dataframe=food_reviews_streaming_df,
    statistics_config=stats_config,
    name="food_reviews_feature_group_with_data_aggregation",
    transformation_id=average_score_transformation.id
)

aggregated_fg

In [None]:
query = fg.materialise_stream(input_dataframe=food_reviews_streaming_df, checkpoint_dir="oci://demo-2@idogsu2ylimg/food-reviews/checkpoint-complete", ingestion_mode=StreamingIngestionMode.COMPLETE)