Oracle Data Science service sample notebook.

Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

***

# <font color="red">Data Flow Studio: Big Data Operations in Feature Store</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---
# Overview:

This notebook demonstrates how to run Feature Store on interactive Spark workloads on a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster through [Apache Livy](https://livy.apache.org/) integration. **Data Flow Spark Magic** is used for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. It includes a set of magic commands for interactively running Spark code.



## Contents:

- <a href="#introduction">1. Introduction</a>
- <a href='#pre_requisites'>2. Pre-requisites</a>
    - <a href='#policies_'>2.1 Policies</a>
    - <a href='#prerequisites_helpers'>2.2 Helpers</a>
    - <a href='#authentication'>2.3 Authentication</a>
    - <a href='#variables'>2.4 Variables</a>
- <a href='#dataflow_magic'>3. Dataflow Magic</a>
    - <a href='#load_extension'>3.1. Load extension</a>
    - <a href='#create_session'>3.2. Create DataFlow Session</a>
    - <a href='#data_exploration'>3.3. Data exploration</a>
    - <a href='#load_featuregroup'>3.4. Creation of logical entities of feature group</a>
        - <a href='#create_feature_store'>3.4.1 Creation of feature store</a>
        - <a href='#create_entity'>3.4.2 Creation of entity</a>
        - <a href='#create_feature_group'>3.4.3 Creation of feature group</a>
        - <a href='#materialise_feature_store'>3.4.4 Materialisation of feature group</a>
        - <a href='#query_feature_group'>3.4.5 Querying of feature group</a>
- <a href='#references'>4. References</a>

---


Compatible conda pack: [PySpark 3.2 and Feature Store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 1.0)


In [None]:
# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.
!pip install --pre --no-deps oracle-ads==2.9.0rc0

<a id="introduction"></a>
# 1. Introduction

Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.

The following are some key terms that will help you understand OCI Data Science Feature Store:


* **Feature Vector**: Set of feature values for any one primary/identifier key. Eg. All/subset of features of customer id ‘2536’ can be called as one feature vector.

* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.

* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc.

* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in ml models. It serves as an organizational unit within the feature store for users to manage, version and share features across different ml projects. By organizing features into groups, data scientists and ml engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.

* **Feature Group Job**: Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.

* **Dataset**: A dataset is a collection of feature that are used together to either train a model or perform model inference.

* **Dataset Job**: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results.

<a id='pre_requisites'></a>
# 2. Pre-requisites to Running this Notebook

Data Flow Sessions are accessible through the following conda environment: 

* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1)**

The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage. You can customize `fs_pyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Data Flow session cluster. The metastore id of hive metastore is tied to feature store construct of feature store service.

<a id='policies_'></a>
## 2.1. Policies
This section covers the creation of dynamic groups and policies needed to use the service.

* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm)
* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)
* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)
* [Data Catalog Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)

<a id="prerequisites_helpers"></a>
## 2.2 Helpers
This section provides a helper method used across the notebook to prepare arguments for the magic commands. This function is particularly useful when you want to pass Python variables as arguments to the spark magic commands 

In [None]:
import json


def prepare_command(command: dict) -> str:
    """Converts dictionary command to the string formatted commands."""
    return f"'{json.dumps(command)}'"

<a id="authentication"></a>
## 2.3. Authentication
The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the Data Flow Session Spark cluster.<br> 
To setup authentication use the ```ads.set_auth("resource_principal")``` or ```ads.set_auth("api_key")```. 

In [None]:
import ads

ads.set_auth("resource_principal")  # Supported values: resource_principal, api_key

<a id="variables"></a>
## 2.4. Variables
To run this notebook, you must provide some information about your tenancy configuration. To connect to the HIVE metastore, replace `<metastore_id>` with the OCID for the HIVE metastore.

To create and run a Data Flow session, you must specify a `<compartment_id>`, `<metastoreId>`, bucket `<logs_bucket_uri>` and `<custom_conda_environment_uri>` for storing logs. These resources must be in the same compartment.

In [None]:
import os
compartment_id = os.environ.get("NB_SESSION_COMPARTMENT_OCID")
metastore_id = "<metastore_id>"
logs_bucket_uri = "<logs-bucket-url>"

custom_conda_environment_uri = "oci://service-conda-packs@id19sfcrra6z/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v1#conda"

<a id="dataflow_magic"></a>
# 3. Data Flow Spark Magic
Data Flow Spark Magic commands allow you to interactively work with Data Flow Spark clusters (sessions) in Jupyter notebooks through the Livy REST API. It provides a set of Jupyter Notebook cell magic commands to turn Jupyter into an integrated Spark development environment for remote clusters. 

**Data Flow Magic allows you to:**

* Run Spark code against Data Flow remote Spark cluster
* Create a Data Flow Spark Session with SparkContext and HiveContext against Data Flow remote Spark cluster
* Capture the output of Spark queries as a local Pandas data frame to interact easily with other Python libraries (e.g. matplotlib)

<a id="load_extension"></a>
### 3.1. Load Spark Magic Commands and Getting Help
Data Flow Spark Magic is a JupyterLab extension that you need to activate in your notebook using the `%load_ext dataflow.magics` magic command.<br>
After the extension is activated, the `%help` command can be used to get the list of supported commands.

In [None]:
%load_ext dataflow.magics

<a id="create_session"></a>
### 3.2. Create Session
To create a new Data Flow cluster session use the `%create_session` magic command.

In [None]:
command = prepare_command(
    {
        "compartmentId": compartment_id,
        "displayName": "spark_session_via_notebook",
        "language": "PYTHON",
        "sparkVersion": "3.2.1",
        "numExecutors": 8,
        "metastoreId": metastore_id,
        "driverShape": "VM.Standard2.1",
        "executorShape": "VM.Standard2.1",
        "driverShapeConfig": {"ocpus": 2, "memoryInGBs": 16},
        "executorShapeConfig": {"ocpus": 2, "memoryInGBs": 16},
        "type": "SESSION",
        "logsBucketUri": logs_bucket_uri,
        "configuration": {
            "spark.archives": custom_conda_environment_uri,
            "fs.oci.client.hostname": "https://objectstorage.us-ashburn-1.oraclecloud.com"
        },
    }
)

%create_session -l python -c $command

In [None]:
%%spark
from great_expectations.core import ExpectationSuite, ExpectationConfiguration

import ads
from ads.feature_store.entity import Entity
from ads.feature_store.feature_group import FeatureGroup
from ads.feature_store.feature_group_expectation import ExpectationType
from ads.feature_store.feature_store import FeatureStore
from ads.feature_store.input_feature_detail import FeatureDetail, FeatureType
from ads.feature_store.statistics_config import StatisticsConfig
from ads.feature_store.transformation import TransformationMode
import os

# Set the Authentications for the feature store operations
ads.set_auth(auth="resource_principal", client_kwargs={"fs_service_endpoint": "https://{api_gateway}/20230101"})

# Variables
compartment_id = "<compartment_id>"
metastore_id = "<metastore_id>"

<a id="data_exploration"></a>
### 3.3. Data exploration

In [None]:
%%spark
df_nyc_tlc = spark.read.parquet("oci://hosted-ds-datasets@bigdatadatasciencelarge/nyc_tlc/201[1,2,3,4,5,6,7,8]/**/data.parquet", header=False, inferSchema=True)
df_nyc_tlc = df_nyc_tlc.select("vendor_id", "pickup_at", "dropoff_at")

df_nyc_tlc.show()

<a id="load_featuregroup"></a>
### 3.4. Create feature store logical entities

<a id="create_feature_store"></a>
#### 3.4.1 Creation of Feature Store
Feature store is the top level entity for feature store service

In [None]:
%%spark
feature_store_resource = FeatureStore(). \
    with_description("Feature Store Description"). \
    with_compartment_id(compartment_id). \
    with_display_name("FeatureStore"). \
    with_offline_config(metastore_id=metastore_id)

feature_store = feature_store_resource.create()
feature_store

<a id="create_entity"></a>
#### 3.4.2 Creation of Entity
An entity is a group of semantically related features.

In [None]:
%%spark
entity = feature_store.create_entity()
entity

<a id="create_feature_group"></a>
#### 3.4.3 Creation of Feature group
A feature group is an object that represents a logical group of time-series feature data as it is found in a datasource.

In [None]:
%%spark

# Initialize Expectation Suite
expectation_suite_trans = ExpectationSuite(expectation_suite_name="feature_definition")
expectation_suite_trans.add_expectation(
    ExpectationConfiguration(
        expectation_type="EXPECT_COLUMN_VALUES_TO_NOT_BE_NULL",
        kwargs={"column": "vendor_id"}
    )
)

stats_config = StatisticsConfig().with_is_enabled(False)

feature_group = entity.create_feature_group(
    primary_keys=["vendor_id"],
    schema_details_dataframe=df_nyc_tlc, #infer the schema from the data frame
    expectation_suite=expectation_suite_trans,
    expectation_type=ExpectationType.LENIENT,
    statistics_config=stats_config,
    name="feature_group_big_data",
)

feature_group

<a id="materialise_feature_store"></a>
#### 3.4.4 Materialisation of Feature group

In [None]:
%%spark
import pandas as pd
df_nyc_tlc = spark.read.parquet("oci://hosted-ds-datasets@bigdatadatasciencelarge/nyc_tlc/201[1,2,3,4,5,6,7,8]/**/data.parquet", header=False, inferSchema=True)
df_nyc_tlc = df_nyc_tlc.select("vendor_id", "pickup_at", "dropoff_at").limit(1000)

feature_group.materialise(df_nyc_tlc)

<a id="query_feature_group"></a>
#### 3.4.5 Feature group Querying

In [None]:
%%spark
feature_group.select().show()

In [None]:
%%spark
feature_group.select(["vendor_id", "pickup_at"]).show()

In [None]:
%%spark
feature_group.filter(feature_group.vendor_id == "CMT").show()

<a id='references'></a>
# References
- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)
- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)