Oracle Data Science service sample notebook.

Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

***

# <font color="red">Data Flow Studio: Big Data Operations in Feature Store</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---
# Overview:

This notebook demonstrates how to run interactive Spark workloads on a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster through [Apache Livy](https://livy.apache.org/) integration. **Data Flow Spark Magic** is used for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. It includes a set of magic commands for interactively running Spark code.



## Contents:

- <a href="#concepts">1. Introduction</a>
- <a href='#pre-requisites'>1. Pre-requisites</a>
    - <a href='#policies'>2.1 Policies</a>
    - <a href='#prerequisites_helpers'>2.2 Prerequisites Helpers</a>
    - <a href='#prerequisites_authentication'>2.3 Authentication</a>
    - <a href='#prerequisites_variables'>2.4 Variables</a>
- <a href='#dataflow_magic'>3. Dataflow Magic</a>
    - <a href='#load_extension'>3.1. Load extension</a>
    - <a href='#load_featuregroup'>3.2. Load feature groups</a>
    - <a href='#data_exploration'>3.3. Data exploration</a>
    - <a href='#load_featuregroup'>3.4. Creation of logical entities of feature group</a>
        - <a href='#create_feature_store'>3.4.1 Creation of feature store</a>
        - <a href='#create_entity'>3.4.2 Creation of entity</a>
        - <a href='#create_feature_group'>3.4.3 Creation of feature group</a>
        - <a href='#materialise_feature_group'>3.4.4 Materialisation of feature group</a>
        - <a href='#query_feature_group'>3.4.5 Querying of feature group</a>
- <a href='#ref'>4. References</a>

---


Compatible conda pack: [PySpark 3.2 and Data Flow](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8

<img src="https://objectstorage.us-ashburn-1.oraclecloud.com/p/jkC0Ow2ARR8rTw0ykUpLjr6a_9wZnb9PTDYif8pdKxMK_nrbpcSj0mSIeecQCjsE/n/idogsu2ylimg/b/demo-2/o/download.png"/>

---

<a id="concepts"></a>
# 1. Introduction

Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.

The following are some key terms that will help you understand OCI Data Science Feature Store:


* **Feature Vector**: Set of feature values for any one primary/identifier key. Eg. All/subset of features of customer id ‘2536’ can be called as one feature vector.

* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.

* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc.

* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in ml models. It serves as an organizational unit within the feature store for users to manage, version and share features across different ml projects. By organizing features into groups, data scientists and ml engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.

* **Feature Group Job**: Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.

* **Dataset**: A dataset is a collection of feature that are used together to either train a model or perform model inference.

* **Dataset Job**: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results.

<a id='pre-requisites'></a>
# 2. Pre-requisites 

Data Flow Sessions are accessible through the following conda environment: 

* **PySpark 3.2 and Feature Store (pyspark_3_v1)**

<a id='policies'></a>
## 2.1. Policies
This section covers the creation of dynamic groups and policies needed to use the service.

* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm/)
* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)
* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)
* [Data Catalog Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)

<a id="prerequisites_helpers"></a>
## 2.2 Helpers
This section provides a helper method used across the notebook to prepare arguments for the magic commands. This function is particularly useful when you want to pass Python variables as arguments to the spark magic commands 

In [4]:
import json


def prepare_command(command: dict) -> str:
    """Converts dictionary command to the string formatted commands."""
    return f"'{json.dumps(command)}'"

<a id="prerequisites_authentication"></a>
## 2.3. Authentication
The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the Data Flow Session Spark cluster.<br> 
To setup authentication use the ```ads.set_auth("resource_principal")``` or ```ads.set_auth("api_key")```. 

In [5]:
import ads

ads.set_auth("resource_principal")  # Supported values: resource_principal, api_key

<a id="prerequisites_variables"></a>
## 2.4. Variables
To run this notebook, you must provide some information about your tenancy configuration. To connect to the HIVE metastore, replace `<metastore_id>` with the OCID for the HIVE metastore. Connecting to the metastore is optional.  

To create and run a Data Flow session, you must specify a `<compartment_id>`, `<metastoreId>`, bucket `<logs_bucket_uri>` and `<custom_conda_environment_uri>` for storing logs. These resources must be in the same compartment.

In [6]:
compartment_id = "<compartment_id>"
metastore_id = "<metastore_id>"
logs_bucket_uri = "<logs-bucket-url>"

custom_conda_environment_uri = "oci://service-conda-packs-fs@bigdatadatasciencelarge/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v1#conda"

<a id="dataflow_magic"></a>
# 3. Data Flow Spark Magic
Data Flow Spark Magic commands allow you to interactively work with Data Flow Spark clusters (sessions) in Jupyter notebooks through the Livy REST API. It provides a set of Jupyter Notebook cell magic commands to turn Jupyter into an integrated Spark development environment for remote clusters. 

**Data Flow Magic allows you to:**

* Run Spark code against Data Flow remote Spark cluster
* Create a Data Flow Spark Session with SparkContext and HiveContext against Data Flow remote Spark cluster
* Capture the output of Spark queries as a local Pandas data frame to interact easily with other Python libraries (e.g. matplotlib)

<a id="load_extension"></a>
### 3.1. Load Spark Magic Commands and Getting Help
Data Flow Spark Magic is a JupyterLab extension that you need to activate in your notebook using the `%load_ext dataflow.magics` magic command.<br>
After the extension is activated, the `%help` command can be used to get the list of supported commands.

In [7]:
%load_ext dataflow.magics

<a id="create_session"></a>
### 3.2. Create Session
To create a new Data Flow cluster session use the `%create_session` magic command.

In [8]:
command = prepare_command(
    {
        "compartmentId": compartment_id,
        "displayName": "spark_session_via_notebook",
        "language": "PYTHON",
        "sparkVersion": "3.2.1",
        "numExecutors": 8,
        "metastoreId": metastore_id,
        "driverShape": "VM.Standard2.1",
        "executorShape": "VM.Standard2.1",
        "driverShapeConfig": {"ocpus": 2, "memoryInGBs": 16},
        "executorShapeConfig": {"ocpus": 2, "memoryInGBs": 16},
        "type": "SESSION",
        "logsBucketUri": logs_bucket_uri,
        "configuration": {
            "spark.archives": custom_conda_environment_uri,
            "fs.oci.client.hostname": "https://objectstorage.us-ashburn-1.oraclecloud.com"
        },
    }
)

%create_session -l python -c $command

Setting up the Cluster..


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Cluster is ready..
Starting Spark application..


Session ID,Kind,State,Current session
ocid1.dataflowapplication.oc1.iad.anuwcljsnif7xwia5uvy54rp5ybm2u2va6sg2azmpmtsw4i7s2wpqy3thj3a,pyspark,IN_PROGRESS,Dataflow Run


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.
SparkContext available as 'sc'.


In [9]:
%%spark
from great_expectations.core import ExpectationSuite, ExpectationConfiguration

import ads
from ads.feature_store.entity import Entity
from ads.feature_store.feature_group import FeatureGroup
from ads.feature_store.feature_group_expectation import ExpectationType
from ads.feature_store.feature_store import FeatureStore
from ads.feature_store.input_feature_detail import FeatureDetail, FeatureType
from ads.feature_store.statistics_config import StatisticsConfig
from ads.feature_store.transformation import TransformationMode
import os

# Set the Authentications for the feature store operations
ads.set_auth(auth="resource_principal", client_kwargs={"service_endpoint": "https://pac7vnpvfa2xkagazweggatqwy.apigateway.us-ashburn-1.oci.customer-oci.com/20230101"})

# Variables
compartment_id = "<compartment_id>"
metastore_id = "<metastore_id>"

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<a id="data_exploration"></a>
### 3.3. Data exploration

In [10]:
%%spark
df_nyc_tlc = spark.read.parquet("oci://hosted-ds-datasets@bigdatadatasciencelarge/nyc_tlc/201[1,2,3,4,5,6,7,8]/**/data.parquet", header=False, inferSchema=True)
df_nyc_tlc = df_nyc_tlc.select("vendor_id", "pickup_at", "dropoff_at")

df_nyc_tlc.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+-------------------+-------------------+
|vendor_id|          pickup_at|         dropoff_at|
+---------+-------------------+-------------------+
|      CMT|2011-01-29 02:38:35|2011-01-29 02:47:07|
|      CMT|2011-01-28 10:38:19|2011-01-28 10:42:18|
|      CMT|2011-01-28 23:49:58|2011-01-28 23:57:44|
|      CMT|2011-01-28 23:52:09|2011-01-28 23:59:21|
|      CMT|2011-01-28 10:34:39|2011-01-28 11:25:50|
|      CMT|2011-01-28 23:50:00|2011-01-28 23:58:11|
|      CMT|2011-01-29 02:38:48|2011-01-29 02:50:37|
|      CMT|2011-01-29 02:41:16|2011-01-29 02:45:45|
|      CMT|2011-01-28 23:50:51|2011-01-29 00:07:55|
|      CMT|2011-01-29 02:41:34|2011-01-29 03:08:14|
|      CMT|2011-01-28 23:50:22|2011-01-29 00:03:23|
|      CMT|2011-01-29 02:40:30|2011-01-29 02:43:08|
|      CMT|2011-01-29 02:42:47|2011-01-29 02:50:31|
|      CMT|2011-01-28 23:51:10|2011-01-29 00:03:19|
|      CMT|2011-01-28 05:07:16|2011-01-28 05:12:25|
|      CMT|2011-01-29 02:42:31|2011-01-29 02:55:56|
|      CMT|2

<a id="load_featuregroup"></a>
### 3.4. Create feature store logical entities

<a id="create_feature_store"></a>
#### 3.4.1 Creation of Feature Store
Feature store is the top level entity for feature store service

In [11]:
%%spark
feature_store_resource = FeatureStore(). \
    with_description("Feature Store Description"). \
    with_compartment_id(compartment_id). \
    with_display_name("FeatureStore"). \
    with_offline_config(metastore_id=metastore_id)

feature_store = feature_store_resource.create()
feature_store

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


kind: featurestore
spec:
  compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a
  description: Feature Store Description
  displayName: FeatureStore
  id: 8893420628AB925DBEF259F660862F31
  offlineConfig:
    metastoreId: ocid1.datacatalogmetastore.oc1.iad.amaaaaaanif7xwiaavhd2liaebamr3tbjzio3uw2lxuteoa5ejsfvhqufbsa
type: featureStore

<a id="create_entity"></a>
#### 3.4.2 Creation of Entity
An entity is a group of semantically related features.

In [12]:
%%spark
entity = feature_store.create_entity()
entity

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


kind: entity
spec:
  compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a
  featureStoreId: 8893420628AB925DBEF259F660862F31
  id: 5748B756C5CEE21176FCCDFDB64FA08F
  name: entity_resource-sticky-salmon-2023-07-14-05:46.01
type: entity

<a id="create_feature_group"></a>
#### 3.4.3 Creation of Feature group
A feature group is an object that represents a logical group of time-series feature data as it is found in a datasource.

In [13]:
%%spark

# Initialize Expectation Suite
expectation_suite_trans = ExpectationSuite(expectation_suite_name="feature_definition")
expectation_suite_trans.add_expectation(
    ExpectationConfiguration(
        expectation_type="EXPECT_COLUMN_VALUES_TO_NOT_BE_NULL",
        kwargs={"column": "vendor_id"}
    )
)

stats_config = StatisticsConfig().with_is_enabled(False)

feature_group = entity.create_feature_group(
    primary_keys=["vendor_id"],
    schema_details_dataframe=df_nyc_tlc, #infer the schema from the data frame
    expectation_suite=expectation_suite_trans,
    expectation_type=ExpectationType.LENIENT,
    statistics_config=stats_config,
    name="feature_group_big_data",
)

feature_group

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


kind: FeatureGroup
spec:
  compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a
  entityId: 5748B756C5CEE21176FCCDFDB64FA08F
  expectationDetails:
    createRuleDetails:
    - arguments:
        column: vendor_id
      levelType: ERROR
      name: Rule-0
      ruleType: EXPECT_COLUMN_VALUES_TO_NOT_BE_NULL
    expectationType: LENIENT
    name: feature_definition
    validationEngineType: GREAT_EXPECTATIONS
  featureStoreId: 8893420628AB925DBEF259F660862F31
  id: 6BAC94626CABC8944E7C29F5D9C8FC5E
  inputFeatureDetails:
  - featureType: STRING
    name: vendor_id
    orderNumber: 1
  - featureType: TIMESTAMP
    name: pickup_at
    orderNumber: 2
  - featureType: TIMESTAMP
    name: dropoff_at
    orderNumber: 3
  isInferSchema: false
  name: feature_group_big_data
  primaryKeys:
    items:
    - name: vendor_id
  statisticsConfig:
    isEnabled: false
type: featureGroup

<a id="materialise_feature_store"></a>
#### 3.4.4 Materialisation of Feature group

In [14]:
%%spark
import pandas as pd
df_nyc_tlc = spark.read.parquet("oci://hosted-ds-datasets@bigdatadatasciencelarge/nyc_tlc/201[1,2,3,4,5,6,7,8]/**/data.parquet", header=False, inferSchema=True)
df_nyc_tlc = df_nyc_tlc.select("vendor_id", "pickup_at", "dropoff_at").limit(1000)

feature_group.materialise(df_nyc_tlc)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Calculating Metrics: 100%|##########| 8/8 [01:04<00:00,  8.12s/it]

<a id="query_feature_group"></a>
#### 3.4.5 Feature group Querying

In [15]:
%%spark
feature_group.select().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+-------------------+-------------------+
|vendor_id|          pickup_at|         dropoff_at|
+---------+-------------------+-------------------+
|      VTS|2011-02-27 04:00:00|2011-02-27 04:14:00|
|      VTS|2011-02-27 20:38:00|2011-02-27 20:46:00|
|      VTS|2011-02-27 17:47:00|2011-02-27 17:58:00|
|      VTS|2011-02-26 19:56:00|2011-02-26 20:04:00|
|      VTS|2011-02-23 13:05:00|2011-02-23 13:10:00|
|      VTS|2011-02-27 03:48:00|2011-02-27 04:01:00|
|      VTS|2011-02-27 17:52:00|2011-02-27 18:02:00|
|      VTS|2011-02-27 00:44:00|2011-02-27 01:04:00|
|      VTS|2011-02-27 04:08:00|2011-02-27 04:22:00|
|      VTS|2011-02-27 11:53:00|2011-02-27 12:05:00|
+---------+-------------------+-------------------+
only showing top 10 rows

In [16]:
%%spark
feature_group.select(["vendor_id", "pickup_at"]).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+-------------------+
|vendor_id|          pickup_at|
+---------+-------------------+
|      VTS|2011-02-27 04:00:00|
|      VTS|2011-02-27 20:38:00|
|      VTS|2011-02-27 17:47:00|
|      VTS|2011-02-26 19:56:00|
|      VTS|2011-02-23 13:05:00|
|      VTS|2011-02-27 03:48:00|
|      VTS|2011-02-27 17:52:00|
|      VTS|2011-02-27 00:44:00|
|      VTS|2011-02-27 04:08:00|
|      VTS|2011-02-27 11:53:00|
+---------+-------------------+
only showing top 10 rows

In [17]:
%%spark
feature_group.filter(feature_group.vendor_id == "CMT").show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+---------+----------+
|vendor_id|pickup_at|dropoff_at|
+---------+---------+----------+
+---------+---------+----------+

<a id='ref'></a>
# References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)