In [None]:
# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.

!odsc conda install --uri https://objectstorage.us-ashburn-1.oraclecloud.com/p/qnzzHQPGQYghdyH206yDk25MZH1FaMGdNNhKUl74BhRsW4muvFyGViKIqpxgnxI3/n/ociodscdev/b/ads_conda_pack_builds/o/PySpark_3/teamcity_20230512_084146_38972446/f227145b7ee5fc1c73a69ebaa671b81e/PySpark_3.2_and_Feature_Store.tar.gz

Oracle Data Science service sample notebook.

Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

***

# <font color="red">Feature store quickstart</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---
# Overview:
---
Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data.

## Contents:

- <a href="#concepts">1. Introduction</a>
- <a href='#pre-requisites'>2. Pre-requisites</a>
    - <a href='#policies'>2.1 Policies</a>
    - <a href='#prerequisites_authentication'>2.2 Authentication</a>
    - <a href='#prerequisites_variables'>2.3 Variables</a>
- <a href='#featurestore_overview'>3. Feature store quickstart using APIs</a>
    - <a href='#create_featurestore'>3.1. Create feature store</a>
    - <a href='#create_entity'>3.2. Create business entity in feature store</a>
    - <a href='#create_featuregroup'>3.3. Create feature group and upload data to feature group</a>
    - <a href='#query_featuregroup'>3.4. Query feature group</a>
    - <a href='#create_dataset'>3.5. Create dataset from multiple or one feature group</a>
    - <a href='#query_dataset'>3.6 Query dataset</a>
- <a href='#featurestore_yaml'>4. Feature store quickstart using YAML</a>
- <a href='#ref'>5. References</a>

---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials under your agreement with Oracle.

This [`Citi Bike`](https://ride.citibikenyc.com/data-sharing-policy) dataset license is used in this notebook.

---

<a id="concepts"></a>
# 1. Introduction

Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.

The following are some key terms that will help you understand OCI Data Science Feature Store:


* **Feature Vector**: Set of feature values for any one primary/identifier key. Eg.  All/subset of  features of customer id '2536' can be called as one feature vector.
* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.
* **Entity**: An entity is a group of semantically related features.
* **Datasource**: Features are engineered from raw data stored in various data sources (e.g. object storage, Oracle Database, Oracle MySQL, etc).
* **Feature Group** - A feature group is an object that represents a logical group of time-series feature data as it is found in a datasource.
* **Dataset**: Datasets are created from features stored in the feature store service and are used to train models and to perform online model inference.

<a id='pre-requisites'></a>
# 2. Pre-requisites

Notebook Sessions are accessible through the following conda environment: 

* **PySpark 3.2 and Feature store 1.0 (pyspark32_p38_cpu_feature_store_v1)**

You can customize `pyspark32_p38_cpu_feature_store_v1`, publish it, and use it as a runtime environment for a Notebook session cluster. 

<a id='policies'></a>
### 2.1. Policies
This section covers the creation of dynamic groups and policies needed to use the service.

* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)
* [Data Catalog Metastore Required Policies](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)

<a id="prerequisites_authentication"></a>
### 2.2. Authentication
The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook Spark cluster.<br> 
To setup authentication use the ```ads.set_auth("resource_principal")``` or ```ads.set_auth("api_key")```. 

In [None]:
import ads
ads.set_auth(auth="api_key", client_kwargs={"service_endpoint": "http://localhost:21000/20230101"})

<a id="prerequisites_variables"></a>
### 2.3. Variables
To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `<compartment_id>` and bucket `<metastore_id>` for storing logs. The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage and the metastore id of hive metastore is tied to feature store construct of feature store service.

In [4]:
import os

compartment_id = os.environ.get("NB_SESSION_COMPARTMENT_OCID")
metastore_id = "<metastore_id>"

<a id="featurestore_overview"></a>
# 3. Feature store quick start using APIs
By default the **PySpark 3.2, Feature store and Data Flow** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html) and [deeque](https://github.com/awslabs/deequ) libraries. In an ADS feature store module, you can either use the Python programmatic or YAML interface to define feature store entities. Below section describes how to create feature store entities using programmatic interface.

In [6]:
import pandas as pd 
from ads.feature_store.feature_store import FeatureStore
from ads.feature_store.dataset import Dataset

<a id="create_featurestore"></a>
### 3.1 Create feature store
Feature store is a top level construct to provide logical segregation of resources

In [None]:
feature_store_resource = FeatureStore().\
    with_description("Data consisting of bike riders data").\
    with_compartment_id(compartment_id).\
    with_display_name("Bike rides").\
    with_offline_config(metastore_id=metastore_id)

In [None]:
feature_store = feature_store_resource.create()

<a id="create_entity"></a>
### 3.2 Create entity
An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc.

In [None]:
entity = feature_store.create_entity(
    display_name="Bike rides",
    description="description for bike riders"
)

<a id="create_featuregroup"></a>
### 3.3 Create feature group
A feature group is the code that contains instructions on the ingestion of raw data and computation of the feature. This [`Citi Bike`](https://ride.citibikenyc.com/data-sharing-policy) dataset license is used in this notebook. values. 

In [None]:
from ads.feature_store.feature_group import FeatureGroup

In [None]:
feature_group = FeatureGroup.with_primary_keys(["ride_id"])\
    .with_display_name("city_bike_feature_group")\
    .with_entity_id(entity.id)

In [None]:
profiles_df = pd.read_csv("/data/biketrips/JC-201901-citibike-tripdata.csv")

In [None]:
feature_group.materialise(profiles_df)

<a id="query_featuregroup"></a>
### 3.4 Query feature group
Feature store provides a DataFrame API to ingest data into the Feature Store. You can also retrieve feature data in a DataFrame, that can either be used directly to train models or materialized to file(s) for later use to train models

In [None]:
query = feature_group.select(["ride_id","rideable_type"]) 
query.show()

<a id="create_dataset"></a>
### 3.5 Create dataset
A dataset is a collection of feature snapshots that are joined together to either train a model or perform model inference.

In [None]:
dataset_resource = Dataset()\
    .with_description("Dataset consisting of a subset of features in feature group: bike riders")\
    .with_compartment_id(compartment_id)\
    .with_name("Bike riders dataset")\
    .with_entity_id(entity.id)\
    .with_feature_store_id(feature_store.id)\
    .with_query(query)

In [None]:
dataset = dataset_resource.create()

In [None]:
dataset.materialise()

<a id="query_dataset"></a>
### 3.6 Query dataset
Feature store provides a DataFrame API to ingest data into the Feature Store. You can also retrieve feature data in a DataFrame, that can either be used directly to train models or materialized to file(s) for later use to train models

In [None]:
query = dataset.select(["ride_id","rideable_type"]) 
query.show()

<a id="featurestore_yaml"></a>
# 4. Feature store quick start using YAML
In an ADS feature store module, you can either use the Python programmatic interface or YAML to define feature store entities. Below section describes how to create feature store entities using YAML as an interface.

In [None]:
from ads.feature_store.feature_store_registrar import FeatureStoreRegistrar

<a id='ref'></a>
# References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)