diff --git a/notebook_examples/feature_store_querying.ipynb b/notebook_examples/feature_store_querying.ipynb
new file mode 100644
index 00000000..776f9a52
--- /dev/null
+++ b/notebook_examples/feature_store_querying.ipynb
@@ -0,0 +1,1518 @@
+{
+ "cells": [
+ {
+ "cell_type": "raw",
+ "id": "5ff263c5",
+ "metadata": {},
+ "source": [
+ "qweews@notebook{feature_store-querying.ipynb,\n",
+ " title: Feature store handling querying operations\n",
+ " summary: Using feature store to transform, store and query your data using pandas like interface to query and join\n",
+ " developed_on: fspyspark32_p38_cpu_v1,\n",
+ " keywords: feature store, querying,\n",
+ " license: Universal Permissive License v 1.0\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "03f57fba",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!odsc conda install -s fspyspark32_p38_cpu_v1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bb694927",
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2023-05-24T08:26:08.572567Z",
+ "start_time": "2023-05-24T08:26:08.328013Z"
+ },
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.\n",
+ "!pip install --pre --no-deps oracle-ads==2.9.0rc0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "08c20e45",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "Oracle Data Science service sample notebook.\n",
+ "\n",
+ "Copyright (c) 2022, 2023 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n",
+ "\n",
+ "***\n",
+ "\n",
+ "# Feature store handling querying operations\n",
+ "
by the Oracle Cloud Infrastructure Data Science Service.
\n",
+ "\n",
+ "---\n",
+ "# Overview:\n",
+ "---\n",
+ "Managing many datasets, data sources, and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift, and training serving skew all lead to increased model development time and poor model performance. Feature store can be used to solve many of the problems becuase it provides a centralised way to transform and access data for training and serving time. Feature store helps define a standardised pipeline for ingestion of data and querying of data.This notebook demonstrates how to use feature store using a notebook spark session.\n",
+ "\n",
+ "Compatible conda pack: [PySpark 3.2 and Feature Store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 1.0)\n",
+ "\n",
+ "## Contents:\n",
+ "\n",
+ "- 1. Introduction\n",
+ "- 2. Pre-requisites to Running this Notebook\n",
+ " - 2.1. Setup\n",
+ " - 2.2 Policies\n",
+ " - 2.3 Authentication\n",
+ " - 2.4 Variables\n",
+ "- 3. Feature store querying\n",
+ " - 3.1. Exploration of data in feature store\n",
+ " - 3.2. Create feature store logical entities\n",
+ " - 3.3. Explore feature groups\n",
+ " - 3.4. Select subset of features\n",
+ " - 3.5. Filter feature groups\n",
+ " - 3.6. Apply joins on feature groups\n",
+ " - 3.7. Create dataset from multiple or one feature group\n",
+ " - 3.8. Free form sql query\n",
+ " - 3.9. Feature store Entities using YAML\n",
+ "- 4. References\n",
+ "\n",
+ "---\n",
+ "\n",
+ "**Important:**\n",
+ "\n",
+ "Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = \"\"` would become `database_name = \"production\"`.\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9c794835",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "# 1. Introduction\n",
+ "\n",
+ "OCI Data Science feature store is a stack-based API solution that's deployed using OCI Resource Manager in your tenancy.\n",
+ "\n",
+ "Review the following key terms to understand the Data Science feature store:\n",
+ "\n",
+ "\n",
+ "* **Feature Vector**: Set of feature values for any one primary or identifier key. For example, all or a subset of features of customer id ‘2536’ can be called as one feature vector.\n",
+ "\n",
+ "* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.\n",
+ "\n",
+ "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Or an entity is an object or concept that is described by its features. Examples of entities are customer, product, transaction, review, image, document, and so on.\n",
+ "\n",
+ "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in machine learning (ML) models. It serves as an organizational unit within the feature store for you to manage, version and share features across different ML projects. By organizing features into groups, data scientists and ML engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n",
+ "\n",
+ "* **Feature Group Job**: A feature group job is the processing instance of a feature group. Each feature group job includes validation results and statistics results.\n",
+ "\n",
+ "* **Dataset**: A dataset is a collection of features that are used together to either train a model or perform model inference.\n",
+ "\n",
+ "* **Dataset Job**: dataset job is the processing instance of a dataset. Each dataset job includes validation results and statistics results."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d9cadd7f",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "# 2. Pre-requisites to Running this Notebook\n",
+ "\n",
+ "Notebook Sessions are accessible through the following conda environment: \n",
+ "\n",
+ "* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1)**\n",
+ "\n",
+ "You can customize `fspyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "45568cac",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 2.1. Setup\n",
+ "\n",
+ "\n",
+ "### `spark-defaults.conf`\n",
+ "\n",
+ "The `spark-defaults.conf` file is used to define the properties that are used by Spark. A templated version is installed when you install a Data Science conda environment that supports PySpark. However, you must update the template so that the Data Catalog metastore can be accessed. You can do this manually. However, the `odsc data-catalog config` commandline tool is ideal for setting up the file because it gathers information about your environment, and uses that to build the file.\n",
+ "\n",
+ "The `odsc data-catalog config` command line tool needs the `--metastore` option to define the Data Catalog metastore OCID. No other command line option is needed because settings have default values, or they take values from your notebook session environment. Following are common parameters that you may need to override.\n",
+ "\n",
+ "The `--authentication` option sets the authentication mode. It supports resource principal and API keys. The preferred method for authentication is resource principal, which is sent with `--authentication resource_principal`. If you want to use API keys, then use the `--authentication api_key` option. If the `--authentication` isn't specified, API keys are used. When API keys are used, information from the OCI configuration file is used to create the `spark-defaults.conf` file.\n",
+ "\n",
+ "Object Storage and Data Catalog are regional services. By default, the region is set to the region your notebook session is running in. This information is taken from the environment variable, `NB_REGION`. Use the `--region` option to override this behavior.\n",
+ "\n",
+ "The default location of the `spark-defaults.conf` file is `/home/datascience/spark_conf_dir` as defined in the `SPARK_CONF_DIR` environment variable. Use the `--output` option to define the directory where to write the file.\n",
+ "\n",
+ "You need to determine what settings are appropriate for your configuration. However, the following works for most configurations and is run in a terminal window.\n",
+ "\n",
+ "```bash\n",
+ "odsc data-catalog config --authentication resource_principal --metastore \n",
+ "```\n",
+ "For more assistance, use the following command in a terminal window:\n",
+ "\n",
+ "```bash\n",
+ "odsc data-catalog config --help\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b83d4381",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 2.2. Policies\n",
+ "This section covers the creation of dynamic groups and policies needed to use the service.\n",
+ "\n",
+ "* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm)\n",
+ "* [Data Catalog Metastore Required Policies](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)\n",
+ "* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)\n",
+ "* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2c8084e8",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 2.3. Authentication\n",
+ "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook session.
\n",
+ "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "419755bd",
+ "metadata": {
+ "ExecuteTime": {
+ "start_time": "2023-05-24T08:26:08.577504Z"
+ },
+ "is_executing": true,
+ "pycharm": {
+ "is_executing": true,
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import ads\n",
+ "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"fs_service_endpoint\": \"https://{api_gateway}/20230101\"})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "869ab9aa",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 2.4. Variables\n",
+ "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and `` which is the OCID of the Data Catalog metastore."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "622e7cd2",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true,
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "\n",
+ "compartment_id = os.environ.get(\"NB_SESSION_COMPARTMENT_OCID\")\n",
+ "metastore_id = \"\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c6b93dd8",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "# 3. Feature store querying\n",
+ "By default the **PySpark 3.2 and Feature Store Python 3.8** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html) library. In an ADS feature store module, you can either use the Python programmatic or YAML interface to define feature store entities. The joining functionality is heavily inspired by the APIs used by Pandas to merge, join or filter DataFrames. The APIs allow you to specify which features to select from which feature group, how to join them and which features to use in join conditions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1c7e9f28",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true,
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import warnings\n",
+ "warnings.filterwarnings(\"ignore\", message=\"iteritems is deprecated\")\n",
+ "warnings.filterwarnings(\"ignore\", category=DeprecationWarning)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a617d39a",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true,
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "from ads.feature_store.feature_store import FeatureStore\n",
+ "from ads.feature_store.feature_group import FeatureGroup\n",
+ "from ads.feature_store.model_details import ModelDetails\n",
+ "from ads.feature_store.dataset import Dataset\n",
+ "from ads.feature_store.common.enums import DatasetIngestionMode\n",
+ "\n",
+ "from ads.feature_store.feature_group_expectation import ExpectationType\n",
+ "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n",
+ "from ads.feature_store.feature_store_registrar import FeatureStoreRegistrar"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5473731e",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 3.1. Exploration of data in feature store"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "120e216e",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true,
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "flights_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/flights.csv\")[['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']]\n",
+ "flights_df = flights_df.head(100)\n",
+ "flights_df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6bc0ed52",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true,
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "airports_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airports.csv\")\n",
+ "airports_df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0ab07769",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true,
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "airlines_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airlines.csv\")\n",
+ "airlines_df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9e339416",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 3.2. Create feature store logical entities"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "99671727",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "#### 3.2.1 Feature Store\n",
+ "Feature store is the top level entity for feature store service\n",
+ "\n",
+ "Call the ```.create()``` method of the Feature store instance to create a feature store."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3a210b6a",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true,
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_store_resource = (\n",
+ " FeatureStore().\n",
+ " with_description(\"Data consisting of flights\").\n",
+ " with_compartment_id(compartment_id).\n",
+ " with_display_name(\"flights details\").\n",
+ " with_offline_config(metastore_id=metastore_id)\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1757874f",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true,
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_store = feature_store_resource.create()\n",
+ "feature_store"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c4206e4f",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "#### 3.2.2 Entity\n",
+ "An entity is a group of semantically related features."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3ab9d52e",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "entity = feature_store.create_entity(\n",
+ " display_name=\"Flight details2\",\n",
+ " description=\"description for flight details\"\n",
+ ")\n",
+ "entity"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7290a601",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "#### 3.2.3 Feature group\n",
+ "A feature group is an object that represents a logical group of time-series feature data as it is found in a datasource."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f6660b7e",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "##### Flights Feature Group\n",
+ "\n",
+ "Create feature group for flights\n",
+ "\n",
+ "\n",
+ "

\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cbeed679",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_flights = (\n",
+ " FeatureGroup()\n",
+ " .with_feature_store_id(feature_store.id)\n",
+ " .with_primary_keys([\"FLIGHT_NUMBER\"])\n",
+ " .with_name(\"flights_feature_group\")\n",
+ " .with_entity_id(entity.id)\n",
+ " .with_compartment_id(compartment_id)\n",
+ " .with_schema_details_from_dataframe(flights_df)\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fa811b52",
+ "metadata": {
+ "collapsed": false,
+ "jupyter": {
+ "outputs_hidden": false
+ },
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_flights.create()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e0da6f0d",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_flights.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8da122b1",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_flights.materialise(flights_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "67016236",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "##### Airport Feature Group\n",
+ "\n",
+ "Create feature group for airport"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5fabe3f2",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "expectation_suite_airports = ExpectationSuite(\n",
+ " expectation_suite_name=\"test_airports_df\"\n",
+ ")\n",
+ "expectation_suite_airports.add_expectation(\n",
+ " ExpectationConfiguration(\n",
+ " expectation_type=\"expect_column_values_to_not_be_null\",\n",
+ " kwargs={\"column\": \"IATA_CODE\"},\n",
+ " )\n",
+ ")\n",
+ "expectation_suite_airports.add_expectation(\n",
+ " ExpectationConfiguration(\n",
+ " expectation_type=\"expect_column_values_to_be_between\",\n",
+ " kwargs={\"column\": \"LATITUDE\", \"min_value\": -1.0, \"max_value\": 1.0},\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "expectation_suite_airports.add_expectation(\n",
+ " ExpectationConfiguration(\n",
+ " expectation_type=\"expect_column_values_to_be_between\",\n",
+ " kwargs={\"column\": \"LONGITUDE\", \"min_value\": -1.0, \"max_value\": 1.0},\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "99294b1c",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airports = (\n",
+ " FeatureGroup()\n",
+ " .with_feature_store_id(feature_store.id)\n",
+ " .with_primary_keys([\"IATA_CODE\"])\n",
+ " .with_name(\"airport_feature_group\")\n",
+ " .with_entity_id(entity.id)\n",
+ " .with_compartment_id(compartment_id)\n",
+ " .with_schema_details_from_dataframe(airports_df)\n",
+ " .with_expectation_suite(\n",
+ " expectation_suite=expectation_suite_airports,\n",
+ " expectation_type=ExpectationType.LENIENT,\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "91301811",
+ "metadata": {
+ "collapsed": false,
+ "jupyter": {
+ "outputs_hidden": false
+ },
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airports.create()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "767ce780",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airports.materialise(airports_df)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1fce8a51",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airports.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "15ac3504",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "##### Airlines Feature Group\n",
+ "\n",
+ "Create feature group for airlines\n",
+ "\n",
+ "\n",
+ "

\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "192ad5cd",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a95c8833",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "expectation_suite_airlines = ExpectationSuite(\n",
+ " expectation_suite_name=\"test_airlines_df\"\n",
+ ")\n",
+ "expectation_suite_airlines.add_expectation(\n",
+ " ExpectationConfiguration(\n",
+ " expectation_type=\"expect_column_values_to_not_be_null\",\n",
+ " kwargs={\"column\": \"IATA_CODE\"},\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0976a401",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airlines = (\n",
+ " FeatureGroup()\n",
+ " .with_feature_store_id(feature_store.id)\n",
+ " .with_primary_keys([\"IATA_CODE\"])\n",
+ " .with_name(\"airlines_feature_group\")\n",
+ " .with_entity_id(entity.id)\n",
+ " .with_compartment_id(compartment_id)\n",
+ " .with_schema_details_from_dataframe(airlines_df)\n",
+ " .with_expectation_suite(\n",
+ " expectation_suite=expectation_suite_airlines,\n",
+ " expectation_type=ExpectationType.STRICT,\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b9d7ef56",
+ "metadata": {
+ "collapsed": false,
+ "jupyter": {
+ "outputs_hidden": false
+ },
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airlines.create()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b926675d",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airlines.materialise(airlines_df)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "707c723d",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airlines.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "683f2b6f",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 3.3. Explore feature groups"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d06d4dbc",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_flights.get_features_df()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5fb0ed95",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airports.get_features_df()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "edd5c063",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airlines.get_features_df()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "647a3818",
+ "metadata": {},
+ "source": [
+ "You can retrieve feature data in a DataFrame, that can either be used to train models."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3484d5af",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_flights.select().show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "53111598",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airports.select().show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1947ac2b",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airlines.select().show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5ff4ffc6",
+ "metadata": {},
+ "source": [
+ "You can call the `get_statistics()` method of the feature group to fetch statistics for a specific ingestion job.You can use `to_pandas()` or `to_json()` to view the statistics.\n",
+ "You can visualize feature statistics with `to_viz()`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "30d69581",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airlines.get_statistics().to_pandas()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "02bec075",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_flights.get_statistics().to_pandas()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e31148f6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_airlines.get_statistics().to_viz()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "11f3a879",
+ "metadata": {},
+ "source": [
+ "You can call the `get_validation_output()` method of the FeatureGroup instance to fetch validation results for a specific ingestion job."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d382ff25",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airlines.get_validation_output().to_pandas()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "442d4462",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_airlines.get_validation_output().to_summary()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e3ded350",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 3.4. Select subset of features"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e11b7184",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airlines.select(['IATA_CODE']).show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "16561536",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 3.5. Filter feature groups"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1a251d97",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airlines.filter(feature_group_airlines.IATA_CODE == \"EV\").show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2944b0e7",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 3.6. Apply joins on feature groups\n",
+ "As in Pandas, if the feature has the same name on both feature groups, then you can use the `on=[]` paramter. If they have different names, then you can use the `left_on=[]` and `right_on=[]` paramters:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "56bedaff",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from ads.feature_store.common.enums import JoinType\n",
+ "\n",
+ "query = (\n",
+ " feature_group_flights.select()\n",
+ " .join(feature_group_airlines.select(), left_on=['ORIGIN_AIRPORT'], right_on=['IATA_CODE'], join_type=JoinType.LEFT)\n",
+ " .join(feature_group_airports.select(), left_on=['AIRLINE'], right_on=['IATA_CODE'], join_type=JoinType.LEFT)\n",
+ ")\n",
+ "query.show(5)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5d77bcde",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "query.to_string()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b018652b",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 3.7. Create dataset from multiple or one feature group\n",
+ "A dataset is a collection of feature snapshots that are joined together to either train a model or perform model inference.\n",
+ "\n",
+ "\n",
+ "

\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1df69889",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "dataset = (\n",
+ " Dataset()\n",
+ " .with_description(\"Combined dataset for flights\")\n",
+ " .with_compartment_id(compartment_id)\n",
+ " .with_name(\"flights_dataset\")\n",
+ " .with_entity_id(entity.id)\n",
+ " .with_feature_store_id(feature_store.id)\n",
+ " .with_query(query.to_string())\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5bb3d9ff",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "##### Create Dataset\n",
+ "\n",
+ "Call the ```.create()``` method of the Dataset instance to create a dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b5dd4e45",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "dataset.create()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fc64c019",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "dataset.materialise()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6fabd82e",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "#### Interoperability with model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "14b04a7c",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "model_details = ModelDetails().with_items([\"ocid1.modelcatalog.oc1.unique_ocid\"])\n",
+ "dataset.add_models(model_details)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2305f112",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "#### Visualise lineage\n",
+ "\n",
+ "Use the ```.show()``` method on the Dataset instance to visualize the lineage of the dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bad948c4",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "dataset.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f3125aa8",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "dataset.profile().show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "840a47fa",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "dataset.as_of(version_number=0).show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b3ed236",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 3.8. Freeform SQL query\n",
+ "Feature store provides a way to query feature store using free flow query. User need to mention `entity id` as the database name and `feature group name` as the table name to query feature store. This functionality can be useful if you need to express more complex queries for your use case"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5d38518f",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "entity_id = entity.id\n",
+ "\n",
+ "sql = (f\"SELECT flights_feature_group.*, airport_feature_group.IATA_CODE \"\n",
+ " f\"FROM `{entity_id}`.flights_feature_group flights_feature_group \"\n",
+ " f\"LEFT JOIN `{entity_id}`.airport_feature_group airport_feature_group \"\n",
+ " f\"ON flights_feature_group.ORIGIN_AIRPORT=airport_feature_group.IATA_CODE\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "425b900b",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_store.sql(sql).show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1962f56e",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.9. Feature store Entities using YAML\n",
+ "In an ADS feature store module, you can either use the Python programmatic interface or YAML to define feature store entities. Below section describes how to create feature store entities using YAML as an interface."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2734866a",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_store_yaml = \"\"\"\n",
+ "apiVersion: v1\n",
+ "kind: featureStore\n",
+ "spec:\n",
+ " displayName: Flights feature store\n",
+ " compartmentId: \"ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\"\n",
+ " offlineConfig:\n",
+ " metastoreId: \"ocid1.datacatalogmetastore.oc1.iad.amaaaaaabiudgxyap7tizm4gscwz7amu7dixz7ml3mtesqzzwwg3urvvdgua\"\n",
+ "\n",
+ " entity: &flights_entity\n",
+ " - kind: entity\n",
+ " spec:\n",
+ " name: Flights\n",
+ "\n",
+ " featureGroup:\n",
+ " - kind: featureGroup\n",
+ " spec:\n",
+ " entity: *flights_entity\n",
+ " name: flights_feature_group\n",
+ " primaryKeys:\n",
+ " - IATA_CODE\n",
+ " inputFeatureDetails:\n",
+ " - featureType: STRING\n",
+ " name: IATA_CODE\n",
+ " orderNumber: 1\n",
+ " - featureType: STRING\n",
+ " name: AIRPORT\n",
+ " orderNumber: 2\n",
+ " - featureType: STRING\n",
+ " name: CITY\n",
+ " orderNumber: 3\n",
+ " - featureType: STRING\n",
+ " name: STATE\n",
+ " orderNumber: 4\n",
+ " - featureType: STRING\n",
+ " name: COUNTRY\n",
+ " orderNumber: 5\n",
+ " - featureType: FLOAT\n",
+ " name: LATITUDE\n",
+ " orderNumber: 6\n",
+ " - featureType: FLOAT\n",
+ " name: LONGITUDE\n",
+ " orderNumber: 7\n",
+ " - kind: featureGroup\n",
+ " spec:\n",
+ " entity: *flights_entity\n",
+ " name: airlines_feature_group\n",
+ " primaryKeys:\n",
+ " - IATA_CODE\n",
+ " inputFeatureDetails:\n",
+ " - featureType: STRING\n",
+ " name: IATA_CODE\n",
+ " orderNumber: 1\n",
+ " - featureType: STRING\n",
+ " name: AIRPORT\n",
+ " orderNumber: 2\n",
+ " - featureType: STRING\n",
+ " name: CITY\n",
+ " orderNumber: 3\n",
+ " - featureType: STRING\n",
+ " name: STATE\n",
+ " orderNumber: 4\n",
+ " - featureType: STRING\n",
+ " name: COUNTRY\n",
+ " orderNumber: 5\n",
+ " - featureType: FLOAT\n",
+ " name: LATITUDE\n",
+ " orderNumber: 6\n",
+ " - featureType: FLOAT\n",
+ " name: LONGITUDE\n",
+ " orderNumber: 7\n",
+ "\n",
+ " - kind: featureGroup\n",
+ " spec:\n",
+ " entity: *flights_entity\n",
+ " name: airport_feature_group\n",
+ " primaryKeys:\n",
+ " - IATA_CODE\n",
+ " inputFeatureDetails:\n",
+ " - featureType: STRING\n",
+ " name: IATA_CODE\n",
+ " orderNumber: 1\n",
+ " - featureType: STRING\n",
+ " name: AIRLINE\n",
+ " orderNumber: 2\n",
+ " dataset:\n",
+ " - kind: dataset\n",
+ " spec:\n",
+ " name: flights_dataset\n",
+ " entity: *flights_entity\n",
+ " description: \"Dataset for flights\"\n",
+ " query: 'SELECT flight.IATA_CODE, flight.AIRPORT FROM flights_feature_group flight'\n",
+ "\"\"\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b988a15a",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "registrar = FeatureStoreRegistrar.from_yaml(yaml_string=feature_store_yaml)\n",
+ "registrar.create()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9fee36b0",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "# 4. References\n",
+ "\n",
+ "- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)\n",
+ "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n",
+ "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n",
+ "- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n",
+ "- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a9c7006c",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python [conda env:fspyspark32_p38_cpu_v1]",
+ "language": "python",
+ "name": "conda-env-fspyspark32_p38_cpu_v1-py"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.17"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebook_examples/feature_store_quickstart.ipynb b/notebook_examples/feature_store_quickstart.ipynb
new file mode 100644
index 00000000..26db2c76
--- /dev/null
+++ b/notebook_examples/feature_store_quickstart.ipynb
@@ -0,0 +1,870 @@
+{
+ "cells": [
+ {
+ "cell_type": "raw",
+ "id": "63f5fcad",
+ "metadata": {},
+ "source": [
+ "@notebook{feature_store-quickstart.ipynb,\n",
+ " title: Using feature store for feature ingestion and feature querying,\n",
+ " summary: Introduction to the Oracle Cloud Infrastructure Feature Store.Use feature store for feature ingestion and feature querying,\n",
+ " developed_on: fspyspark32_p38_cpu_v1,\n",
+ " keywords: feature store,\n",
+ " license: Universal Permissive License v 1.0\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e4664bc7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!odsc conda install -s fspyspark32_p38_cpu_v1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "60881fb2",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.\n",
+ "!pip install --pre --no-deps oracle-ads==2.9.0rc0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "526f6c48",
+ "metadata": {},
+ "source": [
+ "Oracle Data Science service sample notebook.\n",
+ "\n",
+ "Copyright (c) 2022, 2023 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n",
+ "\n",
+ "***\n",
+ "\n",
+ "# Feature store quickstart\n",
+ "by the Oracle Cloud Infrastructure Data Science Service.
\n",
+ "\n",
+ "---\n",
+ "# Overview:\n",
+ "---\n",
+ "Managing many datasets, datasources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all lead to increased model development time and worse model performance. Feature store can be used to solve many of the problems becuase it provides a centralised way to transform and access data for training and serving time. Feature store helps define a standardised pipeline for ingestion of data and querying of data.This notebook demonstrates how to use feature store using a notebook spark session.\n",
+ "\n",
+ "Compatible conda pack: [PySpark 3.2 and Feature Store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 1.0)\n",
+ "\n",
+ "## Contents:\n",
+ "\n",
+ "- 1. Introduction\n",
+ "- 2. Pre-requisites to Running this Notebook\n",
+ " - 2.1 Setup\n",
+ " - 2.2 Policies\n",
+ " - 2.3 Authentication\n",
+ " - 2.4 Variables\n",
+ "- 3. Feature store quickstart using APIs\n",
+ " - 3.1 Exploration of data\n",
+ " - 3.2 Create feature store logical entities\n",
+ " - 3.2.1 Feature store\n",
+ " - 3.2.2 Entity\n",
+ " - 3.2.3 Transformation\n",
+ " - 3.2.4 Feature group \n",
+ " - 3.3 Explore feature groups\n",
+ " - 3.4 Create dataset\n",
+ " - 3.3 Explore dataset\n",
+ " - 4. Feature store quickstart using YAML\n",
+ " - 5. References\n",
+ "\n",
+ "---\n",
+ "\n",
+ "**Important:**\n",
+ "\n",
+ "Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = \"\"` would become `database_name = \"production\"`.\n",
+ "\n",
+ "---\n",
+ "\n",
+ "Datasets are provided as a convenience. Datasets are considered third-party content and are not considered materials under your agreement with Oracle.\n",
+ "\n",
+ "`Citibike` dataset is used in this notebook.You can access the citibike dataset license [here](https://ride.citibikenyc.com/data-sharing-policy)\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ce2026c1",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# 1. Introduction\n",
+ "\n",
+ "OCI Data Science feature store is a stack-based API solution that's deployed using OCI Resource Manager in your tenancy.\n",
+ "\n",
+ "Review the following key terms to understand the Data Science feature store:\n",
+ "\n",
+ "\n",
+ "* **Feature Vector**: Set of feature values for any one primary or identifier key. For example, all or a subset of features of customer id ‘2536’ can be called as one feature vector.\n",
+ "\n",
+ "* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.\n",
+ "\n",
+ "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Or an entity is an object or concept that is described by its features. Examples of entities are customer, product, transaction, review, image, document, and so on.\n",
+ "\n",
+ "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in machine learning (ML) models. It serves as an organizational unit within the feature store for you to manage, version and share features across different ML projects. By organizing features into groups, data scientists and ML engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n",
+ "\n",
+ "* **Feature Group Job**: A feature group job is the processing instance of a feature group. Each feature group job includes validation results and statistics results.\n",
+ "\n",
+ "* **Dataset**: A dataset is a collection of features that are used together to either train a model or perform model inference.\n",
+ "\n",
+ "* **Dataset Job**: dataset job is the processing instance of a dataset. Each dataset job includes validation results and statistics results."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b4c99a09",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# 2. Pre-requisites to Running this Notebook\n",
+ "\n",
+ "Notebook Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1) conda environment.\n",
+ "\n",
+ "You can customize `fspyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session cluster. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "55a6b373",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 2.1. Setup\n",
+ "\n",
+ "To set up the environment, a `spark-defaults.conf` must be configured. Data Catalog Metastore id must also be provided.\n",
+ "\n",
+ "\n",
+ "### `spark-defaults.conf`\n",
+ "\n",
+ "The `spark-defaults.conf` file is used to define the properties that are used by Spark. A templated version is installed when you install a Data Science conda environment that supports PySpark. However, you must update the template so that the Data Catalog metastore can be accessed. You can do this manually. However, the `odsc data-catalog config` commandline tool is ideal for setting up the file because it gathers information about your environment, and uses that to build the file.\n",
+ "\n",
+ "The `odsc data-catalog config` command line tool needs the `--metastore` option to define the Data Catalog metastore OCID. No other command line option is needed because settings have default values, or they take values from your notebook session environment. Following are common parameters that you may need to override.\n",
+ "\n",
+ "The `--authentication` option sets the authentication mode. It supports resource principal and API keys. The preferred method for authentication is resource principal, which is sent with `--authentication resource_principal`. If you want to use API keys, then use the `--authentication api_key` option. If the `--authentication` isn't specified, API keys are used. When API keys are used, information from the OCI configuration file is used to create the `spark-defaults.conf` file.\n",
+ "\n",
+ "Object Storage and Data Catalog are regional services. By default, the region is set to the region your notebook session is running in. This information is taken from the environment variable, `NB_REGION`. Use the `--region` option to override this behavior.\n",
+ "\n",
+ "The default location of the `spark-defaults.conf` file is `/home/datascience/spark_conf_dir` as defined in the `SPARK_CONF_DIR` environment variable. Use the `--output` option to define the directory where to write the file.\n",
+ "\n",
+ "You need to determine what settings are appropriate for your configuration. However, the following works for most configurations and is run in a terminal window.\n",
+ "\n",
+ "```bash\n",
+ "odsc data-catalog config --authentication resource_principal --metastore \n",
+ "```\n",
+ "For more assistance, use the following command in a terminal window:\n",
+ "\n",
+ "```bash\n",
+ "odsc data-catalog config --help\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "31411ccd",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 2.2. Policies\n",
+ "This section covers the creation of dynamic groups and policies needed to use the service.\n",
+ "\n",
+ "* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)\n",
+ "* [Data Catalog Metastore Required Policies](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a9c2f3f8",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 2.3. Authentication\n",
+ "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook session.
\n",
+ "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dae3ada6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import ads\n",
+ "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"fs_service_endpoint\": \"http://{api_gateway}/20230101\"})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e05054be",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 2.4. Variables\n",
+ "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and `` which is the OCID of the Data Catalog metastore. The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage and the metastore id of hive metastore is tied to feature store construct of feature store service."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "42eb13d1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "\n",
+ "compartment_id = os.environ.get(\"NB_SESSION_COMPARTMENT_OCID\")\n",
+ "metastore_id = \"\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a322c822",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# 3. Feature store quick start using APIs\n",
+ "By default the **PySpark 3.2 and Feature Store Python 3.8** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html) library. In an ADS feature store module, you can either use the Python programmatic or YAML interface to define feature store entities. Below section describes how to create feature store entities using programmatic interface."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8ff205bc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd \n",
+ "from ads.feature_store.feature_store import FeatureStore\n",
+ "from ads.feature_store.dataset import Dataset\n",
+ "from ads.feature_store.feature_group import FeatureGroup\n",
+ "from ads.feature_store.feature_store_registrar import FeatureStoreRegistrar\n",
+ "from ads.feature_store.common.enums import ExpectationType\n",
+ "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n",
+ "from ads.feature_store.transformation import TransformationMode"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f30f4edd",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.1 Exploration of data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "43343910",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bike_df = pd.read_csv(\"data/201901-citibike-tripdata.csv\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "71ff424a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bike_df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0ade4f83",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bike_df.columns = bike_df.columns.str.replace(' ', '')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1af3d7cc",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.2. Create feature store logical entities"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7397f58c",
+ "metadata": {},
+ "source": [
+ "\n",
+ "#### 3.2.1 Feature store\n",
+ "\n",
+ "Feature store is the top level entity for feature store service.\n",
+ "Call the ```.create()``` method of the Feature store instance to create a feature store."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "655e04c2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_store_resource = (\n",
+ " FeatureStore().\n",
+ " with_description(\"Data consisting of bike riders data\").\n",
+ " with_compartment_id(compartment_id).\n",
+ " with_display_name(\"Bike rides\").\n",
+ " with_offline_config(metastore_id=metastore_id)\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fbc492ff",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_store = feature_store_resource.create()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "48a349dc",
+ "metadata": {},
+ "source": [
+ "\n",
+ "#### 3.2.2 Entity\n",
+ "An entity is a group of semantically related features. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "51ed55b0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "entity = feature_store.create_entity(\n",
+ " display_name=\"Bike rides\",\n",
+ " description=\"description for bike riders\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7635cfca",
+ "metadata": {},
+ "source": [
+ "\n",
+ "#### 3.2.3 Transformation\n",
+ "Transformations in a feature store refers to the operations and processes applied to raw data to create, modify or derive new features that can be used as inputs for ML Models"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c507286d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def is_round_trip(bike_df):\n",
+ " bike_df['roundtrip'] = bike_df['startstationid'] == bike_df['endstationid']\n",
+ " return bike_df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c181827c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "transformation = feature_store.create_transformation(\n",
+ " transformation_mode=TransformationMode.PANDAS,\n",
+ " source_code_func=is_round_trip,\n",
+ " display_name=\"is_round_trip\",\n",
+ ")\n",
+ "transformation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5c917e7c",
+ "metadata": {},
+ "source": [
+ "\n",
+ "#### 3.2.4 Feature group\n",
+ "A feature group is an object that represents a logical group of time-series feature data as it is found in a datasource. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5345aa39",
+ "metadata": {},
+ "source": [
+ "\n",
+ "##### 3.2.4.1 Associate Expectation Suite\n",
+ "Feature validation is the process of checking the quality and accuracy of the features used in a machine learning model.Feature store allows you to define expectation on the data which is being materialised into feature group and dataset.This is achieved using open source library Great Expectations.\n",
+ "\n",
+ "An Expectation is a verifiable assertion about your data. You can define expectation as below:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "babc39c3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "expectation_suite = ExpectationSuite(expectation_suite_name=\"feature_definition\")\n",
+ "expectation_suite.add_expectation(\n",
+ " ExpectationConfiguration(\n",
+ " expectation_type=\"expect_column_values_to_not_be_null\",\n",
+ " kwargs={\"column\": \"stoptime\"}\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bfcf8653",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_bike = (\n",
+ " FeatureGroup()\n",
+ " .with_feature_store_id(feature_store.id)\n",
+ " .with_primary_keys([\"bikeid\"])\n",
+ " .with_name(\"bike_feature_group\")\n",
+ " .with_entity_id(entity.id)\n",
+ " .with_compartment_id(compartment_id)\n",
+ " .with_schema_details_from_dataframe(bike_df)\n",
+ " .with_expectation_suite(expectation_suite, ExpectationType.LENIENT)\n",
+ " .with_transformation_id(transformation.id)\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3fe51b5e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_bike.create()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "28f95654",
+ "metadata": {},
+ "source": [
+ "\n",
+ "To persist the feature group and save feature data along with the metadata in the feature store, call the `materialise()` method with data frame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a68d2c02",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_bike.materialise(bike_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a723bc8f",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.3. Explore feature groups"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f012acae",
+ "metadata": {},
+ "source": [
+ "You can retrieve feature data in a DataFrame, that can either be used directly to train models or materialized to file(s) for later use to train models"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e6e9516e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query = feature_group_bike.select() \n",
+ "query.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "23b9704c",
+ "metadata": {},
+ "source": [
+ "You can call the `get_statistics()` method of the feature group to fetch statistics for a specific ingestion job.You can use `to_pandas()` or `to_json()` to view the statistics."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0be0b698",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_bike.get_statistics().to_pandas()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "086e9f8a",
+ "metadata": {},
+ "source": [
+ "You can visualize feature statistics with `to_viz()`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4e3c9a53",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_bike.get_statistics().to_viz()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "63f9d642",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_bike.get_statistics().to_viz([\"birthyear\"])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "36ed80f5",
+ "metadata": {},
+ "source": [
+ "You can call the `get_validation_output()` method of the FeatureGroup instance to fetch validation results for a specific ingestion job."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b6b9b759",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_bike.get_validation_output().to_pandas()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "83fd0852",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_bike.get_validation_output().to_summary()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1f3eb0dd",
+ "metadata": {},
+ "source": [
+ "\n",
+ "#### Visualise lineage\n",
+ "\n",
+ "Use the ```.show()``` method on the FeatureGroup instance to visualize the lineage of the featuregroup."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ca36cc7b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_bike.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9151b303",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.4 Create dataset\n",
+ "A dataset is a collection of feature snapshots that are joined together to either train a model or perform model inference."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bbf4fd15",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query.to_string()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9957c20e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dataset_resource = (\n",
+ " Dataset()\n",
+ " .with_description(\"Dataset consisting of a subset of features in feature group: bike riders\")\n",
+ " .with_compartment_id(compartment_id)\n",
+ " .with_name(\"bike_riders_dataset\")\n",
+ " .with_entity_id(entity.id)\n",
+ " .with_feature_store_id(feature_store.id)\n",
+ " .with_query(query.to_string())\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "677a8061",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dataset = dataset_resource.create()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1b9fca33",
+ "metadata": {},
+ "source": [
+ "You can call the `materialise()` method of the Dataset instance to load the data to dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0d7a6a34",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dataset.materialise()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5c77773c",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.5 Explore dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c1832b28",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dataset.as_of(version_number=0).show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "de6c4045",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dataset.get_statistics().to_pandas()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2b6da96a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dataset.get_statistics().to_viz()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b0f4dcc2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dataset.profile().show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e6419d55",
+ "metadata": {},
+ "source": [
+ "\n",
+ "#### Visualise lineage\n",
+ "\n",
+ "Use the ```.show()``` method on the Dataset instance to visualize the lineage of the dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ee24e1d8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dataset.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9b9b5cce",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# 4. Feature store quickstart using YAML\n",
+ "In an ADS feature store module, you can either use the Python programmatic interface or YAML to define feature store entities. Below section describes how to create feature store entities using YAML as an interface."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7479c28",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_store_yaml = \"\"\"\n",
+ "apiVersion: v1\n",
+ "kind: featureStore\n",
+ "spec:\n",
+ " displayName: Bike feature store\n",
+ " compartmentId: \n",
+ " offlineConfig:\n",
+ " metastoreId: \n",
+ "\n",
+ " entity: &bike_entity\n",
+ " - kind: entity\n",
+ " spec:\n",
+ " name: Bike rides\n",
+ "\n",
+ " featureGroup:\n",
+ " - kind: featureGroup\n",
+ " spec:\n",
+ " entity: *bike_entity\n",
+ " name: bike_feature_group\n",
+ " primaryKeys:\n",
+ " - bikeid\n",
+ " inputFeatureDetails:\n",
+ " - name: \"bikeid\"\n",
+ " featureType: \"INTEGER\"\n",
+ " orderNumber: 1\n",
+ " cast: \"STRING\"\n",
+ " - name: \"endstationlongitude\"\n",
+ " featureType: \"FLOAT\"\n",
+ " orderNumber: 2\n",
+ " cast: \"STRING\"\n",
+ " - name: \"tripduration\"\n",
+ " featureType: \"INTEGER\"\n",
+ " orderNumber: 3\n",
+ " cast: \"STRING\"\n",
+ "\n",
+ " dataset:\n",
+ " - kind: dataset\n",
+ " spec:\n",
+ " name: bike_dataset\n",
+ " entity: *bike_entity\n",
+ " description: \"Dataset for bike\"\n",
+ " query: 'SELECT bike.bikeid, bike.endstationlongitude FROM bike_feature_group bike'\n",
+ "\"\"\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ebdbb40e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "registrar = FeatureStoreRegistrar.from_yaml(yaml_string=feature_store_yaml)\n",
+ "registrar.create()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3bc2818c",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# References\n",
+ "\n",
+ "- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)\n",
+ "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n",
+ "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n",
+ "- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n",
+ "- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ff4d2ad3",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python [conda env:fspyspark32_p38_cpu_v1]",
+ "language": "python",
+ "name": "conda-env-fspyspark32_p38_cpu_v1-py"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.17"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebook_examples/feature_store_schema_evolution.ipynb b/notebook_examples/feature_store_schema_evolution.ipynb
new file mode 100644
index 00000000..6fd1c115
--- /dev/null
+++ b/notebook_examples/feature_store_schema_evolution.ipynb
@@ -0,0 +1,781 @@
+{
+ "cells": [
+ {
+ "cell_type": "raw",
+ "id": "12ce2509",
+ "metadata": {},
+ "source": [
+ "qweews@notebook{feature_store_schema_evolution.ipynb,\n",
+ " title: Schema Enforcement and Schema Evolution in Feature Store,\n",
+ " summary: Perform Schema Enforcement and Schema Evolution in Feature Store when materialising the data.,\n",
+ " developed_on: fspyspark32_p38_cpu_v1,\n",
+ " keywords: feature store, querying ,schema enforcement,schema evolution\n",
+ " license: Universal Permissive License v 1.0\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "59b6b678",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!odsc conda install -s fspyspark32_p38_cpu_v1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "77341f7e",
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2023-05-24T08:26:08.572567Z",
+ "start_time": "2023-05-24T08:26:08.328013Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.\n",
+ "!pip install --pre --no-deps oracle-ads==2.9.0rc0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "eafaf892",
+ "metadata": {},
+ "source": [
+ "Oracle Data Science service sample notebook.\n",
+ "\n",
+ "Copyright (c) 2022, 2023 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n",
+ "\n",
+ "***\n",
+ "\n",
+ "# Schema enforcement and schema evolution\n",
+ "by the Oracle Cloud Infrastructure Data Science Service.
\n",
+ "\n",
+ "---\n",
+ "# Overview:\n",
+ "---\n",
+ "Managing many datasets, data sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift, and training serving skew all lead to increased model development time and poor model performance. Feature store can be used to solve many of the problems becuase it provides a centralised way to transform and access data for training and serving time. Feature store helps define a standardised pipeline for ingestion of data and querying of data. This notebook shows how schema enforcement and schema evolution are carried out in Feature Store\n",
+ "\n",
+ "Compatible conda pack: [PySpark 3.2 and Feature store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8\n",
+ "\n",
+ "\n",
+ "

\n",
+ "
\n",
+ "\n",
+ "## Contents:\n",
+ "\n",
+ "- 1. Introduction\n",
+ "- 2. Pre-requisites to Running this Notebook\n",
+ " - 2.1. Setup\n",
+ " - 2.2. Policies\n",
+ " - 2.3. Authentication\n",
+ " - 2.4. Variables\n",
+ "- 3. Schema enforcement and schema evolution\n",
+ " - 3.1. Exploration of data in feature store\n",
+ " - 3.2. Create feature store logical entities\n",
+ " - 3.3. Schema enforcement\n",
+ " - 3.4. Schema evolution\n",
+ " - 3.5. Ingestion Modes\n",
+ " - 3.5.1. Append\n",
+ " - 3.5.2. Overwrite\n",
+ " - 3.5.3. Upsert\n",
+ " - 3.6. Viewing Feature Group History\n",
+ " - 3.7. Time travel Queries on Feature Group \n",
+ "- 4. References\n",
+ "\n",
+ "---\n",
+ "\n",
+ "**Important:**\n",
+ "\n",
+ "Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = \"\"` would become `database_name = \"production\"`.\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2df44476",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# 1. Introduction\n",
+ "\n",
+ "OCI Data Science feature store is a stack-based API solution that's deployed using OCI Resource Manager in your tenancy.\n",
+ "\n",
+ "Review the following key terms to understand the Data Science feature store:\n",
+ "\n",
+ "\n",
+ "* **Feature Vector**: Set of feature values for any one primary or identifier key. For example, all or a subset of features of customer id ‘2536’ can be called as one feature vector.\n",
+ "\n",
+ "* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.\n",
+ "\n",
+ "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Or, an entity is an object or concept that is described by its features. Examples of entities are customer, product, transaction, review, image, document, and so on.\n",
+ "\n",
+ "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in machine learning (ML) models. It serves as an organizational unit within the feature store for you to manage, version, and share features across different ML projects. By organizing features into groups, data scientists and ML engineers can efficiently discover, reuse, and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n",
+ "\n",
+ "* **Feature Group Job**: A feature group job is the processing instance of a feature group. Each feature group job includes validation and statistics results.\n",
+ "\n",
+ "* **Dataset**: A dataset is a collection of features that are used together to either train a model or perform model inference.\n",
+ "\n",
+ "* **Dataset Job**: A dataset job is the processing instance of a dataset. Each dataset job includes validation and statistics results."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c76e31af",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# 2. Pre-requisites to Running this Notebook\n",
+ "Notebook Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1) conda environment.\n",
+ "\n",
+ "You can customize `fspyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1233c93e",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 2.1. Setup\n",
+ "\n",
+ "\n",
+ "### `spark-defaults.conf`\n",
+ "\n",
+ "The `spark-defaults.conf` file is used to define the properties that are used by Spark. A templated version is installed when you install a Data Science conda environment that supports PySpark. However, you must update the template so that the Data Catalog metastore can be accessed. You can do this manually. However, the `odsc data-catalog config` commandline tool is ideal for setting up the file because it gathers information about your environment, and uses that to build the file.\n",
+ "\n",
+ "The `odsc data-catalog config` command line tool needs the `--metastore` option to define the Data Catalog metastore OCID. No other command line option is needed because settings have default values, or they take values from your notebook session environment. Following are common parameters that you may need to override.\n",
+ "\n",
+ "The `--authentication` option sets the authentication mode. It supports resource principal and API keys. The preferred method for authentication is resource principal, which is sent with `--authentication resource_principal`. If you want to use API keys, then use the `--authentication api_key` option. If the `--authentication` isn't specified, API keys are used. When API keys are used, information from the OCI configuration file is used to create the `spark-defaults.conf` file.\n",
+ "\n",
+ "Object Storage and Data Catalog are regional services. By default, the region is set to the region your notebook session is running in. This information is taken from the environment variable, `NB_REGION`. Use the `--region` option to override this behavior.\n",
+ "\n",
+ "The default location of the `spark-defaults.conf` file is `/home/datascience/spark_conf_dir` as defined in the `SPARK_CONF_DIR` environment variable. Use the `--output` option to define the directory where to write the file.\n",
+ "\n",
+ "You need to determine what settings are appropriate for your configuration. However, the following works for most configurations and is run in a terminal window.\n",
+ "\n",
+ "```bash\n",
+ "odsc data-catalog config --authentication resource_principal --metastore \n",
+ "```\n",
+ "For more assistance, use the following command in a terminal window:\n",
+ "\n",
+ "```bash\n",
+ "odsc data-catalog config --help\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "24965d4e",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 2.2. Policies\n",
+ "This section covers the creation of dynamic groups and policies needed to use the service.\n",
+ "\n",
+ "* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm/)\n",
+ "* [Data Catalog Metastore Required Policies](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)\n",
+ "* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)\n",
+ "* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c74885f6",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 2.3. Authentication\n",
+ "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook session.
\n",
+ "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "24963829",
+ "metadata": {
+ "ExecuteTime": {
+ "start_time": "2023-05-24T08:26:08.577504Z"
+ },
+ "is_executing": true,
+ "pycharm": {
+ "is_executing": true
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import ads\n",
+ "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"fs_service_endpoint\": \"https://{api_gateway}/20230101\"})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b0c5e0fd",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 2.4. Variables\n",
+ "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and `` for offline feature store."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "edaf733c",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "\n",
+ "compartment_id = os.environ.get(\"NB_SESSION_COMPARTMENT_OCID\")\n",
+ "metastore_id = \"\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c9c2e7c8",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# 3. Schema enforcement and schema evolution\n",
+ "By default the **PySpark 3.2, Feature store and Data Flow** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html).Schema enforcement is a Delta Lake feature that prevents you from appending data with a different schema to a table.To change a table's current schema and to accommodate data that is changing over time,schema evolution feature is used while performing an append or overwrite operation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "75d9beed",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "from ads.feature_store.feature_store import FeatureStore\n",
+ "from ads.feature_store.feature_group import FeatureGroup\n",
+ "from ads.feature_store.model_details import ModelDetails\n",
+ "from ads.feature_store.dataset import Dataset\n",
+ "from ads.feature_store.common.enums import DatasetIngestionMode\n",
+ "\n",
+ "from ads.feature_store.feature_group_expectation import ExpectationType\n",
+ "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n",
+ "from ads.feature_store.feature_store_registrar import FeatureStoreRegistrar"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7ff53923",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.1. Exploration of data in feature store\n",
+ "\n",
+ "\n",
+ "

\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f43e2ef0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "flights_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/flights.csv\")[['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']]\n",
+ "flights_df = flights_df.head(100)\n",
+ "flights_df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "82430a0d",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true
+ }
+ },
+ "outputs": [],
+ "source": [
+ "columns = ['IATA_CODE', 'AIRPORT', 'CITY', 'STATE', 'LATITUDE', 'LONGITUDE']\n",
+ "airports_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airports.csv\")[columns]\n",
+ "airports_df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "143c3b29",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true
+ }
+ },
+ "outputs": [],
+ "source": [
+ "airlines_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airlines.csv\")\n",
+ "airlines_df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "00083134",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.2. Create feature store logical entities"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4f99ae87",
+ "metadata": {},
+ "source": [
+ "#### 3.2.1. Feature Store\n",
+ "Feature store is the top level entity for feature store service"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ca0b8bfd",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_store_resource = (\n",
+ " FeatureStore().\n",
+ " with_description(\"Data consisting of flights\").\n",
+ " with_compartment_id(compartment_id).\n",
+ " with_display_name(\"flights details\").\n",
+ " with_offline_config(metastore_id=metastore_id)\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9704fa85",
+ "metadata": {},
+ "source": [
+ "\n",
+ "##### Create Feature Store\n",
+ "\n",
+ "Call the ```.create()``` method of the Feature store instance to create a feature store."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "de4b205d",
+ "metadata": {
+ "pycharm": {
+ "is_executing": true
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_store = feature_store_resource.create()\n",
+ "feature_store"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "473f6677",
+ "metadata": {},
+ "source": [
+ "#### 3.2.2. Entity\n",
+ "An entity is a group of semantically related features."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3dcf22bf",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "entity = feature_store.create_entity(\n",
+ " display_name=\"Flight details schema evolution/enforcement\",\n",
+ " description=\"description for flight details\"\n",
+ ")\n",
+ "entity"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "80c9c3be",
+ "metadata": {},
+ "source": [
+ "\n",
+ "#### 3.2.3. Feature Group\n",
+ "\n",
+ "Create feature group for airport"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "970161e6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n",
+ "\n",
+ "expectation_suite_airports = ExpectationSuite(\n",
+ " expectation_suite_name=\"test_airports_df\"\n",
+ ")\n",
+ "expectation_suite_airports.add_expectation(\n",
+ " ExpectationConfiguration(\n",
+ " expectation_type=\"expect_column_values_to_not_be_null\",\n",
+ " kwargs={\"column\": \"IATA_CODE\"},\n",
+ " )\n",
+ ")\n",
+ "expectation_suite_airports.add_expectation(\n",
+ " ExpectationConfiguration(\n",
+ " expectation_type=\"expect_column_values_to_be_between\",\n",
+ " kwargs={\"column\": \"LATITUDE\", \"min_value\": -1.0, \"max_value\": 1.0},\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "expectation_suite_airports.add_expectation(\n",
+ " ExpectationConfiguration(\n",
+ " expectation_type=\"expect_column_values_to_be_between\",\n",
+ " kwargs={\"column\": \"LONGITUDE\", \"min_value\": -1.0, \"max_value\": 1.0},\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bc323dd5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_airports = (\n",
+ " FeatureGroup()\n",
+ " .with_feature_store_id(feature_store.id)\n",
+ " .with_primary_keys([\"IATA_CODE\"])\n",
+ " .with_name(\"airport_feature_group\")\n",
+ " .with_entity_id(entity.id)\n",
+ " .with_compartment_id(compartment_id)\n",
+ " .with_schema_details_from_dataframe(airports_df)\n",
+ " .with_expectation_suite(\n",
+ " expectation_suite=expectation_suite_airports,\n",
+ " expectation_type=ExpectationType.LENIENT,\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2e437522",
+ "metadata": {
+ "collapsed": false,
+ "jupyter": {
+ "outputs_hidden": false
+ }
+ },
+ "outputs": [],
+ "source": [
+ "feature_group_airports.create()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3fc98501",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_airports.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ec22e95c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_airports.materialise(airports_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ed7c012e",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.3. Schema enforcement\n",
+ "\n",
+ "Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table's schema. For example, a front desk manager at a busy restaurant that only accepts reservations, the schema enforcement checks to see whether each column in the data inserted into the table is in the list of expected columns. Meaning each one has a \"reservation\", and rejects any writes with columns that aren't on the list."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "eef566cc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "columns = ['IATA_CODE', 'AIRPORT', 'CITY', 'STATE', 'LATITUDE', 'LONGITUDE', 'COUNTRY']\n",
+ "airports_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airports.csv\")[columns]\n",
+ "airports_df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d620cedf",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_airports.with_schema_details_from_dataframe(airports_df)\n",
+ "feature_group_airports.update()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c62e82c9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_airports.materialise(airports_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e8cbab63",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.4. Schema evolution\n",
+ "\n",
+ "Schema evolution allows you to change a table's current schema to accommodate data that is changing over time. Typically, it's used when performing an append or overwrite operation to automatically adapt the schema to include one or more new columns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d69f3378",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from ads.feature_store.feature_option_details import FeatureOptionDetails\n",
+ "feature_option_details = FeatureOptionDetails().with_feature_option_write_config_details(merge_schema=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "794597c9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_airports.materialise(\n",
+ " input_dataframe=airports_df,\n",
+ " feature_option_details=feature_option_details\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bcaa552c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_airports"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b4eca757",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.5. Ingestion modes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7ae4c0e5",
+ "metadata": {},
+ "source": [
+ "\n",
+ "#### 3.5.1. Append\n",
+ "\n",
+ "In ``append`` mode, new data is added to the existing table. If the table already exists, the new data is appended to it, extending the dataset. This mode is suitable for scenarios where you want to continuously add new records without modifying or deleting existing data. It preserves the existing data and only appends the new data to the end of the table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "24acf8e6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from ads.feature_store.feature_group_job import IngestionMode\n",
+ "feature_group_airports.materialise(airports_df, ingestion_mode=IngestionMode.APPEND)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "91ed98b3",
+ "metadata": {},
+ "source": [
+ "\n",
+ "#### 3.5.2. Overwrite\n",
+ "In ``overwrite`` mode, the existing table is replaced entirely with the new data being saved. If the table already exists, it is dropped and a new table is created with the new data. This mode is useful when you want to completely refresh the data in the table with the latest data and discard all previous records."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "690a6136",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from ads.feature_store.feature_group_job import IngestionMode\n",
+ "feature_group_airports.materialise(airports_df, ingestion_mode=IngestionMode.OVERWRITE)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1866795c",
+ "metadata": {},
+ "source": [
+ "\n",
+ "#### 3.5.3. Upsert\n",
+ "``Upsert`` mode (merge mode) is used to update existing records in the table based on a primary key or a specified condition. If a record with the same key exists, it is updated with the new data. Otherwise, a new record is inserted. This mode is useful for maintaining and synchronizing data between the source and destination tables while avoiding duplicates."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b2ddd858",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from ads.feature_store.feature_group_job import IngestionMode\n",
+ "feature_group_airports.materialise(airports_df, ingestion_mode=IngestionMode.UPSERT)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5495e3cf",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.6. Viewing Feature Group History\n",
+ "You can call the ``history()`` method of the FeatureGroup instance to show history of the feature group."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "feb1762f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_airports.history().toPandas()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1fb57d2c",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3.7. Time travel Queries on Feature Group\n",
+ "\n",
+ "You can call the ``as_of()`` method of the FeatureGroup instance to get specified point in time and time traveled data.\n",
+ "The ``.as_of()`` method takes the following optional parameter:\n",
+ "\n",
+ "- commit_timestamp: date-time. Commit timestamp for feature group\n",
+ "- version_number: int. Version number for feature group"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1ec4cc00",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_airports.as_of(version_number = 0).show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4ce013ab",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_group_airports.as_of(version_number = 1).show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1abcc338",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# 4. References\n",
+ "- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)\n",
+ "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n",
+ "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n",
+ "- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n",
+ "- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1157840c",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python [conda env:fspyspark32_p38_cpu_v1]",
+ "language": "python",
+ "name": "conda-env-fspyspark32_p38_cpu_v1-py"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.17"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebook_examples/feature_store_spark_magic.ipynb b/notebook_examples/feature_store_spark_magic.ipynb
new file mode 100644
index 00000000..98d2cb40
--- /dev/null
+++ b/notebook_examples/feature_store_spark_magic.ipynb
@@ -0,0 +1,687 @@
+{
+ "cells": [
+ {
+ "cell_type": "raw",
+ "id": "8ce4f16c",
+ "metadata": {},
+ "source": [
+ "qweews@notebook{feature_store_spark_magic.ipynb,\n",
+ " title: Data Flow Studio : Big Data Operations in Feature Store.,\n",
+ " summary: Run Feature Store on interactive Spark workloads on a long lasting Data Flow Cluster.,\n",
+ " developed_on: fspyspark32_p38_cpu_v1,\n",
+ " keywords: feature store, querying,spark magic,data flow\n",
+ " license: Universal Permissive License v 1.0\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "55da3909",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!odsc conda install -s fspyspark32_p38_cpu_v1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5c24e9f2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.\n",
+ "!pip install --pre --no-deps oracle-ads==2.9.0rc0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fe5598bd",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "Oracle Data Science service sample notebook.\n",
+ "\n",
+ "Copyright (c) 2022, 2023 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n",
+ "***\n",
+ "\n",
+ "# Data Flow Studio: Big Data Operations in Feature Store\n",
+ "by the Oracle Cloud Infrastructure Data Science Service.
\n",
+ "\n",
+ "---\n",
+ "# Overview:\n",
+ "\n",
+ "This notebook demonstrates how to run Feature Store on interactive Spark workloads on a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster through [Apache Livy](https://livy.apache.org/) integration. **Data Flow Spark Magic** is used for interactively working with remote Spark clusters using Livy (a Spark REST server) in Jupyter notebooks. Data Flow Spark Magic includes a set of magic commands for interactively running Spark code.\n",
+ "\n",
+ "\n",
+ "\n",
+ "## Contents:\n",
+ "\n",
+ "- 1. Introduction\n",
+ "- 2. 2. Pre-requisites to Running this Notebook\n",
+ " - 2.1 Policies\n",
+ " - 2.2 Helpers\n",
+ " - 2.3 Authentication\n",
+ " - 2.4 Variables\n",
+ "- 3. Data Flow Spark Magic\n",
+ " - 3.1. Load Spark Magic Commands and Getting Help\n",
+ " - 3.2. Create DataFlow Session\n",
+ " - 3.3. Data exploration\n",
+ " - 3.4. Create Feature Store Logical Entities\n",
+ " - 3.4.1 Creating a feature store\n",
+ " - 3.4.2 Creating an entity\n",
+ " - 3.4.3 Creating a feature group\n",
+ " - 3.4.4 Materialising a Feature Group\n",
+ " - 3.4.5 Querying a Feature group\n",
+ "- 4. References\n",
+ "\n",
+ "---\n",
+ "\n",
+ "\n",
+ "Compatible conda pack: [PySpark 3.2 and Feature Store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 1.0)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cd84d936",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "# 1. Introduction\n",
+ "\n",
+ "OCI Data Science feature store is a stack-based API solution that's deployed using OCI Resource Manager in your tenancy.\n",
+ "\n",
+ "Review the following key terms to understand the Data Science feature store:\n",
+ "\n",
+ "\n",
+ "* **Feature Vector**: Set of feature values for any one primary or identifier key. For example, all or a subset of features of customer id ‘2536’ can be called as one feature vector.\n",
+ "\n",
+ "* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.\n",
+ "\n",
+ "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Or, an entity is an object or concept that is described by its features. Examples of entities are customer, product, transaction, review, image, document, and so on.\n",
+ "\n",
+ "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in machine learning (ML) models. It serves as an organizational unit within the feature store for you to manage, version, and share features across different ML projects. By organizing features into groups, data scientists and ML engineers can efficiently discover, reuse, and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n",
+ "\n",
+ "* **Feature Group Job**: A feature group job is the processing instance of a feature group. Each feature group job includes validation and statistics results.\n",
+ "\n",
+ "* **Dataset**: A dataset is a collection of features that are used together to either train a model or perform model inference.\n",
+ "\n",
+ "* **Dataset Job**: A dataset job is the processing instance of a dataset. Each dataset job includes validation and statistics results."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "76acf33b",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "# 2. Pre-requisites to Running this Notebook\n",
+ "\n",
+ "Data Flow Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1) conda environment.\n",
+ "\n",
+ "The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage. You can customize `fs_pyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Data Flow session cluster. The metastore id of hive metastore is tied to feature store construct of feature store service."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f1e2b6a1",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "## 2.1. Policies\n",
+ "This section covers the creation of dynamic groups and policies needed to use the service.\n",
+ "\n",
+ "* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm)\n",
+ "* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)\n",
+ "* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)\n",
+ "* [Data Catalog Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2207bfb3",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "## 2.2 Helpers\n",
+ "This helper method is used across the notebook to prepare arguments for the magic commands. This function is particularly useful when you want to pass Python variables as arguments to the spark magic commands."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "32894857",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import json\n",
+ "\n",
+ "\n",
+ "def prepare_command(command: dict) -> str:\n",
+ " \"\"\"Converts dictionary command to the string formatted commands.\"\"\"\n",
+ " return f\"'{json.dumps(command)}'\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4b610391",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "## 2.3. Authentication\n",
+ "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the Data Flow Session Spark cluster.
\n",
+ "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```. For example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ae1080f3",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import ads\n",
+ "\n",
+ "ads.set_auth(\"resource_principal\") # Supported values: resource_principal, api_key"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3aa57661",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "## 2.4. Variables\n",
+ "To run this notebook, you must provide some information about your tenancy configuration. To connect to the HIVE metastore, replace `` with the OCID for the HIVE metastore.\n",
+ "\n",
+ "To create and run a Data Flow session, you must specify a ``, ``, bucket `` and `` for storing logs. These resources must be in the same compartment."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7157113d",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "compartment_id = os.environ.get(\"NB_SESSION_COMPARTMENT_OCID\")\n",
+ "metastore_id = \"\"\n",
+ "logs_bucket_uri = \"\"\n",
+ "\n",
+ "custom_conda_environment_uri = \"oci://service-conda-packs@id19sfcrra6z/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v1#conda\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "426be51d",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "# 3. Data Flow Spark Magic\n",
+ "Data Flow Spark Magic commands allow you to interactively work with Data Flow Spark clusters (sessions) in Jupyter notebooks using the Livy REST API. The commands provide a set of Jupyter notebook cell magic commands to turn Jupyter into an integrated Spark development environment for remote clusters. \n",
+ "\n",
+ "**Data Flow Spark Magic allows you to:**\n",
+ "\n",
+ "* Run Spark code against a Data Flow remote Spark cluster.\n",
+ "* Create a Data Flow Spark session with SparkContext and HiveContext against Data Flow remote Spark cluster.\n",
+ "* Capture the output of Spark queries as a local Pandas dataframe to interact with other Python libraries (such as matplotlib)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a1f403f7",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 3.1. Load Spark Magic Commands and Getting Help\n",
+ "Data Flow Spark Magic is a JupyterLab extension that you need to activate in your notebook using the `%load_ext dataflow.magics` magic command.
\n",
+ "After the extension is activated, you can use the `%help` command to view the list of supported commands."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4d61b5fa",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%load_ext dataflow.magics"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ec076494",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 3.2. Create DataFlow Session.\n",
+ "Create a new Data Flow cluster session using the `%create_session` magic command."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1aba2243",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "command = prepare_command(\n",
+ " {\n",
+ " \"compartmentId\": compartment_id,\n",
+ " \"displayName\": \"spark_session_via_notebook\",\n",
+ " \"language\": \"PYTHON\",\n",
+ " \"sparkVersion\": \"3.2.1\",\n",
+ " \"numExecutors\": 8,\n",
+ " \"metastoreId\": metastore_id,\n",
+ " \"driverShape\": \"VM.Standard2.1\",\n",
+ " \"executorShape\": \"VM.Standard2.1\",\n",
+ " \"driverShapeConfig\": {\"ocpus\": 2, \"memoryInGBs\": 16},\n",
+ " \"executorShapeConfig\": {\"ocpus\": 2, \"memoryInGBs\": 16},\n",
+ " \"type\": \"SESSION\",\n",
+ " \"logsBucketUri\": logs_bucket_uri,\n",
+ " \"configuration\": {\n",
+ " \"spark.archives\": custom_conda_environment_uri,\n",
+ " \"fs.oci.client.hostname\": \"https://objectstorage.us-ashburn-1.oraclecloud.com\"\n",
+ " },\n",
+ " }\n",
+ ")\n",
+ "\n",
+ "%create_session -l python -c $command"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0aad36e2",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%%spark\n",
+ "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n",
+ "\n",
+ "import ads\n",
+ "from ads.feature_store.entity import Entity\n",
+ "from ads.feature_store.feature_group import FeatureGroup\n",
+ "from ads.feature_store.feature_group_expectation import ExpectationType\n",
+ "from ads.feature_store.feature_store import FeatureStore\n",
+ "from ads.feature_store.input_feature_detail import FeatureDetail, FeatureType\n",
+ "from ads.feature_store.statistics_config import StatisticsConfig\n",
+ "from ads.feature_store.transformation import TransformationMode\n",
+ "import os\n",
+ "\n",
+ "# Set the Authentications for the feature store operations\n",
+ "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"fs_service_endpoint\": \"https://{api_gateway}/20230101\"})\n",
+ "\n",
+ "# Variables\n",
+ "compartment_id = \"\"\n",
+ "metastore_id = \"\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9e9fafb3",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 3.3. Data exploration"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0dde877c",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%%spark\n",
+ "df_nyc_tlc = spark.read.parquet(\"oci://hosted-ds-datasets@bigdatadatasciencelarge/nyc_tlc/201[1,2,3,4,5,6,7,8]/**/data.parquet\", header=False, inferSchema=True)\n",
+ "df_nyc_tlc = df_nyc_tlc.select(\"vendor_id\", \"pickup_at\", \"dropoff_at\")\n",
+ "\n",
+ "df_nyc_tlc.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6180b15c",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "### 3.4. Create feature store logical entities"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c54f9744",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "#### 3.4.1. Creating a Feature Store\n",
+ "Feature store is the top level entity for feature store service"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "73b8d3a4",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%%spark\n",
+ "feature_store_resource = FeatureStore(). \\\n",
+ " with_description(\"Feature Store Description\"). \\\n",
+ " with_compartment_id(compartment_id). \\\n",
+ " with_display_name(\"FeatureStore\"). \\\n",
+ " with_offline_config(metastore_id=metastore_id)\n",
+ "\n",
+ "feature_store = feature_store_resource.create()\n",
+ "feature_store"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0569a4f9",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "#### 3.4.2. Creating an Entity\n",
+ "An entity is a group of semantically related features."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b85d5002",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%%spark\n",
+ "entity = feature_store.create_entity()\n",
+ "entity"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0029a900",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "#### 3.4.3. Creating a Feature group\n",
+ "A feature group is an object that represents a logical group of time-series feature data as it is found in a datasource."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "eab895fe",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%%spark\n",
+ "\n",
+ "# Initialize Expectation Suite\n",
+ "expectation_suite_trans = ExpectationSuite(expectation_suite_name=\"feature_definition\")\n",
+ "expectation_suite_trans.add_expectation(\n",
+ " ExpectationConfiguration(\n",
+ " expectation_type=\"EXPECT_COLUMN_VALUES_TO_NOT_BE_NULL\",\n",
+ " kwargs={\"column\": \"vendor_id\"}\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "stats_config = StatisticsConfig().with_is_enabled(False)\n",
+ "\n",
+ "feature_group = entity.create_feature_group(\n",
+ " primary_keys=[\"vendor_id\"],\n",
+ " schema_details_dataframe=df_nyc_tlc, #infer the schema from the data frame\n",
+ " expectation_suite=expectation_suite_trans,\n",
+ " expectation_type=ExpectationType.LENIENT,\n",
+ " statistics_config=stats_config,\n",
+ " name=\"feature_group_big_data\",\n",
+ ")\n",
+ "\n",
+ "feature_group"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "147916a9",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "#### 3.4.4. Materialising a Feature Group"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1f26caf3",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%%spark\n",
+ "import pandas as pd\n",
+ "df_nyc_tlc = spark.read.parquet(\"oci://hosted-ds-datasets@bigdatadatasciencelarge/nyc_tlc/201[1,2,3,4,5,6,7,8]/**/data.parquet\", header=False, inferSchema=True)\n",
+ "df_nyc_tlc = df_nyc_tlc.select(\"vendor_id\", \"pickup_at\", \"dropoff_at\").limit(1000)\n",
+ "\n",
+ "feature_group.materialise(df_nyc_tlc)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f39a2317",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "#### 3.4.5. Querying a Feature Group"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ede99da4",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%%spark\n",
+ "feature_group.select().show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a32b4b1e",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%%spark\n",
+ "feature_group.select([\"vendor_id\", \"pickup_at\"]).show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6de58a22",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%%spark\n",
+ "feature_group.filter(feature_group.vendor_id == \"CMT\").show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c3dff2d1",
+ "metadata": {
+ "pycharm": {
+ "name": "#%% md\n"
+ }
+ },
+ "source": [
+ "\n",
+ "# 4. References\n",
+ "- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)\n",
+ "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n",
+ "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n",
+ "- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n",
+ "- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5f9babbb",
+ "metadata": {
+ "pycharm": {
+ "name": "#%%\n"
+ }
+ },
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python [conda env:fspyspark32_p38_cpu_v1]",
+ "language": "python",
+ "name": "conda-env-fspyspark32_p38_cpu_v1-py"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.17"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebook_examples/index.json b/notebook_examples/index.json
index 6bf854ea..addc5d3a 100644
--- a/notebook_examples/index.json
+++ b/notebook_examples/index.json
@@ -594,5 +594,74 @@
"summary": "Compare training time between CPU and GPU trained models using XGBoost",
"time_created": "2023-03-30T10:01:38",
"title": "XGBoost with RAPIDS"
+ },
+ {
+ "developed_on": "fspyspark32_p38_cpu_v1",
+ "filename": "feature_store_quickstart.ipynb",
+ "keywords": [
+ "pyspark",
+ "featurestore",
+ "machine learning",
+ "feature transformation",
+ "feature storage",
+ "feature validation",
+ "feature statistics"
+ ],
+ "license": "Universal Permissive License v 1.0",
+ "size": 21304,
+ "summary": "Introduction to the Oracle Cloud Infrastructure Feature Store.Use feature store for feature ingestion and feature querying",
+ "time_created": "2023-03-29T11:04:51",
+ "title": "Feature Store Quickstart"
+ },
+ {
+ "developed_on": "fspyspark32_p38_cpu_v1",
+ "filename": "feature_store_querying.ipynb",
+ "keywords": [
+ "pyspark",
+ "featurestore",
+ "feature querying",
+ "feature transformation",
+ "feature storage",
+ "feature validation",
+ "feature statistics"
+ ],
+ "license": "Universal Permissive License v 1.0",
+ "size": 21304,
+ "summary": "Explore Feature Store Functionalities.Transform, Store your Data in Feature Store.Query your data using Feature Store using pandas like interface to query and join",
+ "time_created": "2023-03-29T11:04:51",
+ "title": "Feature store handling querying operations"
+ },
+ {
+ "developed_on": "fspyspark32_p38_cpu_v1",
+ "filename": "feature_store_schema_evolution.ipynb",
+ "keywords": [
+ "pyspark",
+ "featurestore",
+ "feature transformation",
+ "feature storage",
+ "schema evolution"
+ ],
+ "license": "Universal Permissive License v 1.0",
+ "size": 21304,
+ "summary": "Perform Schema Enforcement and Schema Evolution in Feature Store when materialising the data",
+ "time_created": "2023-03-29T11:04:51",
+ "title": "Schema Enforcement and Schema Evolution in Feature Store"
+ },
+ {
+ "developed_on": "fspyspark32_p38_cpu_v1",
+ "filename": "feature_store_spark_magic.ipynb",
+ "keywords": [
+ "pyspark",
+ "featurestore",
+ "feature transformation",
+ "feature storage",
+ "schema evolution",
+ "data flow"
+ ],
+ "license": "Universal Permissive License v 1.0",
+ "size": 21304,
+ "summary": "Run Feature Store on interactive Spark workloads on a long lasting Data Flow Cluster",
+ "time_created": "2023-03-29T11:04:51",
+ "title": "Data Flow Studio : Big Data Operations in Feature Store"
}
]
\ No newline at end of file