In [None]:
# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.
!pip install --pre --no-deps oracle-ads==2.9.0rc0

Oracle Data Science service sample notebook.

Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

***

# <font color="red">Schema enforcement and schema evolution</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---
# Overview:
---
Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data. This notebook shows how schema enforcement and schema evolution are carried out in Feature Store

Compatible conda pack: [PySpark 3.2 and Feature store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8

<div>
    <img src="https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/_images/overview-roles.png"  />
</div>

## Contents:

- <a href='#introduction'>1. Introduction</a>
- <a href='#pre_requisites'>2. Pre-requisites</a>
    - <a href='#setup_setup'>2.1 Setup</a>
    - <a href='#policies_'>2.2 Policies</a>
    - <a href='#authentication'>2.3 Authentication</a>
    - <a href='#variables'>2.4 Variables</a>
- <a href='#schema'>3. Schema enforcement and schema evolution</a>
    - <a href='#dataexploration'>3.1. Exploration of data in feature store</a>
    - <a href='#feature_store'>3.2. Create feature store logical entities</a>
    - <a href='#schema_enforcement'>3.3. Schema enforcement</a>
    - <a href='#ingestion_modes'>3.4. Ingestion Modes</a>
        - <a href='#append'>3.4.1 Append</a>
        - <a href='#overwrite'>3.4.2 Overwrite</a>
        - <a href='#upsert'>3.4.3 Upsert</a>
    - <a href='#history'>3.5. History</a>
    - <a href='#preview'>3.6. As_of Feature </a>
- <a href='#references'>4. References</a>

---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

<a id='introduction'></a>
# 1. Introduction

Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.

The following are some key terms that will help you understand OCI Data Science Feature Store:


* **Feature Vector**: Set of feature values for any one primary/identifier key. Eg. All/subset of features of customer id ‘2536’ can be called as one feature vector.

* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.

* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc.

* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in ml models. It serves as an organizational unit within the feature store for users to manage, version and share features across different ml projects. By organizing features into groups, data scientists and ml engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.

* **Feature Group Job**: Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.

* **Dataset**: A dataset is a collection of feature that are used together to either train a model or perform model inference.

* **Dataset Job**: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results.

<a id='pre_requisites'></a>
# 2. Pre-requisites to Running this Notebook
Notebook Sessions are accessible through the following conda environment: 

* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1)**

You can customize `fspyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session cluster. 


<a id='setup_setup'></a>
### 2.1. Setup

<a id='setup_spark-defaults'></a>
### `spark-defaults.conf`

The `spark-defaults.conf` file is used to define the properties that are used by Spark. A templated version is installed when you install a Data Science conda environment that supports PySpark. However, you must update the template so that the Data Catalog metastore can be accessed. You can do this manually. However, the `odsc data-catalog config` commandline tool is ideal for setting up the file because it gathers information about your environment, and uses that to build the file.

The `odsc data-catalog config` command line tool needs the `--metastore` option to define the Data Catalog metastore OCID. No other command line option is needed because settings have default values, or they take values from your notebook session environment. Following are common parameters that you may need to override.

The `--authentication` option sets the authentication mode. It supports resource principal and API keys. The preferred method for authentication is resource principal, which is sent with `--authentication resource_principal`. If you want to use API keys, then use the `--authentication api_key` option. If the `--authentication` isn't specified, API keys are used. When API keys are used, information from the OCI configuration file is used to create the `spark-defaults.conf` file.

Object Storage and Data Catalog are regional services. By default, the region is set to the region your notebook session is running in. This information is taken from the environment variable, `NB_REGION`. Use the `--region` option to override this behavior.

The default location of the `spark-defaults.conf` file is `/home/datascience/spark_conf_dir` as defined in the `SPARK_CONF_DIR` environment variable. Use the `--output` option to define the directory where to write the file.

You need to determine what settings are appropriate for your configuration. However, the following works for most configurations and is run in a terminal window.

```bash
odsc data-catalog config --authentication resource_principal --metastore <metastore_id>
```
For more assistance, use the following command in a terminal window:

```bash
odsc data-catalog config --help
```

<a id='policies_'></a>
### 2.2. Policies
This section covers the creation of dynamic groups and policies needed to use the service.

* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm/)
* [Data Catalog Metastore Required Policies](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)
* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)
* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)

<a id="authentication"></a>
### 2.3. Authentication
The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook cluster.<br>
To setup authentication use the ```ads.set_auth("resource_principal")``` or ```ads.set_auth("api_key")```.

In [None]:
import ads
ads.set_auth(auth="resource_principal", client_kwargs={"fs_service_endpoint": "https://{api_gateway}/20230101"})

<a id="variables"></a>
### 2.4. Variables
To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `<compartment_id>` and bucket `<metastore_id>` for offline feature store.

In [None]:
import os

compartment_id = os.environ.get("NB_SESSION_COMPARTMENT_OCID")
metastore_id = "<metastore_id>"

<a id="schema"></a>
# 3. Schema enforcement and schema evolution
By default the **PySpark 3.2, Feature store and Data Flow** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html).Schema enforcement is a Delta Lake feature that prevents you from appending data with a different schema to a table.To change a table's current schema and to accommodate data that is changing over time,Schema evolution feature is used while performing an append or overwrite operation.

In [None]:
import pandas as pd
from ads.feature_store.feature_store import FeatureStore
from ads.feature_store.feature_group import FeatureGroup
from ads.feature_store.model_details import ModelDetails
from ads.feature_store.dataset import Dataset
from ads.feature_store.common.enums import DatasetIngestionMode

from ads.feature_store.feature_group_expectation import ExpectationType
from great_expectations.core import ExpectationSuite, ExpectationConfiguration
from ads.feature_store.feature_store_registrar import FeatureStoreRegistrar

<a id='dataexploration'></a>
### 3.1. Exploration of data in feature store

<div>
    <img src="https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/_images/feature_store_demo.jpg" width="700" height="350" />
</div>

In [None]:
flights_df = pd.read_csv("https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/flights.csv")[['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']]
flights_df = flights_df.head(100)
flights_df.head()

In [None]:
columns = ['IATA_CODE', 'AIRPORT', 'CITY', 'STATE', 'LATITUDE', 'LONGITUDE']
airports_df = pd.read_csv("https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airports.csv")[columns]
airports_df.head()

In [None]:
airlines_df = pd.read_csv("https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airlines.csv")
airlines_df.head()

<a id="feature_store"></a>
### 3.2. Create feature store logical entities

#### 3.2.1 Feature Store
Feature store is the top level entity for feature store service

In [None]:
feature_store_resource = (
    FeatureStore().
    with_description("Data consisting of flights").
    with_compartment_id(compartment_id).
    with_display_name("flights details").
    with_offline_config(metastore_id=metastore_id)
)

<a id="create_feature_store"></a>
##### Create Feature Store

Call the ```.create()``` method of the Feature store instance to create a feature store.

In [None]:
feature_store = feature_store_resource.create()
feature_store

#### 3.2.2 Entity
An entity is a group of semantically related features.

In [None]:
entity = feature_store.create_entity(
    display_name="Flight details schema evolution/enforcement",
    description="description for flight details"
)
entity

<a id="create_feature_group_airport"></a>
#### 3.2.3 Feature Group

Create feature group for airport

In [None]:
from great_expectations.core import ExpectationSuite, ExpectationConfiguration

expectation_suite_airports = ExpectationSuite(
    expectation_suite_name="test_airports_df"
)
expectation_suite_airports.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_not_be_null",
        kwargs={"column": "IATA_CODE"},
    )
)
expectation_suite_airports.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={"column": "LATITUDE", "min_value": -1.0, "max_value": 1.0},
    )
)

expectation_suite_airports.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={"column": "LONGITUDE", "min_value": -1.0, "max_value": 1.0},
    )
)

In [None]:
feature_group_airports = (
    FeatureGroup()
    .with_feature_store_id(feature_store.id)
    .with_primary_keys(["IATA_CODE"])
    .with_name("airport_feature_group")
    .with_entity_id(entity.id)
    .with_compartment_id(compartment_id)
    .with_schema_details_from_dataframe(airports_df)
    .with_expectation_suite(
        expectation_suite=expectation_suite_airports,
        expectation_type=ExpectationType.LENIENT,
     )
)

In [None]:
feature_group_airports.create()

In [None]:
feature_group_airports.show()

In [None]:
feature_group_airports.materialise(airports_df)

<a id="schema_enforcement"></a>
### 3.3. Schema enforcement

Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table's schema. Like the front desk manager at a busy restaurant that only accepts reservations, it checks to see whether each column in data inserted into the table is on its list of expected columns (in other words, whether each one has a "reservation"), and rejects any writes with columns that aren't on the list.

In [None]:
columns = ['IATA_CODE', 'AIRPORT', 'CITY', 'STATE', 'LATITUDE', 'LONGITUDE', 'COUNTRY']
airports_df = pd.read_csv("https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airports.csv")[columns]
airports_df.head()

In [None]:
feature_group_airports.with_schema_details_from_dataframe(airports_df)
feature_group_airports.update()

In [None]:
feature_group_airports.materialise(airports_df)

<a id="schema_evolution"></a>
### 3.4. Schema evolution

Schema evolution is a feature that allows users to easily change a table's current schema to accommodate data that is changing over time. Most commonly, it's used when performing an append or overwrite operation, to automatically adapt the schema to include one or more new columns.

In [None]:
from ads.feature_store.feature_option_details import FeatureOptionDetails
feature_option_details = FeatureOptionDetails().with_feature_option_write_config_details(merge_schema=True)

In [None]:
feature_group_airports.materialise(
    input_dataframe=airports_df,
    feature_option_details=feature_option_details
)

In [None]:
feature_group_airports

<a id="ingestion_modes"></a>
### 3.5. Ingestion modes

<a id="append"></a>
#### 3.5.1. Append

In ``append`` mode, new data is added to the existing table. If the table already exists, the new data is appended to it, extending the dataset. This mode is suitable for scenarios where you want to continuously add new records without modifying or deleting existing data. It preserves the existing data and only appends the new data to the end of the table.

In [None]:
from ads.feature_store.feature_group_job import IngestionMode
feature_group_airports.materialise(airports_df, ingestion_mode=IngestionMode.APPEND)

<a id="overwrite"></a>
#### 3.5.2. Overwrite
In ``overwrite`` mode, the existing table is replaced entirely with the new data being saved. If the table already exists, it will be dropped and a new table will be created with the new data. This mode is useful when you want to completely refresh the data in the table with the latest data, discarding any previous records.

In [None]:
from ads.feature_store.feature_group_job import IngestionMode
feature_group_airports.materialise(airports_df, ingestion_mode=IngestionMode.OVERWRITE)

<a id="upsert"></a>
#### 3.5.3. Upsert
``Upsert`` mode, also known as ``merge`` mode, is used to update existing records in the table based on a primary key or a specified condition. If a record with the same key exists, it will be updated with the new data; otherwise, a new record will be inserted. This mode is useful for maintaining and synchronizing data between the source and destination tables while avoiding duplicates.

In [None]:
from ads.feature_store.feature_group_job import IngestionMode
feature_group_airports.materialise(airports_df, ingestion_mode=IngestionMode.UPSERT)

<a id="history"></a>
### 3.6. History
You can call the ``history()`` method of the FeatureGroup instance to show history of the feature group.

In [None]:
feature_group_airports.history().toPandas()

<a id="preview"></a>
### 3.7. as_of

You can call the ``as_of()`` method of the FeatureGroup instance to to get specified point in time and time traveled data.
The ``.as_of()`` method takes the following optional parameter:

- commit_timestamp: date-time. Commit timestamp for feature group
- version_number: int. Version number for feature group

In [None]:
feature_group_airports.as_of(version_number = 0).show()

In [None]:
feature_group_airports.as_of(version_number = 1).show()

<a id='references'></a>
# References
- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)
- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)