# <font color=red>Using Data Flow Run to create design time entities of OCI Feature Store</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle Cloud Infrastructure Data Science Team </font></p>

***

## Contents:
 - <a href='#intro'>Introduction</a>
     - <a href='#prerequisite'>Setup</a>
         - <a href='#policy'>Policy</a>
         - <a href='#var'>Variables</a>
 - <a href='#appscript'>Application Script</a>
 - <a href='#jobs'>Create and Run a Data Flow Application</a>
     - <a href='#conf'>Configurating Job</a>
     - <a href='#run'>Run the Data Flow Application</a>
 - <a href='#clean_up'>Clean Up</a>
 - <a href='#ref'>References</a>

---

<a id='intro'></a>
## Introduction 

Data Flow is a hosted Apache Spark server. It is quick to start, and can scale to handle large datasets in parallel. ADS provides a convenient API for creating and maintaining workloads on Data Flow. In this notebook, we will use the Accelerated Data Science SDK (ADS) to help us define a Data Flow Job to create design time entities of OCI feature store which can later be used to ingest feature data.

For more information on using ADS for data flow, you can go to our [documentation](https://docs.oracle.com/en-us/iaas/tools/ads-sdk/latest/user_guide/jobs/index.html).

<a id='prerequisite'></a>
### Setup

<a id='policy'></a>
#### Policy

To control who has access to Data Flow, and the type of access for each group of users, you must create policies. See [Data Flow Policies](https://docs.oracle.com/en-us/iaas/data-flow/using/policies.htm) and [Data Catalog Metastore Required Policies](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) for more details.

<a id='var'></a>
#### Variables

To run this notebook, you must provide some information about your tenancy configuration. To connect to the metastore, replace `<metastore_id>` with the OCID for the metastore.A Hive Metastore is the central repository of metadata for a Hive cluster. It stores metadata for data structures such as databases, tables, and partitions in a relational database, backed by files on Object Storage. 

To create and run a Data Flow application, you must specify a compartment and buckets for storing logs and the Data Flow script. These resources must be in the same compartment.

<a id='jobs'></a>
## Create and Run a Data Flow Application

<a id='conf'></a>
### Configurating Job

The preferred method for running Data Flow applications is to run them as a Job. This Job allows you to better manage your resources and isolate the Data Flow application from the notebook. A `DataFlow` object must be created and is a subclass of `Infrastructure`. The object defines the metadata related to the Data Flow service. For example, the object stores properties specific to Data Flow service, such as `compartment_id`, `logs_bucket_uri`. This object also defines the connection between Data Flow and the metastore. To define the actual parameters needed to run the Data Flow job, a `DataFlowRuntime` object is required. The object is a subclass of `Runtime`. `DataFlowRuntime` stores properties related to the script to be run. The object defines the buckets used for the logs, the location of the Data Flow application script, and any command line options needed.

To use a private bucket as the `logs_bucket`, ensure that a Data Flow Service policy has been added. See the [prerequisite step](#prerequisite) and the [policy setup page](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#policy_set_up) for more details.

In the following example, the `dataflow_configs` variable is a `DataFlow` that has the compartment OCID, metastore OCID, log bucket URI, the compute shape for the driver, the compute shape that is used for the executor, and the version of Spark.

In [None]:
from ads.jobs import DataFlow, Job, DataFlowRuntime
import os
import ads
ads.set_auth("resource_principal")
logs_bucket_uri = "oci://<logs_bucket_uri>"
compartment_id = "<compartment_id>"
metastore_id = "<metastore_id>"

dataflow_configs = (
    DataFlow()
    .with_compartment_id(compartment_id)
    .with_logs_bucket_uri(logs_bucket_uri)
    .with_driver_shape("VM.Standard.E4.Flex")
    .with_driver_shape_config(ocpus=2, memory_in_gbs=32)
    .with_executor_shape("VM.Standard.E4.Flex")
    .with_executor_shape_config(ocpus=4, memory_in_gbs=64)
    .with_spark_version("3.2.1")
    .with_metastore_id(metastore_id)
)

<a id='appscript'></a>
### Application Script

An application script is used to execute the Data Flow job. The following cell creates this script and saves it to local storage. However, Data Flow requires that the script is stored in Object Storage as it cannot access your notebook session. The ADS framework takes care of uploading this script to Object Storage for you.

ADS DataFlowRuntime doesnt support with_service_conda for now.User neede to publish the conda pack and use it as custom conda

In [None]:
!odsc conda publish -s  /home/datascience/conda/fspyspark32_p38_cpu_v3 --uri oci://<object storage path> --force

The `runtime_config` variable is a `DataFlowRuntime` object. It contains information about the location of the script and the bucket for the script. The script URI defines the location of the Data Flow application script. This can be on local storage or in Object Storage. If the path is local, then the script bucket must be specified so that the framework can upload the script to the Object Storage bucket. Data Flow requires a script to be available in Object Storage. The URI for buckets must have the following format `oci://<bucket_name>@<namespace>/<prefix>`.

In [None]:
runtime_config = (
    DataFlowRuntime()
    .with_script_uri("feature_store_creation.py")
    .with_script_bucket("oci://<script_bucket_uri>")
    .with_custom_conda("oci://<conda_pack_uri>")
)

The following cell creates a Job that executes the Data Flow application. The `Job` object needs a name, information about the Data Flow cluster infrastructure, and the runtime configuration. The `.create()` method is used to create the Data Flow application.

In [None]:
df = Job(name="FS DF Application", infrastructure=dataflow_configs, runtime=runtime_config)

In [None]:
df.create(overwrite=True)
print(df)

<a id='run'></a>
### Run the Data Flow Application

To run this Data Flow application, call the `.run()` method. It creates a `DataFlowRun` object. 

In [None]:
df_run = df.run()
print(df_run)

The `.watch()` method on the `DataFlowRun` object accesses the logs and prints them to the screen.

In [None]:
df_run.watch()