In [None]:
# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.

!pip install -U oracle-ads

Oracle Data Science service sample notebook.

Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

---

# <font color="red">Using Data Catalog (Hive) Metastore with DataFlow</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---

# Overview:

This notebook demonstrates how to write and test a Data Flow batch application using the Oracle Cloud Infrastructure (OCI) Data Catalog Metastore. [Oracle Cloud Infrastructure (OCI) Data Catalog](https://docs.oracle.com/en-us/iaas/data-catalog/home.htm) is a metadata management service that helps data professionals discover data and support data governance.  The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets backed by Object Store. [Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) is a fully managed service for running [Apache Spark](https://spark.apache.org/) applications. You write and test a Data Flow batch application using the Data Catalog Metastore in this notebook.

Compatible conda pack: [PySpark 3.0 and Data Flow](https://docs.oracle.com/en-us/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.7 (version 5.0)

---

## Contents:

 - <a href='#intro'>Introduction</a>
     - <a href='#prerequisite'>Setup</a>
         - <a href='#policy'>Policy</a>
         - <a href='#var'>Variables</a>
 - <a href='#appscript'>Application Script</a>
 - <a href='#jobs'>Create and Run a Data Flow Application</a>
     - <a href='#conf'>Configurating Job</a>
     - <a href='#run'>Run the Data Flow Application</a>
 - <a href='#clean_up'>Clean Up</a>
 - <a href='#ref'>References</a>

---


Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials 
under your agreement with Oracle.

You can access the `orcl_attrition` dataset license [here](https://oss.oracle.com/licenses/upl).


In [None]:
import ads
import os
import tempfile
import shutil

from ads.jobs.ads_job import Job
from ads.jobs import DataFlow, DataFlowRun, DataFlowRuntime
from uuid import uuid4

ads.set_auth(auth="resource_principal")

<a id='intro'></a>
# Introduction 

[Oracle Cloud Infrastructure (OCI) Data Catalog](https://docs.oracle.com/en-us/iaas/data-catalog/home.htm) is a metadata management service that helps data professionals discover data and support data governance.  The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage. [Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) is a fully managed service for running [Apache Spark](https://spark.apache.org/) applications. [Data Science jobs](https://docs.oracle.com/en-us/iaas/tools/ads-sdk/latest/user_guide/jobs/overview.html) allow you to run customized tasks outside of a notebook session. As a Data Flow user, you can access the Data Catalog Metastore to securely store and retrieve schema definitions for data assets. For integration with Data Flow, the Metastore provides an invocation endpoint to Data Flow. This endpoint exposes the Hive Metastore interface. [Apache Hive](https://hive.apache.org/) is a data warehousing framework that facilitates read, write, or manage operations on large datasets residing in distributed systems. A Hive Metastore is the central repository of metadata for a Hive cluster. It stores metadata for data structures such as databases, tables, and partitions in a relational database, backed by files on Object Storage. Apache Spark SQL makes use of a Hive Metastore for this purpose.

<a id='prerequisite'></a>
## Setup

<a id='policy'></a>
### Policy

To control who has access to Data Flow, and the type of access for each group of users, you must create policies. See [Data Flow Policies](https://docs.oracle.com/en-us/iaas/data-flow/using/policies.htm) and [Data Catalog Metastore Required Policies](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) for more details.

<a id='var'></a>
### Variables

To run this notebook, you must provide some information about your tenancy configuration. The `<job_name>` is a unique name for a job. To connect to the metastore, replace `<metastore_id>` with the OCID for the metastore. To create and run a Data Flow application, you must specify a compartment and buckets for storing logs and the Data Flow script. These resources must be in the same compartment.

In [None]:
job_name = "<job_name>"
log_bucket_uri = "oci://<bucket_name>@<namespace>/<prefix>"
metastore_id = "<metastore_id>"
script_bucket = "oci://<bucket_name>@<namespace>/<prefix>"

compartment_id = os.environ.get("NB_SESSION_COMPARTMENT_OCID")
driver_shape = "VM.Standard2.1"
executor_shape = "VM.Standard2.1"
spark_version = "3.0.2"

<a id='appscript'></a>
# Application Script

An application script is used to execute the Data Flow job. The following cell creates this script and saves it to local storage. However, Data Flow requires that the script is stored in Object Storage as it cannot access your notebook session. The ADS framework takes care of uploading this script to Object Storage for you.

The next cell contains the script in a single string. The script is written to the local storage. This method works well for small scripts. Larger scripts are developed outside of the notebook. The application script uses Employee Attrition data to create a new database and a product view table. This data is loaded in from a publicly accessible Object Storage bucket. The metastore manages all the metadata about the new database, while the actual data is copied to your Object Storage bucket. The script performs a query on the database. Finally, the script removes the database from the metastore. This causes the files in Object Storage, related to the database, to be removed.

In [None]:
script = '''
from pyspark.sql import SparkSession

def main():

    database_name = "employee_attrition"
    table_name = "orcl_attrition"

    # Create a Spark session
    spark = SparkSession \\
        .builder \\
        .appName("Python Spark SQL basic example") \\
        .enableHiveSupport() \\
        .getOrCreate()

    # Load a CSV file from a public Object Storage bucket
    df = spark \\
        .read \\
        .format("csv") \\
        .option("header", "true") \\
        .option("multiLine", "true") \\
        .load("oci://hosted-ds-datasets@bigdatadatasciencelarge/synthetic/orcl_attrition.csv")

    print(f"Creating {database_name}")
    spark.sql(f"DROP DATABASE IF EXISTS {database_name} CASCADE")
    spark.sql(f"CREATE DATABASE IF NOT EXISTS {database_name}")

    # Write the data to the database
    df.write.mode("overwrite").saveAsTable(f"{database_name}.{table_name}")

    # Use Spark SQL to read from the database.
    query_result_df = spark.sql(f"""
                                SELECT EducationField, SalaryLevel, JobRole FROM {database_name}.{table_name} limit 10
                                """)

    # Convert the filtered Apache Spark DataFrame into JSON format and write it out to stdout
    # so that it can be captured in the log.
    print('\\n'.join(query_result_df.toJSON().collect()))
    
    # Clean resources
    spark.sql(f"DROP DATABASE IF EXISTS {database_name} CASCADE")

if __name__ == '__main__':
    main()
'''

dataflow_base_folder = tempfile.mkdtemp()
print(f"Data flow directory: {dataflow_base_folder}")

pyspark_file_path = os.path.join(dataflow_base_folder, "example.py")

with open(pyspark_file_path, "w") as f:
    print(script.strip(), file=f)

print(f"Script path: {pyspark_file_path}")

<a id='jobs'></a>
# Create and Run a Data Flow Application

<a id='conf'></a>
## Configurating Job

The preferred method for running Data Flow applications is to run them as a Job. This Job allows you to better manage your resources and isolate the Data Flow application from the notebook. A `DataFlow` object must be created and is a subclass of `Infrastructure`. The object defines the metadata related to the Data Flow service. For example, the object stores properties specific to Data Flow service, such as `compartment_id`, `logs_bucket_uri`. This object also defines the connection between Data Flow and the metastore. To define the actual parameters needed to run the Data Flow job, a `DataFlowRuntime` object is required. The object is a subclass of `Runtime`. `DataFlowRuntime` stores properties related to the script to be run. The object defines the buckets used for the logs, the location of the Data Flow application script, and any command line options needed.

To use a private bucket as the `logs_bucket`, ensure that a Data Flow Service policy has been added. See the [prerequisite step](#prereq) and the [policy setup page](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#policy_set_up) for more details.

In the following example, the `dataflow_configs` variable is a `DataFlow` that has the compartment OCID, metastore OCID, log bucket URI, the compute shape for the driver, the compute shape that is used for the executor, and the version of Spark.

In [None]:
if metastore_id != "<metastore_id>":
    dataflow_configs = DataFlow(
        {
            "compartment_id": compartment_id,
            "driver_shape": driver_shape,
            "executor_shape": executor_shape,
            "logs_bucket_uri": log_bucket_uri,
            "metastore_id": metastore_id,
            "spark_version": spark_version,
        }
    )
else:
    print(
        "DataFlow object was not created. Enter configuration values in the Setup section."
    )

The `runtime_config` variable is a `DataFlowRuntime` object. It contains information about the location of the script and the bucket for the script. The script URI defines the location of the Data Flow application script. This can be on local storage or in Object Storage. If the path is local, then the script bucket must be specified so that the framework can upload the script to the Object Storage bucket. Data Flow requires a script to be available in Object Storage. The URI for buckets must have the following format `oci://<bucket_name>@<namespace>/<prefix>`.

In [None]:
if metastore_id != "<metastore_id>":
    runtime_config = (
        DataFlowRuntime()
        .with_script_uri(pyspark_file_path)
        .with_script_bucket(script_bucket)
    )
else:
    print(
        "DataFlow object was not created. Enter configuration values in the Setup section."
    )

The following cell creates a Job that executes the Data Flow application. The `Job` object needs a name, information about the Data Flow cluster infrastructure, and the runtime configuration. The `.create()` method is used to create the Data Flow application.

In [None]:
if metastore_id != "<metastore_id>":
    df_job = Job(name=job_name, infrastructure=dataflow_configs, runtime=runtime_config)
    df_app = df_job.create()
else:
    print(
        "Job object was not created. Enter configuration values in the Setup section."
    )

<a id='run'></a>
## Run the Data Flow Application

To run this Data Flow application, call the `.run()` method. It creates a `DataFlowRun` object. 

In [None]:
if metastore_id != "<metastore_id>":
    df_run = df_app.run()
else:
    print(
        "Job object was not created. Enter configuration values in the Setup section."
    )

The `.watch()` method on the `DataFlowRun` object accesses the logs and prints them to the screen.

In [None]:
if metastore_id != "<metastore_id>":
    df_run.watch()
else:
    print(
        "Job object was not created. Enter configuration values in the Setup section."
    )

<a id='clean_up'></a>
# Clean Up

This notebook creates several resources such as a database with a metastore entry and files in Object Storage. Also, the notebook creates a Data Flow instance and a Job. The Data Flow application deletes the database, removes the entry in the Data Catalog, and deletes the files on Object Storage related to the database. The Data Flow automatically cleans up when done. You have to manually clean up the Data Flow application script and the associated log files. The following cell cleans up the Job objects.

In [None]:
if metastore_id != "<metastore_id>":
    df_run.delete()
else:
    print("Skipping, as the metastore_id is not defined.")

In [None]:
shutil.rmtree(dataflow_base_folder)

Use [ocifs](https://ocifs.readthedocs.io/en/latest/unix-operations.html#rm) to clean up the Data Flow log and script buckets.

<a id='ref'></a>
# References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Connecting to an Autonomous Database](https://docs.oracle.com/en-us/iaas/Content/Database/Tasks/adbconnecting.htm)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)