In [None]:
# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.

!pip install -U oracle-ads

Oracle Data Science service sample notebook.

Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

---

# <font color="red">Using Data Catalog Metastore with PySpark</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---

# Prerequisites

Compatible conda pack: <pack> (see links for examples)

***

# Overview:

This notebook demonstrates how to configure and use PySpark to process data in the Oracle Cloud Infrastructure (OCI) Data Catalog metastore. The [Data Catalog](https://docs.oracle.com/en-us/iaas/data-catalog/home.htm) service is a metadata management service that helps data professionals discover data and supports data governance.  The [Data Catalog Hive metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) is part of the Data Catalog service and it provides schema definitions for objects in structured and unstructured data assets that reside in Object Storage. Use the metastore as a central metadata repository manage data tables that are backed by files in Object Storage. To be able to access the metastore from the notebook using PySpark, you must configure PySpark. This notebook demostrates some common operations, such as create and load tables from the metastore. It also shows you how to use PySpark to query a table and access the results in a notebook session.

Compatible conda pack: [PySpark 3.0 and Data Flow](https://docs.oracle.com/en-us/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.7 (version 5.0)

---

## Contents:

 - <a href='#intro'>Introduction</a>
 - <a href='#setup'>Setup</a>
     - <a href='#setup_spark-defaults'>`spark-defaults.conf`</a>
     - <a href='#setup_session'>Session Setup</a>
     - <a href='#conda_configuration_dcat_testing'>Testing the Configuration</a>
 - <a href='#write_dcat'>Save the Data to Data Catalog Metastore</a>
 - <a href='#read_dcat'>Read Data from Data Catalog Metastore</a>
 - <a href='#clean_up'>Clean Up</a>
 - <a href='#ref'>References</a>

---


Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials 
under your agreement with Oracle.
    
You can access the `orcl_attrition` dataset license [here](https://oss.oracle.com/licenses/upl).


In [None]:
import fsspec
import oci
import os
import pandas as pd
import pyarrow.parquet as pq
import shutil
import tempfile

from ads.common import auth as authutil
from pyspark import SparkConf
from pyspark.sql import SparkSession, Row
from urllib.parse import urlparse

<a id='intro'></a>
# Introduction

Various data professionals, such as data engineers, data scientists, data stewards, and chief data officers, use [Data Catalog](https://docs.oracle.com/en-us/iaas/data-catalog/home.htm) to manage metadata. In the Data Catalog, data assets represent a data source, such as a database, an object store, a file or document store, a message queue, or an application. One of the most common uses for data scientists is to use Data Catalog to manage metadata about a flat-file database that is backed by Object Storage. This notebook demonstrates how to connect to the Data Catalog, and create and read from a database on Object Storage. This database must have its metadata stored in the Data Catalog.

The [Data Catalog Hive metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and semi-structured data assets on Object Storage. The metastore is the central metadata repository to understand tables that are backed by files on object storage. The metastore provides an invocation endpoint that exposes the Hive metastore interface. [Apache Hive](https://hive.apache.org/) is a data warehousing framework that facilitates read, write, or manage operations on large datasets residing in distributed systems. A Hive metastore is the central repository of metadata for a Hive cluster. It stores metadata for data structures such as databases, tables, and partitions in a relational database, backed by files maintained in Object Storage. [Apache Spark SQL](https://spark.apache.org/sql/) makes use of a Hive metastore for this purpose.

<a id='setup'></a>
# Setup

To set up the environment, a `spark-defaults.conf` msut be configured. Several variables that define entities such as the Data Catalog Metastore and the location of the data warehouse bucket must also be defined.

<a id='setup_spark-defaults'></a>
## `spark-defaults.conf`

The `spark-defaults.conf` file is used to define the properties that are used by Spark. A templated version is installed when you install a Data Science conda environment that supports PySpark. However, you must update the template so that the Data Catalog metastore can be accessed. You can do this manually. However, the `odsc data-catalog config` commandline tool is ideal for setting up the file because it gathers information about your environment, and uses that to build the file.

The `odsc data-catalog config` command line tool needs the `--metastore` option to define the Data Catalog metastore OCID. No other command line option is needed because settings have default values, or they take values from your notebook session environment. Following are common parameters that you may need to override.

The `--authentication` option sets the authentication mode. It supports resource principal and API keys. The preferred method for authentication is resource principal, which is sent with `--authentication resource_principal`. If you want to use API keys, then use the `--authentication api_key` option. If the `--authentication` isn't specified, API keys are used. When API keys are used, information from the OCI configuration file is used to create the `spark-defaults.conf` file.

Object Storage and Data Catalog are regional services. By default, the region is set to the region your notebook session is running in. This information is taken from the environment variable, `NB_REGION`. Use the `--region` option to override this behavior.

The default location of the `spark-defaults.conf` file is `/home/datascience/spark_conf_dir` as defined in the `SPARK_CONF_DIR` environment variable. Use the `--output` option to define the directory where to write the file.

You need to determine what settings are appropriate for your configuration. However, the following works for most configurations and is run in a terminal window.

```bash
odsc data-catalog config --authentication resource_principal --metastore <metastore_id>
```
For more assistance, use the following command in a terminal window:

```bash
odsc data-catalog config --help
```

<a id='setup_session'></a>
## Session Setup

The notebook makes connections to the Data Catalog metastore and Object Storage. In the next cell, specify the bucket URI to act as the data warehouse. Use the `warehouse_uri` variable with the `oci://<bucket_name>@<namespace_name>/<key>` format. Update the variable `metastore_id` with the OCID of the Data Catalog metastore.

In [None]:
# warehouse_uri points to the default location for managed databases and tables
warehouse_uri = "<warehouse_uri>"
metastore_id = "<metastore_id>"

<a id='conda_configuration_dcat_testing'></a>
## Testing the Configuration

To test the configuration, the next cell connects to the Data Catalog metastore, and requests a list of databases:

In [None]:
if metastore_id != "<metastore_id>":
    spark = (
        SparkSession.builder.appName("Python Spark SQL Hive integration example")
        .config("spark.sql.warehouse.dir", warehouse_uri)
        .config("spark.hadoop.oracle.dcat.metastore.id", metastore_id)
        .enableHiveSupport()
        .getOrCreate()
    )
    spark.sparkContext.setLogLevel("ERROR")

    # show the databases in the warehouse:
    spark.sql("SHOW DATABASES").show()
else:
    spark = None
    print(
        "No connection was made to the Data Catalog Metastore. Enter configuration values in the Setup section."
    )

<a id='write_dcat'></a>
# Save to Data Catalog Metastore

In this section, the connection to the Data Catalog metastore is used to create a database in the Object Storage. The metastore manages the metadata about the table, and Object Storage manages the files. PySpark is used to perform the actual operations.

In the next cell, PySpark creates a database. Since PySpark is connected to the metastore, the Data Catalog metastore manages the metadata about the table. Additionally, the PySpark connection is linked to an Object Storage bucket, so the data is written to it. From the perspective of Object Storage, it is only mananging files. The metastore doesn't perform the actual operations on the data. That is handled by PySpark, but PySpark depends on the metastore to understand that the files in Object Storage is a database.

After the database is created, PySparks reads data from a publically accessible Object Storage bucket. Then it's written to the data warehouse Object Storage bucket, and the metadata in the metastore is updated.

In [None]:
database_name = "ODSC_DEMO"
table_name = "ODSC_PYSPARK_METASTORE_DEMO"
file_path = (
    "oci://hosted-ds-datasets@bigdatadatasciencelarge/synthetic/orcl_attrition.csv"
)

if spark is not None:
    spark.sql(f"DROP DATABASE IF EXISTS {database_name} CASCADE")
    spark.sql(f"CREATE DATABASE {database_name}")
    input_dataframe = spark.read.option("header", "true").csv(file_path)
    input_dataframe.write.mode("overwrite").saveAsTable(f"{database_name}.{table_name}")
else:
    print("Database was not created. Enter configuration values in the Setup section.")

<a id='read_dcat'></a>
# Read from Data Catalog Metastore

Once the PySpark connection has been made to the Data Catalog metastore and the Object Storage bucket that backs the data warehouse, you can perform PySpark operations similar to any other PySpark setup. The following cell uses HiveQL, which is a SQL like data manipulation language (DML) that retrieves records from a database that is managed by the Data Catalog metastore.

In [None]:
if spark is not None:
    spark_df = spark.sql(
        f"""
                        SELECT EducationField, SalaryLevel, JobRole FROM {database_name}.{table_name} limit 10
                        """
    )
    spark_df.show()
else:
    spark_df = None
    print(
        "No HiveQL query was executed. Enter configuration values in the Setup section."
    )

<a id='query_topd'></a>
## Convert to a Pandas DataFrame

The `spark_df` object doesn't actually contain the record results in memory. Generally, a PySpark dataframe is used when the data is very large. However, it is a common data science workflow pattern to perform computations in PySpark that aggrigate and reduce the dataset size so that it can be used in a notebook session. 

The built-in PySpark ``.toPandas()`` method is used to convert PySpark dataframe to a Pandas dataframe. It is an expensive operation that you should use carefully to minimize the performance impact on your Spark applications. If you require this, especially when the dataframe is fairly large, you need to consider `PyArrow` optimization when converting Spark to Pandas dataframe or the reverse. To use Arrow, set the Spark configuration `spark.sql.execution.arrow.enabled` to `true`. This configuration is disabled by default.

In [None]:
if spark is not None:
    pd_df = spark_df.toPandas()
    print(pd_df)
else:
    print(
        "The PySpark DataFrame was not converted to a Pandas DataFrame. Enter configuration values in the Setup section."
    )

<a id='clean_up'></a>
# Clean Up

This notebook created a number of artifacts, such as creating a database table and starting a Apache Spark cluster. The next cell removes these resources.

In [None]:
if spark is not None:
    spark.sql(f"DROP DATABASE IF EXISTS {database_name} CASCADE")
    spark.stop()

<a id='ref'></a>
# References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Connecting to an Autonomous Database](https://docs.oracle.com/en-us/iaas/Content/Database/Tasks/adbconnecting.htm)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [PySpark Documentation](https://spark.apache.org/docs/latest/api/python/)
- [Using sqlnet.ora file with JDBC](https://stackoverflow.com/questions/63696611/can-the-oracle-jdbc-thin-driver-use-a-sqlnet-ora-file-for-configuration)