<font color=gray>ADS Sample Notebook.

Copyright (c) 2021 Oracle, Inc. All rights reserved.
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

***
# <font color=red>Using the Autonomous Database with PySpark</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle Cloud Infrastructure Data Science Team</font></p>

***

## Overview:

This notebook demonstrates how to use PySpark to process data in Oracle Cloud Infrastructure (OCI) Object Storage and save the results to an Oracle Autonomous Database. It also demonstrates how to query data from an ADB using a local PySpark session.

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

## Prerequisites:
- Experience with the topic: Intermediate
- Professional experience: Intermediate


## Objectives:
This notebook covers the following topics:
 - <a href='#intro'>Introduction</a>
     - <a href='#variables'>Setup the Required Variables</a>
     - <a href='#credentials'>Obtain Credentials from the Vault</a>
     - <a href='#setup_wallet'>Setup the Wallet</a>
 - <a href='#read_os'>Reading Data from Object Storage</a>
 - <a href='#save_adb'>Save the Data to the Database</a>
 - <a href='#read_adb'>Read from the Database using PySpark</a>
 - <a href='#clean_up'>Clean Up Artifacts</a>
 - <a href='#ref'>References</a>

In [None]:
import base64
import cx_Oracle
import oci
import os
import shutil
import tempfile
import zipfile

from ads.database import connection 
from ads.vault.vault import Vault
from pyspark import SparkConf
from pyspark.sql import SparkSession
from urllib.parse import urlparse

<a id='intro'></a>
# Introduction

It has become a common practice to store structured and semi-structured data using services such as Object Storage. This provides a scalable solution to store vast quantities of data that can be post-processed. However, using a relational database management system (RDMS) such as the Oracle Autonomous Database provides advantages like ACID compliance, rapid relational joins, support for complex business logic, and more. It is important to be able to access information stored in Object Storage, process that information, and load it into an RBMS. This notebook demonstrates how to use PySpark, a Python interface to Apache Spark, to perform these operations.

This notebook uses a publically accessible Object Storage location to read from. However, an Autonomous Database needs to be configured with permissions to create a table, write to that table, and read from it. It also assumes that the credentials to access the database are stored in the Vault. This is the best practice as it prevents the credentials from being stored locally or in the notebook where they may be accessible to others. If you do not have credentials stored in the Vault, see the `vault.ipynb` example notebook to guide you through the process of storing the credentials. Once credentials to the database, are stored in the Vault, you need the OCIDs for the Vault, encryption key, and the secret.

Autonomous Databases have an additional level of security that is needed to access them and are wallet file. You can obtain the wallet file from your account administrator or download it using the steps that are outlined in the [downloading a wallet(https://docs.oracle.com/en-us/iaas/Content/Database/Tasks/adbconnecting.htm#access). The wallet file is a ZIP file. This notebook unzips the wallet and updates the configuration settings so you don't have to.

The database connection also needs the TNS name of the database. Your database administrator can give you the TNS name of the database that you have access to.

<a id='variables'></a>
## Setup the Required Variables

The required variables to set up are:

1. `vault_id`, `key_id`, `secret_ocid`: The OCID of the secret by storing the username and password required to connect to your Autonomous Database in a secret within the OCI Vault service. Note that the secret is the credential needed to access a database. This notebook is designed so that any secret can be stored as long as it is in the form of a dictionary. To store your secret, just modify the dictionary, see the `vault.ipynb` example notebook for detailed steps to generate this OCID.
1. `tnsname`: A TNS name valid for the database.
1. `wallet_path`: The local path to your wallet ZIP file, see the `autonomous_database.ipynb` example notebook for instructions on accessing the wallet file.

In [None]:
vault_id = "<vault_id>"
key_id = "<key_id>"
secret_ocid = "<secret_ocid>"
tnsname = "<tnsname>"
wallet_path = "<wallet_path>"

<a id='credentials'></a>
## Obtain Credentials from the Vault

If the `vault_id`, `key_id`, and `secret_id` have been updated, then the notebook obtains a handle to the vault with a variable called `vault`. This uses the `get_secret()` method to return a dictionary with the user credentials. The approach assumes that the Accelerated Data Science (ADS) library was used to store the secret.

In [None]:
if vault_id != "<vault_id>" and key_id != "<key_id>" and secret_ocid != "<secret_ocid>":
    print("Getting wallet username and password")
    vault = Vault(vault_id=vault_id, key_id=key_id)
    adb_creds = vault.get_secret(secret_ocid)
    user = adb_creds["username"]
    password = adb_creds["password"]
else:
    print("Skipping as it appears that you do not have vault, key, and secret ocid specified.")

<a id='setup_wallet'></a>
## Setup the Wallet

An Autonomous Database requires a wallet file to access the database. The `wallet_path` variable defines the location of this file. The next cell prepares the wallet file to make a connection to the database. It also creates the Autonomous Database connection string, `adb_url`.

In [None]:
def setup_wallet(wallet_path):
    """
    Prepare ADB wallet file for use in PySpark.
    """

    temporary_directory = tempfile.mkdtemp()
    zip_file_path = os.path.join(temporary_directory, "wallet.zip")

    # Extract everything locally.
    with zipfile.ZipFile(wallet_path, "r") as zip_ref:
        zip_ref.extractall(temporary_directory)

    return temporary_directory

if wallet_path != "<wallet_path>":
    print("Setting up wallet")
    tns_path = setup_wallet(wallet_path)
else:
    print("Skipping as it appears that you do not have wallet_path specified.")

In [None]:
if "tns_path" in globals() and tnsname != "<tnsname>":
    adb_url = f"jdbc:oracle:thin:@{tnsname}?TNS_ADMIN={tns_path}"
else:
    print("Skipping, as the tns_path or tnsname are not defined.")

<a id='read_os'></a>
# Reading Data from Object Storage

This notebook uses PySpark to access the Object Storage file. The next cell creates a Spark application called "Python Spark SQL Example" and returns a SparkContext. The `SparkContext`, normally called `sc`,  is a handle to the Spark application.

The data file that is used is relatively small so the notebook uses PySpark by running a version of Spark in local mode. That means, it is running in the notebook session. For larger jobs, we recommended that you use the [Oracle Data Flow](https://www.oracle.com/big-data/data-flow/) service, which is an Oracle managed Spark service.

In [None]:
# create a spark session
sc = SparkSession \
    .builder \
    .appName("Python Spark SQL Example") \
    .getOrCreate()

This notebook reads in a data file that is stored in an Oracle Object Storage file. This is defined with the `file_path` variable. The `SparkContext` with the `read.option().csv()` methods is used to read in the CSV file from Object Storage into a data frame.

In [None]:
file_path = "oci://hosted-ds-datasets@bigdatadatasciencelarge/synthetic/orcl_attrition.csv"
input_dataframe = sc.read.option("header", "true").csv(file_path)

<a id='save_adb'></a>
# Save the Data to the Database

This notebook creates a table in your database with the name specified with `table_name`. The name that is defined should be unique so that it does not interfere with any existing table in your database. If it does, change the value to something that is unique.

In [None]:
table_name = "ODSC_PYSPARK_ADB_DEMO"

if tnsname != "<tnsname>" and "adb_url" in globals():
    print("Saving processed data to " + adb_url)
    properties = {
        "oracle.net.tns_admin": tnsname,
        "password": password,
        "user": user,
    }
    input_dataframe.write.jdbc(
        url=adb_url, table=table_name, properties=properties
    )
else:
    print("Skipping as it appears that you do not have tnsname specified.")

<a id='read_adb'></a>
# Read from the Database using PySpark

PySpark can be used to load data from the Autonomous Database into a Spark application. The next cell makes a JDBC connection to the database defined using the `adb_url` variable and accesses the table defined with `table_name`. The credentials stored in the vault and previously read into memory are used. Once this command is executed, you can perform Spark operations on it.

This table is relatively small so the notebook uses PySpark in the notebook session. However, for larger jobs, we recommended that you use the [Oracle Data Flow](https://www.oracle.com/big-data/data-flow/) service.

In [None]:
if "adb_url" in globals():
    output_dataframe = sc.read \
        .format("jdbc") \
        .option("url", adb_url) \
        .option("dbtable", table_name) \
        .option("user", user) \
        .option("password", password) \
        .load()
else:
    print("Skipping as it appears that you do not have adb_url configured.")

The database table is loaded into Spark so that you can perform operations to transform, model, and much more. In the next cell, the notebook prints the table demonstrating that it was successfully loaded into Spark from the Autonomous Database.

In [None]:
if "adb_url" in globals():
    output_dataframe.show()
else:
    print("Skipping as it appears that you do not have output_dataframe configured.")

<a id='clean_up'></a>
# Clean Up Artifacts

This notebook created a number of artifacts such as unzipping the wallet file, creating a database table, and starting a Spark cluster. The next cell removes these resources.

In [None]:
if wallet_path != "<wallet_path>":
    connection.update_repository(key="pyspark_adb", value=adb_creds) 
    connection.import_wallet(wallet_path=wallet_path, key="pyspark_adb")
    conn = cx_Oracle.connect(user, password, tnsname)
    cursor = conn.cursor()
    cursor.execute(f"DROP TABLE {table_name}")
    cursor.close()
    conn.close()
else:
    print("Skipping as it appears that you do not have wallet_path specified.")
    
if "tns_path" in globals():
    shutil.rmtree(tns_path)
    
sc.stop()

<a id='ref'></a>
# References

* [PySpark Documentation](https://spark.apache.org/docs/latest/api/python/)
* [Using sqlnet.ora file with JDBC](https://stackoverflow.com/questions/63696611/can-the-oracle-jdbc-thin-driver-use-a-sqlnet-ora-file-for-configuration)
* [Connecting to an Autonomous Database](https://docs.oracle.com/en-us/iaas/Content/Database/Tasks/adbconnecting.htm)