Oracle Data Science service sample notebook.

Copyright (c) 2021, 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

---

# <font color="red">Partition an Apache Spark DataFrame from an Autonomous Database</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---
# Overview:

This notebook demonstrates how to extract information from an Oracle Autonomous Database (ADB) and load it into a partitioned PySpark dataframe. It demonstrates two methods to do this. The notebook shows you how to extract an entire ADB table, partition it, and then load it into a PySpark dataframe. Then it demonstrates how to create a partitioned Spark dataframe using a [Data Query Language (DQL)](https://en.wikipedia.org/wiki/Data_query_language) to extract relational data or subsets of data from an ADB.

Developed on [PySpark 2.4 and Data Flow](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.7 (version 3.0)

---

## Contents:

<a href='intro'>Introduction</a>
<a href='setup'>Setup</a>
  - <a href='setup_vars'>Required Variables</a>
  - <a href='#coresite'>`coresite.xml` Configuraton</a>
  - <a href='setup_credentials'>Obtain Credentials from the Vault</a>
  - <a href='setup_wallet'>Setup the Wallet</a>
- <a href='read_adb'>Creating Partitions from a Database</a> 
  - <a href='read_table'>Partition a Table</a>
  - <a href='read_subquery'>Partition a Subquery</a>
- <a href='#ref'>References</a>

---


Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials 
under your agreement with Oracle.
    
You can access the `SH.SALES` dataset license [here](https://oss.oracle.com/licenses/upl).

---


In [None]:
import base64
import cx_Oracle
import oci
import os
import shutil
import tempfile
import zipfile

from ads.database import connection
from ads.vault.vault import Vault
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import spark_partition_id, asc
from urllib.parse import urlparse

<a id='intro'></a>
# Introduction

Oracle Cloud Infrastructure (OCI) [Data Flow](https://www.oracle.com/big-data/data-flow/) with [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) provides a scalable solution to store and process vast quantities of data. However, the source of truth is often a relational database management system (RDMS) like [ABD](https://www.oracle.com/autonomous-database/). ADB provides advantages like ACID (atomicity, consistency, isolation, durability) compliance, rapid relational joins, support for complex business logic, and more. However, [Apache Spark](http://spark.apache.org/)-based systems provide distributed computing, streaming, graph computation, access to a wide array of machine learning libraries, and more.

To improve parallelism in Apache Spark, it is often best to partition the data across the cluster so that the executors do not have to shift data around. Connections between ABD and Apache Spark are made with a [JDBC](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) driver. The properties, `partitionColumn`, `lowerBound`, `upperBound`, and `numPartitions` control the behavior of the partitioning. `partitionColumn` is the column in ADB that the data is partitioned on. Since Apache Spark does not have indexes, it is important to choose a partition that improves your specific processing requirements. The `partitionColumn` column must be a numeric, date, or timestamp.

`numPartitions` defines the maximum number of partitions that the data is to be split into and therefore the amount of parallelism in the cluster. It also determines the maximum number of JDBC connections.

The `lowerBound` and `upperBound` properties are used to define the stride. It is a common misconception that they act as a filter when retrieving data, which is not the case. The stride defines the range of data that is stored in each partition.

Since the partition column is a numeric, date, or timestamp, it can be split into strides. The size of a stride is defined by:

```python
stride = (upperBound / numPartitions) - (lowerBound / numPartitions)
```

If `upperBound = 1000`, `lowerBound = 0`, and `numPartitions = 10`, then the stride is equals 10.

If the following query was run using [SparkSQL](https://spark.apache.org/sql/):

```SQL
SELECT * FROM table
```

Assume that `partitionColumn = 'CUST_ID'`, then the query is mutated into the following set of queries to take advantage of the partitions:

```SQL
SELECT * FROM table WHERE CUST_ID IS NULL or CUST_ID < 100
SELECT * FROM table WHERE CUST_ID >= 100 AND CUST_ID < 200
SELECT * FROM table WHERE CUST_ID >= 200 AND CUST_ID < 300
SELECT * FROM table WHERE CUST_ID >= 300 AND CUST_ID < 400
...
SELECT * FROM table WHERE CUST_ID >= 800 AND CUST_ID < 900
SELECT * FROM table WHERE CUST_ID >= 900
```

<a id='setup'></a>
# Setup

An ADB must be configured with permissions read from the preinstalled `SH.SALES` table. The\is notebook also assumes that the credentials to access the database are stored in the [OCI Vault](https://www.oracle.com/security/cloud-security/key-management/). This is the best practice as it prevents the credentials from being stored locally or in the notebook where they may be accessible to others. If you do not have credentials stored in the Vault, use the `vault.ipynb` example notebook to guide you through the process.

In addition to the user credentials, the ADB requires a wallet file. You can obtain the wallet file from your account administrator or download it using the steps that are outlined in [downloading a wallet](https://docs.oracle.com/en-us/iaas/Content/Database/Tasks/adbconnecting.htm#access). The wallet file is a ZIP file. This notebook unzips the wallet and updates the configuration settings so you don't have to.

The database connection also needs the TNS name of the database. Your database administrator can give you the TNS name of the database that you have access to.

<a id='setup_vars'></a>
## Required Variables

Update the next cell with the values for these variables.

1. `vault_id`, `key_id`, `secret_ocid`: The OCID of the secret by storing the username and password required to connect to your ADB in a secret within the Vault service. The secret is the credential needed to access a database. This notebook is designed so that any secret can be stored as long as it is in a dictionary format. To store your secret, modify the dictionary with your OCID, see the `vault.ipynb` example notebook for detailed steps to generate this OCID.
1. `tnsname`: A TNS name valid for the database.
1. `wallet_path`: The local path to your wallet ZIP file, see the `autonomous_database.ipynb` example notebook for instructions on accessing the wallet file.
<a id='coresite'></a>
## `coresite.xml` Configuraton

1. Before accessing Oracle Cloud Infrastructure (OCI) Object Storage from your local Apache Spark environment, ensure that you have the `core-site.xml` under `spark_conf_dir` configured properly. It sets the connector properties that are use to connect to Object Storage. The `core-site.xml` file can be configured in the terminal with the following command: ``odsc core-site config -o``
1. Before creating applications in the OCI Data Flow service, ensure that you have configured your tenancy for that service. To do this configuration, follow the steps in the [Data Flow documentation](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#getting_started).

In [None]:
vault_id = "<vault_id>"
key_id = "<key_id>"
secret_ocid = "<secret_ocid>"
tnsname = "<tnsname>"
wallet_path = "<wallet_path>"

In [None]:
if vault_id != "<vault_id>" and key_id != "<key_id>" and secret_ocid != "<secret_ocid>":
    print("Getting wallet username and password")
    vault = Vault(vault_id=vault_id, key_id=key_id)
    adb_creds = vault.get_secret(secret_ocid)
    user = adb_creds["username"]
    password = adb_creds["password"]
else:
    print(
        "Skipping as it appears that you do not have vault, key, and secret ocid specified."
    )

<a id='setup_credentials'></a>
## Obtain Credentials from the Vault

The approach assumes that the Accelerated Data Science (ADS) library was used to store the secret. If the `vault_id`, `key_id`, and `secret_id` variables have been updated, then the notebook obtains a handle to the Vault with the `vault` variable. This uses the `get_secret()` method to return a dictionary with the user credentials. 

<a id='setup_wallet'></a>
## Setup the Wallet

An ADB requires a wallet file to access the database. The `wallet_path` variable defines the location of this file. The next cell prepares the wallet file to make a connection to the database. It also creates the ADB connection string, `adb_url`.

In [None]:
def setup_wallet(wallet_path):
    """
    Prepare ADB wallet file for use in PySpark.
    """

    temporary_directory = tempfile.mkdtemp()
    zip_file_path = os.path.join(temporary_directory, "wallet.zip")

    # Extract everything locally.
    with zipfile.ZipFile(wallet_path, "r") as zip_ref:
        zip_ref.extractall(temporary_directory)

    return temporary_directory


if wallet_path != "<wallet_path>":
    print("Setting up wallet")
    tns_path = setup_wallet(wallet_path)
else:
    print("Skipping as it appears that you do not have wallet_path specified.")

In [None]:
if "tns_path" in globals() and tnsname != "<tnsname>":
    adb_url = f"jdbc:oracle:thin:@{tnsname}?TNS_ADMIN={tns_path}"
else:
    print("Skipping, as the tns_path or tnsname are not defined.")

<a id='read_adb'></a>
# Creating Partitions from a Database

The next cell creates an Apache Spark application that pullS data from the `SH.SALES` dataset. This table is built into the ADB. It demonstrates how to use the `dbtable` property in the [JDBC](https://spark.apache.org/docs/2.4.0/sql-data-sources-jdbc.html) driver along with the `partitionColumn`, `lowerBound`, `upperBound`, and `numPartitions` properties to partition the data.

The `dbtable` accepts anything that is allowed in a `FROM` clause. This can be the full table name of a subquery.

The next cell creates a PySpark application:


In [None]:
# create a spark session
sc = SparkSession.builder.appName("PySpark Partition Example").getOrCreate()
sc.sparkContext.setLogLevel("ERROR")

<a id='read_table'></a>
## Partition a Table

You can partition the entire `SH.SALES` table into 10 partitions based on the `CUST_ID` column. The `read()` method is used to read from the ADB and the `option("dbtable", "SH.SALES")` is used to select the entire database table for import into PySpark.

There are 918,843 records in this dataset and multiple records per customer. The distribution of `CUST_ID` is not uniform, which poses a challenge to balance the partition sizes. Ideally, you want an equal number of records in each partition. The `lowerBound` is set to 0. The `upperBound` is set to 14,000. These numbers are chosen because it does a reasonable job of splitting the records across the partitions so that they are roughly balanced.

In [None]:
if "adb_url" in globals():
    sales_table = (
        sc.read.format("jdbc")
        .option("url", adb_url)
        .option("dbtable", "SH.SALES")
        .option("partitionColumn", "CUST_ID")
        .option("lowerBound", 0)
        .option("upperBound", 14000)
        .option("numPartitions", 10)
        .option("user", user)
        .option("password", password)
        .load()
    )
else:
    print("Skipping as it appears that you do not have adb_url configured.")

Display a subset of the data:

In [None]:
if "sales_table" in globals():
    sales_table.show()

Display the number of partitions in the dataframe:

In [None]:
if "sales_table" in globals():
    print(f"There are {sales_table.rdd.getNumPartitions()} partitions.")

Count the number of records in each partition:


In [None]:
if "sales_table" in globals():
    sales_table.withColumn("partitionId", spark_partition_id()).groupBy(
        "partitionId"
    ).count().orderBy(asc("partitionId")).show()

<a id='read_subquery'></a>
## Partition a Subquery

You can partition the entire subquery into 10 partitions based on the `CUST_ID` column. Again, the `dbtable` option is used on the `read()` method. This example uses `option("dbtable", <SUBQUERY>)`, which contains a subquery. The subquery is an SQL command that returns a record set and it must be wrapped in parathesis.

In this example, the subquery limits the number of columns and the number of rows that it returns. The subquery is given by:

```SQL
(SELECT AMOUNT_SOLD, CUST_ID FROM SH.SALES WHERE CUST_ID < 1000)
```

In [None]:
subquery = "(SELECT AMOUNT_SOLD, CUST_ID FROM SH.SALES WHERE CUST_ID < 1000)"
if "adb_url" in globals():
    sales_subquery = (
        sc.read.format("jdbc")
        .option("url", adb_url)
        .option("dbtable", subquery)
        .option("partitionColumn", "CUST_ID")
        .option("lowerBound", 0)
        .option("upperBound", 1000)
        .option("numPartitions", 10)
        .option("user", user)
        .option("password", password)
        .load()
    )
else:
    print("Skipping as it appears that you do not have adb_url configured.")

Display a subset of the data. Only the `AMOUNT_SOLD` and `CUST_ID` are returned:

In [None]:
if "sales_subquery" in globals():
    sales_table.show()

Display the number of partitions in the dataframe:

In [None]:
if "sales_subquery" in globals():
    print(f"There are {sales_table.rdd.getNumPartitions()} partitions.")

Count the number of records in each partition:


In [None]:
if "sales_subquery" in globals():
    sales_table.withColumn("partitionId", spark_partition_id()).groupBy(
        "partitionId"
    ).count().orderBy(asc("partitionId")).show()

Stop the PySpark Cluster:

In [None]:
sc.stop()

<a id='ref'></a>
# References

- [ACID](https://en.wikipedia.org/wiki/ACID)
- [ADS Library Documentation](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html)
- [Apache Spark](http://spark.apache.org/)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [Downloading a wallet](https://docs.oracle.com/en-us/iaas/Content/Database/Tasks/adbconnecting.htm#access)
- [JDBC driver](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)
- [OCI Data Flow service](https://www.oracle.com/big-data/data-flow/)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [OCI Vault service](https://www.oracle.com/security/cloud-security/key-management/)
- [Oracle Autonomous Database (ABD)](https://www.oracle.com/autonomous-database/)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
- [Relational database management system (RDMS)](https://en.wikipedia.org/wiki/Relational_database)
- [SparkSQL](https://spark.apache.org/sql/)