In [None]:
# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.

!pip install -U oracle-ads

<font color=gray>Oracle Data Science service sample notebook.

Copyright (c) 2022 Oracle, Inc.  All rights reserved.
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

***
# <font color=red>Connect to Oracle Big Data Service</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle Cloud Infrastructure Data Science Service Team </font></p>

***

# Overview:

The Oracle Big Data Service ([BDS](https://docs.oracle.com/en-us/iaas/Content/bigdata/home.htm)) is an Oracle Cloud Infrastructure (OCI) service that is designed for big data use cases and supports Hadoop and Spark. BDS has features such as HDFS and Hive. You can use BDS for short-lived clusters used to tackle specific tasks, and long-lived clusters that manage large data lakes. To connect to BDS from a notebook session, the cluster must have Kerberos enabled. This notebook demonstrates how to configure Kerberos authentication using ADS.

Compatible conda pack: [PySpark 3.0 and Data Flow](https://docs.oracle.com/en-us/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.7 (version 5.0)

---

## Contents:

* <a href='#introduction'>Introduction</a>
* <a href='#setup'>Setup</a>
* <a href='#BDSSecretKeeper'>BDSSecretKeeper</a>
* <a href='#save'>Save to Vault</a>
* <a href='#load'>Load from Vault</a>
* <a href='#connect'>Connect to Resources</a>
    * <a href='#connect_hdfs'>HDFS</a>
    * <a href='#connect_hive'>Hive</a>
* <a href='#clean-up'>Clean Up</a>
* <a href="#ref">References</a>

---

Datasets are provided as a convenience. Datasets are considered third-party content and are not considered materials under your agreement with Oracle.

---


In [None]:
import ads
import fsspec
import os

from ads.bds.auth import refresh_ticket
from ads.secrets.big_data_service import BDSSecretKeeper
from oci.config import from_file
from oci.vault import VaultsClient
from oci.vault.models import ScheduleSecretDeletionDetails
from impala.dbapi import connect

ads.set_auth("resource_principal")

<a id='introduction'></a>
# Introduction

As a network security protocol, Kerberos authenticates requests between hosts. A trusted third-party and secret-key cryptography are used to perform the authentication. The protocol has three main entities, client, server, and the Key Distribution Center (KDC). The notebook is the client, and the server is the BDS. The role of KDC is to perform authentication and grant tickets. The tickets allow the client and server to verify identities.

At a high level, Kerberos works by having the client make a request to authenticate. The KDC verifies the client's credentials and a number of messages are sent back and forth between the client, server and KDC. Ultimately, the KDC creates a service session key that is shared between the client and the server. This session key is used in the communication that follows.

Kerberos requires `keytab` and `krb5.conf` files. These are specific to each cluster and are obtained from the master node on the BDS cluster.

<a id='setup'></a>
# Setup

The best practice is to store the Kerberos credentials and configuration parameters in the vault. By doing this, access can be limited and auditing is supported. Further, it allows for the credentials to be rotated and updated in a single place. You don't have to update each notebook session because they pull the configuration parameters and Kerberos files from the vault as needed. ADS provides the `BDSSecretKeeper` class that makes connecting to BDS, Hive, and HDFS simple and secure. 

You must obtain two files from the master node on the BDS cluster. The keytab file and `krb5.conf` files must be stored on the block volume in the notebook session. The `krb5.conf` file is in the `/etc` directory of the BDS master node. If you're using a vault to store the credentials, these files are stored in the secure and convenient vault.

For Kerberos authentication, you provide the following values:

- `kerb5_path`: The local path to the `krb5.conf` configuration file.
- `keytab_path`: The local path to the principal's Keytab file.
- `principal`: The unique identity to which Kerberos can assign tickets.

To connect to the BDS cluster's HDFS you need the hostname and the port. These aren't strictly needed to access BDS, but most data science workflows require at least occasional access:

- `hdfs_host`: The HDFS hostname to use to connect to the HDFS file system on the BDS cluster.
- `hdfs_port`: The HDFS port to use to connect to the HDFS file system on the BDS cluster.

BDS provides access to Hive. To connect to Hive you need the hostname and the port:

- `hive_host`: The Hive hostname for the BDS cluster.
- `hive_port`: The Hive port for the BDS cluster.

To save the configuration parameters in the vault, you provide the OCIDs of the vault and the master encryption key:

- `vault_id` = OCID of the vault.
- `key_id` = OCID of the master encryption key.

If the configuration parameters are already stored in the vault, all you need is the secret OCID that has the necessary information. It's safe to hardcode the secret OCID into a notebook because OCI IAM manages access.

In the following cell, update the values to use to create your secret:

In [None]:
# Kerberos Authentication
kerb5_path = "<kerb5_path>"
keytab_path = "<keytab_file_path>"
principal = "<principal>"

# HDFS configuration
hdfs_host = "<hdfs_host>"
hdfs_port = "<hdfs_port>"

# HIVE configuration
hive_host = "<hive_host>"
hive_port = "<hive_port>"

# Vault OCIDs
vault_id = "<vault_id>"
key_id = "<key_id>"

<a id='BDSSecretKeeper'></a>
# BDSSecretKeeper

ADS provides the `BDSSecretKeeper` class to manage your BDS secrets. It's used to store the configuration parameters, including the `keytab` and `krb5.conf` files, in the vault. Use the `BDSSecretKeeper` and Kerberos context manager, `krbcontext()` to manage the connection to BDS, HDFS, and Hive.

The following cell creates a `BDSSecretKeeper` object, which contains the configuration parameters to connect with BDS and its services.

In [None]:
if vault_id != "<vault_id>" and key_id != "<key_id>" and principal != "<principal>":
    secret = BDSSecretKeeper(
        vault_id=vault_id,
        key_id=key_id,
        principal=principal,
        hdfs_host=hdfs_host if hdfs_host != "<hdfs_host>" else None,
        hive_host=hive_host if hive_host != "<hive_host>" else None,
        hdfs_port=hdfs_port if hdfs_port != "<hdfs_port>" else None,
        hive_port=hive_port if hive_port != "<hive_port>" else None,
        keytab_path=keytab_path if keytab_path != "<keytab_path>" else None,
        kerb5_path=kerb5_path if kerb5_path != "<kerb5_path>" else None,
    )
else:
    secret = None
    print(
        "BDSSecretKeeper object was not created. Enter configuration values in the Setup section."
    )

<a id='save'></a>
# Save to Vault

The `.save()` method on a `BDSSecretKeeper` object stores the configuration parameters in the vault. You can provide metadata such as a name, description, and tags. The contents of the `keytab` file and `krb5.conf` files are saved if the `save_files` parameter is set to `True`. The best practice is to store these files in the vault.

The following cell stores the BDS configuration parameters in the vault.

In [None]:
if secret is not None:
    saved_secret = secret.save(
        name="Sample_BDS_Secret_1",
        description="Demo BDSSecretKeeper secret to connect to BDS resourses",
        freeform_tags={"schema": "BDSSecretKeeper"},
        save_files=True,
    )
    print(f"Secret OCID: {saved_secret.secret_id}")
else:
    saved_secret = None
    print(
        "No secrets were saved to the Vault. Enter configuration values in the Setup section."
    )

<a id='load'></a>
# Load from Vault

The `.load_secret()` method in the `BDSSecretKeeper` class accepts a secret's OCID, and returns the contents of the vault. The default behavior is to save the `keytab` and `krb5.conf` files to the `keytab_path` and `kerb5_path` paths that are specified in the secret. You can change this by passing updated values to the `keytab_path` and `kerb5_path` parameters. Any parameter that can be passed to the `BDSSecretKeeper` constructor can be used in `.load_secret()` to override the values returned from the vault.

The `keytab file` and `krb5.conf` file can only be saved to the local block storage if they were saved in the vault.

Once the configuration parameters have been obtained and the files are saved, the Kerberos context manager (`.krbcontext()`) is used to create the Kerberos ticket.

In the following cell, the `.load_secret()` method obtains the configuration parameters from the vault and setups the `keytab` and `krb5.conf` files. The call to `krbcontext()` obtains a Kerberos ticket. The Kerberos ticket isn't returned from the `krbcontext()` because it's managed by a process running in the notebook session.

In [None]:
if saved_secret is not None:
    with BDSSecretKeeper.load_secret(saved_secret.secret_id) as config:
        refresh_ticket(principal=config["principal"], keytab_path=config["keytab_path"])
        hdfs_config = {
            "host": config["hdfs_host"],
            "port": config["hdfs_port"],
            "protocol": "webhdfs",
            "kerberos": True,
        }
        hive_config = {
            "host": config["hive_host"],
            "port": config["hive_port"],
        }
else:
    config = None
    print(
        "No secret was returned from the Vault. Enter configuration values in the Setup section."
    )

<a id='connect'></a>
# Connect to Resources

A successful call to `krbcontext` enables to you to connect to resources such as Hive and HDFS.

<a id='connect_hdfs'></a>
## HDFS

WebHDFS protocol was developed to allow an external application, such as a notebook session to access or manage files in the HDFS in BDS. It's based on an industry-standard RESTful mechanism that doesn't require Java binding. It works with operations such as reading files, writing to files, making directories, changing permissions, and renaming. It defines a public HTTP REST API that permits clients to access HDFS over the web. Clients can use common tools such as curl and wget to access the [HDFS](https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#Document+Conventions).

Data scientists commonly use `fsspec` to perform file-level operations on HDFS. To obtain a handle for the HDFS file system over WebHDFS, use the `fsspec.filesystem()` method. This method accepts a dictionary that contains the HDFS hostname, and port. It must also specify that the protocol is WebHDFS. To connect to BDS, Kerberos must be enabled. The following cell defines the connection parameters, and obtains a file system handle. Use the `fs` file system handle to list the files in the root directory with `fs.ls("/")`.

In [None]:
if (
    hdfs_config is not None
    and hdfs_config["host"] is not None
    and hdfs_config["port"] is not None
):
    fs = fsspec.filesystem(**hdfs_config)
    print(fs.ls("/"))
else:
    fs = None
    print(
        "No connection was made to the HDFS file system. Enter configuration values in the Setup section."
    )

<a id='connect_hive'></a>
## Hive

Apache Hive is a data warehouse system that runs on top of Hadoop. It allows you to perform SQL-like on your data using the HGL language. The `.connect()` method on a `impala.dbapi` object makes a connection. This connection needs the Hive hostname and port. For secure BDS, the authentication must be set to `GSSAPI`, and the service name is set to 'hive'.

A cursor is generally used to work with HQL. The `.execute()` method allows for HQL commands to be run on Hive. For example, this command uses a Hive cursor to get a copy of all the data in the fictitious `my_database` database.

``` python
cursor.execute("SELECT * FROM my_database")
```

To materialize the results and convert it to a Pandas dataframe, use the following command:

``` python
df = pd.DataFrame(cursor.fetchall(), columns=[col[0] for col in cursor.description])
```

The following cell makes a connection to Hive and returns a cursor, which is used to list the databases in the Hive cluster.

In [None]:
if (
    hive_config is not None
    and hive_config["host"] is not None
    and hive_config["port"] is not None
):
    cursor = connect(
        host=hive_config["host"],
        port=hive_config["port"],
        auth_mechanism="GSSAPI",
        kerberos_service_name="hive",
    ).cursor()
    cursor.execute("SHOW DATABASES")
    print(cursor.fetchall())
else:
    cursor = None
    print(
        "No connection was made to Hive. Enter configuration values in the Setup section."
    )

<a id='clean-up'></a>
# Clean Up

The following code removes the `keytab` and `krb5.conf` files from local storage. It also deletes the secret from the vault, and drops connections to HDFS and Hive.

In [None]:
# Disconnect from Hive
if cursor is not None:
    cursor.close()

# Remove the Keytab file
if keytab_path is not None and os.path.exists(keytab_path):
    os.remove(keytab_path)

# Remove the `krb5.conf` file
if kerb5_path is not None and os.path.exists(kerb5_path):
    os.remove(kerb5_path)

# Delete the secret
if saved_secret is not None:
    oci_config = from_file(
        os.path.join(os.path.expanduser("~"), ".oci", "config"), "DEFAULT"
    )
    try:
        VaultsClient(oci_config).schedule_secret_deletion(
            saved_secret.secret_id, ScheduleSecretDeletionDetails()
        )
    except:
        print("The secret has already been deleted.")

<a id="ref"></a>
# References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)