<font color=gray>ADS Sample Notebook.

Copyright (c) 2021 Oracle, Inc. All rights reserved. Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

***
# <font color=red>Getting Started with Oracle Cloud Infrastructure Data Science</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle Cloud Infrastructure Data Science Service Team </font></p>

***

## Service Overview

Welcome to Oracle Cloud Infrastructure Data Science Service!

Oracle Cloud Infrastructure Data Science Service is a fully managed platform for data science teams to build, train, and manage machine learning models using Oracle Cloud Infrastructure.

The Data Science Service:

* Provides data scientists with a collaborative, project-driven workspace.
* Enables self-service access to infrastructure for data science workloads.
* Includes Python-centric tools, libraries, and packages developed by the open-source community and the [Oracle Accelerated Data Science Library](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html), which supports the end-to-end lifecycle of predictive models:
* Data acquisition, profiling, preparation, and visualization.
* Feature engineering.
* Model training.
* Model evaluation, explanation, and interpretation.
* Model storage through the [Model Catalog](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/manage-models.htm). 
* Model deployment.
* Integrates with the rest of the Oracle Cloud Infrastructure stack, including [Oracle Functions](https://docs.cloud.oracle.com/en-us/iaas/Content/Functions/Concepts/functionsoverview.htm), [Data Flow](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_data_flow.htm), [Autonomous Data Warehouse](https://docs.cloud.oracle.com/en-us/iaas/Content/Database/Concepts/adboverview.htm), [Streaming](https://docs.cloud.oracle.com/en-us/iaas/Content/Streaming/Concepts/streamingoverview.htm), [Vault](https://docs.cloud.oracle.com/en-us/iaas/Content/KeyManagement/Concepts/keyoverview.htm), [Logging](https://docs.cloud.oracle.com/en-us/iaas/Content/Logging/Concepts/loggingoverview.htm#loggingoverview), and [Object Storage](https://docs.cloud.oracle.com/en-us/iaas/Content/Object/Concepts/objectstorageoverview.htm).
* Helps data scientists concentrate on methodology and domain expertise to deliver more models to production.

For more details, check out the [Data Science service guide](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm).

---

## Overview

The PySpark and Data Flow conda allows you to leverage the power of Apache Spark. Use it to access the full computational power of a notebook session by using parallel computing. For larger jobs, you can interactively develop Spark applications and submit them to Oracle Data Flow without blocking the notebook session. PySpark MLlib implements a wide collection of powerful machine-learning algorithms. Use the SQL-like language of PySparkSQL to analyze huge amounts of structure and semi-structured data stored on Oracle Object Storage. Speed up your workflow by using sparksql-magic to run PySparkSQL queries directly in the notebook.

In this notebook, we will go through how to authenticate to Oracle Cloud Infrastructure Resources and how to configure the `core-site.xml` file so PySpark can access Object Storage.

---

## Prerequisites:
- Experience with a specific topic: Novice
- Professional experience: None

---

## Objectives:

- <a href='#authentication'>Understanding Authentication to Oracle Cloud Infrastructure Resources from a Notebook Session</a>
 - <a href='#resource_principals'>Authentication with Resource Principals</a>
    - <a href='#resource_principals_ads'>Resource Principals Authentication using the ADS SDK</a>
    - <a href='#resource_principals_oci'>Resource Principals Authentication using the OCI SDK</a>
    - <a href='#resource_principals_cli'>Resource Principals Authentication using the OCI CLI</a> 
- <a href='#conda'>Conda</a>
    - <a href='#conda_overview'>Overview</a>
    - <a href='#conda_libraries'>Principal Conda Libraries</a>
    - <a href='#conda_configuration'>Configuration</a>
        - <a href='#odsc_coresite_command'>Configuration of `core-site.mxl` Using the `odsc` Command Line Tool</a>
        - <a href='#manually_update_coresite'>Manually Configurating `core-site.xml`</a>
        - <a href='#conda_configuration_testing'>Testing the Configuration</a>
- <a href='#ref'>References</a> 

---

In [None]:
import logging
import warnings
import os
import ads
from ads import set_documentation_mode
from oci.auth.signers import get_resource_principals_signer
from oci.data_science import DataScienceClient
from os import path
from os import cpu_count
from pyspark.sql import SparkSession
import re

set_documentation_mode(False)
warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

<a id='authentication'></a>
# Understanding Authentication to Oracle Cloud Infrastructure Resources from a Notebook Session

When working within a notebook session, the `datascience` user is used. This user does not have an Oracle Cloud Infrastructure Identity and Access Management (IAM) identity, so it has no access to the Oracle Cloud Infrastructure API. To access Oracle Cloud Infrastructure resources, including Data Science projects, models and any other Oracle Cloud Infrastructure service resources from the notebook environment, you must configure either resource principals or API keys. 

PySpark cannot reach Object Storage using resource principals. The only auth mechanism that would allow PySpark to connect with Object Storage is through setting up the Oracle Cloud Infrastructure configuration and key files.  In addition, the configuration and key files cannot contain a passphrase.  Please refer to [OCI Documentation on Setting up Keys and Configuration File](https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/devguidesetupprereq.htm) and the example notebook `api_keys.ipynb` on how to set up configuration file and API keys.

If you have to have a passphrase in your configuration and key files, you can download the file from Object Storage locally with the OCI Python SDK and then load the local file in Spark context.


<a id='resource_principals'></a>
## Authentication with Resource Principals 

**Note: If you authenticate with Resource Principals, PySpark will not be able to access Object Storage.** 

Oracle Cloud Infrastructure Data Science enables easy and secure authentication using the notebook session's resource principal to access other Oracle Cloud Infrastructure resources, including Data Science projects and models. Follow the steps below to utilize your notebook session's resource principal.

In advance, a tenancy administrator must write policies to grant permissions to the resource principal to access other Oracle Cloud Infrastructure resources, see [Manually Configuring Your Tenancy for Data Science](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/configure-tenancy.htm) for more details.

There are two methods to configure the notebook to use resource principals and they are `ads` library or using the `oci` library. While both these libraries provide the required authentication, the `ads` library  has been specifically designed for easy operation within a Data Science notebook session.

If you do not wish to take on these library dependencies, it is also possible to use the `oci` command on the command line.

For more details on using resource principals in the Data Science service, see the [ADS Configuration Guide](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/user_guide/configuration/configuration.html#) and the [Authenticating to the Oracle Cloud Infrastructure APIs from a Notebook Session](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/use-notebook-sessions.htm#topic_kxj_znw_pkb).

<a id='resource_principals_ads'></a>
### Resource Principals Authentication using the ADS SDK

Within a notebook session, configure the use of a resource principal for the ADS SDK by running this in a notebook cell:

The `set_auth()` method sets the proper authentication mechanism for ADS. ADS uses the `oci` SDK to access resources like the model catalog or Object Storage.

In [None]:
ads.set_auth(auth='resource_principal') 

<a id='resource_principals_oci'></a>
### Resource Principals Authentication using the OCI SDK

Within your notebook session, the `oci` library can use the resource principal. This method gives more flexibility but this flexibility is generally not needed within a Data Science notebook session. The following cell demonstrates how to make a basic connection using the default settings.

In [None]:
resource_principal = get_resource_principals_signer() 
dsc = DataScienceClient(config={}, signer=resource_principal)

<a id='resource_principals_cli'></a>
### Resource Principals Authentication using the OCI CLI

Within a notebook session, the Oracle Cloud Infrastructure CLI can be used to configure the resource principal using the `--auth=resource_principal` flag. For example:

In [None]:
cmd = "oci data-science project get --project-id=$PROJECT_OCID --auth=resource_principal 2>&1"
print(os.popen(cmd).read())

If the resource principal is correctly configured, a message similar to the following will be printed.

```{json}
{
"data": {
"compartment-id": "ocid1.compartment.oc1..aaaaaaaafl3avkal72rrwuy4m5rumpwh7r4axzzzzzz...",
"created-by": "ocid1.user.oc1..aaaaaaaabfrlcbiyvjmjvgh3ns6trdyoewxytzzzzzz...",
"defined-tags": {},
"description": "my favorite demo project\n",
"display-name": "demo-project",
"freeform-tags": {},
"id": "ocid1.datascienceproject.oc1.iad.aaaaaaaappvg4tp5kmbkurcyeghxaqmaknw3szzzzzz...",
"lifecycle-state": "ACTIVE",
"time-created": "2019-11-14T22:29:06.870000+00:00"
},
"etag": "b4d66fb733748f3454206d5de6b9acb3634edc804zzzzzz..."
}
```

<a id='conda'></a>
# Conda

<a id='conda_overview'></a>
## Overview

This conda allows data scientists to leverage Apache Spark.  You can set up Spark applications and submit them to Oracle Cloud Infrastructure Data Flow.  Also, you can use PySpark including PySpark MLib and PySparkSQL.  

<a id='conda_libraries'></a>
## Principal Conda Libraries

Here are some of the libraries included in this conda:

- ads: Partial ADS distribution. This distribution excludes Oracle AutoML and MLX. 
- oraclejdk: Oracle Java Development Kits
- pyspark: python API for Apache Spark 
- scikit-learn: library for building machine learning models including regressions, classifiers and clustering algorithms
- sparksql-magic: library for Spark SQL magic commands for Jupyter Notebook

<a id='conda_configuration'></a>
## Configuration

**To access Oracle Cloud Infrastructure Object Storage the `core-site.xml` file must be configured.**  

`core-site.xml` can be manually configured or configured with the help of the `odsc` program.

<a id='odsc_coresite_command'></a>
### Configuration of `core-site.mxl` Using the `odsc` Command Line Tool

With an oci config file, you can run `odsc core-site config -o`. This by default uses the oci config file stored at `~/.oci/config`, auto-populates `core-site.xml`, and saves to `~/spark_conf_dir/core-site.xml`. 

The following command line options are available: 
- -a, --authentication Authentication mode. Only api_key is supported.
- -c, --config Path to the oci config file.
- -p, --profile Name of the profile.
- -r, --region Name of the region.
- -o, --overwrite Overwrite core-site.xml.
- -O, --output Output path for core-site.xml.
- -q, --quiet Suppress non-error output.

Run `odsc core-site config --help` to checking the usage of this cli through command line

<a id='manually_update_coresite'></a>
### Manually Configuring `core-site.xml`
When the conda package is installed, a templated version of `core-site.xml` is installed. This file can be manually updated.

The `core-site.xml` file has several parameters you need to provide. Here are their descriptions:

`fs.oci.client.hostname`: address of object storage i.e. `https://objectstorage.us-ashburn-1.oraclecloud.com` You will need to replace us-ashburn-1 with the region you are in.

`fs.oci.client.auth.tenantId`: OCID of your tenancy.

`fs.oci.client.auth.userId`: your user OCID.

`fs.oci.client.auth.fingerprint`: fingerprint for the key pair being used.

`fs.oci.client.auth.pemfilepath`: The full path and file name of the private key used for authentication. 

The values of these parameters can be found inside the OCI configuration file.

The following is an example `core-site.xml` file that has been updated. Place all the parameter values between the `<value>` and `</value>` tags.

```{xml}
<configuration><!-- reference: https://docs.cloud.oracle.com/en-us/iaas/Content/API/SDKDocs/hdfsconnector.htm -->
  <property>
    <name>fs.oci.client.hostname</name>
    <value>https://objectstorage.us-ashburn-1.oraclecloud.com</value>
  </property>
  <!--<property>-->
    <!--<name>fs.oci.client.hostname.myBucket.myNamespace</name>-->
    <!--<value></value>&lt;!&ndash; myBucket@myNamespace &ndash;&gt;-->
  <!--</property>-->
  <property>
    <name>fs.oci.client.auth.tenantId</name>
    <value>ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkzzzzz...</value> 
  </property>
  <property>
    <name>fs.oci.client.auth.userId</name>
    <value>ocid1.user.oc1..aaaaaaaacdxbfmyhe7sxc6iwi73okzuf3src6zzzzzz...</value>
  </property>
  <property>
    <name>fs.oci.client.auth.fingerprint</name>
    <value>01:01:02:03:05:08:13:1b:2e:49:77:c0:01:37:01:f7</value>
  </property>
  <property>
    <name>fs.oci.client.auth.pemfilepath</name>
    <value>/home/datascience/.oci/key.pem</value>
  </property>
</configuration>
```


<a id='conda_configuration_testing'></a>
### Testing the Configuration

Set up a spark session in your PySpark conda environment to test if the configuration has been set up properly.  Run the following cells and make sure there are no errors.

In [None]:
# create a spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.driver.cores", str(max(1, cpu_count() - 1))) \
    .config("spark.executor.cores", str(max(1, cpu_count() - 1))) \
    .getOrCreate()

We are going to load a CSV file from a public bucket.

In [None]:
berlin_airbnb = spark\
      .read\
      .format("csv")\
      .option("header", "true")\
      .option("multiLine", "true")\
      .load("oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/kaggle_berlin_airbnb_listings_summary.csv")\
      .cache() # cache the dataset to increase computing speed
# the dataframe as a sql view so we can perform SQL on it
berlin_airbnb.createOrReplaceTempView("berlin")

One can also use the sparksql magic to run a query on the view and store the results as a dataframe: 

In [None]:
%load_ext sparksql_magic
%config SparkSql.max_num_rows=20

In [None]:
%%sparksql --cache --view result df 

SELECT latitude, longitude FROM berlin LIMIT 10

<a id='ref'></a>
# References

* [Understanding and Using Conda Environments](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/use-notebook-sessions.htm#conda_understand_environments)
* [ADS Configuration Guide](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/user_guide/configuration/configuration.html#)
* [Authenticating to the Oracle Cloud Infrastructure APIs from a Notebook Session](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/use-notebook-sessions.htm#topic_kxj_znw_pkb)
* [Manually Configuring Your Tenancy for Data Science](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/configure-tenancy.htm)
* [Data Science service guide](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
* [Data Flow service guide](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm)
* [Our Data Science & AI Blog](https://blogs.oracle.com/datascience/)
* [PySpark Documentation](https://spark.apache.org/docs/latest/api/python/index.html)
* [SparkSQL Magic Documentation](https://github.com/cryeo/sparksql-magic)