
---
    
# <font color="red">Getting Started with Spark Conda Env</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Workshop for Sisal</font></p>

---
---

## Overview

The PySpark and Data Flow conda allows you to leverage the power of Apache Spark. Use it to access the full computational power of a notebook session by using parallel computing. For larger jobs, you can interactively develop Apache Spark applications and submit them to Oracle Data Flow without blocking the notebook session. PySpark MLlib implements a wide collection of powerful machine-learning algorithms. Use the SQL-like language of PySparkSQL to analyze huge amounts of structure and semi-structured data stored on Oracle Object Storage. Speed up your workflow by using sparksql-magic to run PySparkSQL queries directly in the notebook.

This notebook shows you how to authenticate OCI resources, and how to configure the `core-site.xml` file so that PySpark can access Object Storage.

---

## Contents:

- <a href='#authentication'>Understanding Authentication to Oracle Cloud Infrastructure Resources from a Notebook Session</a>
 - <a href='#resource_principals'>Authentication with Resource Principals</a>
    - <a href='#resource_principals_ads'>Resource Principals Authentication using the ADS SDK</a>
    - <a href='#resource_principals_oci'>Resource Principals Authentication using the OCI SDK</a>
    - <a href='#resource_principals_cli'>Resource Principals Authentication using the OCI CLI</a> 
- <a href='#conda'>Conda</a>
    - <a href='#conda_overview'>Overview</a>
    - <a href='#conda_libraries'>Principal Conda Libraries</a>
    - <a href='#conda_configuration'>Configuration</a>
        - <a href='#coresite_auth_rp'>Authentication with Resource Principals</a>
           - <a href='#odsc_coresite_command_rp'>Configuration of `core-site.mxl` Using the `odsc` Command Line Tool</a>
           - <a href='#manually_update_coresite_rp'>Manually Configurating `core-site.xml`</a>
        - <a href='#coresite_auth_api_keys'>Authentication with API Keys</a>
           - <a href='#odsc_coresite_command_api_keys'>Configuration of `core-site.mxl` Using the `odsc` Command Line Tool</a>
           - <a href='#manually_update_coresite_api_keys'>Manually Configurating `core-site.xml`</a>
        - <a href='#conda_configuration_testing'>Testing the Configuration</a>
- <a href='#ref'>References</a> 

---

In [1]:
import logging
import warnings
import os
import ads
from oci.auth.signers import get_resource_principals_signer
from oci.data_science import DataScienceClient
from os import path
from os import cpu_count
from pyspark.sql import SparkSession
import re

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

<a id='authentication'></a>
# Understanding Authentication to OCI Resources from a Notebook Session

When working within a notebook session, the `datascience` user is used. This user does not have an OCI Identity and Access Management (IAM) identity, so it has no access to the OCI API. To access OCI service resources, including Data Science projects and models, from a notebook environment, you must configure either resource principals or API keys. 

PySpark can authenticate with Object Storage using resource principals or API keys.  API keys *cannot* contain a passphrase, see [setting up keys and configuration files](https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/devguidesetupprereq.htm), and the `api_keys.ipynb` example notebook.

If you must have a passphrase in your configuration and key files, you can download the file from Object Storage locally with the OCI Python SDK, and then load the local file in a Spark context.


<a id='resource_principals'></a>
## Authentication with Resource Principals 

Data Science enables easy and secure authentication using the notebook session's resource principal to access other OCI resources, including Data Science projects and models. The following cells show you how to use your notebook session's resource principal.

In advance, a tenancy administrator must write policies to grant permissions to the resource principal to access other OCI resources, see [manually configuring your tenancy for Data Science](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/configure-tenancy.htm) for details.

There are two methods to configure the notebook to use resource principals, use the `ads` library or the `oci` library. While both these libraries provide the required authentication, the `ads` library is specifically designed for easy operation within a Data Science notebook session.

If you don't want to take on these library dependencies, you can use the `oci` command from the command line.

For more details about using resource principals in the Data Science service, see the [ADS Configuration Guide](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/user_guide/configuration/configuration.html#), and [authenticating to the OCI APIs from a notebook session](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/use-notebook-sessions.htm#topic_kxj_znw_pkb).

<a id='resource_principals_ads'></a>
### Resource Principals Authentication using the ADS SDK

The `set_auth()` method sets the proper authentication mechanism for ADS. ADS uses the `oci` SDK to access resources like the model catalog or Object Storage.

Within a notebook session, you configure the use of a resource principal for the ADS SDK by running this in a notebook cell:

In [2]:
ads.set_auth(auth='resource_principal') 

<a id='conda'></a>
# Conda

<a id='conda_overview'></a>
## Overview

This conda allows data scientists to leverage Apache Spark.  You can set up Apache Spark applications, and then submit them to Data Flow. You can also use PySpark, including PySpark MLib and PySparkSQL.  

<a id='conda_libraries'></a>
## Principal Conda Libraries

These are some of the libraries included in this conda:

- ads: Partial ADS distribution. This distribution excludes Oracle AutoML and MLX. 
- oraclejdk: Oracle Java Development Kits.
- pyspark: Python API for Apache Spark.
- scikit-learn: A library for building machine learning models including regressions, classifiers, and clustering algorithms.
- sparksql-magic: A Library for SparkSQL Magic commands for Jupyter notebooks.

<a id='conda_configuration'></a>
## Configuration

To access Object Storage, the `core-site.xml` file must be configured.  

`core-site.xml` can be manually configured or configured with the use of the `odsc` program.

<a id='coresite_auth_rp'></a>
### Authentication with Resource Principals

<a id='odsc_coresite_command_rp'></a>
#### Configuration of `core-site.mxl` Using the `odsc` Command Line Tool

When authenticated with resource principals, you can run `odsc core-site config -o -a resource_principal`. It automatically populates `core-site.xml`, and saves the file to `~/spark_conf_dir/core-site.xml`. 

You can use these command line options 
- `-a`, `--authentication` Authentication mode. Supports `resource_principal` and `api_key` (default).
- `-r`, `--region` Name of the region.
- `-o`, `--overwrite` Overwrite `core-site.xml`.
- `-O`, `--output` Output path for `core-site.xml`.
- `-q`, `--quiet` Suppress non-error output.

Run `odsc core-site config --help` to check the use of this CLI using the command line.

<a id='manually_update_coresite_rp'></a>
#### Manually Configuring `core-site.xml`
When the conda package is installed, a templated version of `core-site.xml` is also installed. 

This file has to be updated to include the following values:

`fs.oci.client.hostname`: The address of Object Storage. For example, `https://objectstorage.us-ashburn-1.oraclecloud.com` You have to replace `us-ashburn-1` with the region you are in.

`fs.oci.client.custom.authenticator`: Set the value to `com.oracle.bmc.hdfs.auth.ResourcePrincipalsCustomAuthenticator`. 

When using resource principals, these properties don't need to be configured:

- `fs.oci.client.auth.tenantId`
- `fs.oci.client.auth.userId`
- `fs.oci.client.auth.fingerprint`
- `fs.oci.client.auth.pemfilepath`

The following example `core-site.xml` file illustrates using resource principals for authentication to access Object Storage:

```{xml}
<?xml version="1.0"?>
<configuration>
  <property>
    <name>fs.oci.client.hostname</name>
    <value>https://objectstorage.us-ashburn-1.oraclecloud.com</value>
  </property>
  <property>
    <name>fs.oci.client.custom.authenticator</name>
    <value>com.oracle.bmc.hdfs.auth.ResourcePrincipalsCustomAuthenticator</value>
  </property>
</configuration>
```

For details, see [HDFS connector for Object Storage #using resource principals for authentication](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/hdfsconnector.htm#hdfs_using_resource_principals_for_authentication).

<a id='coresite_auth_api_keys'></a>
### Authentication with API Keys
<a id='odsc_coresite_command_api_keys'></a>
#### Configuration of `core-site.mxl` Using the `odsc` Command Line Tool

With an OCI configuration file, you can run `odsc core-site config -o`. By default, the file uses the OCI configuration file stored in `~/.oci/config`, automatically populates `core-site.xml`, and saves it to `~/spark_conf_dir/core-site.xml`. 

You can use these command line options 
- `-a`, `--authentication` Authentication mode. Supports `resource_principal` and `api_key` (default).
- `-c`, `--configuration` Path to the OCI configuration file.
- `-p`, `--profile` Name of the profile.
- `-r`, `--region` Name of the region.
- `-o`, `--overwrite` Overwrite `core-site.xml`.
- `-O`, `--output` Output path for `core-site.xml`.
- `-q`, `--quiet` Suppress non-error output.

Run `odsc core-site config --help` to check the use of this CLI using the command line.

<a id='manually_update_coresite_api_keys'></a>
#### Manually Configuring `core-site.xml`
When the conda environment is installed, a templated version of `core-site.xml` is also installed. You can manually update this file.

You must specify the following `core-site.xml` file parameters:

`fs.oci.client.hostname`: Address of Object Storage. For example, `https://objectstorage.us-ashburn-1.oraclecloud.com`. You must replace us-ashburn-1 with the region you are in.

`fs.oci.client.auth.tenantId`: OCID of your tenancy.

`fs.oci.client.auth.userId`: Your user OCID.

`fs.oci.client.auth.fingerprint`: Fingerprint for the key pair being used.

`fs.oci.client.auth.pemfilepath`: The full path and file name of the private key used for authentication. 

The values of these parameters are found in the OCI configuration file.

The following is an example `core-site.xml` file that has been updated. Put all the parameter values between the `<value>` and `</value>` tags:

```{xml}
<configuration><!-- reference: https://docs.cloud.oracle.com/en-us/iaas/Content/API/SDKDocs/hdfsconnector.htm -->
  <property>
    <name>fs.oci.client.hostname</name>
    <value>https://objectstorage.us-ashburn-1.oraclecloud.com</value>
  </property>
  <!--<property>-->
    <!--<name>fs.oci.client.hostname.myBucket.myNamespace</name>-->
    <!--<value></value>&lt;!&ndash; myBucket@myNamespace &ndash;&gt;-->
  <!--</property>-->
  <property>
    <name>fs.oci.client.auth.tenantId</name>
    <value>ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkzzzzz...</value> 
  </property>
  <property>
    <name>fs.oci.client.auth.userId</name>
    <value>ocid1.user.oc1..aaaaaaaacdxbfmyhe7sxc6iwi73okzuf3src6zzzzzz...</value>
  </property>
  <property>
    <name>fs.oci.client.auth.fingerprint</name>
    <value>01:01:02:03:05:08:13:1b:2e:49:77:c0:01:37:01:f7</value>
  </property>
  <property>
    <name>fs.oci.client.auth.pemfilepath</name>
    <value>/home/datascience/.oci/key.pem</value>
  </property>
</configuration>
```


<a id='conda_configuration_testing'></a>
### Testing the Configuration

Set up a spark session in your PySpark conda environment to test if the configuration has been set up properly.  Run the following cells, and ensure that there are no errors.

In [4]:
# create a spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.driver.cores", str(1)) \
    .config("spark.executor.cores", str(4)) \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

Next, load a CSV file from a public bucket:

In [5]:
#
# result is a Spark DataFrame
#
berlin_airbnb = spark\
      .read\
      .format("csv")\
      .option("header", "true")\
      .option("multiLine", "true")\
      .load("oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/kaggle_berlin_airbnb_listings_summary.csv")\
      .cache() # cache the dataset to increase computing speed

# the dataframe as a sql view so we can perform SQL on it
berlin_airbnb.createOrReplaceTempView("berlin")

In [9]:
berlin_airbnb.columns

['id',
 'listing_url',
 'scrape_id',
 'last_scraped',
 'name',
 'summary',
 'space',
 'description',
 'experiences_offered',
 'neighborhood_overview',
 'notes',
 'transit',
 'access',
 'interaction',
 'house_rules',
 'thumbnail_url',
 'medium_url',
 'picture_url',
 'xl_picture_url',
 'host_id',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_listings_count',
 'host_total_listings_count',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'street',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'city',
 'state',
 'zipcode',
 'market',
 'smart_location',
 'country_code',
 'country',
 'latitude',
 'longitude',
 'is_location_exact',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'amenities',


You can also use `sparksql-magic` to run a query on the view, and store the results as a dataframe: 

In [10]:
%load_ext sparksql_magic
%config SparkSql.max_num_rows=20

In [11]:
%%sparksql --cache --view result df 

SELECT name, latitude, longitude FROM berlin LIMIT 10

cache dataframe with lazy load
create temporary view `result`
capture dataframe to local variable `df`


0,1,2
name,latitude,longitude
Berlin-Mitte Value! Quiet courtyard/very central,52.53453732241747,13.402556926822387
Prenzlauer Berg close to Mauerpark,52.54851279221664,13.404552826587466
Fabulous Flat in great Location,52.534996191586714,13.417578665333295
BerlinSpot Schoneberg near KaDeWe,52.498854933130026,13.34906453348717
BrightRoom with sunny greenview!,52.5431572633131,13.415091104515707
Geourgeous flat - outstanding views,52.533030768026826,13.416046823956403
Apartment in Prenzlauer Berg,52.547846407992154,13.405562243722455
APARTMENT TO RENT,52.51051399601544,13.457850238106195
In the Heart of Berlin - Kreuzberg,52.50479227385915,13.435101853886051


In [12]:
%%sparksql --cache

SELECT count(*) FROM berlin

cache dataframe with lazy load


0
count(1)
22552


In [13]:
df.head(10)

[Row(name='Berlin-Mitte Value! Quiet courtyard/very central', latitude='52.53453732241747', longitude='13.402556926822387'),
 Row(name='Prenzlauer Berg close to Mauerpark', latitude='52.54851279221664', longitude='13.404552826587466'),
 Row(name='Fabulous Flat in great Location', latitude='52.534996191586714', longitude='13.417578665333295'),
 Row(name='BerlinSpot Schoneberg near KaDeWe', latitude='52.498854933130026', longitude='13.34906453348717'),
 Row(name='BrightRoom with sunny greenview!', latitude='52.5431572633131', longitude='13.415091104515707'),
 Row(name='Geourgeous flat - outstanding views', latitude='52.533030768026826', longitude='13.416046823956403'),
 Row(name='Apartment in Prenzlauer Berg', latitude='52.547846407992154', longitude='13.405562243722455'),
 Row(name='APARTMENT TO RENT', latitude='52.51051399601544', longitude='13.457850238106195'),
 Row(name='In the Heart of Berlin - Kreuzberg', latitude='52.50479227385915', longitude='13.435101853886051'),
 Row(name='Do

#### Let's try with another file, in one of our buckets

**TIP**: 
* URL FORMAT is oci://{BUCKET}@{NAMESPACE}/{OBJECT_NAME}

In [22]:
orcl_attrition = spark\
      .read\
      .format("csv")\
      .option("header", "true")\
      .load("oci://drift_input@frqap2zhtzbe/reference.csv")\
      .cache() # cache the dataset to increase computing speed

In [23]:
# the dataframe as a sql view so we can perform SQL on it
orcl_attrition.createOrReplaceTempView("ATTRITION")

In [24]:
%%sparksql --cache

SELECT count(*) FROM ATTRITION

cache dataframe with lazy load


0
count(1)
1176


In [25]:
%%sparksql --cache

SELECT * FROM ATTRITION

cache dataframe with lazy load
only showing top 20 row(s)


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
TravelForWork,MonthlyRate,PercentSalaryHike,CommuteLength,SalaryLevel,YearsOnJob,JobInvolvement,PerformanceRating,Gender,TrainingTimesLastYear,YearsSinceLastPromotion,EnvironmentSatisfaction,YearsinIndustry,JobLevel,JobRole,WorkLifeBalance,Age,RelationshipSatisfaction,MaritalStatus,YearsAtCurrentLevel,HourlyRate,MonthlyIncome,OverTime,JobSatisfaction,EducationField,JobFunction,EducationalLevel,NumCompaniesWorked,StockOptionLevel,YearsWithCurrManager
infrequent,19146,22,2,5640,2,2,4,Male,2,2,4,4,2,Manufacturing Director,1,23,1,Married,2,33,4775,No,4,Life Sciences,Software Developer,L2,6,2,2
none,3395,23,2,5678,23,2,4,Male,3,14,3,25,3,Healthcare Representative,2,46,4,Married,15,74,10748,No,3,Life Sciences,Software Developer,L1,3,1,4
infrequent,4510,18,15,2022,5,3,3,Female,2,4,2,7,1,Research Scientist,3,57,1,Married,4,72,4963,Yes,2,Life Sciences,Software Developer,L4,9,3,3
none,17071,16,25,6782,1,4,3,Female,2,0,2,22,4,Sales Executive,2,41,4,Single,0,100,13194,Yes,2,Life Sciences,Product Management,L3,4,0,0
infrequent,18725,23,10,1980,4,3,4,Male,4,0,4,10,1,Laboratory Technician,3,52,2,Married,2,96,2075,No,4,Life Sciences,Software Developer,L4,3,2,3
infrequent,17312,11,3,1376,22,3,3,Male,2,4,1,24,5,Manager,2,43,1,Married,6,56,18880,No,3,Life Sciences,Software Developer,L3,5,0,14
infrequent,12449,12,20,5406,6,2,3,Male,2,3,4,6,1,Laboratory Technician,3,29,3,Married,5,78,3196,No,1,Medical,Software Developer,L4,1,3,3
infrequent,23213,18,25,2160,1,1,3,Male,3,0,3,1,1,Laboratory Technician,1,27,2,Single,0,66,2340,Yes,4,Technical Degree,Software Developer,L3,1,0,0
infrequent,20328,16,12,6618,0,2,3,Female,3,0,4,6,2,Sales Executive,3,57,3,Married,0,89,5380,No,1,Marketing,Product Management,L5,4,1,0


In [26]:
%%sparksql --cache --view result df_gender 

SELECT gender, count(*) FROM ATTRITION GROUP BY gender

cache dataframe with lazy load
create temporary view `result`
capture dataframe to local variable `df_gender`


0,1
gender,count(1)
Female,479
Male,697


In [27]:
df_gender.show()

+------+--------+
|gender|count(1)|
+------+--------+
|Female|     479|
|  Male|     697|
+------+--------+

