In [None]:
# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.
!pip install -U oracle-ads

Oracle Data Science service sample notebook.

Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

***

# <font color="red">Spark NLP within Oracle Cloud Infrastructure Data Flow Studio</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---
# Overview:

This notebook demonstrates how to use [Spark NLP](https://www.johnsnowlabs.com/) within a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster.

Compatible conda pack: [PySpark 3.2 and Data Flow](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8

---

## Contents:

- <a href='#pre-requisites'>1. Pre-requisites</a>
    - <a href='#policies'>1.2 Policies</a>
    - <a href='#prerequisites_helpers'>1.3 Helpers</a>
    - <a href='#prerequisites_authentication'>1.4 Authentication</a>
    - <a href='#prerequisites_variables'>1.5 Variables</a>    
- <a href='#spark_nlp'>2. John Snow Labs Spark NLP</a>
    - <a href='#load_extension'>2.1. Load Data Flow Spark Magic Extension</a>
    - <a href='#spark_nlp_pretrained_models'>2.2. Pre-trained Spark NLP models installation</a>
    - <a href='#custom_conda_environment_publishing'>2.3. Publishing custom PySpark 3.2 and Data Flow conda environment</a>
    - <a href='#new_session_published_conda'>2.4 Create a Data Flow Session with a new `spark.archives` configuration</a>
    - <a href='#simple_anotation_example'>2.5. Simple annotation task example</a>    
- <a href='#cleanup'>3. Clean Up</a> 
- <a href='#ref'>4. References</a>   

---

<a id='pre-requisites'></a>
# 1. Pre-requisites 

Data Flow Sessions are accessible through the following conda environment: 

* **PySpark 3.2 and Data Flow 1.0 (pyspark32_p38_cpu_v1)**

You can customize `pypspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Data Flow session cluster. 

<a id='policies'></a>
## 1.2. Policies
This section covers the creation of dynamic groups and policies needed to use the service.

* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm/)
* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)
* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)

<a id="prerequisites_helpers"></a>
## 1.3 Helpers
This section provides a helper method that will be used across the notebook to prepare arguments for the magic commands.

In [None]:
import json


def prepare_command(command: dict) -> str:
    """Converts dictionary command to the string formatted commands."""
    return f"'{json.dumps(command)}'"

<a id="prerequisites_authentication"></a>
## 1.4. Authentication
The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the Data Flow Session Spark cluster.<br> 
To setup authentication use the ```ads.set_auth("resource_principal")``` or ```ads.set_auth("api_key")```. 

In [None]:
import ads

ads.set_auth("resource_principal")  # Supported values: resource_principal, api_key

<a id="prerequisites_variables"></a>
## 1.5. Variables
To run this notebook, you must provide some information about your tenancy configuration. To create and run a Data Flow session, you must specify a `<compartment_id>` and bucket `<logs_bucket_uri>` for storing logs. These resources must be in the same compartment.

In [None]:
import os

compartment_id = os.environ.get("NB_SESSION_COMPARTMENT_OCID")
# Assuming you already have a dataflow-logs bucket created in the region where the cluster is running.
# Otherwise specify the bucket where the stdout/err logs will be stored.
logs_bucket_uri = "oci://<bucket_name>@<namespace>/<prefix>"

<a id="spark_nlp"></a>
# 2. John Snow Labs Spark NLP 
By default the **PySpark 3.2 and Data Flow** conda environment includes pre-installed [Matplotlib](https://matplotlib.org/) and [Spark NLP](https://www.johnsnowlabs.com/) libraries. The examples below demonstrate how to prepare custom conda environment, publish it to the Object Storage and use within a Data Flow Spark session.

<a id="load_extension"></a>
## 2.1. Load Data Flow Spark Magic Extension
Data Flow Spark Magic is a JupyterLab extension, that you need to activate in your notebook using the `%load_ext dataflow.magics` magic command.
It will automatically create a SparkContext (`sc`) and HiveContext (`sqlContext`) inside any SparkMagic cell. Use the `%help` command to get the list of supported commands. If you want to access the docstrings of any magic command and figure out what arguments to provide, simply add `?` at then end of the command, for instance `%create_session?`.

In [None]:
%load_ext dataflow.magics

<a id="spark_nlp_pretrained_models"></a>
## 2.2. Pre-trained Spark NLP models installation
If you need any pre-trained Spark NLP models, you have to download them and unzip them in the conda environment folder. Data Flow does not support egress to the public internet. You cannot dynamically download pre-trained models from the internet in Data Flow sessions.
However you can download pre-trained models from the model hub as zip [archives](https://nlp.johnsnowlabs.com/models) and then unzip the models in the conda environment folder. Download the example model [Explain Document DL Pipeline for English](https://nlp.johnsnowlabs.com/2021/03/23/explain_document_dl_en.html) and upload it into your notebook session. Or you can execute this cell and download a public model: 

In [None]:
%%bash 

cd ~
curl https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_dl_en_3.0.0_3.0_1616473268265.zip --output explain_document_dl_en_3.0.0_3.0_1616473268265.zip
mkdir /home/datascience/conda/pyspark32_p38_cpu_v1/sparknlp-models
unzip explain_document_dl_en_3.0.0_3.0_1616473268265.zip -d /home/datascience/conda/pyspark32_p38_cpu_v1/sparknlp-models/

<a id="custom_conda_environment_publishing"></a>
## 2.3. Publishing custom PySpark 3.2 and Data Flow conda environment
Use the `odsc conda publish` command to publish conda environment to the Object Storage bucket.<br>
Follow the [Publishing a Conda Environment to an Object Storage Bucket in Your Tenancy](https://docs.oracle.com/en-us/iaas/data-science/using/conda_publishs_object.htm#:~:text=You%20can%20publish%20a%20conda%20environment%20that%20you%20have%20installed,persist%20them%20across%20notebook%20sessions.) to get more details about how to publish custom conda environments.

In [None]:
%%bash 

odsc conda publish -s pyspark32_p38_cpu_v1

<a id="new_session_published_conda"></a>
## 2.4. Create a Data Flow Session with a new `spark.archives` configuration

Now the published conda environment can be used within a Data Flow session. The path to the published conda environment can be copied from the [Environment Explorer](https://docs.oracle.com/en-us/iaas/data-science/using/conda_viewing.htm). <br>

Example path : `oci://<your-bucket>@<your-tenancy-namespace>/conda_environments/cpu/PySpark 3.2 and Data Flow/1.0/pyspark32_p38_cpu_v1#conda`

To create a new Data Flow session use the `%create_session` magic command.

In [None]:
custom_conda_environment_uri = "oci://<your-bucket>@<your-tenancy-namespace>/conda_environments/cpu/PySpark 3.2 and Data Flow/1.0/pyspark32_p38_cpu_v1#conda"

command = prepare_command(
    {
        "compartmentId": compartment_id,
        "displayName": "TestDataFlowSessionWithCustomCondaEnvironment",
        "language": "PYTHON",
        "sparkVersion": "3.2.1",
        "numExecutors": 2,
        "driverShape": "VM.Standard.E4.Flex",
        "executorShape": "VM.Standard.E4.Flex",
        "driverShapeConfig": {"ocpus": 2, "memoryInGBs": 32},
        "executorShapeConfig": {"ocpus": 2, "memoryInGBs": 32},
        "logsBucketUri": logs_bucket_uri,
        "type": "SESSION",
        "configuration": {
            "spark.archives": custom_conda_environment_uri,
            "spark.jars.ivy": "/opt/spark/work-dir/conda/.ivy2",
            "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0",
        },
    }
)

%create_session -l python -c $command

<a id="simple_anotation_example"></a>
## 2.5. Simple annotation task example

In [None]:
%%spark
 
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
 
# Start SparkSession with Spark NLP
# start() functions has 3 parameters: gpu, m1, and memory
# sparknlp.start(gpu=True) will start the session with GPU support
# sparknlp.start(m1=True) will start the session with macOS M1 support
# sparknlp.start(memory="16G") to change the default driver memory in SparkSession
spark = sparknlp.start()
 
# Download a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_dl', lang='en', disk_location="/opt/spark/work-dir/conda/sparknlp-models/")
 
# Your testing dataset
text = """
Lawrence Joseph Ellison (born August 17, 1944) is an American business magnate and investor who is the co-founder,
executive chairman, chief technology officer (CTO) and former chief executive officer (CEO) of the
American computer technology company Oracle Corporation.[2] As of September 2022, he was listed by
Bloomberg Billionaires Index as the ninth-wealthiest person in the world, with an estimated
fortune of $93 billion.[3] Ellison is also known for his 98% ownership stake in Lanai,
the sixth-largest island in the Hawaiian Archipelago.[4]
"""
 
# Annotate your testing dataset
result = pipeline.annotate(text)
 
# What's in the pipeline
print(list(result.keys()))
 
# Check the results
print(result['entities'])

**Expected result:**
<div class="cell border-box-sizing text_cell rendered">
    <div class="prompt input_prompt"></div>
    <div class="inner_cell">
        <div class="text_cell_render border-box-sizing rendered_html">
            <div class="alert alert-block alert-info" style="background: none; border: 1px solid; padding: 10px">
                <b><i class="fa fa-info-circle" aria-hidden="true"></i>&nbsp; Info</b><br>
<div style="padding:10px 0px">

```python
['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence']
['Lawrence Joseph Ellison', 'American', 'American', 'Oracle Corporation', 'Bloomberg Billionaires Index', 'Ellison', 'Lanai', 'Hawaiian Archipelago']
```
</div>
            </div>
        </div>
    </div>
</div> 

<a id='cleanup'></a>
# 3. Clean Up
Use the `%stop_session` magic command to stop your active Data Flow session.

In [None]:
%stop_session

<a id='ref'></a>
# 4. References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm/)
- [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)
- [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)