# Setup 

**Developed on conda environment [PySpark 3.2 and Data Flow](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8**


In [None]:
import json
import os 

def prepare_command(command: dict) -> str:
    """Converts dictionary command to the string formatted commands."""
    return f'\'{json.dumps(command)}\''

In [None]:
import ads
ads.set_auth("resource_principal") # Supported values: resource_principal, api_key

In [None]:
%load_ext dataflow.magics

In [None]:
compartment_id = os.environ.get("NB_SESSION_COMPARTMENT_OCID")
#assuming you already have a dataflow-logs bucket created in the region where the cluster is running. 
#Otherwise specify the bucket where the stdout/err logs will be stored. 
logs_bucket_uri = "oci://dataflow-logs@bigdatadatasciencelarge"

# John Snow Labs Spark NLP 

If you need any pre-trained Spark NLP models, you will have to download them and unzip them in the conda environment folder. Data Flow does not support egress to the public internet. You cannot dynamically download pre-trained models from the internet in Data Flow sessions.
However you can download pre-trained models from the model hub as zip [archives](https://nlp.johnsnowlabs.com/models) and then unzip the models in the conda environment folder. Download the example model [Explain Document DL Pipeline for English](https://nlp.johnsnowlabs.com/2021/03/23/explain_document_dl_en.html) and upload it into your notebook session. Or you can execute this cell and download a public model: 

In [None]:
%%bash 

cd ~
curl https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_dl_en_3.0.0_3.0_1616473268265.zip --output explain_document_dl_en_3.0.0_3.0_1616473268265.zip
mkdir /home/datascience/conda/pyspark32_p38_cpu_v1/sparknlp-models
unzip explain_document_dl_en_3.0.0_3.0_1616473268265.zip -d /home/datascience/conda/pyspark32_p38_cpu_v1/sparknlp-models/

publish the conda environment 

In [None]:
%%bash 

odsc conda publish -s pyspark32_p38_cpu_v1

Create the cluster 

In [None]:
custom_conda_environment_uri = "oci://conda-env-ds@bigdatadatasciencelarge/conda_environments/cpu/PySpark 3.2 and Data Flow/1.0/pyspark32_p38_cpu_v1#conda"

command = prepare_command({
    "compartmentId": compartment_id,
    "displayName": "TestDataFlowSessionWithCustomCondaEnvironment",
    "language": "PYTHON",
    "sparkVersion": "3.2.1",
    "numExecutors":2,
    "driverShape":"VM.Standard.E4.Flex",
    "executorShape":"VM.Standard.E4.Flex",
    "driverShapeConfig":{"ocpus":2,"memoryInGBs":32},
    "executorShapeConfig":{"ocpus":2,"memoryInGBs":32},
    "logsBucketUri": logs_bucket_uri,
    "type": "SESSION",
    "configuration":{
        "spark.archives": custom_conda_environment_uri,
        "fs.oci.client.hostname": "https://objectstorage.us-ashburn-1.oraclecloud.com",
        "spark.jars.ivy":"/opt/spark/work-dir/conda/.ivy2",
        "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0"
    }})

%create_session -l python -c $command

Try a simple annotation task: 

In [None]:
%%spark
 
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
 
# Start SparkSession with Spark NLP
# start() functions has 3 parameters: gpu, m1, and memory
# sparknlp.start(gpu=True) will start the session with GPU support
# sparknlp.start(m1=True) will start the session with macOS M1 support
# sparknlp.start(memory="16G") to change the default driver memory in SparkSession
spark = sparknlp.start()
 
# Download a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_dl', lang='en', disk_location="/opt/spark/work-dir/conda/sparknlp-models/")
 
# Your testing dataset
text = """
Lawrence Joseph Ellison (born August 17, 1944) is an American business magnate and investor who is the co-founder,
executive chairman, chief technology officer (CTO) and former chief executive officer (CEO) of the
American computer technology company Oracle Corporation.[2] As of September 2022, he was listed by
Bloomberg Billionaires Index as the ninth-wealthiest person in the world, with an estimated
fortune of $93 billion.[3] Ellison is also known for his 98% ownership stake in Lanai,
the sixth-largest island in the Hawaiian Archipelago.[4]
"""
 
# Annotate your testing dataset
result = pipeline.annotate(text)
 
# What's in the pipeline
print(list(result.keys()))
 
# Check the results
print(result['entities'])

**Expected result:**
<div class="cell border-box-sizing text_cell rendered">
    <div class="prompt input_prompt"></div>
    <div class="inner_cell">
        <div class="text_cell_render border-box-sizing rendered_html">
            <div class="alert alert-block alert-info" style="background: none; border: 1px solid; padding: 10px">
                <b><i class="fa fa-info-circle" aria-hidden="true"></i>&nbsp; Info</b><br>
<div style="padding:10px 0px">

```python
['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence']
['Lawrence Joseph Ellison', 'American', 'American', 'Oracle Corporation', 'Bloomberg Billionaires Index', 'Ellison', 'Lanai', 'Hawaiian Archipelago']
```
</div>
            </div>
        </div>
    </div>
</div> 

# Session Termination 

In [None]:
%stop_session