## CML 2 CDE Tutorial

##### CDP users can develop Spark Jobs interactively in CML without losing the benefits of CDE, the Spark purpose-built Data Service in the Cloudera Data Platform

* CDE offers a rich CLI and API. This is normally used to manage workflows with CI/CD tools 
* The CLI and API can be used within a CML Session
* You can automate deployments to CDE and version project artifact with CML's API V2

### Part 1: A simple PySpark Job in CML

#### You can prototype your PySpark (or Scala Spark) Jobs interactively with CML Sessions

In [6]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
import sys
import os
import json

In [1]:
spark = SparkSession \
    .builder \
    .appName("Pyspark PPP ETL") \
    .config("spark.hadoop.fs.s3a.s3guard.ddb.region", os.environ["REGION"])\
    .config("spark.yarn.access.hadoopFileSystems", os.environ["STORAGE"])\
    .getOrCreate()  
    
#Path of our file in S3
input_path = os.environ["STORAGE"] + '/datalake/cde-demo/PPP-Sub-150k-TX.csv'

#This is to deal with tables existing before running this code. Not needed if you're starting fresh.
#spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")

#Bring data into Spark from S3 Bucket
base_df=spark.read.option("header","true").option("inferSchema","true").csv(input_path)
#Print schema so we can see what we're working with
print(f"printing schema")
base_df.printSchema()

#Filter out only the columns we actually care about
filtered_df = base_df.select("LoanAmount", "City", "State", "Zip", "BusinessType", "NonProfit", "JobsRetained", "DateApproved", "Lender")

#This is a Texas only dataset but lets do a quick count to feel good about it
print(f"How many TX records did we get?")
tx_cnt = filtered_df.count()
print(f"We got: %i " % tx_cnt)

#Create the database if it doesnt exist
print(f"Creating TexasPPP Database \n")
spark.sql("CREATE DATABASE IF NOT EXISTS TexasPPP")
spark.sql("SHOW databases").show()

print(f"Inserting Data into TexasPPP.loan_data table \n")

#insert the data
filtered_df.\
  write.\
  mode("append").\
  saveAsTable("TexasPPP"+'.'+"loan_data", format="parquet")

#Another sanity check to make sure we inserted the right amount of data
print(f"Number of records \n")
spark.sql("Select count(*) as RecordCount from TexasPPP.loan_data").show()

print(f"Retrieve 15 records for validation \n")
spark.sql("Select * from TexasPPP.loan_data limit 15").show()

Setting spark.hadoop.yarn.resourcemanager.principal to pauldefusco
                                                                                

printing schema
root
 |-- LoanAmount: double (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Zip: integer (nullable = true)
 |-- NAICSCode: integer (nullable = true)
 |-- BusinessType: string (nullable = true)
 |-- RaceEthnicity: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Veteran: string (nullable = true)
 |-- NonProfit: string (nullable = true)
 |-- JobsRetained: integer (nullable = true)
 |-- DateApproved: string (nullable = true)
 |-- Lender: string (nullable = true)
 |-- CD: string (nullable = true)

How many TX records did we get?


                                                                                

We got: 337237 
Creating TexasPPP Database 



Hive Session ID = d577966e-7c97-41eb-8f54-9895a15d107a


+--------------------+
|        databaseName|
+--------------------+
|         01_car_data|
|           01_car_dw|
|             airline|
|          airline_dw|
|            airlines|
|        airlines_csv|
|       airlines_csv1|
|   airlines_csv_vish|
|    airlines_iceberg|
|   airlines_iceberg1|
|airlines_iceberg_...|
|          airquality|
|              apache|
|          atlas_demo|
|            bankdemo|
|              bhagan|
|             cdedemo|
|        cdp_overview|
|            climrisk|
|                cnav|
+--------------------+
only showing top 20 rows

Inserting Data into TexasPPP.loan_data table 



                                                                                

Number of records 



                                                                                

+-----------+
|RecordCount|
+-----------+
|    1011711|
+-----------+

Retrieve 15 records for validation 



[Stage 7:>                                                          (0 + 2) / 2]

+----------+-------------+-----+-----+--------------------+---------+------------+------------+--------------------+
|LoanAmount|         City|State|  Zip|        BusinessType|NonProfit|JobsRetained|DateApproved|              Lender|
+----------+-------------+-----+-----+--------------------+---------+------------+------------+--------------------+
|   20000.0|        ALVIN|   TX|77511| Sole Proprietorship|     null|          11|  04/06/2020|Texas Advantage C...|
|   20000.0|       HEARNE|   TX|77859|Limited  Liabilit...|     null|           1|  04/30/2020|First National Ba...|
|   20000.0|     CLEBURNE|   TX|76033|Subchapter S Corp...|     null|        null|  04/30/2020|Commonwealth Busi...|
|   20000.0|      HOUSTON|   TX|77006|Subchapter S Corp...|     null|           4|  04/14/2020|     Allegiance Bank|
|   20000.0|   ROPESVILLE|   TX|79358|         Corporation|     null|           4|  04/06/2020|          Vista Bank|
|   20000.0|      HOUSTON|   TX|77080| Sole Proprietorship|     

                                                                                

### Part 2: Working with the CDE CLI

#### You can submit the same PySpark Job into your CDE Virtual Cluster with the CDE CLI

#### Open the CML Session Terminal and execute the following Spark Submit. You will need to enter your CDP Workload Password

<code> cde spark submit /home/cdsw/Data_Extraction_Sub_150k.py </code>

![alt text](images/cml2cde_2.png)

#### Navigate to your CDE Virtual Cluster's Job Runs page and validate that the job has been created

![alt text](images/cml2cde_3.png)

#### Important Takeaways

* We pasted the same PySpark code we executed in the previous cells in "/home/cdsw/Data_Extraction_Sub_150k.py". No code changes required.
* The CLI has been previously installed. This is tied to a specific CDE Virtual Cluster as when we installed it we had to set the VC JOBS API URL.
* Notice that althouigh the Spark Job is running as a CDE Job, CDE does not create a reusable CDE Job for it. You can only rerun by resubmitting the Spark Submit above.
* However, you can use the CDE CLI to create CDE Jobs, Resources, and more. This is better for monitoring and troubleshooting the same Spark Job across multiple runs.

#### Learning to Use the CLI

#### For a full reference to the CLI, please visit this site: https://docs.cloudera.com/data-engineering/cloud/cli-access/topics/cde-cli-reference.html

#### Once you identified a CLI command you want to execute, you can use the terminal to familiarize yourself with all flags

![alt text](images/cml2cde_4.png)

### More CDE CLI Command examples

##### Using the CML Session Terminal, execute the following in this order

#### Create a CDE Resource

##### A CDE Resource allows you to save and reuse CDE Job related artifacts  
<code>cde resource create --name cml2cde_cli_resource </code>

#### Upload a File to a Resource

<code>cde resource upload --name cml2cde_cli_resource --local-path "/home/cdsw/Data_Extraction_Sub_150k.py" --resource-path "Data_Extraction_Sub_150k.py"</code>

#### Create a CDE Job with the Resource File

##### The Job command has a lot options. For example, you can create a CDE Job with the CDE Resource file and run it on a schedule. 
<code>cde job create --name "cml2cde_cli_job" --type "spark"
                --application-file "Data_Extraction_Sub_150k.py" 
                --cron-expression "0 */1 * * *" \
                --schedule-enabled "true" 
                --schedule-start "2022-04-29" 
                --schedule-end "2022-05-02" 
                --mount-1-resource "cml2cde_cli_resource"</code>

#### Search for CDE Jobs based on attribute

##### You can use attributes for your search. In this case, we search by name.
<code>cde job list --filter 'name[like]%cml2cde%'<code/>

#### List all CDE Job Runs

##### A CDE Resource allows you to save and reuse CDE Job related artifacts  
<code>cde run list</code>

#### Describe CDE Job Run

#### Replace the integer with your job run id e.g. 47 below
<code>cde run describe --id 47</code>

#### Create a CDE Job with Custom Spark Log Level

#### Using the log-level parameter, you can choose any of the folling: (TRACE, DEBUG, INFO, WARN, ERROR, FATAL, OFF)
<code>cde job create --name "cml2cde_cli_job_custom_log_level" --type "spark"
                --application-file "Data_Extraction_Sub_150k.py"
                --log-level "DEBUG"
                --schedule-enabled "false" 
                --mount-1-resource "cml2cde_cli_resource"</code>

#### Trigger execution of the "cml2cde_cli_job_custom_log_level" Job:
<code>cde job run --name "cml2cde_cli_job_custom_log_level" --application-file "Data_Extraction_Sub_150k.py"</code>

#### Navigate back to the corresponding CDE Job Run, pick a log type of preference and validate that the log type as now changed.

![alt text](images/cml2cde_6.png)

#### Collect CDE Job Run Logs

##### You can download the Spark Logs you would have access to in CDE. Notice you have more options e.g. executor logs
<code>cde run logs --type "driver/stdout" --id 47<code/>

#### You can modify the log type to any of the available tabs in the corresponding CDE Job Run page. For example:
* <code>"driver/stderr" or "Driver/stdout"<code/>
* <code>"executor id/stdout"<code/>

![alt text](images/cml2cde_5.png)

### Part 3: Working with the CDE API

##### In order to run the following commands you need to have set the JOBS_API_URL environment variables. Please ensure you have followed instructions located in the README file.

#### You can submit the same PySpark Job into your CDE Virtual Cluster with the CDE API

In [7]:
import os
import json

In [8]:
def set_cde_token():
    rep = os.environ["JOBS_API_URL"].split("/")[2].split(".")[0]
    os.environ["GET_TOKEN_URL"] = os.environ["JOBS_API_URL"].replace(rep, "service").replace("dex/api/v1", "gateway/authtkn/knoxtoken/api/v1/token")
    token_json = !curl -u $WORKLOAD_USER:$WORKLOAD_PASSWORD $GET_TOKEN_URL
    os.environ["ACCESS_TOKEN"] = json.loads(token_json[5])["access_token"]
    return json.loads(token_json[5])["access_token"]

tok = set_cde_token()

#### You can perform the same commands you executed above with the CLI. You can execute each command from the terminal, or in a notebook cell with the "!" prefix.

#### Submit a request to create a CDE Resource

In [4]:
# Create a CDE Resource with the API
!curl -H "Authorization: Bearer $ACCESS_TOKEN" -X POST \
  "$JOBS_API_URL/resources" -H "Content-Type: application/json" \
  -d "{ \"name\": \"cml2cde_api_resource\"}"

#### Validate that the resource has been created in your CDE Virtual Cluster by visiting the Resources tab again

![alt text](images/cml2cde_7.png)

#### Check that the CDE resource has been created

In [5]:
!curl -H "Authorization: Bearer $ACCESS_TOKEN" -X GET "$JOBS_API_URL/resources/cml2cde_api_resource"

{"name":"cml2cde_api_resource","type":"files","created":"2022-09-01T20:26:38Z","modified":"2022-09-01T20:26:38Z","lastUsed":"0001-01-01T00:00:00Z","retentionPolicy":"keep_indefinitely","status":"ready"}

#### Upload the Spark Job to the CDE Resource

In [10]:
!curl -H "Authorization: Bearer $ACCESS_TOKEN" -X PUT \
  "$JOBS_API_URL/resources/cml2cde_api_resource/Data_Extraction_Sub_150k.py" \
  -F "file=@/home/cdsw/cml2cde_tutorial_code/Data_Extraction_Sub_150k.py"

#### Create a CDE Job with the Uploaded File

In [11]:
# Create the CDE Job from the resource
!curl -H "Authorization: Bearer $ACCESS_TOKEN" -X POST "$JOBS_API_URL/jobs" \
          -H "accept: application/json" \
          -H "Content-Type: application/json" \
          -d "{ \"name\": \"cml2cde_api_job\", \"type\": \"spark\", \"retentionPolicy\": \"keep_indefinitely\", \"mounts\": [ { \"dirPrefix\": \"/\", \"resourceName\": \"cml2cde_api_resource\" } ], \"spark\": { \"file\": \"Data_Extraction_Sub_150k.py\"},\"schedule\": { \"enabled\": false} }"

#### Execute the CDE Job

In [12]:
!curl -H "Authorization: Bearer $ACCESS_TOKEN" -X POST "$JOBS_API_URL/jobs/cml2cde_api_job/run"

{"id":48}

#### As with the CDE CLI, the CDE API allows you to do a lot more than the above. 
#### The best way to start is by visiting the Swagger page for by going to the CDE Virtual Cluster Details Job, then opening the API DOC link as shown below.

![alt text](images/cml2cde_8.png)

![alt text](images/cml2cde_9.png)

#### You can build API commands for your cluster from there. For example, click on the “GET/jobs” method under the “jobs” section.

![alt text](images/cml2cde_10.png)

#### Click on the “Try it out” button. 

#### Next, try entering a few options in the provided fields. For example, ensure the latestjob flag is set to "true" and enter “name[eq]cml2cde_api_job” string in the first field.

#### Finally hit "Execute"

![alt text](images/cml2cde_11.png)

#### Notice the "Request URL" field has been populated for you. You can use this to construct your next request, as shown below. Also notice a response preview is provided.

![alt text](images/cml2cde_12.png)

#### Using the highlighted portion of the "Request URL" we can construct a new request as shown below. 

In [13]:
# Create the CDE Job from the resource
latest_job_json = !curl -H "Authorization: Bearer $ACCESS_TOKEN" -X GET "$JOBS_API_URL/jobs?latestjob=true&filter=name%5Beq%5Dcml2cde_api_job&limit=20&offset=0&orderby=name&orderasc=true" \
          -H "accept: application/json" \
          -H "Content-Type: application/json"

In [14]:
json.loads(latest_job_json[-1])

{'jobs': [{'name': 'cml2cde_api_job',
   'type': 'spark',
   'created': '2022-09-01T20:28:32Z',
   'modified': '2022-09-01T20:28:32Z',
   'lastUsed': '2022-09-01T20:28:39Z',
   'mounts': [{'resourceName': 'cml2cde_api_resource', 'dirPrefix': '/'}],
   'spark': {'file': 'Data_Extraction_Sub_150k.py'},
   'retentionPolicy': 'keep_indefinitely',
   'schedule': {'enabled': False, 'user': 'pauldefusco'},
   'latestRunInfo': {'id': 48,
    'job': 'cml2cde_api_job',
    'type': 'spark',
    'status': 'starting',
    'mounts': [{'resourceName': 'cml2cde_api_resource', 'dirPrefix': '/'}],
    'spark': {'spec': {'file': 'Data_Extraction_Sub_150k.py'},
     'submitID': '22'},
    'user': 'pauldefusco',
    'started': '2022-09-01T20:28:39Z',
    'ended': '0001-01-01T00:00:00Z',
    'identity': {'disableRoleProxy': True, 'role': 'instance'}}}],
 'meta': {'hasNext': False, 'limit': 20, 'offset': 0, 'count': 1}}

### Part 4: Working with CDE in Python

#### You can reconstruct your CDE API requests with the Python requests library. Notice you will need a CDE Token, as you did with the CDE API.

In [9]:
def set_cde_token():
    rep = os.environ["JOBS_API_URL"].split("/")[2].split(".")[0]
    os.environ["GET_TOKEN_URL"] = os.environ["JOBS_API_URL"].replace(rep, "service").replace("dex/api/v1", "gateway/authtkn/knoxtoken/api/v1/token")
    token_json = !curl -u $WORKLOAD_USER:$WORKLOAD_PASSWORD $GET_TOKEN_URL
    os.environ["ACCESS_TOKEN"] = json.loads(token_json[5])["access_token"]
    return json.loads(token_json[5])["access_token"]

In [10]:
tok = set_cde_token()

#### Browse Resources

In [17]:
import requests

In [18]:
url = os.environ["JOBS_API_URL"] + "/resources"
myobj = {"name": "cml2cde_python"}
headers = {"Authorization": f'Bearer {tok}', 
          "Content-Type": "application/json"}

## Only showing the latest two resources
x = requests.get(url, headers=headers)
x.json()["resources"][-3:-1]

# CDE API Equivalent
#!curl -H "Authorization: Bearer ${CDE_TOKEN}" -X GET \
#   "${CDE_JOB_URL_AWS}/resources"

[{'name': 'FirstDag',
  'type': 'files',
  'signature': '70b62c9689bf594137762fa855e8ee434c95ead6',
  'created': '2022-08-31T20:32:16Z',
  'modified': '2022-09-01T00:21:33Z',
  'lastUsed': '2022-09-01T00:21:58Z',
  'retentionPolicy': 'keep_indefinitely',
  'status': 'ready'},
 {'name': 'FlightsProcessIcebergJAR',
  'type': 'files',
  'signature': 'c6dbf07fa5925cb5778200391d0a987dc45d1d2d',
  'created': '2022-07-28T21:13:37Z',
  'modified': '2022-07-28T21:13:37Z',
  'lastUsed': '2022-07-28T21:14:36Z',
  'retentionPolicy': 'keep_indefinitely',
  'status': 'ready'}]

#### Create a Resource

In [19]:
url = os.environ["JOBS_API_URL"] + "/resources"
myobj = {"name": "cml2cde_python_resource"}
data_to_send = json.dumps(myobj).encode("utf-8")
headers = {"Authorization": f'Bearer {tok}', 
          "Content-Type": "application/json"}

x = requests.post(url, data=data_to_send, headers=headers)
print(x.status_code)

# CDE API Equivalent
#!curl -H "Authorization: Bearer $ACCESS_TOKEN" -X POST \
#  "$CDE_VC_ENDPOINT/resources" -H "Content-Type: application/json" \
#  -d "{ \"name\": \"cml2cde_resource\"}"

201


#### You can build more Python Requests with the help of the CDE API Documentation as you did in Part 3.

#### If used in conjunction with CML API V2, this is probably the easiest way to build CDE Pipelines in CML. We will see an example in the next notebook.

### Part 5: Closing the Loop

#### While you can submit Spark Jobs in any of the ways above, it's important to keep in mind that the outputs of CDE Jobs can be retrieved from CML just as easily.

#### To prove this, execute the "Validation" PySpark Job with the CDE CLI. The python file is in the home directory. For your convenience, the code is below.

#### Execute the Spark Submit from the CML Session terminal. 
<code>cde spark submit /home/cdsw/Validation.py<code/>

In [24]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
import sys

spark = SparkSession \
    .builder \
    .appName("Pyspark PPP ETL") \
    .config("spark.hadoop.fs.s3a.s3guard.ddb.region", "us-east-2")\
    .config("spark.yarn.access.hadoopFileSystems", "s3a://demo-aws-go02")\
    .getOrCreate()  
    
print(f"Retrieve 1000 records for validation \n")
df = spark.sql("Select * from TexasPPP.loan_data limit 1000")

df.\
  write.\
  mode("append").\
  saveAsTable("ValidationTable", format="parquet")

Retrieve 1000 records for validation 



                                                                                

#### Next, execute the following cell and observe that the data is readily available in CML

In [25]:
#### You may have lost your Spark Session by now. If that is the case, recreate one:

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
import sys

spark = SparkSession \
    .builder \
    .appName("Pyspark PPP ETL") \
    .config("spark.hadoop.fs.s3a.s3guard.ddb.region", os.environ["REGION"])\
    .config("spark.yarn.access.hadoopFileSystems", os.environ["STORAGE"])\
    .getOrCreate()

In [26]:
print(f"Retrieve Data from the New table \n")
spark.sql("Select * from ValidationTable").show()

Retrieve Data from the New table 

+----------+--------------+-----+-----+--------------------+---------+------------+------------+--------------------+
|LoanAmount|          City|State|  Zip|        BusinessType|NonProfit|JobsRetained|DateApproved|              Lender|
+----------+--------------+-----+-----+--------------------+---------+------------+------------+--------------------+
| 149997.07|   COLLEYVILLE|   TX|76034|         Corporation|     null|          14|  04/28/2020|    Bank of the West|
|  149990.0|        LAMESA|   TX|79331|Limited  Liabilit...|     null|          12|  04/13/2020|   First United Bank|
|  149965.0|CORPUS CHRISTI|   TX|78418|         Corporation|     null|           1|  04/15/2020|American Bank, Na...|
|  149956.0|        AUSTIN|   TX|78704|         Corporation|     null|          11|  04/10/2020|  PlainsCapital Bank|
|  149952.0|     KERRVILLE|   TX|78028|         Corporation|     null|          12|  04/08/2020|Texas Hill Countr...|
|  149942.0|   SAN AN