## CML 2 CDE Pipeline

#### You can use CML API V2 and other Python Libraries to build CML2CDE Pipelines for the following purposes:
* Optimizing Spark Jobs 
* Versioning Spark Jobs
* Decreasing per Spark Job Costs by executing Jobs in a MultiCloud Pattern

![alt text](images/cml2cde_1.png)

In [3]:
import os
import requests

#### CML API v2 can be imported as shown below. For a full tutorial please visit: https://github.com/pdefusco/CML_AMP_APIv2/blob/master/CMLAPI.ipynb

In [4]:
try:
    import cmlapi
except ModuleNotFoundError:
    import os
    cluster = os.getenv("CDSW_API_URL")[:-1]+"2"
    !pip3 install {cluster}/python.tar.gz
    import cmlapi

from cmlapi.utils import Cursor
import string
import random
import json

try:
    client = cmlapi.default_client()
except ValueError:
    print("Could not create a client. If this code is not being run in a CML session, please include the keyword arguments \"url\" and \"cml_api_key\".")

#### Obtain CML Project ID

In [5]:
### Set Project ID ###
    
project_id = os.environ["CDSW_PROJECT_ID"]

#### Create a CML Job for Script 3 A

In [6]:
# Create a job. We will create dependent/children jobs of this job, so we call this one a "grandparent job". The parameter "runtime_identifier" is needed if this is running in a runtimes project.
grandparent_job_body = cmlapi.CreateJobRequest(
    project_id = project_id,
    name = "create_cde_job",
    script = "cml2cde_pipeline_code/3_A_create_cde_job.py",
    cpu = 1.0,
    memory = 2.0,
    runtime_identifier = "docker.repository.cloudera.com/cdsw/ml-runtime-workbench-python3.7-standard:2021.09.1-b5", 
    runtime_addon_identifiers = ["spark311-13-hf1"]
)

In [7]:
# Create this job within the project specified by the project_id parameter.
grandparent_job = client.create_job(grandparent_job_body, project_id)

#### Create a CML Job for Script 3 B

In [8]:
# Create a job. We will create dependent/children jobs of this job, so we call this one a "grandparent job". The parameter "runtime_identifier" is needed if this is running in a runtimes project.
parent_job_body = cmlapi.CreateJobRequest(
    project_id = project_id,
    name = "run_cde_job",
    script = "cml2cde_pipeline_code/3_B_run_cde_job.py",
    cpu = 1.0,
    memory = 2.0,
    runtime_identifier = "docker.repository.cloudera.com/cdsw/ml-runtime-workbench-python3.7-standard:2021.09.1-b5", 
    runtime_addon_identifiers = ["spark311-13-hf1"]
)

In [9]:
# Create this job within the project specified by the project_id parameter.
parent_job = client.create_job(parent_job_body, project_id)

#### Create a CML Job for Script 3 C

In [10]:
# Create a job. We will create dependent/children jobs of this job, so we call this one a "grandparent job". The parameter "runtime_identifier" is needed if this is running in a runtimes project.
#job_body = cmlapi.CreateJobRequest(
#    project_id = project_id,
#    name = "github_backup",
#    script = "cml2cde_pipeline_code/3_C_github_backup.py",
#    cpu = 2.0,
#    memory = 4.0,
#    runtime_identifier = "docker.repository.cloudera.com/cdsw/ml-runtime-workbench-python3.7-standard:2021.09.1-b5", 
#    runtime_addon_identifiers = ["spark311-13-hf1"]
#)

In [11]:
# Create this job within the project specified by the project_id parameter.
#job = client.create_job(job_body, project_id)

#### Navigate back to the CML Project Home Page and notice there are two new CML Jobs.

#### These CML Jobs have not yet run once. Come back to the notebook once you've validated this.

![alt text](images/cml2cde_13.png)

#### Run the CML Job Pipeline

In [12]:
# If the job has dependent jobs, the dependent jobs will run after the job succeeds.
# In this case, the grandparent job will run first, then the parent job, and then the child job, provided each job run succeeds.
jobrun_body = cmlapi.CreateJobRunRequest(project_id, 
                                         grandparent_job.id, 
                                         environment={"JOBS_API_URL":os.environ["JOBS_API_URL"], 
                                                    "CDE_Resource_Name":"cml2cde_cicd_resource", 
                                                    "CDE_Job_Name":"cml2cde_cicd_job", 
                                                    "WORKLOAD_USER":os.environ["WORKLOAD_USER"],
                                                    "WORKLOAD_PASSWORD":os.environ["WORKLOAD_PASSWORD"]}
                                        )
job_run = client.create_job_run(jobrun_body, project_id, grandparent_job.id)
run_id = job_run.id

This time we have triggered CML Job execution. Navigate back to the CML Project Home and validate this.

![alt text](images/cml2cde_14.png)

#### Tip: if you set the CML Job Environment Variables when triggering a CML Job they will not be saved in the CML Job Definition and thus be invisible when the CML Job Definition is opened.

In [13]:
# If the job has dependent jobs, the dependent jobs will run after the job succeeds.
# In this case, the grandparent job will run first, then the parent job, and then the child job, provided each job run succeeds.
jobrun_body = cmlapi.CreateJobRunRequest(project_id, 
                                         parent_job.id, 
                                         environment={"JOBS_API_URL":os.environ["JOBS_API_URL"], 
                                                    "CDE_Resource_Name":"cml2cde_cicd_resource", 
                                                    "CDE_Job_Name":"cml2cde_cicd_job", 
                                                    "WORKLOAD_USER":os.environ["WORKLOAD_USER"],
                                                    "WORKLOAD_PASSWORD":os.environ["WORKLOAD_PASSWORD"]}
                                        )
job_run = client.create_job_run(jobrun_body, project_id, parent_job.id)
run_id = job_run.id

When you go back to the CML Project Home observe that this second job has also launched.

### Looking into the CDE Jobs
   * You created two CML Jobs with APIv2. Each CML Job contained calls to CDE via its API.
   * Open scripts 3_A and 3_B located in the cml2cde_pipeline_code folder and validate this.
   * Notice you didn't just create a single job. You created a system to transfer PySpark scripts from CML into CDE Resources before running the corresponding CDE Jobs.

#### Navigate to the CDE Virtual Cluster Jobs Page and validate CDE Job execution.

![alt text](images/cml2cde_15.png)

![alt text](images/cml2cde_16.png)

![alt text](images/cml2cde_17.png)