CDE CI/CD pipelines with CML

Project Summary

This tutorial demonstrates how you can build CI/CD pipelines to prototype your Spark jobs in the Cloudera Machine Learning Service and deploy them in the Cloudera Data Engineering Service for more sophisticated Spark Job management, visibility and troubleshooting.

This tutorial is divided into the following sections:

Installing the CDE CLI
A CDE CLI and API tutorial
A simple CDE CI/CD pipeline with CML Jobs and CML API v2
More Python Requests with CDE examples
A demonstration of how to convert Spark Oozie Actions to Spark CDE Jobs programmatically
A demonstration of how to convert Oozie Workflows to Airflow CDE Jobs programmatically
Using the oozie2cde API, a full example of a migration from Oozie to CDE

Prerequisites

This project requires access to a CML Workspace and a CDE Virtual Cluster. This could be either in CDP Public or Private Cloud.

Some familiarity with Python, CLI’s and API’s in general, and Jupyter Notebooks is recommended. However, no coding is required beyond executing the provided notebooks.

If you are completely new to CML and would like a quick intro to creating Projects, Sessions, using Spark and more, please start with this repository

CML Project Setup

Navigate to the CDP Management Console and then open CML. Take note of the region your workspace is in. For example, in the screenshot below this is “us-east-2”.

Save this to your editor, you will need this later as the REGION environment variable.

Navigate to your CML Workspace and create a new project as shown below.

Select the Git Option and paste the following URL:

https://github.com/pdefusco/CML2CDE.git

There is no need to customize the Runtime Setup settings at the bottom of the page. Default option values are fine.

Within the project, launch a CML Session with the following settings:

Session Name: Setup Session (anything is fine here)
Editor: Workbench
Kernel: Python 3.7 or above
Enable Spark: disabled
Resource Profile: 2 vCPU / 4 GiB Memory - 0 GPUs
Other settings: default values are fine.

Next, open script “0_Setup.py” and hit the play button at the top of the screen to run all code at once.

This script will print the Cloud Storage environment variable to the screen, download all data for the tutorial, and move it from the local /home/cdsw folder to a new Cloud Storage folder.

This works for both AWS and Azure without edits.

Follow along while the code executes on the right side of the screen. When the prompt turns back to green, enter the following code snippet and copy the output to your editor.

Save this to your editor, you will need this later as the STORAGE environment variable.

Next, navigate to your CDE Virtual Cluster. Open the Cluster Details page as shown below.

Copy the value for the JOBS_API_URL and save it to your editor, you will need this later as the JOBS_API_URL environment variable.

Finally, save the values you pasted in your editor as CML Project Environment Variables.

Navigate to the “Project Settings” menu and open the “Advanced” tab. Set the variables with the following keys and values as shown below.

Finally, go back to CML and end your current Workbench Editor session. You won’t need it anymore.

Part 1: Installing the CDE CLI

Navigate back to the CML Project home. Launch a new session with the following settings:

Session Name: “JupyterLab Session”
Editor: JupyterLab
Kernel: Python 3.7 or higher
Enable Spark: off
Resource Profile: 2 vCPU / 4 GiB Mem - No GPUs required.
Other options: you can leave the remaining options such as Edition and Version to their default values.

The notebooks shows you how to install the CDE CLI. Notice that although we will complete this process in CML, this can just as easily be done in your local machine.

Open Notebook "1_CDE_CLI_Install.ipynb”. Run all cells by clicking on “Run” -> “Run All Cells” as shown below. No changes to the code are required.

Part 2: A CDE CLI and API tutorial

Next, open Notebook “2_CML2CDE_Tutorial.ipynb”. This notebook provides some examples on the following topics:

Running a simple PySpark job in CML from the CML Session Notebook
Sending Spark-Submit commands via the CDE CLI from the CML Session Terminal
Working with the CDE API from the CML Session Notebook
Composing the same API requests with the Python Requests library
Submitting a final PySpark job in CML from the same CML Session Notebook to effortlessly read the same data that was processed in CDE

Walk through the notebook and follow along in each cell. Detailed instructions are provided in the Notebook Markdown cells.

Part 3: A simple CDE CI/CD pipeline with CML Jobs and CML API v2

APIv2 is the native CML Python Library to build CI/CD and ML Ops pipelines. This notebook uses APIv2 to build a CI/CD Pipeline made of CDE Jobs. Specifically in this notebook you will:

Instantiate two CML Jobs and trigger their execution with APIv2
Each CML Job will include one or more API calls to CDE. As a whole they will constitute a workflow to move PySpark Jobs from CML to CDE Resources.

For more on CML APIv2 it, please visit the “Related Demos and Tutorials” section below.

Open Notebook “3_CML2CDE_Pipeline.ipynb” and follow the instructions in the comments.

Part 4: CDE with Python Requests

The tutorial showcases how to convert more CDE API calls to Python Request methods. Detailed instructions are included in the notebook. You can reuse the same code or use it as inspiration to create more Python workflows with CDE.

Open Notebook “4_Python2CDE.ipynb” and follow the instructions in the comments.

Part 5: Converting Spark Oozie Actions to Spark CDE Jobs Programmatically

Oozie is a legacy Hadoop workflow orchestration project. It gained a lot of traction as a tool to schedule complex MapReduce job sequences and was later upgraded to support Spark Jobs via "Spark Oozie Actions". The project has been largely replaced by Apache Airflow which is natively available in each CDE Virtual Cluster without need for installation. The notebook showcases how to convert Oozie Spark Actions into Spark CDE Jobs.

In order to use it you need to have:

Spark Job files (jar, python or java ok)
XML Files corresponding to the Spark Oozie Actions you want to migrate

Open Notebook “5_Oozie2CDESpark.ipynb” and follow the instructions in the comments.

Part 6: Converting Spark Oozie Actions to Airflow CDE Jobs Programmatically

Continuing from notebook 5, this notebook implements a series of methods to convert Oozie Workflows to Airflow CDE DAGs.

Open notebook “6_Oozie2CDEAirflow.ipynb” and follow the instructions in the comments.

Part 7: Using the oozie2cde Module

Building up from notebooks 4 - 6, this notebooks puts everything together in an api called "oozie2cde". The API is located in the "oozie2cde" folder and is easily replicable on your local machine.

Open notebook “7_Using_Oozie2CDE.ipynb” and follow the instructions in the comments.

Conclusions

In this tutorial you have built a workflow using CML Sessions and CDE Jobs. The main takeaways are:

CML and CDE can both be used to build Spark ML and ETL Pipelines. Each has its advantages but you don’t have to choose.
CDP is foundationally built on SDX. CDE and CML datasets and workloads are all tracked from SDX.
This implies that any results from Spark Jobs running in CDE are available to CML users, and vice versa.
You can close the loop and switch back and forth between CML and CDE at will.
Of course, if you prefer not to share your datasets with other users, Ranger and SDX will allow you to impose such limitations.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
airflow_dags		airflow_dags
cml2cde		cml2cde
cml2cde_pipeline_code		cml2cde_pipeline_code
cml2cde_tutorial_code		cml2cde_tutorial_code
data		data
example_spark_jobs		example_spark_jobs
images		images
oozie2cde		oozie2cde
oozie_workflows		oozie_workflows
0_Setup.py		0_Setup.py
1_CDE_CLI_Install.ipynb		1_CDE_CLI_Install.ipynb
2_CML2CDE_Tutorial.ipynb		2_CML2CDE_Tutorial.ipynb
3_CML2CDE_Pipeline.ipynb		3_CML2CDE_Pipeline.ipynb
4_Python2CDE.ipynb		4_Python2CDE.ipynb
5_Oozie2CDESpark.ipynb		5_Oozie2CDESpark.ipynb
6_Oozie2CDEAirflow.ipynb		6_Oozie2CDEAirflow.ipynb
7_Using_Oozie2CDE.ipynb		7_Using_Oozie2CDE.ipynb
8_ReproducingCDE.ipynb		8_ReproducingCDE.ipynb
9_PythonPipeline.ipynb		9_PythonPipeline.ipynb
README.md		README.md
branch_constraint_constraintbranch.ipynb		branch_constraint_constraintbranch.ipynb
branch_constraint_constraintbranch.py		branch_constraint_constraintbranch.py
cde		cde
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDE CI/CD pipelines with CML

Project Summary

Prerequisites

CML Project Setup

Save this to your editor, you will need this later as the REGION environment variable.

Save this to your editor, you will need this later as the STORAGE environment variable.

Copy the value for the JOBS_API_URL and save it to your editor, you will need this later as the JOBS_API_URL environment variable.

Part 1: Installing the CDE CLI

Part 2: A CDE CLI and API tutorial

Part 3: A simple CDE CI/CD pipeline with CML Jobs and CML API v2

Part 4: CDE with Python Requests

Part 5: Converting Spark Oozie Actions to Spark CDE Jobs Programmatically

Part 6: Converting Spark Oozie Actions to Airflow CDE Jobs Programmatically

Part 7: Using the oozie2cde Module

Conclusions

Related Demos and Tutorials

Next Steps

About

Releases

Packages

Languages

pdefusco/CML2CDE

Folders and files

Latest commit

History

Repository files navigation

CDE CI/CD pipelines with CML

Project Summary

Prerequisites

CML Project Setup

Save this to your editor, you will need this later as the REGION environment variable.

Save this to your editor, you will need this later as the STORAGE environment variable.

Copy the value for the JOBS_API_URL and save it to your editor, you will need this later as the JOBS_API_URL environment variable.

Part 1: Installing the CDE CLI

Part 2: A CDE CLI and API tutorial

Part 3: A simple CDE CI/CD pipeline with CML Jobs and CML API v2

Part 4: CDE with Python Requests

Part 5: Converting Spark Oozie Actions to Spark CDE Jobs Programmatically

Part 6: Converting Spark Oozie Actions to Airflow CDE Jobs Programmatically

Part 7: Using the oozie2cde Module

Conclusions

Related Demos and Tutorials

Next Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages