Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -16,76 +16,125 @@
"cells": [
{
"cell_type": "markdown",
"source": "# Packaging in Spark\r\n",
"metadata": {}
"source": [
"<p align=\"center\">\n",
"<img src =\"https://raw.githubusercontent.com/microsoft/azuredatastudio/master/src/sql/media/microsoft_logo_gray.svg?sanitize=true\" width=\"250\" align=\"center\">\n",
"</p>\n",
"\n",
"# **Spark Package Management in SQL Server 2019 Big Data Clusters**\n",
"This guide covers installing packages and submitting jobs to a SQL Server 2019 Big Data Cluster using Spark.\n",
"* Built-In Tools\n",
"* Install Packages from a Maven Repository onto the Spark Cluster at Runtime\n",
"* Import .jar from HDFS for use at runtime\n",
"* Import .jar at runtime through Azure Data Studio notebook cell configuration\n",
"* Install Python Packages at Runtime for use with PySpark \n",
"* Submit local .jar or python file\n",
"<!-- <span style=\"color:red\"><font size=\"3\">Please press the \"Run Cells\" button to run the notebook</font></span> -->"
],
"metadata": {
"azdata_cell_guid": "cbc8ced8-8931-4302-b252-7e7e478a16d4"
}
},
{
"cell_type": "markdown",
"source": "## Use Case 1: I can have key packages in boxed\r\n - All pacakges that come with spark and hadoop distribution\r\n - Python3.5 and Python 2.7\r\n - Pandas, Sklearn and several other supporting ml packages\r\n - R and supporting pacakges as part of MRO\r\n - sparklyr\r\n\r\n \r\n ",
"metadata": {}
"source": [
"# Built-in Tools\n",
"* Spark and Hadoop base packages\n",
"* Python 3.5 and Python 2.7\n",
"* Pandas, Sklearn, Numpy, and other data processing packages.\n",
"* R and MRO packages\n",
"* Sparklyr\n",
""
],
"metadata": {
"azdata_cell_guid": "2fc8a069-115e-4d9b-bedc-5c55f79466b1"
}
},
{
"cell_type": "markdown",
"source": "## Use Case 2: I can install pacakges from maven repo to my spark cluster\r\nMaven central is a source of lot of packages. A lot of spark ecosystem pacakges are availble there. These pacakages can be installed to your spark cluster using notebook cell configuration at the start of your spark session.\r\n",
"metadata": {}
},
{
"cell_type": "code",
"source": "%%configure -f\n{\"conf\": {\"spark.jars.packages\": \"com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1\"}}",
"metadata": {
"language": "scala"
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": "<IPython.core.display.HTML object>",
"text/html": "Current session configs: <tt>{'conf': {'spark.jars.packages': 'com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.50'}, 'kind': 'spark'}</tt><br>"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": "<IPython.core.display.HTML object>",
"text/html": "No active sessions."
},
"metadata": {}
}
"source": [
"# Install Packages from a Maven Repository onto the Spark Cluster at Runtime\r\n",
"Maven packages can be installed onto your Spark cluster using notebook cell configuration at the start of your spark session. Before starting a spark session in Azure Data Studio, run the following code:\r\n",
"\r\n",
"```\r\n",
"%%configure -f` \\\r\n",
"{\"conf\": {\"spark.jars.packages\": \"com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1\"}}\r\n",
"```\r\n",
""
],
"execution_count": 3
"metadata": {
"azdata_cell_guid": "a0fecc05-f094-4dda-9afe-0de8ddad87eb"
}
},
{
"cell_type": "code",
"source": "import com.microsoft.azure.eventhubs._",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "import com.microsoft.azure.eventhubs._\n"
}
"cell_type": "markdown",
"source": [
"# Import .jar from HDFS for use at runtime\n",
"\n",
"Import jar at runtime through Azure Data Studio notebook cell configuration.\n",
"\n",
"```\n",
"%%configure -f\n",
"{\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}\n",
"```\n",
""
],
"execution_count": 5
"metadata": {
"azdata_cell_guid": "c5e65fa2-faf0-4e22-aac1-69d7ff8c9989"
}
},
{
"cell_type": "markdown",
"source": "## Use Case 3: I have a local jar that i want to run in the spark cluster\r\nAs a user you may build your own customer pacakges that want to run as part of your spark jobs. These pacakges can be uploaded as HDFS and using a notebook configuration spark can consume these pacakges in a jar.\r\n\r\n\r\n",
"metadata": {}
"source": [
"# Import .jar at runtime through Azure Data Studio notebook cell configuration\n",
"\n",
"```\n",
"%%configure -f\n",
"{\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}\n",
"```\n",
""
],
"metadata": {
"azdata_cell_guid": "6fc4085f-e142-4355-b215-148dbf6c5b86"
}
},
{
"cell_type": "code",
"source": "%%configure -f\r\n {\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}",
"metadata": {},
"outputs": [],
"execution_count": 0
"cell_type": "markdown",
"source": [
"# Install Python Packages at Runtime for use with PySpark\n",
"\n",
"The following code can be used to install packages on each executor node at runtime. \\\n",
"**Note**: This installation is temporary, and must be performed each time a new Spark session is invoked.\n",
"\n",
"``` Python\n",
"import subprocess\n",
"\n",
"# Install TensorFlow\n",
"stdout = subprocess.check_output(\n",
" \"pip3 install tensorflow\",\n",
" stderr=subprocess.STDOUT,\n",
" shell=True).decode(\"utf-8\")\n",
"print(stdout)\n",
"```"
],
"metadata": {
"azdata_cell_guid": "07944b55-7266-4fcd-8e9b-9fd6cb8cfef5"
}
},
{
"cell_type": "code",
"source": "import com.my.mycodeJar._",
"metadata": {},
"outputs": [],
"execution_count": 0
"cell_type": "markdown",
"source": [
"# Submit local .jar or python file\r\n",
"One of the key scenarios for big data clusters is the ability to submit Spark jobs for SQL Server. The Spark job submission feature allows you to submit a local Jar or Py files with references to SQL Server 2019 big data cluster. It also enables you to execute a Jar or Py files, which are already located in the HDFS file system.\r\n",
"\r\n",
"* [Submit Spark jobs on SQL Server Big Data Clusters in Azure Data Studio](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job?view=sqlallproducts-allversions)\r\n",
"* [Submit Spark jobs on SQL Server Big Data Clusters in IntelliJ](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job-intellij-tool-plugin?view=sqlallproducts-allversions)\r\n",
"* [Submit Spark jobs on SQL Server big data cluster in Visual Studio Code](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-hive-tools-vscode?view=sqlallproducts-allversions)\r\n",
""
],
"metadata": {
"azdata_cell_guid": "7d1b55c0-1961-45f7-8449-a24a913106e4"
}
}
]
}