diff --git a/samples/features/sql-big-data-cluster/spark/config-install/installpackage_Spark.ipynb b/samples/features/sql-big-data-cluster/spark/config-install/installpackage_Spark.ipynb index 2f1d39ac8b..bd9b8dedf6 100644 --- a/samples/features/sql-big-data-cluster/spark/config-install/installpackage_Spark.ipynb +++ b/samples/features/sql-big-data-cluster/spark/config-install/installpackage_Spark.ipynb @@ -16,76 +16,125 @@ "cells": [ { "cell_type": "markdown", - "source": "# Packaging in Spark\r\n", - "metadata": {} + "source": [ + "

\n", + "\n", + "

\n", + "\n", + "# **Spark Package Management in SQL Server 2019 Big Data Clusters**\n", + "This guide covers installing packages and submitting jobs to a SQL Server 2019 Big Data Cluster using Spark.\n", + "* Built-In Tools\n", + "* Install Packages from a Maven Repository onto the Spark Cluster at Runtime\n", + "* Import .jar from HDFS for use at runtime\n", + "* Import .jar at runtime through Azure Data Studio notebook cell configuration\n", + "* Install Python Packages at Runtime for use with PySpark \n", + "* Submit local .jar or python file\n", + "" + ], + "metadata": { + "azdata_cell_guid": "cbc8ced8-8931-4302-b252-7e7e478a16d4" + } }, { "cell_type": "markdown", - "source": "## Use Case 1: I can have key packages in boxed\r\n - All pacakges that come with spark and hadoop distribution\r\n - Python3.5 and Python 2.7\r\n - Pandas, Sklearn and several other supporting ml packages\r\n - R and supporting pacakges as part of MRO\r\n - sparklyr\r\n\r\n \r\n ", - "metadata": {} + "source": [ + "# Built-in Tools\n", + "* Spark and Hadoop base packages\n", + "* Python 3.5 and Python 2.7\n", + "* Pandas, Sklearn, Numpy, and other data processing packages.\n", + "* R and MRO packages\n", + "* Sparklyr\n", + "" + ], + "metadata": { + "azdata_cell_guid": "2fc8a069-115e-4d9b-bedc-5c55f79466b1" + } }, { "cell_type": "markdown", - "source": "## Use Case 2: I can install pacakges from maven repo to my spark cluster\r\nMaven central is a source of lot of packages. A lot of spark ecosystem pacakges are availble there. These pacakages can be installed to your spark cluster using notebook cell configuration at the start of your spark session.\r\n", - "metadata": {} - }, - { - "cell_type": "code", - "source": "%%configure -f\n{\"conf\": {\"spark.jars.packages\": \"com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1\"}}", - "metadata": { - "language": "scala" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": "", - "text/html": "Current session configs: {'conf': {'spark.jars.packages': 'com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.50'}, 'kind': 'spark'}
" - }, - "metadata": {} - }, - { - "output_type": "display_data", - "data": { - "text/plain": "", - "text/html": "No active sessions." - }, - "metadata": {} - } + "source": [ + "# Install Packages from a Maven Repository onto the Spark Cluster at Runtime\r\n", + "Maven packages can be installed onto your Spark cluster using notebook cell configuration at the start of your spark session. Before starting a spark session in Azure Data Studio, run the following code:\r\n", + "\r\n", + "```\r\n", + "%%configure -f` \\\r\n", + "{\"conf\": {\"spark.jars.packages\": \"com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1\"}}\r\n", + "```\r\n", + "" ], - "execution_count": 3 + "metadata": { + "azdata_cell_guid": "a0fecc05-f094-4dda-9afe-0de8ddad87eb" + } }, { - "cell_type": "code", - "source": "import com.microsoft.azure.eventhubs._", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": "import com.microsoft.azure.eventhubs._\n" - } + "cell_type": "markdown", + "source": [ + "# Import .jar from HDFS for use at runtime\n", + "\n", + "Import jar at runtime through Azure Data Studio notebook cell configuration.\n", + "\n", + "```\n", + "%%configure -f\n", + "{\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}\n", + "```\n", + "" ], - "execution_count": 5 + "metadata": { + "azdata_cell_guid": "c5e65fa2-faf0-4e22-aac1-69d7ff8c9989" + } }, { "cell_type": "markdown", - "source": "## Use Case 3: I have a local jar that i want to run in the spark cluster\r\nAs a user you may build your own customer pacakges that want to run as part of your spark jobs. These pacakges can be uploaded as HDFS and using a notebook configuration spark can consume these pacakges in a jar.\r\n\r\n\r\n", - "metadata": {} + "source": [ + "# Import .jar at runtime through Azure Data Studio notebook cell configuration\n", + "\n", + "```\n", + "%%configure -f\n", + "{\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}\n", + "```\n", + "" + ], + "metadata": { + "azdata_cell_guid": "6fc4085f-e142-4355-b215-148dbf6c5b86" + } }, { - "cell_type": "code", - "source": "%%configure -f\r\n {\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}", - "metadata": {}, - "outputs": [], - "execution_count": 0 + "cell_type": "markdown", + "source": [ + "# Install Python Packages at Runtime for use with PySpark\n", + "\n", + "The following code can be used to install packages on each executor node at runtime. \\\n", + "**Note**: This installation is temporary, and must be performed each time a new Spark session is invoked.\n", + "\n", + "``` Python\n", + "import subprocess\n", + "\n", + "# Install TensorFlow\n", + "stdout = subprocess.check_output(\n", + " \"pip3 install tensorflow\",\n", + " stderr=subprocess.STDOUT,\n", + " shell=True).decode(\"utf-8\")\n", + "print(stdout)\n", + "```" + ], + "metadata": { + "azdata_cell_guid": "07944b55-7266-4fcd-8e9b-9fd6cb8cfef5" + } }, { - "cell_type": "code", - "source": "import com.my.mycodeJar._", - "metadata": {}, - "outputs": [], - "execution_count": 0 + "cell_type": "markdown", + "source": [ + "# Submit local .jar or python file\r\n", + "One of the key scenarios for big data clusters is the ability to submit Spark jobs for SQL Server. The Spark job submission feature allows you to submit a local Jar or Py files with references to SQL Server 2019 big data cluster. It also enables you to execute a Jar or Py files, which are already located in the HDFS file system.\r\n", + "\r\n", + "* [Submit Spark jobs on SQL Server Big Data Clusters in Azure Data Studio](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job?view=sqlallproducts-allversions)\r\n", + "* [Submit Spark jobs on SQL Server Big Data Clusters in IntelliJ](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job-intellij-tool-plugin?view=sqlallproducts-allversions)\r\n", + "* [Submit Spark jobs on SQL Server big data cluster in Visual Studio Code](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-hive-tools-vscode?view=sqlallproducts-allversions)\r\n", + "" + ], + "metadata": { + "azdata_cell_guid": "7d1b55c0-1961-45f7-8449-a24a913106e4" + } } ] } \ No newline at end of file