Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions samples/features/sql-big-data-cluster/spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

SQL Server Big Data cluster bundles Spark and HDFS together with SQL server. Azure Data Studio IDE provides built in notebooks that enables data scientists and data engineers to run Spark notebooks and job in Python, R, or Scala code against the Big Data Cluster. This folder contains spark sample notebook on using Spark in SQL server Big data cluster

## Folder contents
## Contents

[PySpark Hello World](dataloading/hello_PySpark.ipynb)

Expand All @@ -14,14 +14,18 @@ SQL Server Big Data cluster bundles Spark and HDFS together with SQL server. Azu

[Data Transfer - Spark to SQL using Spark JDBC connector](data-virtualization/spark_to_sql_jdbc.ipynb/)

[Data Transfer - Spark to SQL using MSSQL Spark connector](spark_to_sql/mssql_spark_connector.ipynb/)
[Data Transfer - Spark to SQL using MSSQL Spark connector](data-virtualization/mssql_spark_connector.ipynb/)

## Instructions on how to run in Azure Data Studio
[Configure - Configure a spark session using a notebook](config-install/configure_spark_session.ipynb/)

[Install - Install 3rd party packages](config-install/installpackage_Spark.ipynb/)

[data-loading/transform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/)
[Restful-Access - Access Spark in BDC via restful Livy APIs](restful-api-accessn/accessing_spark_via_livy.ipynb/)

## Instructions on how to run in Azure Data Studio

2. From Azure Data Studio Connect to the SQL Server Master instance in a big data cluster.
1. From Azure Data Studio Connect to the SQL Server Master instance in a big data cluster.

3. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and open the notebook in Azure Data Studio. Wait for the “Kernel” and the target context (“Attach to”) to be populated. If required set the relevant “Kernel” ( e.g **PySpark3** ) and **Attach to** needs to be the IP address of your big data cluster endpoint.
2. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and open the notebook in Azure Data Studio. Wait for the “Kernel” and the target context (“Attach to”) to be populated. If required set the relevant “Kernel” ( e.g **PySpark3** ) and **Attach to** needs to be the IP address of your big data cluster endpoint.

4. Run each cell in the Notebook sequentially.
3. Run each cell in the Notebook sequentially.
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
{
"metadata": {
"kernelspec": {
"name": "pyspark3kernel",
"display_name": "PySpark3"
},
"language_info": {
"name": "pyspark3",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "python",
"version": 3
},
"pygments_lexer": "python3"
}
},
"nbformat_minor": 2,
"nbformat": 4,
"cells": [
{
"cell_type": "markdown",
"source": "# Configuring a Spark session using configure-f\r\nRefer to [Spark Configurations](https://spark.apache.org/docs/latest/configuration.html) for specific parameters",
"metadata": {}
},
{
"cell_type": "code",
"source": "%%configure -f\r\n{\"conf\": {\r\n \"spark.executor.memory\": \"4g\",\r\n \"spark.driver.memory\": \"4g\",\r\n \"spark.executor.cores\": 2,\r\n \"spark.driver.cores\": 1,\r\n \"spark.executor.instances\": 4\r\n }\r\n}",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "<IPython.core.display.HTML object>",
"text/html": "Current session configs: <tt>{'conf': {'spark.executor.memory': '4g', 'spark.driver.memory': '4g', 'spark.executor.cores': 2, 'spark.driver.cores': 1, 'spark.executor.instances': 4}, 'kind': 'pyspark3'}</tt><br>"
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": "<IPython.core.display.HTML object>",
"text/html": "<table>\n<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>93</td><td>application_1558765999724_0190</td><td>pyspark</td><td>idle</td><td><a target=\"_blank\" href=\"https://10.193.16.144:30443/gateway/default/yarn/proxy/application_1558765999724_0190/\">Link</a></td><td><a target=\"_blank\" href=\"http://storage-0-1.storage-0-svc.test.svc.cluster.local:8042/node/containerlogs/container_1558765999724_0190_01_000001/root\">Link</a></td><td></td></tr></table>"
},
"metadata": {},
"output_type": "display_data"
}
],
"execution_count": 3
},
{
"cell_type": "code",
"source": "datafile = \"/spark_data/AdultCensusIncome.csv\"\r\ndf = spark.read.format('csv').options(header='true', inferSchema='true').load(datafile)\r\n\r\ndf.show(5)",
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": "Starting Spark application\n",
"output_type": "stream"
},
{
"data": {
"text/plain": "<IPython.core.display.HTML object>",
"text/html": "<table>\n<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>96</td><td>application_1558765999724_0193</td><td>pyspark3</td><td>idle</td><td><a target=\"_blank\" href=\"https://10.193.16.144:30443/gateway/default/yarn/proxy/application_1558765999724_0193/\">Link</a></td><td><a target=\"_blank\" href=\"http://storage-0-0.storage-0-svc.test.svc.cluster.local:8042/node/containerlogs/container_1558765999724_0193_01_000001/root\">Link</a></td><td>✔</td></tr></table>"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"text": "SparkSession available as 'spark'.\n",
"output_type": "stream"
},
{
"name": "stdout",
"text": "+---+-----------------+--------+----------+--------------+-------------------+------------------+--------------+------+-------+-------------+-------------+---------------+---------------+-------+\n|age| workclass| fnlwgt| education| education-num| marital-status| occupation| relationship| race| sex| capital-gain| capital-loss| hours-per-week| native-country| income|\n+---+-----------------+--------+----------+--------------+-------------------+------------------+--------------+------+-------+-------------+-------------+---------------+---------------+-------+\n| 39| State-gov| 77516.0| Bachelors| 13.0| Never-married| Adm-clerical| Not-in-family| White| Male| 2174.0| 0.0| 40.0| United-States| <=50K|\n| 50| Self-emp-not-inc| 83311.0| Bachelors| 13.0| Married-civ-spouse| Exec-managerial| Husband| White| Male| 0.0| 0.0| 13.0| United-States| <=50K|\n| 38| Private|215646.0| HS-grad| 9.0| Divorced| Handlers-cleaners| Not-in-family| White| Male| 0.0| 0.0| 40.0| United-States| <=50K|\n| 53| Private|234721.0| 11th| 7.0| Married-civ-spouse| Handlers-cleaners| Husband| Black| Male| 0.0| 0.0| 40.0| United-States| <=50K|\n| 28| Private|338409.0| Bachelors| 13.0| Married-civ-spouse| Prof-specialty| Wife| Black| Female| 0.0| 0.0| 40.0| Cuba| <=50K|\n+---+-----------------+--------+----------+--------------+-------------------+------------------+--------------+------+-------+-------------+-------------+---------------+---------------+-------+\nonly showing top 5 rows",
"output_type": "stream"
}
],
"execution_count": 4
},
{
"cell_type": "code",
"source": "from pyspark import SparkConf\r\nfrom pyspark.sql import SparkSession\r\n\r\ndef isConfiguredItem(cfg_items):\r\n if(cfg_items == 'spark.executor.instances' or cfg_items == 'spark.executor.memory' or \\\r\n cfg_items == 'spark.executor.cores' or cfg_items == 'spark.driver.memory' or \\\r\n cfg_items == 'spark.driver.cores'):\r\n return True\r\n\r\nspark = SparkSession.builder.getOrCreate()\r\nconf = SparkConf().getAll()\r\n\r\nfor cfg_items in conf:\r\n if(isConfiguredItem(cfg_items[0])):\r\n print(cfg_items)\r\n\r\n",
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": "('spark.executor.instances', '4')\n('spark.driver.memory', '4g')\n('spark.driver.cores', '1')\n('spark.executor.memory', '4g')\n('spark.executor.cores', '2')",
"output_type": "stream"
}
],
"execution_count": 21
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
{
"metadata": {
"kernelspec": {
"name": "sparkkernel",
"display_name": "Spark | Scala"
},
"language_info": {
"name": "scala",
"mimetype": "text/x-scala",
"codemirror_mode": "text/x-scala",
"pygments_lexer": "scala"
}
},
"nbformat_minor": 2,
"nbformat": 4,
"cells": [
{
"cell_type": "markdown",
"source": "# Packaging in Spark\r\n",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Use Case 1: I can have key packages in boxed\r\n - All pacakges that come with spark and hadoop distribution\r\n - Python3.5 and Python 2.7\r\n - Pandas, Sklearn and several other supporting ml packages\r\n - R and supporting pacakges as part of MRO\r\n - sparklyr\r\n\r\n \r\n ",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Use Case 2: I can install pacakges from maven repo to my spark cluster\r\nMaven central is a source of lot of packages. A lot of spark ecosystem pacakges are availble there. These pacakages can be installed to your spark cluster using notebook cell configuration at the start of your spark session.\r\n",
"metadata": {}
},
{
"cell_type": "code",
"source": "%%configure -f\n{\"conf\": {\"spark.jars.packages\": \"com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1\"}}",
"metadata": {
"language": "scala"
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": "<IPython.core.display.HTML object>",
"text/html": "Current session configs: <tt>{'conf': {'spark.jars.packages': 'com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.50'}, 'kind': 'spark'}</tt><br>"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": "<IPython.core.display.HTML object>",
"text/html": "No active sessions."
},
"metadata": {}
}
],
"execution_count": 3
},
{
"cell_type": "code",
"source": "import com.microsoft.azure.eventhubs._",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "import com.microsoft.azure.eventhubs._\n"
}
],
"execution_count": 5
},
{
"cell_type": "markdown",
"source": "## Use Case 3: I have a local jar that i want to run in the spark cluster\r\nAs a user you may build your own customer pacakges that want to run as part of your spark jobs. These pacakges can be uploaded as HDFS and using a notebook configuration spark can consume these pacakges in a jar.\r\n\r\n\r\n",
"metadata": {}
},
{
"cell_type": "code",
"source": "%%configure -f\r\n {\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}",
"metadata": {},
"outputs": [],
"execution_count": 0
},
{
"cell_type": "code",
"source": "import com.my.mycodeJar._",
"metadata": {},
"outputs": [],
"execution_count": 0
}
]
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading