Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions samples/features/sql-big-data-cluster/spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,11 @@ SQL Server Big Data cluster bundles Spark and HDFS together with SQL server. Azu

[Data Loading - Transforming CSV to Parquet](data-loading/transform-csv-files.ipynb/)

[Data Transfer - Spark to SQL using Spark JDBC connector](data-virtualization/spark_to_sql_jdbc.ipynb/)
[Data Virtualization - Spark to SQL using MSSQL Spark connector](data-virtualization/mssql_spark_connector_non_ad_pyspark.ipynb/)

[Data Transfer - Spark to SQL using MSSQL Spark connector](data-virtualization/mssql_spark_connector_non_ad_pyspark.ipynb/)
[Data Virtualization - Spark to SQL using Spark JDBC connector](data-virtualization/spark_to_sql_jdbc.ipynb/)

[Data Virtualization - Spark with external Object Stores](data-virtualization/spark_external_object_store.ipynb/)

[Configure - Configure a spark session using a notebook](config-install/configure_spark_session.ipynb/)

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,315 @@
{
"metadata": {
"kernelspec": {
"name": "pysparkkernel",
"display_name": "PySpark",
"language": ""
},
"language_info": {
"name": "pyspark",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "python",
"version": 3
},
"file_extension": ".py",
"pygments_lexer": "python3"
}
},
"nbformat_minor": 2,
"nbformat": 4,
"cells": [
{
"cell_type": "markdown",
"source": [
"# Spark read and write on remote Object Stores\n",
"\n",
"By loading the right libraries, Spark is able to both read and write to external Object Stores in a distributed way.\n",
"\n",
"SQL Server Server Big Data Clusters ships libraries to access S3 and ADLS Gen2 protocols.\n",
"\n",
"Libraries are updated with each cumulative update, please make sure to list the available libraries. To list S3 protocol libraries use the following:\n",
"\n",
"```\n",
"kubectl -n <YOUR-BDC-NAMESPACE> exec sparkhead-0 -- bash -c \"ls /opt/hadoop/share/hadoop/tools/lib/*aws*\"\n",
"```\n",
"\n",
"If your scenario requires a library either unavailable or version incompatible with what is shipped with Big Data Clusters, you have some options:\n",
"\n",
"1. Use a session based configure cell with dynamic library loading on Notebooks or Jobs.\n",
"2. Copy the additional libraries to a known HDFS on BDC and reference that at session configuration.\n",
"\n",
"These two scenarios are described in detail in the [Manage libraries](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-install-packages?view=sql-server-ver15) and [Submit Spark jobs by using command-line tools](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job-command-line?view=sql-server-ver15) articles.\n",
"\n",
"## Step 1 - Configure access to the remote storage\n",
"\n",
"In this example we will access a remote S3 protocol object store. \n",
" \n",
"The example considers a [MinIO](https://min.io/) object store service, but would would work with other S3 protocol providers.\n",
"\n",
"Please check your S3 object store provider documentation to understand which libraries are required.\n",
"\n",
"With that information at hand, configure your notebook session or job to use the right library like the example bellow."
],
"metadata": {
"azdata_cell_guid": "32339638-a7ad-465b-b991-17c33798e5d5"
},
"attachments": {}
},
{
"cell_type": "code",
"source": [
"%%configure -f \\\r\n",
"{\r\n",
" \"conf\": {\r\n",
" \"spark.driver.extraClassPath\": \"/opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.271.jar:/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-3.1.168513.jar\",\r\n",
" \"spark.executor.extraClassPath\": \"/opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.271.jar:/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-3.1.168513.jar\",\r\n",
" \"spark.hadoop.fs.s3a.buffer.dir\": \"/var/opt/yarnuser\"\r\n",
" }\r\n",
"}"
],
"metadata": {
"azdata_cell_guid": "64591678-3192-4005-9bff-edd8d05bd982"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"source": [
"spark"
],
"metadata": {
"azdata_cell_guid": "6b343b63-8876-4e58-8058-d2f5798c50dd"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"source": [
"## Step 2 - Add in access tokens to access the remote storage dynamically\r\n",
"\r\n",
"Follow your S3 provider security documentation to change the following cells to correctly configure Spark to connect to the endpoint."
],
"metadata": {
"azdata_cell_guid": "235198a1-5a97-404f-8067-929ed35d8af1"
},
"attachments": {}
},
{
"cell_type": "code",
"source": [
"access_key=\"YOUR_ACCESS_KEY\"\r\n",
"secret=\"YOUR_SECRET\""
],
"metadata": {
"azdata_cell_guid": "9d2880f4-3e74-474e-a955-9b4206c2653b",
"tags": []
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"source": [
"spark._jsc.hadoopConfiguration().set(\"fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\")\r\n",
"spark._jsc.hadoopConfiguration().set(\"fs.s3a.access.key\", access_key)\r\n",
"spark._jsc.hadoopConfiguration().set(\"fs.s3a.secret.key\", secret)\r\n",
"spark._jsc.hadoopConfiguration().set(\"fs.s3a.endpoint\", \"YOUR_ENDPOINT\")\r\n",
"spark._jsc.hadoopConfiguration().set(\"spark.hadoop.fs.s3a.buffer.dir\", \"/var/opt/yarnuser\") # Temp dir for writes back to S3\r\n",
"spark._jsc.hadoopConfiguration().set(\"fs.s3a.connection.ssl.enabled\", \"false\")"
],
"metadata": {
"azdata_cell_guid": "97fb73a6-af3c-47d2-8bd5-13c9999135dd"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"source": [
"## Spark read and write patterns\n",
"\n",
"Use the following examples to cover a range of read and write scenarios to remote object stores.\n",
"\n",
"### Read from external S3 and write to BDC HDFS as table"
],
"metadata": {
"azdata_cell_guid": "3d34444e-026e-44ea-ac79-24e19fe6b4a0"
},
"attachments": {}
},
{
"cell_type": "code",
"source": [
"df = spark.read.csv(\"s3a://NYC-Cab/fhv_tripdata_2015-01.csv\", header=True)"
],
"metadata": {
"azdata_cell_guid": "273a3ff7-0f2e-41a3-a229-0d3eda1c95d6"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"source": [
"df.count()"
],
"metadata": {
"azdata_cell_guid": "c966b240-2e2c-4bb3-9f22-1dacf491d72a"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"source": [
"df.write.format(\"parquet\").save(\"/securelake/fhv_tripdata_2015-01\")"
],
"metadata": {
"azdata_cell_guid": "4bc6608d-da67-423e-8238-3ec186b7396a"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"source": [
"%%sql\r\n",
"DROP TABLE tripdata"
],
"metadata": {
"azdata_cell_guid": "1ea7a5ed-9567-4c21-932d-99b05af79c4f",
"tags": [
"hide_input"
]
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"source": [
"%%sql\r\n",
"CREATE TABLE tripdata\r\n",
"USING parquet\r\n",
"LOCATION '/securelake/fhv_tripdata_2015-01'"
],
"metadata": {
"azdata_cell_guid": "9c3c2ce9-1796-4d32-be3f-c522425306b9"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"source": [
"%%sql\r\n",
"select count(*) from tripdata"
],
"metadata": {
"azdata_cell_guid": "9fa76a3b-60ff-4868-8184-956191d0d96c"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"source": [
"%%sql\r\n",
"select * from tripdata limit 10"
],
"metadata": {
"azdata_cell_guid": "0a99df82-8a4f-479e-8681-b406883646ac"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"source": [
"### Write back to S3 as parquet"
],
"metadata": {
"azdata_cell_guid": "9f540e0c-b18c-436d-9a88-e123382d9778"
}
},
{
"cell_type": "code",
"source": [
"df.write.format(\"parquet\").save(\"s3a://NYC-Cab/fhv_tripdata_2015-01-3\")"
],
"metadata": {
"azdata_cell_guid": "5cdeaf5c-3c29-4fbf-97f9-110a803be1d0"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"source": [
"### Create external table on S3\n",
"\n",
"This example virtualizes a folder on external object store as a Hive table."
],
"metadata": {
"azdata_cell_guid": "e734b5ae-1b6e-4209-abad-94f9226e9af6"
},
"attachments": {}
},
{
"cell_type": "code",
"source": [
"%%sql\r\n",
"DROP TABLE tripdata_s3"
],
"metadata": {
"azdata_cell_guid": "3f5b308e-d528-4c39-afe1-e4924ce38db8",
"tags": [
"hide_input"
]
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"source": [
"%%sql\r\n",
"CREATE TABLE tripdata_s3\r\n",
"USING parquet\r\n",
"LOCATION 's3a://NYC-Cab/fhv_tripdata_2015-01-3'"
],
"metadata": {
"azdata_cell_guid": "1229cff9-7457-48c1-9f0a-dc12ae0ad031"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"source": [
"%%sql\r\n",
"select count(*) from tripdata_s3"
],
"metadata": {
"azdata_cell_guid": "ef4f6b6f-ac19-4704-9436-e1a590cf64fc"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"source": [
"%%sql\r\n",
"select * from tripdata_s3 limit 10"
],
"metadata": {
"azdata_cell_guid": "1878f87c-ef41-4312-acb1-c97de021a817"
},
"outputs": [],
"execution_count": null
}
]
}
Binary file removed samples/manage/.DS_Store
Binary file not shown.
Binary file removed samples/manage/sql-assessment-api/.DS_Store
Binary file not shown.