Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"metadata": {
"kernelspec": {
"name": "pyspark3kernel",
"display_name": "PySpark3"
},
"language_info": {
"name": "pyspark3",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "python",
"version": 3
},
"pygments_lexer": "python3"
}
},
"nbformat_minor": 2,
"nbformat": 4,
"cells": [
{
"cell_type": "code",
"source": "print(\"Hello World! \")\r\n\r\nimport sys\r\nprint(\"Python version \",sys.version)\r\n\r\n#Run some python in notebook\r\nnum = [i*i for i in range(0,20)]\r\nprint(\"My squared numbers \", num)\r\n",
"metadata": {
"language": "python"
},
"outputs": [
{
"name": "stdout",
"text": "Hello World! \nPython version 3.5.2 (default, Nov 12 2018, 13:43:14) \n[GCC 5.4.0 20160609]\nMy squared numbers [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]",
"output_type": "stream"
}
],
"execution_count": 1
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
{
"metadata": {
"kernelspec": {
"name": "sparkkernel",
"display_name": "Spark | Scala"
},
"language_info": {
"name": "scala",
"mimetype": "text/x-scala",
"codemirror_mode": "text/x-scala",
"pygments_lexer": "scala"
}
},
"nbformat_minor": 2,
"nbformat": 4,
"cells": [
{
"cell_type": "code",
"source": "object HelloWorld {\r\n def main(args: Array[String]): Unit= { println(\"Hello Spark Scala\")\r\n }\r\n}\r\n",
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": "Starting Spark application\n",
"output_type": "stream"
},
{
"data": {
"text/plain": "<IPython.core.display.HTML object>",
"text/html": "<table>\n<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>3</td><td>application_1554316083160_0004</td><td>spark</td><td>idle</td><td><a target=\"_blank\" href=\"https://52.191.187.81:30443/gateway/default/yarn/proxy/application_1554316083160_0004/\">Link</a></td><td><a target=\"_blank\" href=\"http://mssql-storage-pool-default-1.service-storage-pool-default.ctp24.svc.cluster.local:8042/node/containerlogs/container_1554316083160_0004_01_000001/root\">Link</a></td><td>✔</td></tr></table>"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"text": "SparkSession available as 'spark'.\n",
"output_type": "stream"
},
{
"name": "stdout",
"text": "defined object HelloWorld\n",
"output_type": "stream"
}
],
"execution_count": 2
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"metadata": {
"kernelspec": {
"name": "sparkrkernel",
"display_name": "Spark | R"
},
"language_info": {
"name": "sparkR",
"mimetype": "text/x-rsrc",
"codemirror_mode": "text/x-rsrc",
"pygments_lexer": "r"
}
},
"nbformat_minor": 2,
"nbformat": 4,
"cells": [
{
"cell_type": "code",
"source": "print(\"Hello SparkR\")\r\n\r\nhead(iris)\r\n",
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": "[1] \"Hello SparkR\"\n Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n1 5.1 3.5 1.4 0.2 setosa\n2 4.9 3.0 1.4 0.2 setosa\n3 4.7 3.2 1.3 0.2 setosa\n4 4.6 3.1 1.5 0.2 setosa\n5 5.0 3.6 1.4 0.2 setosa\n6 5.4 3.9 1.7 0.4 setosa",
"output_type": "stream"
}
],
"execution_count": 5
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
{
"metadata": {
"kernelspec": {
"name": "pyspark3kernel",
"display_name": "PySpark3"
},
"language_info": {
"name": "pyspark3",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "python",
"version": 3
},
"pygments_lexer": "python3"
}
},
"nbformat_minor": 2,
"nbformat": 4,
"cells": [
{
"cell_type": "markdown",
"source": "# Read and write from Spark to SQL\r\nA typical big data scenario is large scale ETL in Spark and writing the processed data to SQLServer. The following samples shows \r\n- reading a HDFS file, \r\n- some basic processing on it and \r\n- then processed data to SQL Server table.\r\n\r\nNeed a database precreated in SQL for this sample. Here we are using database name \"MyTestDatabase\" that can be created using SQL statements below.\r\n\r\n``` sql\r\nCreate DATABASE MyTestDatabase\r\nGO \r\n``` \r\n ",
"metadata": {}
},
{
"cell_type": "code",
"source": "\r\n#Read a file and then write it to the SQL table\r\ndatafile = \"/spark_data/AdultCensusIncome.csv\"\r\ndf = spark.read.format('csv').options(header='true', inferSchema='true', ignoreLeadingWhiteSpace='true', ignoreTrailingWhiteSpace='true').load(datafile)\r\ndf.show(5)\r\n",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education-num| marital-status| occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows"
}
],
"execution_count": 8
},
{
"cell_type": "code",
"source": "\r\n#Process this data. Very simple data cleanup steps. Replacing \"-\" with \"_\" in column names\r\ncolumns_new = [col.replace(\"-\", \"_\") for col in df.columns]\r\ndf = df.toDF(*columns_new)\r\ndf.show(5)\r\n\r\n",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows"
}
],
"execution_count": 9
},
{
"cell_type": "code",
"source": "#Write from Spark to SQL table using JDBC\r\nprint(\"Use build in JDBC connector to write to SQLServer master instance in Big data \")\r\n\r\nservername = \"jdbc:sqlserver://mssql-master-pool-0.service-master-pool\"\r\ndbname = \"MyTestDatabase\"\r\nurl = servername + \";\" + \"databaseName=\" + dbname + \";\"\r\n\r\nc = \"dbo.AdultCensus\"\r\nuser = \"sa\"\r\npassword = \"****\"\r\n\r\nprint(\"url is \", url)\r\n\r\ntry:\r\n df.write \\\r\n .format(\"jdbc\") \\\r\n .mode(\"overwrite\") \\\r\n .option(\"url\", url) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password)\\\r\n .save()\r\nexcept ValueError as error :\r\n print(\"JDBC Write failed\", error)\r\n\r\nprint(\"JDBC Write done \")\r\n\r\n\r\n",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "Use build in JDBC connector to write to SQLServer master instance in Big data \nurl is jdbc:sqlserver://mssql-master-pool-0.service-master-pool;databaseName=MyTestDatabase;\nJDBC Write done"
}
],
"execution_count": 10
},
{
"cell_type": "code",
"source": "#Read to Spark from SQL table using JDBC\r\nprint(\"read data from SQL server table \")\r\njdbcDF = spark.read \\\r\n .format(\"jdbc\") \\\r\n .option(\"url\", url\r\n ) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password) \\\r\n .load()\r\n\r\njdbcDF.show(5)",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "read data from SQL server table \n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows"
}
],
"execution_count": 13
}
]
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading