Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,23 @@
{
"cell_type": "markdown",
"source": [
"![Microsoft](https://raw.githubusercontent.com/microsoft/azuredatastudio/master/src/sql/media/microsoft-small-logo.png)\r\n",
"<p align=\"center\">\r\n",
"<img src =\"https://raw.githubusercontent.com/microsoft/azuredatastudio/master/src/sql/media/microsoft_logo_gray.svg?sanitize=true\" width=\"250\" align=\"center\">\r\n",
"</p>\r\n",
"\r\n",
"# **Twitter Streaming with SQL Server & Spark**\r\n",
"\r\n",
"In this notebook, we will go through the process of using Spark to stream tweets from the Twitter API, and then stream the resulting data into the SQL Server data pool. Once the data is in the data pool, we will perform queries on it using T-SQL or the Spark-SQL connector. \r\n",
"\r\n",
"## **Steps**\r\n",
"1. [Create a Twitter Developer Account](https://developer.twitter.com/en/apply-for-access.html).\r\n",
"2. Setup\r\n",
" 1. Create 'TwitterData' database and retrieve server hostname.\r\n",
" 2. Change kernel from \"SQL\" to \"Spark | Scala\".\r\n",
" 3. Import packages.\r\n",
" 4. Enter required parameters.\r\n",
" 1. Create 'TwitterData' database.\r\n",
" 2. Create an External Data Source 'TweetsDataSource'.\r\n",
" 3. Create an External Table 'Tweets'.\r\n",
" 4. Change kernel from \"SQL\" to \"Spark | Scala\".\r\n",
" 5. Import packages.\r\n",
" 6. Enter required parameters.\r\n",
"3. Define and create a TwitterStream object.\r\n",
"4. Start the TwitterStream.\r\n",
"5. Validate streaming data.\r\n",
Expand All @@ -53,10 +58,12 @@
"cell_type": "markdown",
"source": [
"## **2. Setup**\n",
"1. Create a database in the SQL Server master instance called 'TwitterData', and retrieve server hostname. \n",
"2. Change the Kernel from \"SQL\" to \"Spark | Scala\".\n",
"3. Import Java packages.\n",
"4. Specify setup parameters"
"1. Create a database in the SQL Server master instance named 'TwitterData'.\n",
"2. Create an External Data Source to the Data Pool named 'TweetsDataSource'.\n",
"3. Create an External Table in the Data Pool named 'Tweets'.\n",
"4. Change the Kernel from \"SQL\" to \"Spark | Scala\".\n",
"5. Import Java packages.\n",
"6. Specify setup parameters"
],
"metadata": {
"azdata_cell_guid": "514963d4-c9eb-42a7-bd81-c6735f79d647"
Expand Down Expand Up @@ -112,7 +119,59 @@
{
"cell_type": "markdown",
"source": [
"### **2.2 Change the kernel from \"SQL\" to \"Spark | Scala\"**\n",
"### **2.2 Create External Data Source 'TweetsDataSource'**"
],
"metadata": {
"azdata_cell_guid": "03542af4-1e39-4049-a982-a44fce4cebd4"
}
},
{
"cell_type": "code",
"source": [
"USE TwitterData\n",
"GO\n",
"\n",
"IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'TweetsDataSource')\n",
" CREATE EXTERNAL DATA SOURCE TweetsDataSource\n",
" WITH (LOCATION = 'sqldatapool://controller-svc/default');"
],
"metadata": {
"azdata_cell_guid": "b01e9faf-d701-4a5e-95a3-7afb66b1249b"
},
"outputs": [],
"execution_count": 0
},
{
"cell_type": "markdown",
"source": [
"### **2.3 Create External Table 'Tweets'**"
],
"metadata": {
"azdata_cell_guid": "a2576ce9-bd62-4138-937c-f5ccdfe0834e"
}
},
{
"cell_type": "code",
"source": [
"IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE name = 'Tweets')\n",
" CREATE EXTERNAL TABLE [Tweets]\n",
" (\"screen_name\" NVARCHAR(MAX), \"createdAt\" DATETIME , \"num_followers\" BIGINT, \"text\" NVARCHAR(MAX))\n",
" WITH\n",
" (\n",
" DATA_SOURCE = TweetsDataSource,\n",
" DISTRIBUTION = ROUND_ROBIN\n",
" );"
],
"metadata": {
"azdata_cell_guid": "e80447c6-92a8-459f-aa17-517b89bd5fed"
},
"outputs": [],
"execution_count": 0
},
{
"cell_type": "markdown",
"source": [
"### **2.4 Change the kernel from \"SQL\" to \"Spark | Scala\"**\n",
"At the top of the editor, click the Kernel dropdown menu and change the kernel from \"SQL\" to \"Spark | Scala\". This will update the notebook language, and allow you to proceed with the next steps."
],
"metadata": {
Expand All @@ -122,7 +181,7 @@
{
"cell_type": "markdown",
"source": [
"### **2.3 Import packages**"
"### **2.5 Import packages**"
],
"metadata": {
"azdata_cell_guid": "04406211-4b11-4be8-b0da-e8ade7e6bdfc"
Expand Down Expand Up @@ -157,7 +216,7 @@
{
"cell_type": "markdown",
"source": [
"### **2.4 Parameters**\r\n",
"### **2.6 Parameters**\r\n",
"Enter the required parameters for the Spark streaming job to connect to SQL Server.\r\n",
"\r\n",
"In this example, the connection is made from Spark to the SQL Server master instance using the internal DNS name (Ex: master-0.master-svc) and port (1433). Alternatively, and especially if you are using a highly available Always On Availability Group, you can connect to the Kubernetes service that exposes the primary node of the Always On Availability Group.\r\n",
Expand Down