October 2021 Update #100

feaselkl · 2021-10-27T20:43:10Z

Enhancements based on #99

These changes will also allow us to close #97, #96, #79, and #1 as fixed.

- Created ARM template for non-optional resources. - Updated whiteboard design session based on upcoming lab changes.

- Update lab notebooks to support Databricks REST API for scoring - Remove Azure ML requirements and comments - Update UI images and simplify exercise flow

OrrinEdenfield · 2021-11-04T09:56:37Z

I like the change to use an ARM template for the pre-lab setup. I'd suggest that this ARM template be hosted in the repo and allow for users to Click to deploy from the Github MD page using Custom Deployment capability of Azure Portal. Parameterize everything so this becomes easy for the user.

Alternatively, the non-ARM deployment route forces the user to get experience setting up these services (at least in portal). If user goes that route then having step-by-step using Azure CLI script would also be a terrific addition.

OrrinEdenfield · 2021-11-04T10:03:38Z

I'd suggest using Synapse Analytics which simplify the solution.

ADF becomes Synapse Pipelines
Blob storage becomes ADLS Gen2
Azure Databricks becomes Synapse Spark
(I would recommend keeping Azure ML & not replacing with Azure Databricks - if there is a specific scenario that can't be done using AML then possibly keep Azure Databricks)
Azure SQL DB becomes Synapse SQL Serverless

I don't see a need for ADF vs. Synapse Pipelines (we're not running SSIS Packages for example). Using Synapse also makes it much easier to import sample table (I'm not sure if these specific datasets are available) via the Knowledge Center > Datasets which integrates with the Azure Open Datasets catalog.

OrrinEdenfield · 2021-11-04T10:06:04Z

I recommend making sure all the Spark code is stored in a non-proprietary format like *.ipynb and not .dbc. Even using .py would be better. This will ensure code can be used in other environments.

Additionally to that - please ensure the code in the notebooks also work on Synapse Spark clusters (not using Databricks-specific code).

OrrinEdenfield · 2021-11-04T10:11:42Z

Is there a reason we're downloading data from the internet and then using the SHIR/VM to integrate the data? Why not just host the data in the Github site and then build a pipeline to retrieve the data directly.

I've seen so many customers take more complex actions like described in this lab vs. the more direct/straightforward way of just pulling the data from the internet & stage directly in ADLD Gen2. Downloading the data to then upload the data makes no sense.

If the purpose of this part of the lab is to help the user learn about the SHIR then perhaps a better way would be to generate code locally rather than download. This would better simulate a real-world experience. The user could be given instruction on how to generate code locally in Excel or use some kind of unique code generator software provided (like simulated temperature sensor or something).

OrrinEdenfield · 2021-11-04T10:13:33Z

All data users in Azure should use ADLS Gen2 and not Blob. Blob causes issues down the route most times (especially with ML scenarios due to it's lack of a filesystem [no empty directories]).

OrrinEdenfield · 2021-11-04T10:14:58Z

PBI connecting to Azure Databricks becomes expensive. I'd recommend the Spark process write output data to ADLS Gen2 as Apache Parquet files and then use Synapse SQL Serverless table as source for PBI. Especially for a lab this will be much less cost prohibitive and it lays a better foundation for the user to not use ADB as a data source for reporting.

feaselkl · 2021-11-04T12:26:55Z

Thanks for your feedback, @OrrinEdenfield! I appreciate the time you've put into this and wanted to respond to each element in turn.

I like the change to use an ARM template for the pre-lab setup. I'd suggest that this ARM template be hosted in the repo and allow for users to Click to deploy from the Github MD page using Custom Deployment capability of Azure Portal. Parameterize everything so this becomes easy for the user.

It will be hosted in the repo with a click-to-deploy option. But the PR needs to go out before click-to-deploy can work because it relies on the ARM template already being in the main branch. So the short answer is, we'll do exactly that.

I'd suggest using Synapse Analytics which simplify the solution.

This was discussed in an SME meeting in which the answer was to stay on Databricks, with a recommendation to create a separate Synapse-related workshop in the future. I think there's a lot of value in a Synapse + AI workshop, but rewriting this one to become a Synapse-based workshop wasn't in budget.

I recommend making sure all the Spark code is stored in a non-proprietary format like *.ipynb and not .dbc. Even using .py would be better. This will ensure code can be used in other environments.

They're in .dbc format because we have several folders set up and want to keep instructions as uncomplicated as possible. I should be able to change this to a .zip file of .ipynb notebooks but will need to confirm that behavior is the same. I'll give that a try and incorporate as long as there's no big problem with it. At least doing this will allow learners to unpack the zip file and review contents locally.

Additionally to that - please ensure the code in the notebooks also work on Synapse Spark clusters (not using Databricks-specific code).

I'll do what I can on this front, but there are going to be some Databricks-specific segments because we deploy a Databricks REST API as part of the lab. The other thing I see immediately is SparkR, which is not available in Synapse, and I'll see what else is in there.

As a quick note, I realized that the URL in the lab for the Databricks .dbc file is the old one, so you might have seen the original notebooks. The most recent one is https://github.com/microsoft/MCW-Big-data-and-visualization/blob/issue99/BigDataVis.dbc?raw=true and I'll switch issue99 over to main with the PR.

Is there a reason we're downloading data from the internet and then using the SHIR/VM to integrate the data? Why not just host the data in the Github site and then build a pipeline to retrieve the data directly.

The idea is to represent data stored on-premises, using the Integration Runtime to make it available to Azure services. I'm not sure if there are any other MCWs which cover Integration Runtime setup and thus would be hesitant to drop it entirely. I'll take a peek to see if there are any others which include an Integration Runtime exercise and if so, can make this change.

As far as local data generation goes, that randomness introduces variance between screenshots and what learners see in their own runs. That's certainly not a deal-killer on its own--after all, we have other workshops in which we generate random data--but does it add anything to the experience? I'm not sure it does but am open to your thoughts on it.

All data users in Azure should use ADLS Gen2 and not Blob. Blob causes issues down the route most times (especially with ML scenarios due to it's lack of a filesystem [no empty directories]).

There was a discussion about this specifically, with one of the members of the Synapse team recommending we not use ADLS Gen2 (and therefore Synapse Pipelines, which we had considered as a stand-alone replacement for ADF) for this scenario due to potential difficulties with "provisioning and security issues," as well as that we aren't taking advantage of any of the benefits to ADLS during the workshop. I can update the Whiteboard Design Session to note the benefits of ADLS Gen2 in a production scenario.

PBI connecting to Azure Databricks becomes expensive. I'd recommend the Spark process write output data to ADLS Gen2 as Apache Parquet files and then use Synapse SQL Serverless table as source for PBI. Especially for a lab this will be much less cost prohibitive and it lays a better foundation for the user to not use ADB as a data source for reporting.

I can add a note about this in the Whiteboard Design Session, including it as a good idea when thinking about a move to production. For the lab itself, we use the learner-built Databricks cluster in Exercise 6 and then Power BI is Exercise 7, so additional cost wouldn't be a concern here--the cluster is still running and the auto-shutdown time is longer than the expected time remaining in the lab (2 hours vs 50 minutes).

QC pass

QC

…OL - Big data analytics and visualization.md New workshop title

QC pass

… HOL step-by-step - Big data analytics and visualization.md Title change.

QC - removed retirement survey

QC pass - updated template.

QC pass and template update

…er guide - Big data analytics and visualization.md Title change

…nt guide - Big data analytics and visualization.md Title change

feaselkl and others added 5 commits October 16, 2021 19:58

Work on issue #99

b94283c

- Created ARM template for non-optional resources. - Updated whiteboard design session based on upcoming lab changes.

#99

268fd4f

- Update lab notebooks to support Databricks REST API for scoring - Remove Azure ML requirements and comments - Update UI images and simplify exercise flow

#99 Adding PAT to InitSettings

0a1d923

#99 Using AuthenticationHeaderValue

5062c37

#99 Updating .NET project to work with Databricks outputs

b16a9af

feaselkl requested a review from DawnmarieDesJardins October 27, 2021 20:43

DawnmarieDesJardins changed the title ~~Issue99~~ October 2021 Update Oct 29, 2021

Updating readme

0fb0381

feaselkl and others added 14 commits November 5, 2021 14:19

Issue 99 updates

c60679f

Update phrasing

a543212

#100 rename MCW

b1152f0

Update Before the HOL - Big data and visualization.md

91a2934

QC pass

Update Before the HOL - Big data and visualization.md

06131d1

QC

Rename Before the HOL - Big data and visualization.md to Before the H…

54aa4d8

…OL - Big data analytics and visualization.md New workshop title

Update HOL step-by-step - Big data and visualization.md

d62f9d8

QC pass

Update HOL step-by-step - Big data and visualization.md

a7629ef

QC pass

Update and rename HOL step-by-step - Big data and visualization.md to…

2e867ef

… HOL step-by-step - Big data analytics and visualization.md Title change.

Update README.md

57a9dba

QC - removed retirement survey

Update WDS student guide - Big data and visualization.md

59d3267

QC pass - updated template.

Update WDS trainer guide - Big data and visualization.md

7b0e0a3

QC pass and template update

Rename WDS trainer guide - Big data and visualization.md to WDS train…

8378660

…er guide - Big data analytics and visualization.md Title change

Rename WDS student guide - Big data and visualization.md to WDS stude…

b92b917

…nt guide - Big data analytics and visualization.md Title change

DawnmarieDesJardins merged commit 16fb265 into main Nov 18, 2021

DawnmarieDesJardins deleted the issue99 branch November 18, 2021 23:13

DawnmarieDesJardins mentioned this pull request Nov 18, 2021

Enhancements for Q2 Update #99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

October 2021 Update #100

October 2021 Update #100

feaselkl commented Oct 27, 2021

OrrinEdenfield commented Nov 4, 2021

OrrinEdenfield commented Nov 4, 2021

OrrinEdenfield commented Nov 4, 2021

OrrinEdenfield commented Nov 4, 2021

OrrinEdenfield commented Nov 4, 2021

OrrinEdenfield commented Nov 4, 2021

feaselkl commented Nov 4, 2021

October 2021 Update #100

October 2021 Update #100

Conversation

feaselkl commented Oct 27, 2021

OrrinEdenfield commented Nov 4, 2021

OrrinEdenfield commented Nov 4, 2021

OrrinEdenfield commented Nov 4, 2021

OrrinEdenfield commented Nov 4, 2021

OrrinEdenfield commented Nov 4, 2021

OrrinEdenfield commented Nov 4, 2021

feaselkl commented Nov 4, 2021