Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

October 2021 Update #100

Merged
merged 20 commits into from
Nov 18, 2021
Merged

October 2021 Update #100

merged 20 commits into from
Nov 18, 2021

Conversation

feaselkl
Copy link
Collaborator

Enhancements based on #99

These changes will also allow us to close #97, #96, #79, and #1 as fixed.

feaselkl and others added 5 commits October 16, 2021 19:58
- Created ARM template for non-optional resources.
- Updated whiteboard design session based on upcoming lab changes.
- Update lab notebooks to support Databricks REST API for scoring
- Remove Azure ML requirements and comments
- Update UI images and simplify exercise flow
@DawnmarieDesJardins DawnmarieDesJardins changed the title Issue99 October 2021 Update Oct 29, 2021
@OrrinEdenfield
Copy link

I like the change to use an ARM template for the pre-lab setup. I'd suggest that this ARM template be hosted in the repo and allow for users to Click to deploy from the Github MD page using Custom Deployment capability of Azure Portal. Parameterize everything so this becomes easy for the user.

Alternatively, the non-ARM deployment route forces the user to get experience setting up these services (at least in portal). If user goes that route then having step-by-step using Azure CLI script would also be a terrific addition.

@OrrinEdenfield
Copy link

I'd suggest using Synapse Analytics which simplify the solution.

ADF becomes Synapse Pipelines
Blob storage becomes ADLS Gen2
Azure Databricks becomes Synapse Spark
(I would recommend keeping Azure ML & not replacing with Azure Databricks - if there is a specific scenario that can't be done using AML then possibly keep Azure Databricks)
Azure SQL DB becomes Synapse SQL Serverless

I don't see a need for ADF vs. Synapse Pipelines (we're not running SSIS Packages for example). Using Synapse also makes it much easier to import sample table (I'm not sure if these specific datasets are available) via the Knowledge Center > Datasets which integrates with the Azure Open Datasets catalog.

@OrrinEdenfield
Copy link

I recommend making sure all the Spark code is stored in a non-proprietary format like *.ipynb and not .dbc. Even using .py would be better. This will ensure code can be used in other environments.

Additionally to that - please ensure the code in the notebooks also work on Synapse Spark clusters (not using Databricks-specific code).

@OrrinEdenfield
Copy link

Is there a reason we're downloading data from the internet and then using the SHIR/VM to integrate the data? Why not just host the data in the Github site and then build a pipeline to retrieve the data directly.

I've seen so many customers take more complex actions like described in this lab vs. the more direct/straightforward way of just pulling the data from the internet & stage directly in ADLD Gen2. Downloading the data to then upload the data makes no sense.

If the purpose of this part of the lab is to help the user learn about the SHIR then perhaps a better way would be to generate code locally rather than download. This would better simulate a real-world experience. The user could be given instruction on how to generate code locally in Excel or use some kind of unique code generator software provided (like simulated temperature sensor or something).

@OrrinEdenfield
Copy link

All data users in Azure should use ADLS Gen2 and not Blob. Blob causes issues down the route most times (especially with ML scenarios due to it's lack of a filesystem [no empty directories]).

@OrrinEdenfield
Copy link

PBI connecting to Azure Databricks becomes expensive. I'd recommend the Spark process write output data to ADLS Gen2 as Apache Parquet files and then use Synapse SQL Serverless table as source for PBI. Especially for a lab this will be much less cost prohibitive and it lays a better foundation for the user to not use ADB as a data source for reporting.

@feaselkl
Copy link
Collaborator Author

feaselkl commented Nov 4, 2021

Thanks for your feedback, @OrrinEdenfield! I appreciate the time you've put into this and wanted to respond to each element in turn.

I like the change to use an ARM template for the pre-lab setup. I'd suggest that this ARM template be hosted in the repo and allow for users to Click to deploy from the Github MD page using Custom Deployment capability of Azure Portal. Parameterize everything so this becomes easy for the user.

It will be hosted in the repo with a click-to-deploy option. But the PR needs to go out before click-to-deploy can work because it relies on the ARM template already being in the main branch. So the short answer is, we'll do exactly that.

I'd suggest using Synapse Analytics which simplify the solution.

This was discussed in an SME meeting in which the answer was to stay on Databricks, with a recommendation to create a separate Synapse-related workshop in the future. I think there's a lot of value in a Synapse + AI workshop, but rewriting this one to become a Synapse-based workshop wasn't in budget.

I recommend making sure all the Spark code is stored in a non-proprietary format like *.ipynb and not .dbc. Even using .py would be better. This will ensure code can be used in other environments.

They're in .dbc format because we have several folders set up and want to keep instructions as uncomplicated as possible. I should be able to change this to a .zip file of .ipynb notebooks but will need to confirm that behavior is the same. I'll give that a try and incorporate as long as there's no big problem with it. At least doing this will allow learners to unpack the zip file and review contents locally.

Additionally to that - please ensure the code in the notebooks also work on Synapse Spark clusters (not using Databricks-specific code).

I'll do what I can on this front, but there are going to be some Databricks-specific segments because we deploy a Databricks REST API as part of the lab. The other thing I see immediately is SparkR, which is not available in Synapse, and I'll see what else is in there.

As a quick note, I realized that the URL in the lab for the Databricks .dbc file is the old one, so you might have seen the original notebooks. The most recent one is https://github.com/microsoft/MCW-Big-data-and-visualization/blob/issue99/BigDataVis.dbc?raw=true and I'll switch issue99 over to main with the PR.

Is there a reason we're downloading data from the internet and then using the SHIR/VM to integrate the data? Why not just host the data in the Github site and then build a pipeline to retrieve the data directly.

The idea is to represent data stored on-premises, using the Integration Runtime to make it available to Azure services. I'm not sure if there are any other MCWs which cover Integration Runtime setup and thus would be hesitant to drop it entirely. I'll take a peek to see if there are any others which include an Integration Runtime exercise and if so, can make this change.

As far as local data generation goes, that randomness introduces variance between screenshots and what learners see in their own runs. That's certainly not a deal-killer on its own--after all, we have other workshops in which we generate random data--but does it add anything to the experience? I'm not sure it does but am open to your thoughts on it.

All data users in Azure should use ADLS Gen2 and not Blob. Blob causes issues down the route most times (especially with ML scenarios due to it's lack of a filesystem [no empty directories]).

There was a discussion about this specifically, with one of the members of the Synapse team recommending we not use ADLS Gen2 (and therefore Synapse Pipelines, which we had considered as a stand-alone replacement for ADF) for this scenario due to potential difficulties with "provisioning and security issues," as well as that we aren't taking advantage of any of the benefits to ADLS during the workshop. I can update the Whiteboard Design Session to note the benefits of ADLS Gen2 in a production scenario.

PBI connecting to Azure Databricks becomes expensive. I'd recommend the Spark process write output data to ADLS Gen2 as Apache Parquet files and then use Synapse SQL Serverless table as source for PBI. Especially for a lab this will be much less cost prohibitive and it lays a better foundation for the user to not use ADB as a data source for reporting.

I can add a note about this in the Whiteboard Design Session, including it as a good idea when thinking about a move to production. For the lab itself, we use the learner-built Databricks cluster in Exercise 6 and then Power BI is Exercise 7, so additional cost wouldn't be a concern here--the cluster is still running and the auto-shutdown time is longer than the expected time remaining in the lab (2 hours vs 50 minutes).

@DawnmarieDesJardins DawnmarieDesJardins merged commit 16fb265 into main Nov 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deploy as Web service notebook cell error
3 participants