-
Notifications
You must be signed in to change notification settings - Fork 185
Conversation
- Created ARM template for non-optional resources. - Updated whiteboard design session based on upcoming lab changes.
I like the change to use an ARM template for the pre-lab setup. I'd suggest that this ARM template be hosted in the repo and allow for users to Click to deploy from the Github MD page using Custom Deployment capability of Azure Portal. Parameterize everything so this becomes easy for the user. Alternatively, the non-ARM deployment route forces the user to get experience setting up these services (at least in portal). If user goes that route then having step-by-step using Azure CLI script would also be a terrific addition. |
I'd suggest using Synapse Analytics which simplify the solution. ADF becomes Synapse Pipelines I don't see a need for ADF vs. Synapse Pipelines (we're not running SSIS Packages for example). Using Synapse also makes it much easier to import sample table (I'm not sure if these specific datasets are available) via the Knowledge Center > Datasets which integrates with the Azure Open Datasets catalog. |
I recommend making sure all the Spark code is stored in a non-proprietary format like *.ipynb and not .dbc. Even using .py would be better. This will ensure code can be used in other environments. Additionally to that - please ensure the code in the notebooks also work on Synapse Spark clusters (not using Databricks-specific code). |
Is there a reason we're downloading data from the internet and then using the SHIR/VM to integrate the data? Why not just host the data in the Github site and then build a pipeline to retrieve the data directly. I've seen so many customers take more complex actions like described in this lab vs. the more direct/straightforward way of just pulling the data from the internet & stage directly in ADLD Gen2. Downloading the data to then upload the data makes no sense. If the purpose of this part of the lab is to help the user learn about the SHIR then perhaps a better way would be to generate code locally rather than download. This would better simulate a real-world experience. The user could be given instruction on how to generate code locally in Excel or use some kind of unique code generator software provided (like simulated temperature sensor or something). |
All data users in Azure should use ADLS Gen2 and not Blob. Blob causes issues down the route most times (especially with ML scenarios due to it's lack of a filesystem [no empty directories]). |
PBI connecting to Azure Databricks becomes expensive. I'd recommend the Spark process write output data to ADLS Gen2 as Apache Parquet files and then use Synapse SQL Serverless table as source for PBI. Especially for a lab this will be much less cost prohibitive and it lays a better foundation for the user to not use ADB as a data source for reporting. |
Thanks for your feedback, @OrrinEdenfield! I appreciate the time you've put into this and wanted to respond to each element in turn.
It will be hosted in the repo with a click-to-deploy option. But the PR needs to go out before click-to-deploy can work because it relies on the ARM template already being in the main branch. So the short answer is, we'll do exactly that.
This was discussed in an SME meeting in which the answer was to stay on Databricks, with a recommendation to create a separate Synapse-related workshop in the future. I think there's a lot of value in a Synapse + AI workshop, but rewriting this one to become a Synapse-based workshop wasn't in budget.
They're in .dbc format because we have several folders set up and want to keep instructions as uncomplicated as possible. I should be able to change this to a .zip file of .ipynb notebooks but will need to confirm that behavior is the same. I'll give that a try and incorporate as long as there's no big problem with it. At least doing this will allow learners to unpack the zip file and review contents locally.
I'll do what I can on this front, but there are going to be some Databricks-specific segments because we deploy a Databricks REST API as part of the lab. The other thing I see immediately is SparkR, which is not available in Synapse, and I'll see what else is in there. As a quick note, I realized that the URL in the lab for the Databricks .dbc file is the old one, so you might have seen the original notebooks. The most recent one is https://github.com/microsoft/MCW-Big-data-and-visualization/blob/issue99/BigDataVis.dbc?raw=true and I'll switch
The idea is to represent data stored on-premises, using the Integration Runtime to make it available to Azure services. I'm not sure if there are any other MCWs which cover Integration Runtime setup and thus would be hesitant to drop it entirely. I'll take a peek to see if there are any others which include an Integration Runtime exercise and if so, can make this change. As far as local data generation goes, that randomness introduces variance between screenshots and what learners see in their own runs. That's certainly not a deal-killer on its own--after all, we have other workshops in which we generate random data--but does it add anything to the experience? I'm not sure it does but am open to your thoughts on it.
There was a discussion about this specifically, with one of the members of the Synapse team recommending we not use ADLS Gen2 (and therefore Synapse Pipelines, which we had considered as a stand-alone replacement for ADF) for this scenario due to potential difficulties with "provisioning and security issues," as well as that we aren't taking advantage of any of the benefits to ADLS during the workshop. I can update the Whiteboard Design Session to note the benefits of ADLS Gen2 in a production scenario.
I can add a note about this in the Whiteboard Design Session, including it as a good idea when thinking about a move to production. For the lab itself, we use the learner-built Databricks cluster in Exercise 6 and then Power BI is Exercise 7, so additional cost wouldn't be a concern here--the cluster is still running and the auto-shutdown time is longer than the expected time remaining in the lab (2 hours vs 50 minutes). |
…OL - Big data analytics and visualization.md New workshop title
… HOL step-by-step - Big data analytics and visualization.md Title change.
QC - removed retirement survey
QC pass - updated template.
QC pass and template update
…er guide - Big data analytics and visualization.md Title change
…nt guide - Big data analytics and visualization.md Title change
Enhancements based on #99
These changes will also allow us to close #97, #96, #79, and #1 as fixed.