Project Description

Azure Databricks is an Apache Spark-based big data analytics and machine learning framework optimized for the Microsoft Azure Cloud. Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

This repository consists of seven labs that are designed to help you to understand how to use Azure Databricks for different use cases, including:

Analysis of structured and unstructured data.
Analysis of streaming data.
Machine Learning (ML) and Machine Learning Operations (MLOPs)

The seven labs are implemented as seven independent Azure Databricks notebooks as described below.

Lab-01_Data Engineering

This tutorial helps you understand how to use Azure Databricks Spark to prepare raw data for analytics.

Lab-02_SparkSQL & Performance Optimization

This tutorial help you to understand the capabilities and features of Spark SQL and the various performance options provided by Azure Databricks.

Lab-03_Machine Learning

This tutorial helps you understand the capabilities and features of Azure Spark MLlib for machine learning. It shows how to construct the end-to-end process for building and refining a machine learning model.

Lab-04_Real-time Stream Analytics

This tutorial helps you understand Azure Databricks Spark Structured Streaming. It shows the end-to-end process starting with data ingestion into a Azure Databricks cluster in near real-time, through analysis of the the streaming data and integration with machine learning.

Lab-05_Databricks Delta

This tutorial is helps you understand the features and capabilities of Azure Databricks Delta. Azure Databricks Delta is a next-generation unified analytics engine built on Apache Spark™. It provides ACID transactions, optimized layouts and indexes to enable big data use cases, from batch and streaming ingests, fast interactive queries to machine learning.

Lab-06_Data orchestration using Azure Data Factory

This tutorial helps you understand how Azure Data Factory (ADF) can be used with Azure Databricks, to create and automate piplines.

Lab-07_MLOps with Azure Databricks

This tutorial helps you understand how to use the MLOps approach to automate machine learning with Azure Databricks. MLOps is a practice for collaboration and communication between data scientists and operations professionals to help manage production ML (or deep learning) lifecycle. MLOps in Azure provided through the integration of two Azure Services:

Azure Machine Learning Service (Azure ML Service)
Azure DevOps

Getting Started

Follow these instruction to download the Azure Databricks lab notebooks from this reposiory and import them into Azure Databricks.

Step 1: Create an Azure Databricks workspace using the Azure portal

1.In the Azure portal, select Create a resource > Data + Analytics > Azure Databricks.

2.Under Azure Databricks Service, provide the values to create a Azure Databricks workspace.

3.Provide the following values:

a. Workspace name : A name for your Azure Databricks workspace

b. Subscription : From the drop-down, select your Azure subscription.

c. Resource group : Specify whether you want to create a new resource group or use an existing one. A resource group is a container that holds related resources for an Azure solution. For more information, see Azure Resource Group overview.

d. Location : Specify the location of your Azure Databricks cluster.

e. Pricing Tier : Choose between Standard or Premium.

You can choose the Select Pin to Dashboard option to pin the resource to the Azure Dashboard

Then click Create

4.The workspace creation takes a few minutes. During workspace creation, the portal displays "Submitting deployment for Azure Databricks" tile on the right side. You may need to scroll right on your dashboard to see the tile. There is also a progress bar displayed near the top of the screen. You can watch either area for progress.

5.In the Azure portal, go to the Azure Databricks workspace that you created, and then click Launch Workspace.

6.You are now inside your Databricks workspace.

Step 2: Download the labs from the Bitbucket repository.

1.Go to Download tab and click on Download Repository.

2.The downloaded file is in the zip format.

3.If you extract the ZIP file, you will see the different lab notebooks. All the notebooks are of type .dbc, which stands for Databricks Archive.

Step 3: Import the Notebooks into the Azure Databricks Workspace.

1.Click the Workspace button or the Home button in the sidebar. Select Import as shown in the screenshot. Alternatively, next to any folder, click the right side of the text and select Import.

3.Browse to the location of your lab DBC file, and click import.

Step 4: Creating a Databricks Cluster

In Databricks you can create two different types of clusters:

1.Standard : Standard clusters are the default and can be used with Python, R, Scala, and SQL.

2.High Concurrency : High-concurrency clusters are tuned to provide the efficient resource utilization, isolation, security, and the best performance for sharing by multiple concurrently active users.

We'll be using a Standard cluster for our Labs.

1.Click the clusters icon in the sidebar.

2.Click the Create Cluster button at the top of the page.

3.Click the Create button. The cluster list page shows the status of the new cluster.

a. On the Create Cluster page, specify the cluster name QS.

b. select 5.2 (Scala 2.11, Spark 2.4.0) in the Databricks Runtime Version drop-down.

c.Click Create Cluster

Step 6: Attach the Notebook to your cluster.

Go to the Databricks WorkSpace

Go to your specific notebook in the workspace.

Click on detached written on the top-left corner, and you'll see a list of active clusters.
Attach your notebook to a running cluster. A green icon shall appear on top, meaning our notebook is now attached to our cluster and is ready to be executed.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Lab-01_ Data Engineering.dbc		Lab-01_ Data Engineering.dbc
Lab-02_ SparkSQL & Performance Optimisation.dbc		Lab-02_ SparkSQL & Performance Optimisation.dbc
Lab-03_ Machine Learning.dbc		Lab-03_ Machine Learning.dbc
Lab-04_ Real-time Stream Analytics.dbc		Lab-04_ Real-time Stream Analytics.dbc
Lab-05_ Databricks Delta.dbc		Lab-05_ Databricks Delta.dbc
Lab-06_ Data orchestration using Azure Data Factory.dbc		Lab-06_ Data orchestration using Azure Data Factory.dbc
Lab-07_Azure MLOps.dbc		Lab-07_Azure MLOps.dbc
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lab-01_ Data Engineering.dbc

Lab-01_ Data Engineering.dbc

Lab-02_ SparkSQL & Performance Optimisation.dbc

Lab-02_ SparkSQL & Performance Optimisation.dbc

Lab-03_ Machine Learning.dbc

Lab-03_ Machine Learning.dbc

Lab-04_ Real-time Stream Analytics.dbc

Lab-04_ Real-time Stream Analytics.dbc

Lab-05_ Databricks Delta.dbc

Lab-05_ Databricks Delta.dbc

Lab-06_ Data orchestration using Azure Data Factory.dbc

Lab-06_ Data orchestration using Azure Data Factory.dbc

Lab-07_Azure MLOps.dbc

Lab-07_Azure MLOps.dbc

README.md

README.md

Repository files navigation

Project Description

Lab-01_Data Engineering

Lab-02_SparkSQL & Performance Optimization

Lab-03_Machine Learning

Lab-04_Real-time Stream Analytics

Lab-05_Databricks Delta

Lab-06_Data orchestration using Azure Data Factory

Lab-07_MLOps with Azure Databricks

Getting Started

Step 1: Create an Azure Databricks workspace using the Azure portal

Step 2: Download the labs from the Bitbucket repository.

Step 3: Import the Notebooks into the Azure Databricks Workspace.

Step 4: Creating a Databricks Cluster

Step 6: Attach the Notebook to your cluster.

Your Databricks Environment is now setup and you can proceed with the labs.

About

Releases

Packages

Contributors 2

nthacker/Azure-Databricks-Labs

Folders and files

Latest commit

History

Repository files navigation

Project Description

Lab-01_Data Engineering

Lab-02_SparkSQL & Performance Optimization

Lab-03_Machine Learning

Lab-04_Real-time Stream Analytics

Lab-05_Databricks Delta

Lab-06_Data orchestration using Azure Data Factory

Lab-07_MLOps with Azure Databricks

Getting Started

Step 1: Create an Azure Databricks workspace using the Azure portal

Step 2: Download the labs from the Bitbucket repository.

Step 3: Import the Notebooks into the Azure Databricks Workspace.

Step 4: Creating a Databricks Cluster

Step 6: Attach the Notebook to your cluster.

Your Databricks Environment is now setup and you can proceed with the labs.

About

Resources

Stars

Watchers

Forks