Workshop: SQL Server Big Data Clusters - Architecture
A Microsoft Course from the SQL Server team
- About this Workshop
- Business Applications of this Workshop
- Technologies used in this Workshop
- Before Taking this Workshop
- Workshop Details
- Related Workshops
- Workshop Modules
- Next Steps
Welcome to this Microsoft solutions workshop on the architecture on SQL Server Big Data Clusters. In this workshop, you'll learn how SQL Server Big Data Clusters (BDC) implements large-scale data processing and machine learning, and how to select and plan for the proper architecture to enable machine learning to train your models using Python, R, Java or SparkML to operationalize these models, and how to deploy your intelligent apps side-by-side with their data.
The focus of this workshop is to understand how to deploy an on-premises or local environment of a big data cluster, and understand the components of the big data solution architecture.
You'll start by understanding the concepts of big data analytics, and you'll get an overview of the technologies (such as containers, container orchestration, Spark and HDFS, machine learning, and other technologies) that you will use throughout the workshop. Next, you'll understand the architecture of a BDC. You'll learn how to create external tables over other data sources to unify your data, and how to use Spark to run big queries over your data in HDFS or do data preparation. You'll review a complete solution for an end-to-end scenario, with a focus on how to extrapolate what you have learned to create other solutions for your organization.
This README.MD file explains how the workshop is laid out, what you will learn, and the technologies you will use in this solution.
In this workshop you'll learn:
- When to use Big Data technology
- The components and technologies of Big Data processing
- Abstractions such as Containers and Container Management as they relate to SQL Server and Big Data
- Planning and architecting an on-premises, in-cloud, or hybrid big data solution with SQL Server
- How to install SQL Server big data clusters on-premises and in the Azure Kubernetes Service (AKS)
- How to work with Apache Spark
- The Data Science Process to create an end-to-end solution
- How to work with the tooling for BDC (Azure Data Studio)
- Monitoring and managing the BDC
- Security considerations
Starting in SQL Server 2019, big data clusters allows for large-scale, near real-time processing of data over the HDFS file system and other data sources. It also leverages the Apache Spark framework which is integrated into one environment for management, monitoring, and security of your environment. This means that organizations can implement everything from queries to analysis to Machine Learning and Artificial Intelligence within SQL Server, over large-scale, heterogeneous data. SQL Server big data clusters can be implemented fully on-premises, in the cloud using a Kubernetes service such as Azure's AKS, and in a hybrid fashion. This allows for full, partial, and mixed security and control as desired.
The goal of this workshop is to train the team tasked with architecting and implementing SQL Server big data clusters in the planning, creation, and delivery of a system designed to be used for large-scale data analytics. Since there are multiple technologies and concepts within this solution, the workshop uses multiple types of exercises to prepare the students for this implementation.
The concepts and skills taught in this workshop form the starting points for:
- Data Professionals and DevOps teams, to implement and operate a SQL Server big data cluster system.
- Solution Architects and Developers, to understand how to put together an end-to-end solution.
- Data Scientists, to understand the environment used to analyze and solve specific predictive problems.
Businesses require near real-time insights from ever-larger sets of data from a variety of sources. Large-scale data ingestion requires scale-out storage and processing in ways that allow fast response times. In addition to simply querying this data, organizations want full analysis and even predictive capabilities over their data.
Some industry examples of big data processing are in Retail (Demand Prediction, Market-Basket Analysis), Finance (Fraud detection, customer segmentation), Healthcare (Fiscal control analytics, Disease Prevention prediction and classification, Clinical Trials optimization), Public Sector (Revenue prediction, Education effectiveness analysis), Manufacturing (Predictive Maintenance, Anomaly Detection) and Agriculture (Food Safety analysis, Crop forecasting) to name just a few.
The solution includes the following technologies - although you are not limited to these, they form the basis of the workshop. At the end of the workshop you will learn how to extrapolate these components into other solutions. You will cover these at an overview level, with references to much deeper training provided.
|Linux||Operating system used in Containers and Container Orchestration|
|Containers||Encapsulation level for the SQL Server big data cluster architecture|
|Conainer Orechestration (such as Kubernetes)||Management, control plane for Containers|
|Microsoft Azure||Cloud environment for services|
|Azure Kubernetes Service (AKS)||Kubernetes as a Service|
|Apache HDFS||Scale-out storage subsystem|
|Apache Knox||The Knox Gateway provides a single access point for all REST interactions, used for security|
|Apache Livy||Job submission system for Apache Spark|
|Apache Spark||In-memory large-scale, scale-out data processing architecture used by SQL Server|
|Python, R, Java, SparkML||ML/AI programming languages used for Machine Learning and AI Model creation|
|Azure Data Studio||Tooling for SQL Server, HDFS, Big Data cluster management, T-SQL, R, Python, and SparkML languages|
|SQL Server Machine Learning Services||R, Python and Java extensions for SQL Server|
|Microsoft Data Science Process (TDSP)||Project, Development, Control and Management framework|
|Monitoring and Management||Dashboards, logs, API's and other constructs to manage and monitor the solution|
|Security||RBAC, Keys, Secrets, VNETs and Compliance for the solution|
Condensed Lab: If you have already completed the pre-requisites for this course and are familiar with the technologies listed above, you can jump to a Jupyter Notebooks-based tutorial located here. Load these with Azure Data Studio, starting with bdc_tutorial_00.ipynb.
You'll need a local system that you are able to install software on. The workshop demonstrations use Microsoft Windows as an operating system and all examples use Windows for the workshop. Optionally, you can use a Microsoft Azure Virtual Machine (VM) to install the software on and work with the solution.
You must have a Microsoft Azure account with the ability to create assets, specifically the Azure Kubernetes Service (AKS).
This workshop expects that you understand data structures and working with SQL Server and computer networks. This workshop does not expect you to have any prior data science knowledge, but a basic knowledge of statistics and data science is helpful in the Data Science sections. Knowledge of SQL Server, Azure Data and AI services, Python, and Jupyter Notebooks is recommended. AI techniques are implemented in Python packages. Solution templates are implemented using Azure services, development tools, and SDKs. You should have a basic understanding of working with the Microsoft Azure Platform.
If you are new to these, here are a few references you can complete prior to class:
A full prerequisites document is located here. These instructions should be completed before the workshop starts, since you will not have time to cover these in class. Remember to turn off any Virtual Machines from the Azure Portal when not taking the class so that you do incur charges (shutting down the machine in the VM itself is not sufficient).
This workshop uses Azure Data Studio, Microsoft Azure AKS, and SQL Server (2019 and higher) with a focus on architecture and implementation.
|Primary Audience:||System Architects and Data Professionals tasked with implementing Big Data, Machine Learning and AI solutions|
|Secondary Audience:||Security Architects, Developers, and Data Scientists|
- Technical guide to the Cortana Intelligence Solution Template for predictive maintenance in aerospace and other businesses
This is a modular workshop, and in each section, you'll learn concepts, technologies and processes to help you complete the solution.
|01 - The Big Data Landscape||Overview of the workshop, problem space, solution options and architectures|
|02 - SQL Server BDC Components||Abstraction levels, frameworks, architectures and components within SQL Server big data clusters|
|03 - Planning, Installation|
|Mapping the requirements to the architecture design, constraints, and diagrams|
|04 - Operationalization||Connecting applications to the solution; DDL, DML, DCL|
|05 - Management and |
|Tools and processes to manage the big data cluster|
|06 - Security||Access and Authentication to the various levels of the solution|
Next, Continue to prerequisites