Institute for Advanced Analytics

Distruted Analytics & Machine Learning - Dan Zaratsian, March 2021

IAA Module - Session 1 - Distributed Services and Platform Overview

Asset Directory

Slides

Introduction and Module Agenda
Distributed Computing
Walk-through of Tools and Services for Big Data
Distributed Architectures and Use Cases
Google Colab Notebook Environment
Google BigQuery Sandbox

IAA Module - Session 2 - SQL and NoSQL Services

Asset Directory

Slides

Hadoop 101
Intro to Apache Hive
Apache Hive Syntax and Schema Design
Intro to Apache HBase and Apache Phoenix (NoSQL)
Apache HBase Schema Design & Best Practices
Apache Phoenix Syntax
Intro to Apache SparkSQL
Apache SparkSQL
BigQuery (Serverless SQL)
Google Cloud Firestore (NoSQL)

Assignment

Assignment 1 SQL - Solution
- Due on Friday, March 26
- Please complete as an individual assignment
- Email your code and answers to d.zaratsian@gmail.com
Assignment 2 NoSQL - Solution
- Due on Friday, March 26
- Please complete as an individual assignment
- Email your code and answers to d.zaratsian@gmail.com

IAA Module - Session 3 - Spark Data Processing & Machine Learning

Asset Directory

Slides

Apache Spark Overview
Spark Machine Learning (MLlib)
ML Pipelines
Building and deploying Spark machine learning models
Considerations for ML in distributed environments
Spark Best Practices and Tuning
Spark Code Walk-through (within Google Colab)

Assignment

Assignment 3
- Due on Friday, April 2
- Please complete as an individual assignment
- Email your code to d.zaratsian@gmail.com

IAA Module - Session 4 - SparkML & Scikit-learn Model Deployment

NOTE: Slides from this week were a continuation from Session 3

Spark Pipeline Components
Spark Best Practices
Deploying / Submitting Spark Applications
Scikit-learn Model Training (with NFL Notebook)
Scikit-learn Model Deployment Process

IAA Module - Session 5 - Realtime, Streaming Systems

Asset Directory

Slides

Apache Kafka
Google PubSub
Demo of PubSub
Spark Streaming
Demo of Spark Streaming
Apache Beam (Google Dataflow)

IAA Module - Session 6 - CloudML & Serveless Deployments

Asset Directory

Slides

Overview of Google Cloud
BigQueryML
AutoML
Serverless functions with Google Cloud Functions
Container Based Deployments

Assignment

Assignment 4 - SparkML or Docker Container
- Due on Wednesday, April 14,2021
- Additional Docker content will be covered on Friday
- Email me with any questions regarding the assignment.
- Please submit your code by email to d.zaratsian@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
session_01		session_01
session_02		session_02
session_03		session_03
session_05		session_05
session_06		session_06
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

session_01

session_01

session_02

session_02

session_03

session_03

session_05

session_05

session_06

session_06

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Institute for Advanced Analytics

IAA Module - Session 1 - Distributed Services and Platform Overview

IAA Module - Session 2 - SQL and NoSQL Services

IAA Module - Session 3 - Spark Data Processing & Machine Learning

IAA Module - Session 4 - SparkML & Scikit-learn Model Deployment

IAA Module - Session 5 - Realtime, Streaming Systems

IAA Module - Session 6 - CloudML & Serveless Deployments

References:

About

Releases

Packages

Languages

License

kevcraig/iaa_2021

Folders and files

Latest commit

History

Repository files navigation

IAA Module - Session 1 - Distributed Services and Platform Overview

IAA Module - Session 2 - SQL and NoSQL Services

IAA Module - Session 3 - Spark Data Processing & Machine Learning

IAA Module - Session 4 - SparkML & Scikit-learn Model Deployment

IAA Module - Session 5 - Realtime, Streaming Systems

IAA Module - Session 6 - CloudML & Serveless Deployments

References:

About

Resources

License

Stars

Watchers

Forks

Languages