Distruted Analytics & Machine Learning - Dan Zaratsian, March 2021
- Introduction and Module Agenda
- Distributed Computing
- Walk-through of Tools and Services for Big Data
- Distributed Architectures and Use Cases
- Google Colab Notebook Environment
- Google BigQuery Sandbox
- Hadoop 101
- Intro to Apache Hive
- Apache Hive Syntax and Schema Design
- Intro to Apache HBase and Apache Phoenix (NoSQL)
- Apache HBase Schema Design & Best Practices
- Apache Phoenix Syntax
- Intro to Apache SparkSQL
- Apache SparkSQL
- BigQuery (Serverless SQL)
- Google Cloud Firestore (NoSQL)
Assignment
-
- Due on Friday, March 26
- Please complete as an individual assignment
- Email your code and answers to d.zaratsian@gmail.com
-
- Due on Friday, March 26
- Please complete as an individual assignment
- Email your code and answers to d.zaratsian@gmail.com
- Apache Spark Overview
- Spark Machine Learning (MLlib)
- ML Pipelines
- Building and deploying Spark machine learning models
- Considerations for ML in distributed environments
- Spark Best Practices and Tuning
- Spark Code Walk-through (within Google Colab)
Assignment
- Assignment 3
- Due on Friday, April 2
- Please complete as an individual assignment
- Email your code to d.zaratsian@gmail.com
NOTE: Slides from this week were a continuation from Session 3
- Spark Pipeline Components
- Spark Best Practices
- Deploying / Submitting Spark Applications
- Scikit-learn Model Training (with NFL Notebook)
- Scikit-learn Model Deployment Process
- Apache Kafka
- Google PubSub
- Demo of PubSub
- Spark Streaming
- Demo of Spark Streaming
- Apache Beam (Google Dataflow)
- Overview of Google Cloud
- BigQueryML
- AutoML
- Serverless functions with Google Cloud Functions
- Container Based Deployments
Assignment
- Assignment 4 - SparkML or Docker Container
- Due on Wednesday, April 14,2021
- Additional Docker content will be covered on Friday
- Email me with any questions regarding the assignment.
- Please submit your code by email to d.zaratsian@gmail.com