Skip to content

ozkary/Data-Engineering-Bootcamp

Repository files navigation

Overview

This is a code repo dedicated to following a data engineering bootcamp. As we make progress on the course, I am adding my thoughts, approach and solutions.

Technologies

  • Docker, Docker Hub
  • Terraform
  • Python
  • Git, Github, Github Codespace
  • Google Cloud

More...

Week 1: Introduction & Prerequisites

  • Course overview
  • Introduction to GCP
  • Docker and docker-compose
  • Running Postgres locally with Docker
  • Setting up infrastructure on GCP with Terraform
  • Preparing the environment for the course
  • Homework

Week 2: Workflow Orchestration

  • Data Lake
  • Workflow orchestration
  • Introduction to Prefect
  • ETL with GCP & Prefect
  • Parametrizing workflows
  • Prefect Cloud and additional resources
  • Homework

Week 3: Data Warehouse

  • BigQuery
  • Partitioning and clustering
  • BigQuery best practices
  • Internals of BigQuery
  • Integrating BigQuery with Prefect and AirFlow
  • BigQuery Machine Learning

Week 4: Analytics engineering

  • Basics of analytics engineering
  • dbt (data build tool)
  • BigQuery and dbt
  • dbt models
  • Testing and documenting
  • Deployment to the cloud and locally
  • Visualizing the data with google data studio and metabase

Week 5: Batch Processing with Spark

  • Data Batch processing
  • What is Spark
    • Spark Dataframes
    • Spark SQL
    • Internals: GroupBy and joins

Week 6: Data Streaming with Kafka

  • Kafka Actors
    • Topic
    • Consumer
    • Producer
  • Streams vs State
  • Aggregates
  • Streaming with Spark