Skip to content

mjrulesamrat/gcp-data-engineer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Professional Data Engineer

A Professional Data Engineer enables data-driven decision making by

collecting, transforming, and publishing data. 

A data engineer should be able to

design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on 

- security and compliance; 
- scalability and efficiency; 
- reliability and fidelity; 
- flexibility and portability. 

A data engineer should also be able to

leverage, deploy, and continuously train pre-existing machine learning models.

And that concludes in below four parts:

  1. Designing data processing systems

    • Selecting the appropriate storage technologies
    • Designing data pipelines
    • Designing a data processing solution
    • Migrating data warehousing and data processing
  2. Building and operationalizing data processing systems

    • Building and operationalizing storage systems
    • Building and operationalizing pipelines
    • Building and operationalizing processing infrastructure
  3. Operationalizing machine learning models

    • Deploying an ML pipeline
    • Leveraging pre-built ML models as a service
    • Choosing the appropriate training and serving infrastructure
    • Measuring, monitoring, and troubleshooting machine learning models
  4. Ensuring solution quality

    • Designing for security and compliance
    • Ensuring scalability and efficiency
    • Ensuring reliability and fidelity
    • Ensuring flexibility and portability

Further services breakdown

Managed Databases

  • Cloud SQL
  • DataStore
  • Bigtable
  • Cloud Spanner

DataEngineering Architecture

  • Realtime messaging with Pub/Sub
  • Data Pipelines with Cloud Dataflow
  • Dataproc

Analyzing Data and enabling Machine Learning

  • BigQuery
  • AI Platform
  • Pretrained ML APIs
  • Datalab

Data Visulization

  • Dataprep
  • Data Stidio

Monitoring & Orchestration

  • Cloud Composer

Google cloud services

There are google's supporting services that will help you go-around with Big Data and ML services provided by Google Cloud.

Cloud Compute Engine

  • Scalable and high-performance virtual machines

Cloud IAM roles

  • Fine-grained access control and visibility for centrally managing cloud resources.

Cloud Monitoring & Cloud Logging

  • Monitoring for applications on Google Cloud and AWS.

  • Logging for applications on Google Cloud and AWS.

1. Cloud SQL (Relational)

  • Fully managed relational database service for MySQL, PostgreSQL, and SQL server

  • WordPress, backends, game states, CRM tools, MySQL, PostgreSQL, and Microsoft SQL Servers

  • AWS RDS, AWS Aurora, Azure Database, Azure SQL Database

2. Cloud Storage (Objects)

  • Cloud Storage allows world-wide storage and retrieval of any amount of data at any time. You can use Cloud Storage for a range of scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users via direct download.

  • Globally unique bucket name.

3. Cloud Spanner (Relational)

  • Fully managed, scalable, relational database service for regional and global application data

  • Cloud Spanner is a scalable relational database service built to support transactions, strong consistency, and high availability across regions and continents.

  • Cassandra (with CQL), AWS Aurora, AWS DynamoDB, Azure CosmosDB

4. Firebase realtime database (No-SQL)

  • The Firebase Realtime Database is a cloud-hosted NoSQL database that lets you store and sync data between your users in real time.

  • MongoDB, AWS DynamoDB, Azure Cosmos DB

5. Cloud Firestore (No-SQL)

  • Cloud Firestore is a fast, fully managed, serverless, cloud-native NoSQL document database.

  • Enterprise-grade, scalable NoSQL

  • Sync data across devices, on or offline

  • MongoDB, AWS DynamoDB, Azure CosmosDB

7. Cloud MemoryStore (No-SQL)

  • Cloud Memorystore is a fully managed in-memory data store service for Redis built on scalable, more secure, and highly available infrastructure.

  • Easy lift and shift applications from open-source redis to Memorystore.

  • AWS Elasticache, Azure Cache

8. BigData Ecosystem

  • MapReduce
  • Apache Hadoop & HDFS
  • Apache Spark
  • Apache Pig
  • Apchae Tez
  • Apache Kafka

9. Cloud Pub/Sub

  • Global messaging and event ingestion

  • Pub/Sub is a fully-managed real-time messaging service that allows you to send and receive messages between independent applications.

  • Decouple background data and event processing from the code that handles user-facing requests

  • Streamed events, IoT, metrics can be ingested to cloud pub/sub

10. Cloud DataFlow

  • Managed Apache Beam, Fast, unified stream and batch data processing

  • Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing. With its serverless approach to resource provisioning and management, you have access to virtually limitless capacity to solve your biggest data processing challenges, while paying only for what you use.

  • Horizontal autoscaling of worker resources to maximize resource utilization

  • Can be connected with Pub/Sub to do data processing in batch or streaming

11. Cloud DataProc

  • Managed Apache Spark and Hadoop clusters

  • Also supports Apache Pig and Apache Hive

  • Dataproc is a managed Apache Spark and Apache Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.

12. Cloud BigTable (No-SQL)

  • Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.

  • Global distributed, RowKey concept

  • running large analytical workloads and building low-latency applications

  • HBase, Cassandra, AWS DynamoDB, Azure CosmosDB

13. Cloud BigQuery

  • BigQuery is Google's fully managed, petabyte scale, low cost analytics data warehouse. BigQuery is NoOps—there is no infrastructure to manage and you don't need a database administrator—so you can focus on analyzing data to find meaningful insights, use familiar SQL, and take advantage of our pay-as-you-go model.

  • Serverless, real-time analytics, advanced and predictive analytics, large-scale events, and enterprises

  • AWS Redshift, Snowflake, and Azure SQL Data Warehouse

14. Cloud DataLab

  • Use Cloud Datalab to easily explore, visualize, analyze, and transform data using familiar languages, such as Python and SQL, interactively. Pre-installed Jupyter introductory, sample, and tutorial notebooks, show you how to:

    • Access, analyze, monitor, and visualize data

    • Use notebooks with Python, TensorFlow Machine Learning, and Google Analytics, Google BigQuery, and Google Charts APIs

    • Store these notebooks to GCS and access anytime again

15. Cloud DataStudio

  • Serverless BI reporting and Dashboard

  • Google Data Studio is a fully managed visual analytics service that can help anyone in your organization unlock insights from data through easy-to-create and interactive dashboards that inspire smarter business decision-making.

  • When Data Studio is combined with BigQuery BI Engine, an in-memory analysis service, data exploration and visual interactivity reach sub-second speeds, over massive datasets.

16. Cloud Composer

  • Cloud Composer is a managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.

  • Cloud Composer automation helps you create Airflow environments quickly and use Airflow-native tools, such as the powerful Airflow web interface and command line tools, so you can focus on your workflows and not your infrastructure.


Understanding Machine Learning

1. TensorFlow

2. Pre-trained ML Cloud APIs

3. Auto ML platform by Google Cloud

4. Operationalizing ML models with Google Cloud services


Ensuring Quality (Data security & industry regulations)

1. Data Security

2. Data Privacy

3. Regulations

4. IAM roles to achieve proper security

About

My public notes on the path to complete Google Cloud Professional Data Engineer Certifications

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published