SQL, NOSQL, Apache Spark, Pyspark, Hadoop, Data Clearning, EDA, ETL, Data Warehouse, Data Mining, Data Crawling, Pipelines, Architectures, Data Structure,
Data Engineering Roadmap
Programming Language : Python Scala Java
Operating Systems & Scripting: Linux Unix Shell Scripting
Data Structures & Algorithms (Average Level, No Hard level): Arrays Strings Linked List Stack Queue Tree (Basics) Graph (Basics) Dynamic Programming Searching Sorting
Core Basics of DBMS : DDL DCL DML Integrity Constraints Data Schema Basic Operations ACID Properties Transactions Concurrency Control Deadlock Indexing Hashing Normalization forms Views Stored Procedures ER Diagrams
SQL Scripting : Transactional Databases : MySQL, PostgreSQL All types of joins Nested Queries Group By Use of Case When Statements Window Functions
NoSQL Databases : HBase DataStax Cassandra (Recommended) ElasticSearch MongoDB
Data Exploration Libraries : Pandas NumPy
Data Warehousing Concepts: OLAP vs OLTP Dimension Tables Fact Tables Star Schema Snowflake Schema Warehouse Designing Questions Many more topics
Basic Terminologies In BigData : What is BigData? 5 V’s of BigData Distributed Computation Distributed Storage Vertical vs Horizontal Scaling Commodity Hardwares Clusters File formats CSV JSON AVRO Parquet ORC Type of Data Structured Unstructured Semi-structured
BigData Frameworks : Apache Hadoop (Architecture Understanding Most Imp) HDFS Map-Reduce Yarn Apache Hive How to load data in different file formats Internal Tables External Tables Querying table data stored in HDFS Partitioning Bucketing Map-Side Join Sorted-Merge Join UDF’s in Hive SerDe in Hive Apache Spark (Most Important) Spark Core Spark SQL Spark Streaming Apache SQOOP Apache NIFI Apache FLUME
Workflow Schedulers, Dependency Management : Apache Airflow Azkaban
Messaging Queue Frameworks : Apache KAFKA
Dashboarding Tools : Tableau PowerBI Grafana Kibana (Part of ELK (ElasticSearch - Logstash - Kibana)
BigData Services in Cloud (AWS) : Ondemand Machines AWS EC2 Access Management AWS IAM For Storing and Accessing Credentials AWS Secret Manager Distributed File Storage AWS S3 Transactional Database Services AWS RDS AWS Athena AWS Redshift (Data Warehousing) NoSQL Database Services AWS Dynamo Serverless AWS Lambda ETL Services AWS Glue Scheduler AWS Cloudwatch Distributed Data Computation AWS EMR Messaging Queue AWS SNS AWS SQS Real Time Data Processing AWS Kinesis