Skip to content

prafulacharya/Data_Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data_Engineering

SQL, NOSQL, Apache Spark, Pyspark, Hadoop, Data Clearning, EDA, ETL, Data Warehouse, Data Mining, Data Crawling, Pipelines, Architectures, Data Structure,

Data Engineering Roadmap

Programming Language : Python Scala Java

Operating Systems & Scripting: Linux Unix Shell Scripting

Data Structures & Algorithms (Average Level, No Hard level): Arrays Strings Linked List Stack Queue Tree (Basics) Graph (Basics) Dynamic Programming Searching Sorting

Core Basics of DBMS : DDL DCL DML Integrity Constraints Data Schema Basic Operations ACID Properties Transactions Concurrency Control Deadlock Indexing Hashing Normalization forms Views Stored Procedures ER Diagrams

SQL Scripting : Transactional Databases : MySQL, PostgreSQL All types of joins Nested Queries Group By Use of Case When Statements Window Functions

NoSQL Databases : HBase DataStax Cassandra (Recommended) ElasticSearch MongoDB

Data Exploration Libraries : Pandas NumPy

Data Warehousing Concepts: OLAP vs OLTP Dimension Tables Fact Tables Star Schema Snowflake Schema Warehouse Designing Questions Many more topics

Basic Terminologies In BigData : What is BigData? 5 V’s of BigData Distributed Computation Distributed Storage Vertical vs Horizontal Scaling Commodity Hardwares Clusters File formats CSV JSON AVRO Parquet ORC Type of Data Structured Unstructured Semi-structured

BigData Frameworks : Apache Hadoop (Architecture Understanding Most Imp) HDFS Map-Reduce Yarn Apache Hive How to load data in different file formats Internal Tables External Tables Querying table data stored in HDFS Partitioning Bucketing Map-Side Join Sorted-Merge Join UDF’s in Hive SerDe in Hive Apache Spark (Most Important) Spark Core Spark SQL Spark Streaming Apache SQOOP Apache NIFI Apache FLUME

Workflow Schedulers, Dependency Management : Apache Airflow Azkaban

Messaging Queue Frameworks : Apache KAFKA

Dashboarding Tools : Tableau PowerBI Grafana Kibana (Part of ELK (ElasticSearch - Logstash - Kibana)

BigData Services in Cloud (AWS) : Ondemand Machines AWS EC2 Access Management AWS IAM For Storing and Accessing Credentials AWS Secret Manager Distributed File Storage AWS S3 Transactional Database Services AWS RDS AWS Athena AWS Redshift (Data Warehousing) NoSQL Database Services AWS Dynamo Serverless AWS Lambda ETL Services AWS Glue Scheduler AWS Cloudwatch Distributed Data Computation AWS EMR Messaging Queue AWS SNS AWS SQS Real Time Data Processing AWS Kinesis

About

Data Science, Statistics, SQL and Data Analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published