This course include multiple sections. We are mainly focusing on Databricks Data Engineer certification exam. We have following tutorials:
- Spark SQL ETL
- Pyspark ETL
All the datasets used in the tutorials are available at: https://github.com/martandsingh/datasets
follow below article to learn how to clone this repository to your databricks workspace.
https://www.linkedin.com/pulse/databricks-clone-github-repo-martand-singh/
This course is the first installment of databricks data engineering course. In this course you will learn basic SQL concept which include:
- Create, Select, Update, Delete tables
- Create database
- Filtering data
- Group by & aggregation
- Ordering
- SQL joins
- Common table expression (CTE)
- External tables
- Sub queries
- Views & temp views
- UNION, INTERSECT, EXCEPT keywords
- Versioning, time travel & optimization
This course will teach you how to perform ETL pipelines using pyspark. ETL stands for Extract, Load & Transformation. We will see how to load data from various sources & process it and finally will load the process data to our destination.
This course includes:
- Read files
- Schema handling
- Handling JSON files
- Write files
- Basic transformations
- partitioning
- caching
- joins
- missing value handling
- Data profiling
- date time functions
- string function
- deduplication
- grouping & aggregation
- User defined functions
- Ordering data
- Case study - sales order analysis
you can download all the notebook from our
github repo: https://github.com/martandsingh/ApacheSpark
facebook: https://www.facebook.com/codemakerz
email: martandsays@gmail.com
you will see initial_setup & clean_up notebooks called in every notebooks. It is mandatory to run both the scripts in defined order. initial script will create all the mandatory tables & database for the demo. After you finish your notebook, execute clean up notebook, it will clean all the db objects.
pyspark_init_setup - this notebook will copy dataset from my github repo to dbfs. It will also generate used car parquet dataset. All the datasets will be avalable at
/FileStore/datasets