GitHub - osin-vladimir/architect_big_data_solutions_with_spark: code, labs and lectures for the course

Architect Big Data Solutions with Apache Spark

Introduction

This repository contains lectures and codes for the course that aims to provide a gentle introduction on how to build distributed big data pipelines with the help of Apache Spark. Apache Spark is an open-source data processing engine for engineers and analysts that includes an optimized general execution runtime and a set of standard libraries for building data pipelines, advanced algorithms, and more. Spark is rapidly becoming the compute engine of choice for big data. Spark programs are more concise and often run 10-100 times faster than Hadoop MapReduce jobs. As companies realize this, Spark developers are becoming increasingly valued.

In this course we will learn the architectural and practical part of using Apache Spark to implement big data solutions. We will use the Spark Core, SparkSQL, Spark Streaming, and Spark ML to implement different advanced analytics and machine learning algorithms in a production like data pipeline. This course will master your skills in designing solutions for common Big Data tasks such as creating batch and real-time data processing pipelines, doing machine learning at scale, deploying machine learning models into a production environment, and much more!

Content

Introduction [lecture 1] [labs] [pyspark Python cheat sheet]
SQL and DataFrame [labs] [pyspark SQL cheat sheet]
Batch Processing [lecture 2] [lecture 3]
Stream Processing [lecture 4] [lecture 5] [labs]
Machine Learning [lecture 6] [labs]

Computational Resources

Please register for community version of DataBricks here.
Please register for free tier AWS account here

Data Sources

You can find data and additional information from the links below:

Note: For you convenience data already downloaded to Datasets folder of this repository.

Note: You can upload data to DataBricks directly or use AWS S3 bucket for storage:

Additional Resources

We provide links for nice cheat sheets and books in order to make course as smooth as possible:

Course Initiative:

If you like the initiative please star/fork that repository and feel free to contribute with pull requests.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Datasets		Datasets
Modules		Modules
Resources		Resources
Scripts		Scripts
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets

Datasets

Modules

Modules

Resources

Resources

Scripts

Scripts

LICENSE.md

LICENSE.md

README.md

README.md

Repository files navigation

Architect Big Data Solutions with Apache Spark

Introduction

Content

Computational Resources

Data Sources

Additional Resources

Course Initiative:

Places where this course has been taught (physically)

About

Releases

Packages

Contributors 4

Languages

License

osin-vladimir/architect_big_data_solutions_with_spark

Folders and files

Latest commit

History

Repository files navigation

Architect Big Data Solutions with Apache Spark

Introduction

Content

Computational Resources

Data Sources

Additional Resources

Course Initiative:

Places where this course has been taught (physically)

About

Topics

Resources

License

Stars

Watchers

Forks

Languages