Data Algorithms with Spark by Mahmoud Parsian
"... This book will be a great resource for both readers looking to implement existing algorithms in a scalable fashion and readers who are developing new, custom algorithms using Spark. ..." Dr. Matei Zaharia Original Creator of Apache Spark FOREWORD by Dr. Matei Zaharia |
Foreword by Dr. Matei Zaharia (Original Creator of Apache Spark)
Author: Mahmoud Parsian
-
This new O'Reilly book is the successor Edition of Data Algorithms (published by O'Reilly)
-
This book uses PySpark (much simpler and readable)
-
@OReillyMedia: Data Algorithms with Spark, By @mahmoudparsian
-
Autor Contact: [ Email ] [ Mahmoud Parsian @LinkedIn ][ Mahmoud Parsian @GitHub ]
-
This GitHub repository will host all source code and scripts for Data Algorithms with Spark
-
Chapter solutions are provided in PySpark and Scala
- PySpark solutions are provided by Mahmoud Parsian
- Scala solutions are provided by Deepak Kumar and Biman Mandal
All programs are tested with the following software:
Spark | Python | Scala | Java |
---|---|---|---|
Apache Spark 3.4.0 | Python 3.10.5 | Scala 2.13 | Java 11 |
Chapter | Title |
---|---|
Glossary | Glossary of Big Data, MapReduce, Spark |
Chapter 1 | Introduction to Data Algorithms |
Chapter 2 | Transformations in Action |
Chapter 3 | Mapper Transformations |
Chapter 4 | Reductions in Spark |
Chapter 5 | Partitioning Data |
Chapter 6 | Graph Algorithms |
Chapter 7 | Interacting with External Data Sources |
Chapter 8 | Ranking Algorithms |
Chapter 9 | Fundamental Data Design Patterns |
Chapter 10 | Common Data Design Patterns |
Chapter 11 | Join Design Patterns |
Chapter 12 | Feature Engineering in PySpark |
Bonus Chapter | Title / Description |
---|---|
Glossary | Glossary of Big Data, MapReduce, Spark |
Word Count | Solutions for Word Count using RDDs and DataFrames |
Anagrams | Find words, which are anagrams |
Lambda Expressions | Using Lambda Expressions in PySpark programs |
TF-IDF | Term Frequency - Inverse Document Frequency |
K-mers | K-mers for DNA Sequences |
Correlation | All vs. All Correlation |
Mapping Partitions | mapPartitions() Complete Example |
UDF | User-Defined Function Examples |
DataFrames Transformations | Examples on Creation and Transformation of DataFrames |
DataFrames Tutorials | DataFrames Tutorials: from collections and CSV text files |
Join Operations | Examples on join of RDDs and DataFrames |
PySpark Tutorial 101 | Examples on using PySpark RDDs and DataFrames |
Physical Data Partitioning | Tutorial of Physical Data Partitioning |
Monoids and Combiners | Monoid as a Design Principle |