# About this book

## Who should read this book

**_Data Science with Python and Dask_** was primarily written with beginner to intermediate data scientists, data engineers, and analysts in mind, specifically those who have not yet mastered working with datasets that push the limits of a single machine. While prior experience with other distributed frameworks (such as PySpark) is not necessary, readers who have such experience can also benefit from this book by being able to compare the capabilities and ergonomics of Dask. There are various articles and documentation available online, but none are focused specifically on using Dask for data science in such a comprehensive manner as this book.

## How this book is organized: A roadmap

This book has three sections that cover 11 chapters.

Part 1 lays some foundational knowledge about scalable computing and provides a few simple examples of how Dask uses these concepts to scale out workloads.

- Chapter 01: Introduces Dask, Introduces **directed acyclic graphs**(DAGs) （有向无环图）.
- Chapter 02: How Dask uses DAGs to distribute work across multiple CPU cores and even physical machines. It goes over how to visualize the DAGs automatically generated by the task scheduler, and how the task scheduler divides up resources to efficiently process data.

Part 2 covers common data cleaning, analysis, and visualization tasks with structured data using the Dask DataFrame construct.

- Chapter 03: describes the conceptual design of Dask DataFrames and how they abstract and parallelize Pandas DataFrames.
- Chapter 04: discusses how to create Dask DataFrames from various data sources and formats, such as text files, databases, S3, and Parquet files.
- Chapter 05: Deep dive into using DataFrames to clean and transform datasets. It covers sorting, filtering, dealing with missing values, joining datasets, and writing DataFrames in several file formats.
- Chapter 06: Using built-in aggregate functions (such as **sum**, **mean**, and so on), as well as writing your own aggregate and window functions. It also discusses how to produce basic descriptive statistics.
- Chapter 07: creating basic visualizations, such as pairplots and heatmaps.
- Chapter 08: advanced visualizations with interactivity and geographic features.

Part 3 covers advanced topics in Dask, such as unstructured data, machine learning, and building scalable workloads.

- Chapter 09: demonstrates how to parse, clean, and analyze unstructured data using Dask Bags and Arrays.
- Chapter 10: shows how to build machine learning models from Dask data sources, as well as testing and persisting trained models.
- Chapter 11: completes the book by walking through how to set up a Dask cluster on AWS using Docker.

## About the code
- https://github.com/jcdaniel91/data-science-python-dask
- https://www.manning.com/downloads/1746

